[00:51:19] 10serviceops, 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 4 others: Create puppet role for session storage service - https://phabricator.wikimedia.org/T215883 (10mobrovac) [00:57:43] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10mobrovac) [11:14:21] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) @jijiki what is the total number of items stored... [14:40:24] o/ [14:40:44] how we doing on eventgate staging deployment? did that kafka network thing get merged? [14:42:42] akosiaris: ^ ? [14:42:46] yep akosiaris merged it this morning and it seems to work \o/ [14:42:53] yup [14:42:55] works fine [14:43:29] ottomata: FYI, It's already 16:43 over here. so ... the faster we deploy, the more responsive I 'll manage to be on IRC :-) [14:43:54] the moment the kid is back in the house, you lost me :) [14:51:12] yesterday i was checking if we can upgrade docker to a newer version on kubrnetes workers, it seems current version has been imported from docker-ce one? [14:51:30] it seems buster will ship docker 18.9 so thats another option [14:52:07] my main motivation for the upgrade was the runC vulnerability and possibly performance optimizations, however that also means dropping devicemapper as the docker storage [14:52:24] im unsure about implications of that especially on labs [14:52:24] ah ok! let's go! [14:52:33] akosiaris: ssorry! [14:52:39] i'm trying to fix my IRC nick too but that can wait [14:52:40] ok! [14:53:15] CLUSTER=staging scap-helm eventgate-analytics install -n staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [14:53:17] ya? [14:53:25] yup [14:53:26] go for it [14:55:23] ok it looks like it worked [14:55:23] but [14:55:29] i can't figure out commands to inspect it... [14:55:38] Error: Get http://localhost:8080/api/v1/namespaces/kube-system/pods?labelSelector=app%3Dhelm%2Cname%3Dtiller: dial tcp [::1]:8080: connect: connection refused [14:55:47] CLUSTER=staging scap-helm eventgate-analytics status staging [14:55:51] still on Init phase [14:55:55] probably cloning the repo [14:56:05] ah ah ah right [14:56:19] Warning FailedMount 46s (x8 over 1m) kubelet, kubestage1002.eqiad.wmnet MountVolume.SetUp failed for volume "config-volume" : configmaps "config-staging" not found [14:56:21] mmm [14:56:22] ok CLUSTER=staging not exported so its not sure which one to query without it got it [14:56:23] not so sure [14:56:34] config-staging? [14:56:34] hm [14:58:41] fsero: how did you see that? [14:58:57] sudo KUBECONFIG=/etc/kubernetes/admin-staging.config kubectl -n eventgate-analytics describe pod/eventgate-analytics-staging-86dc4565f8-r6q7b [14:59:11] the rest of the volumes where mounted fine [15:00:18] akosiaris: did we add the rule for gerrit? [15:00:30] yup [15:00:45] it's missing a configmap for the config volume [15:00:48] * akosiaris looking [15:00:50] yep [15:01:07] also this one is probably better since its not admin sudo KUBECONFIG=/etc/kubernetes/eventgate-analytics-staging.config kubectl describe pod eventgate-analytics-staging-86dc4565f8-r6q7b -n eventgate-analytics [15:01:27] (and btw it misses the privileges to list configmaps and we should fix that) [15:01:29] indeed. and you don't need -n in this case [15:02:18] the sudoe one logs the events tho [15:02:21] sudo* [15:02:25] at the bottom of the output [15:03:06] so, the initContainers should finish first, right? [15:03:23] I see that the git clone one is still waiting PodInitializing [15:03:49] and i guess the other stuff is failing with that timeout just because the initContainer never finished [15:04:19] it will fail if initContainer doesnt end in 600 seconds [15:04:25] and then retry :) [15:04:41] the staging-config configmap is missing [15:04:52] that's weird... why ? [15:05:58] ah --dry-run --debug does create it [15:06:26] ah dammit [15:06:31] wrongly aligned --- [15:06:33] * akosiaris testing [15:08:46] oh? [15:09:00] how big is that configmap btw? [15:09:30] more than 1MiB maybe? [15:09:33] 5k [15:09:37] ok we are fine on that front [15:10:25] no not big [15:11:50] ok my fix worked [15:12:01] but we now are in a CrashLoopBackOff [15:12:28] probably because no new chart was built from the 2 changes ottomata pushed yesterday [15:14:07] oh wait [15:14:31] time="2019-02-13T15:14:23Z" level=fatal msg="Error loading config:yaml: line 25: did not find expected alphabetic or numeric character" source="main.go:207" [15:14:47] ok it's the statsd export that complains [15:14:57] and keep the pod in failure [15:15:22] ah I see the error [15:15:50] hm [15:15:58] missing single quotes btw [15:16:00] the changes I pushed yesterday should be just template related. [15:16:02] * akosiaris fixing that too [15:16:02] oh? [15:16:55] ok, purging and reinstalling my test release [15:17:54] ah great, running [15:18:06] ottomata: gimme 2 mins to upload patches [15:18:16] I missed that part during the review, sorry [15:19:11] sure! [15:23:08] ottomata: fix was https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/490342/ [15:24:18] ok great! [15:25:21] so, upgrade? [15:25:35] yes, but wait 30s [15:25:39] k [15:25:42] like [15:25:43] scap-helm eventgate-analytics upgrade staging stable/eventgate-analytics [15:25:43] ? [15:26:03] sorry, tell me again what the stable/eventgate-analytics part is? [15:26:10] ok you are good to go. version 0.0.2 of the is up [15:26:22] upgrade <...?> [15:26:28] stable => helm charts repo aka https://releases.wikimedia.org/charts [15:26:40] eventgate-analytics => chart name [15:26:40] AH k [15:26:49] hm [15:26:54] and it's damn hardcoded that's why "stable" [15:26:57] not my choice [15:27:10] ok trying upgrde [15:27:21] "staging" has no deployed releases [15:27:36] just install again then? [15:27:44] there's nothign in list [15:27:45] OH [15:27:48] CLUSTER=staging [15:27:48] doh [15:27:50] :-) [15:28:38] eventgate-analytics-staging-b4bb564cf-xl99j 2/2 Running 0 36s [15:28:43] great [15:28:50] \o/ [15:28:53] akosiaris@deploy1001:/srv/scap-helm/eventgate$ curl 'http://kubestage1001.eqiad.wmnet:31192/?spec' [15:28:53] {"swagger":"2.0","info":{"version [15:28:54] etc etc etc [15:29:10] oh nice! [15:29:13] hm [15:29:53] so there are a couple of warnings, do they matter? [15:29:56] Warning FailedCreatePodSandBox [15:30:00] Warning MissingClusterDNS [15:30:08] the latter is fine [15:30:09] these two are notimportant IMHO [15:30:21] what's the first one about though? I don't remember right now [15:30:35] IPv6? [15:30:42] Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "eventgate-analytics-staging-b4bb564cf-xl99j_eventgate-analytics" network: failed to get IPv6 addresses for host side of the veth pair [15:31:09] heh? [15:31:11] wait what? [15:31:29] that shows up with a newer version of docker [15:32:05] aside q: did we get the LVS thing up? [15:32:11] i can start epatches for that if not [15:32:23] ottomata: no we haven't cause first you need to deploy in production too [15:32:24] oh staging doesn't have the LVS [15:32:26] right right [15:32:28] ok [15:32:55] but that's the easy part. I also would like fsero to do it this time around, as a baptism of fire :-) [15:33:05] I 'll help ofc [15:33:06] hahaha [15:33:08] eh [15:33:11] count on it [15:33:21] im waiting for my tshirt [15:33:33] i just got an email saying that if we just adopt openshift/coreos, we can simplify a lot of the complexity and don't need as many k8s specialists [15:34:00] by how much ? [15:34:08] 5 [15:34:09] if it's > 50% count me in [15:34:19] by "hundreds of hours" [15:34:23] and you will also get a nice UI! [15:34:29] yes, one click updates [15:34:35] what are you thinking??? call! [15:34:41] and then we can put cloudflare in front of it [15:34:44] and we can go home [15:34:49] amirite [15:34:50] i am home [15:35:18] hahah [15:35:19] paravoid: yeah and also DoH and ESNI [15:35:21] i'm home too! [15:35:30] yippi [15:35:42] anyway, back to eventgate-analytics [15:35:48] ottomata: I guess you want to run tests? [15:37:15] after that, you can probably also deploy to production cluster as well. And we get to do the LVS stuff tomorrow [15:37:20] clusters* [15:37:22] i am home LOL [15:38:24] ok! i'm off tomorrow and friday and monday sooOOooO [15:38:29] lemme see if i can produce ya [15:39:33] the email I got recently is from an Intel sellsman [15:39:41] "Intel has identified several Cloud Service Providers that we either have no direct relationship with, or a very limited one, and Wikimedia Foundation is one of them" [15:40:18] we are a cloud service provider? [15:40:28] technically yes [15:40:45] you know, the neighbours at wmcs [15:40:46] I 'll wear a t-shit "Proud to be a Cloud Service Provider!!!" [15:41:18] jynus: I have no idea what you are talking about [15:41:19] :D [15:41:35] kubernetes is cloud? [15:41:36] akosiaris: how can I tail service logs...i guess logstash? [15:41:41] * akosiaris happy that eventgate-analytics went relatively smoothly [15:41:42] or can i do kubectl logs [15:41:47] ottomata: both [15:42:13] ottomata: kubectl logs -f --since 1m [15:42:29] specially for pods that has been running a lot or produces a lot of logs [15:43:08] ok i'm trying [15:43:09] KUBECONFIG=/etc/kubernetes/eventgate-analytics-staging.config kubectl logs -f --since 1m eventgate-analytics-staging-b4bb564cf-xl99j eventgate-analytics-staging [15:43:11] KUBECONFIG=/etc/kubernetes/eventgate-analytics-staging.config kubectl logs eventgate-analytics-staging-b4bb564cf-xl99j -c eventgate-analytics-staging [15:43:15] it's empty though [15:43:18] it seems to be working yeah [15:43:19] but empty [15:43:30] I guess the service-runner config has no type: stdout entry [15:43:37] its the process sending logs to stdout? [15:43:39] yeah [15:43:51] hm [15:43:59] btw about the ipv6 PodSandbox failure https://github.com/projectcalico/cni-plugin/pull/380 [15:44:01] OHHH [15:44:05] yeat another reason to upgrade [15:44:08] because of the service.deploy == production [15:44:15] fsero: https://phabricator.wikimedia.org/T207804 btw [15:44:16] is overriding the default [15:44:21] which isn't set in minikube [15:44:29] this is a prereq for upgrading to newer docker [15:44:33] cause of that issue [15:45:35] great i'll add that to my radar :) [15:46:17] 10serviceops, 10Operations, 10Kubernetes, 10Patch-For-Review, 10User-fsero: Upgrade calico in production to version 2.4+ - https://phabricator.wikimedia.org/T207804 (10fsero) [15:48:34] btw I can produce a test event and it shows up in kafka yeehaw! [15:52:14] akosiaris: https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/490351/1/charts/eventgate-analytics/templates/config.yaml [15:52:15] ya? [15:52:26] oh should remove level, its set [15:52:33] globally [15:53:22] yup [15:53:25] best rest looks good [15:53:28] but* [15:54:20] oh...i removed .Values.datacenter ! [15:54:22] hm [15:54:31] for the logstash hostname [15:55:27] is there a logstash discovery url? [15:56:05] not that I know of [15:56:07] hm [15:57:43] hm that sucks ok... [15:57:48] can values reference other values? [15:57:48] hehe [15:57:49] no [15:57:52] values is not a template. [15:57:52] hm [15:58:11] there is no logstash.svc.codfw.wmnet btw [15:58:16] ohj! [15:58:21] so this is kind of moot :P [15:58:21] ok i should just use .eqiad everywhere? [15:58:23] great! [15:58:27] no need to deliberate great. [15:58:28] yup :-) [15:59:04] I should at some point tell you guys what maga means in greek btw [15:59:20] haha ? [15:59:53] :D [16:00:04] how do I tell when the new chart is available for upgrade? [16:01:36] scap-helm search eventgate lists the available version [16:01:57] but it's a manual process a bit. Lemme bump it for ya [16:02:29] oh ok [16:02:33] how to bump? [16:04:24] ottomata: https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/490354/ [16:04:34] merged and force puppet run on releases1001 [16:04:34] ahhhh [16:04:35] right [16:04:39] oh [16:04:44] then wait like 5 mins [16:04:52] thats the helm index thing [16:05:13] yeah we should add some CI to it and some more automation [16:05:18] we 've stalled doing it on purpose [16:05:20] akosiaris: btw i think there is a way to only reindex the one chart, so it doesn't bump all the timestamps for hte other charts [16:05:35] that's be nice [16:05:46] s/s/d/ [16:08:21] ok merging that etc. akosiaris [16:10:56] for the CI part it would be nice to also do a helm lint [16:11:03] do we have a phab task for that? [16:11:44] no, I don't think we have [16:11:55] we should indeed to at least a helm lint [16:12:11] i'll fill one :) [16:15:05] huh...does an upgrade not create a new pod...? [16:15:25] depends highly on what gets changed [16:15:39] hm. [16:15:42] remember that sha256 trick I told you about at the deployment level? [16:15:47] nope! [16:16:21] https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/eventgate-analytics/templates/deployment.yaml#L21 [16:16:40] ah so a change in config.yaml will trigger a new deployment and new pods [16:16:57] note that stream-config is not there btw, so that one won't [16:17:35] and get events -w points out new pods are being created [16:18:37] get events -w... [16:19:16] no scratch that.. I am multitasking and read it wrong [16:21:06] ottomata: unles you do the checksum trick nope, however if you want to be sure you can always delete pods and fresh ones will start [16:21:19] this fresh ones will get the updated part from deployment [16:21:46] there's also --recreate-pods to helm upgrade [16:21:52] but I should test that at some point [16:22:10] its just a delete pods undecovered :P [16:22:22] but yes you can do that as well ottomata [16:22:44] hm [16:22:54] hm so that was a config.yaml change though [16:22:59] i should add stream-config for sure [16:23:00] will do [16:23:01] but [16:23:15] this says chart eventgate-analytics-0.0.3 is deployed [16:23:20] but its using the same pod id as before? [16:23:38] (and logs don't seem to come out) [16:24:52] 10serviceops, 10Kubernetes, 10User-fsero: add CI job into operations/deployments-charts repo that helm lint packages and perform the helm index after merge. - https://phabricator.wikimedia.org/T216049 (10fsero) [16:25:43] akosiaris: https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/490357/ [16:25:44] ya? [16:26:07] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10jijiki) @Joe I updated the table above [16:27:07] ottomata: +1ed [16:29:02] ok akosiaris i'm not sure what's up with the pod though [16:29:09] it says it is runnign version 0.0.3 of the chart [16:29:13] but the pod id is the same [16:29:18] and the logs aren't on stdout [16:36:14] sorry also at a meeting, will have a look in a bit [16:37:17] ok [16:44:29] ottomata: you need to repackage the helm chart and update the index [16:44:52] if you want the checksum part applied [16:45:12] fsero: right now that was just for next time, i haven't made any changes to stream-config.yaml [16:45:19] 0.0.3 had changes to config.yaml [16:45:21] which was being checksummed [16:45:23] already [16:45:40] i'm trying to figure out why I don't see those changes applied after upgrading [16:45:43] helm says that 0.0.3 is out [16:45:54] but the pod id is the same as it was before the upgrade [16:54:15] mmm https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/ae2c5f0de41217159577d6181a3178966f5b72ff this was the change right? [16:57:01] config-staging was changed 1h ago, deploy 1h ago and pod 1h ago [16:57:18] fsero yes that is the change... [16:57:31] hm [16:58:35] going to run upgrade again... [16:58:50] ok back [16:58:51] meeting finished [16:58:58] abruptly by a power outage [16:59:07] at least for me, but anyway [16:59:19] so .. helm upgrade and yet the pod was not recreated ? that's the issue? [16:59:57] yeah [17:00:19] but helm seems to have deployed the new chart version [17:00:24] eventgate-analytics-0.0.3 [17:00:56] i entered the pod and the new config is mounted [17:01:04] ! how did you enter the pod i've been trying to do that... [17:01:15] so if you just delete pod new one will get the config [17:01:26] sudo KUBECONFIG=/etc/kubernetes/admin-staging.config kubectl exec -ti eventgate-analytics-staging-b4bb564cf-xl99j bash -n eventgate-analytics [17:01:36] the cat /etc/eventgate [17:01:51] man getting the exact incantation with this is not easy! [17:01:57] haha [17:02:19] i need to run now, i'll be back in a couple of hours so please let me know how it goes [17:02:19] it's not something meant to be done much [17:02:20] ok, so maybe it did the right thing... [17:02:32] entering the pod that is [17:02:32] ottomata: if you just kubectl delete pod [17:02:33] but why didn't the pod id change? [17:02:36] fresh one will work [17:02:39] oh ya [17:02:39] hm [17:03:42] ok i guess i'll try that [17:04:04] what was the time of you last deploy? [17:04:18] 2019-02-13T16:13:01Z ? [17:05:01] i just tried upgrade again at :58 [17:05:06] but there was no new version [17:05:18] so :13 sounds right [17:05:20] there isn't gonna be a new version if nothing has changed [17:05:24] why would there be? [17:05:24] for the attempt alt tghe 0.0.3 upgrade [17:05:26] right [17:05:35] if you do want to force a restart pass --recreate-pods [17:05:37] i just tried running upgrade again [17:05:37] but [17:05:40] the firs ttime around [17:05:48] the first upgrade when 0.0.3 was new [17:05:52] was at :13 [17:06:41] just deleted the pod... [17:06:47] eventgate-analytics-staging-b4bb564cf-s7x98 0/2 Init:0/1 0 8s [17:06:55] new one gets created [17:07:19] and I see logs [17:07:27] {"name":"eventgate","hostname":"eventgate-analytics-staging-b4bb564cf-s7x98","pid":1,"level":40,"levelPath":"warn/service-runner","msg":"startup finished","time":"2019-02-13T17:06:49.665Z","v":0} [17:07:36] great! [17:07:38] those should also be sent to logstash per the configuration [17:07:50] ok, so we just aren't sure why the upgrade didn't recreate the pod itself [17:07:52] very strange [17:07:57] it seems the upgrade applied the configmap...? [17:08:08] to the runnign pod [17:08:14] but didn't restart the service or recreate a new pod [17:08:54] yes and I see logs in logstash too! [17:08:58] perfect! [17:15:39] ottomata: so, I think we are done for now, right ? [17:15:54] akosiaris: yes i think so thank you! everythign is workign great! [17:16:03] the next stuff is all mediawiki / beta related [17:16:04] oh qq [17:16:09] i am going to deploy eventgate to beta [17:16:14] i should just do that via puppet, etc? [17:16:20] or is there helm/k8s stuff there? [17:16:29] no kubernetes stuff there unfortunately [17:16:31] k [17:16:41] labs can't really support the production kubernetes way of doing things [17:16:49] but there is a puppet patch by giuseppe for it [17:16:58] profile::docker_services IIRC [17:17:03] haven't used it yet though [17:17:27] it's exactly there to serve beta for that exact purpose you just stated, but it's not really tested yet [17:18:25] anyway, I am signing off and shutting down equipment. There's a storm and I am having some weird power issues [17:21:20] ok thanks, i'll ask giuseppe about that [17:21:41] _joe_: should i look into that now or just use pupppet + scap etc for beta deployment? ^^^ [17:21:57] <_joe_> sorry I wasn't reading here [17:22:53] <_joe_> ottomata: yes, and ofc with my help if needed [17:22:56] <_joe_> but, tomorrow? [17:23:14] <_joe_> and while we're here, do you need me at the meeting in a few minutes? [17:23:50] _joe_: i don't think so, all is going well atm. i'd only ask about ^^ but i can find you when i need help, maybe next week tho, i'm off til tuesday after today [17:24:10] <_joe_> oh, heh, sorry [17:24:20] <_joe_> but I have meetings starting at 7 am tomorrow [17:24:31] <_joe_> I'd like to have a 12 hour hiatus at least [17:28:09] no probs! :) [21:06:33] I haven't had time to look at it but I have noticed the k8s API latency alerts flapping for most of my afternoon [22:19:22] \x03test [22:19:30] \x03test\x0f [22:19:33] sorry [22:19:44] grrr [22:20:46] 10serviceops, 10Operations, 10Patch-For-Review: "sql" command fails with "sh: 1: mysql: not found" on mwdebug1002 - https://phabricator.wikimedia.org/T211512 (10Dzahn) 05Open→03Resolved a:03Dzahn @Krinkle The change above has been merged today. This removed the non-working sql / sqldump scripts from ca...