[07:47:31] hello people [07:47:45] one question - how should I generated the client_token for prometheus? [07:47:50] *generate [07:48:10] good thing you asked [07:48:32] you shouldn't, it's already generated. All clusters must share currently the same users due to mess in hiera/puppet [07:48:54] up to now it hasn't been an issue since there were no other clusters, but I guess we should revisit it :-) [07:49:05] so just reuse what the other clusters use [07:49:51] akosiaris: buongiorno, grazie :D [07:50:12] I am opening a task for the namespace override thing as well [07:52:34] 10serviceops, 10Lift-Wing, 10Machine-Learning-Team: Allow namespaces to be overriden in deployment-chart's admin_ng - https://phabricator.wikimedia.org/T278208 (10elukey) [07:52:51] not sure if I have used the right terms :D [07:54:24] 10serviceops, 10Lift-Wing, 10Machine-Learning-Team, 10Kubernetes: Allow namespaces to be overriden in deployment-chart's admin_ng - https://phabricator.wikimedia.org/T278208 (10JMeybohm) [08:02:35] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) [08:03:18] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10akosiaris) [08:03:29] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) 05Open→03Resolved a:03akosiaris Added steps in the eqiad task T277741 from the action items list, I am gonna boldy resolve th... [08:17:30] akosiaris: should we start or wait for the puppetmaster situation to settle? [08:18:25] jayme: settled [08:18:28] ;-) [08:18:36] great :) [08:18:43] let me inform -sre that we are starting [08:18:49] ack [08:24:27] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10ops-monitoring-bot) Icinga downtime set by akosiaris@cumin1001 for 1 day, 0:00:00 18 host(s) and their services with reason: Reinitialize eqia... [08:24:49] Step #1 down [08:24:51] done* [08:25:06] * jayme rebased patches [08:25:30] the pybal/prometheus/confd downtimes we still have to click, right? [08:27:31] yeah I but I added helpful links to those [08:27:52] ah, sweet! [08:27:58] we 'll figure out a good way to downtime them in the cookbook [08:29:32] okay. Will you downtime services and I do pybal/prometheus/confd? [08:30:26] services downtimed [08:30:35] jayme: +1 [08:30:44] I 'll depool eqiad now [08:30:51] ack [08:33:49] so, restbase-async doesn't need to change this time around. It's already in codfw [08:33:59] I 've striked that out in the task already [08:34:02] that's why I striked it :) [08:35:00] you did? I thought I did yesterday? damn I should be consuming something memory enhancing it seems [08:35:22] anyway, action item for later (/me consuming some memory enhancement or whatever) [08:35:36] puppet disabled across the cluster. [08:36:19] pybal/prometheus/confd downtimed until ~15:30 UTC [08:36:59] perfect. Traffic is being cut over as we speak [08:37:07] gonna wait another 5m and it should be done [08:38:00] there is a service missing in the list :o [08:38:03] zotero [08:38:20] oh dammit [08:38:34] how did that elude us? [08:38:39] did it last time too ? [08:39:07] I would guess so [08:39:21] ok, added and reissued [08:39:23] now it's done [08:40:28] Looking at SAL we missed it last week [08:41:48] well, at least it was zotero which is only accessed by citoid which was switched over so no requests were making it to zotero anyway [08:42:07] so we got lucky it seems [08:42:54] https://grafana.wikimedia.org/d/000000519/kubernetes-overview?viewPanel=9&orgId=1&from=now-1h&to=now says we can begin [08:43:00] I 'll poweroff the master VMs [08:45:07] jayme: I 'll regenerate the kubemaster.svc.eqiad.wmnet cert now [08:45:19] okay, I'll merge the homer patch [08:49:54] akosiaris: the homer commands I can probabbly scope to the core routers like "cr*eqiad*" ? [08:51:20] hmm...no :) [08:51:34] yes you can [08:51:37] you should actually [08:51:55] homer "cr*codfw*" commit "T277191, adding kubernetes2017" [08:52:06] that was my homer invocation last time [08:52:19] but for "cr*eqiad" it just said "No diff" [08:52:31] "cr*eqiad*" [08:53:56] ;-) [08:54:35] oh, nono. I did "cr*eqiad*" diff and it selected both routers :) [08:54:50] yes, you want both :-) [08:55:01] you will have to answer twice yes though IIRC [08:55:19] I especially wanted a puppet run on cumin1001 before the diff ;-) [08:57:13] ok, Homer run completed successfully on 2 devices: ['cr1-eqiad.wikimedia.org', 'cr2-eqiad.wikimedia.org'] [08:58:12] perfect [08:58:20] so I guess, it's puppet merge and reimage time ? [08:58:23] akosiaris: I'll merge the worker puppet patch now and start reimaging [08:58:26] ack [08:58:32] ok, doing masters then [08:59:05] akosiaris: kill and reboot etcd first [09:00:23] indeed, good point [09:00:25] doing so [09:04:58] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` ['kubernetes1001.eqiad.wmnet', 'kubern... [09:07:03] I'll do the ganeti instances as well [09:21:45] <_joe_> we can delete everything from etcd, if you don't want to reinitialize it, sorry late to the game [09:22:06] thats what we do :) [09:22:11] _joe_: done already [09:22:18] (reboot is just because of ther kernel updates [09:22:22] ah you mean pybal etcd ? [09:22:23] <_joe_> oh I see [09:22:27] <_joe_> no [09:22:31] we meant kubernetes etcd [09:22:34] <_joe_> I meant the k8s ones, [09:22:35] that is getting wiped [09:22:42] yeah the process is a wipe of that datastore [09:22:50] <_joe_> I thought you were going the nuclear way (remove the data files) [09:23:29] ETCDCTL_API=3 etcdctl --endpoints https://foobar.site.wmnet:2379 del "" --from-key=true [09:23:41] it's the neutron way [09:23:56] Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.15", GitCommit:"2adc8d7091e89b6e3ca8d048140618ec89b39369", GitTreeState:"clean", BuildDate:"2020-09-02T11:31:21Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"} [09:23:59] API is up and running [09:24:15] looking for issues now [09:24:46] old nodes are trying to authenticate, I see Unable to authenticate the request due to an error: [invalid bearer token, square/go-jose: error in cryptographic primitive] [09:24:58] akosiaris: did you do the controllermanager token? [09:25:04] which is when a no longer invalid token is being sent to the API server, which should be from the nodes [09:25:05] yes I did [09:25:08] cool [09:25:28] ah dammit [09:25:35] oh [09:25:37] we depooled zotero but forgot to downtime it [09:25:38] lol [09:25:46] eh [09:25:48] ok ACKing it, [09:25:51] <_joe_> it's always zotero [09:26:46] dammit, I thought we would come through without a alert this time but now we got a page :-| [09:28:12] akosiaris: why the heck did we get "PROBLEM - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster1001 is CRITICAL" ? [09:28:34] obviously I've downtimed only the eqiad ones... [09:28:57] oh, I put only the eqiad ones in the list, didn't I ? [09:28:59] sigh [09:29:27] <_joe_> that's ok [09:29:37] oh wait, that shouldn't be alerting [09:29:46] kubemaster on codfw should be fine [09:29:47] <_joe_> akosiaris: disregard [09:29:47] it's codfw [09:29:58] <_joe_> it's a problem with the monitoring [09:30:04] ? [09:30:05] <_joe_> nothing should be wrong in codfw [09:30:22] <_joe_> do you want the lengthy explanation? [09:30:38] I think I have an idea. kubemaster is also used for ML ? [09:30:49] <_joe_> no, it's much lamer [09:30:59] dammit, I was hoping for a duplicate key [09:31:40] <_joe_> the two files for codfw and eqiad come from the same confd template [09:31:47] <_joe_> so the error file is the same for both [09:31:57] <_joe_> so the two alerts fire up at the same time [09:32:01] ouch :) [09:32:06] <_joe_> we'd have to duplicate the file per dc [09:32:28] and this is just for kubemaster? [09:32:56] why isn't it happening for the other services? [09:33:15] it's probably for all, but we did not get into the race condition of all nodes being depooled at the same time [09:33:49] <_joe_> ^^ [09:34:34] ok [09:36:15] jayme: I think I 'll merge the admin_ng/ change while waiting for the reimaging to finish [09:37:40] akosiaris: sure, go ahead. You might as well also sync already. The nodes should pop in one by one then [09:37:52] akosiaris: forgot this one https://gerrit.wikimedia.org/r/c/operations/puppet/+/674269 [09:39:31] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm) [09:41:04] jayme: ok, doing so. I 've merged the kube1017 patch as well [09:41:11] thanks [09:45:11] nodes have indeed started showing up in the api [09:47:16] cool! [09:47:50] akosiaris: we forgot 2017 in conftool-data as well https://gerrit.wikimedia.org/r/c/operations/puppet/+/674270 [09:48:32] lol, can't wait to out of behind pybal [09:48:34] merging [09:48:48] do we need to add default state as well somewhere? [09:49:08] 1017 is weight:0 pooled:inactive now [09:49:12] we used to have it and then we removed it [09:49:25] it was causing it's own set of issues [09:49:43] ah, okay. So I just set the weight to 10 and pool it manually [09:56:05] akosiaris: did you do helmfile sync with the -l steps or in one go? [09:56:06] omg, all cluster components seem to be running fine [09:56:26] jayme: up to calico one by one. After calico said ok, I went for the rest in 1 go [09:56:33] but nothing failed to start with [09:57:02] great [09:59:14] so, I think we are at the look for problems and if none are spotted deploy services stage [10:01:48] yep. I currently have 1015 that fails the check for successful puppet run [10:02:05] but the node itself looks fine [10:03:32] yeah, 1004 complained about it but a quick puppet run was fine. I think it's just latency on monitoring catching up with the state [10:07:29] *1005 not 1015 I meant. That one still hangs in the reboot cookbooks check [10:10:56] 1005 is green now as well [10:14:20] yeah, looks like we got a few last ones pending for 100{2,3,8,14} but it's things I am not worried about (EDAC, SSH) [10:14:24] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1003.eqiad.wmnet', 'kubernetes1002.eqiad.wmnet', 'kubernetes1010.eqiad.wmne... [10:14:50] I think we can deploy everything indeed. jayme: wanna have the honours? [10:15:50] akosiaris: we have a CRITICAL: Hosts in IPVS but unknown to PyBal: set(['kubernetes1017.eqiad.wmnet']) (it's from 1.5h ago, so I sheduled a recheck) [10:16:25] akosiaris: yeah, I can do the deploy-all :D [10:16:34] yeah that one has me puzzled too [10:16:48] I 've looked into IPVS, for the various services 1017 is pooled indeed [10:17:02] akosiaris: kubernetes*017 also do still show up as "staged" in netbox [10:17:05] pybal of course would love to remove it [10:17:29] jayme: ah good point let me fix that while you deploy [10:17:39] akosiaris: let me know how [10:17:47] or what I was missing [10:18:55] editing a web ui manually ? [10:19:03] uh [10:19:07] Staged -> Active is a manual thing IIRC [10:19:13] did not expect that :) okay [10:19:27] but!!! I found out something intereting [10:19:37] https://netbox.wikimedia.org/dcim/devices/2875/ says kubernetes2017 is a juniper [10:19:40] volans ^ [10:19:46] I think think that's true :P [10:19:54] I don't think* [10:19:59] although what do I know ? [10:20:08] unless I reimaged it as a juniper :-P [10:21:16] akosiaris: looking [10:21:17] LOL [10:21:50] I don't think we set that from any automation IIRC [10:22:05] it might be set manually by dcops upon device creation, but I need to check to confirm [10:22:24] ok, maybe then a human error. makes sense [10:22:42] but we could add a report to check that [10:22:51] * jayme pooled all nodes [10:23:33] btw I'm working for you (wrt icinga downtime in spicerack) [10:23:38] ok, both nodes set to active now in netbox and platform attribute removed [10:24:16] usually we set Linux [10:24:42] I can do that, easy enough [10:24:55] 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10Joe) [10:26:21] akosiaris: pyball diff check recovery just came in [10:26:30] \o/ [10:26:49] anyway don't worry Ill have a look at the platform and see the current status [10:26:53] confd template is still crying for kubemaster [10:27:41] I am sensing a .err file [10:27:54] but IIRC that needs some manual action to clean the errors up, right? [10:27:56] yes, that [10:28:02] root@puppetmaster1001:/var/run/confd-template# find . [10:28:02] . [10:28:02] ./.kubemaster873550242.err [10:28:02] ./.kubemaster246259658.err [10:28:08] let me delete those [10:28:17] _joe_: why do we even keep those around btw ^ ? [10:28:22] I can't remember right now [10:29:12] <_joe_> akosiaris: because we check if the file generated is newer than the error [10:29:35] <_joe_> the problem probably is the generated file is untouched from the version before the errors? [10:30:29] I 'll need to look at the check but I can tell you deleting those fixes it [10:30:29] I've created T278221 fwiw [10:31:07] confd compilation was ok though already, those looked like stale files [10:31:32] anyway, confd recoveries came in [10:32:21] It's so nice to run kubectl get pods --all-namespaces -w and just see pod creation just scrolling by [10:32:44] eventstreams being deployed now [10:33:27] is there a way to search and cancel all the downtimes I've created in icinga? [10:42:51] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=6 [10:43:01] not nice for sure. A lot of click click [10:43:29] okay, thats the path I took [10:54:51] deploy went good, double checking time I think [10:57:09] 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10Joe) Some data from one appserver: - httpd uses less than 1 GB of memory and 1 cpu. If we assume we'll reduce the number of workers, it can be safe to assume e.g. 600 MB a... [10:58:53] _joe_: btw, regarding https://phabricator.wikimedia.org/T278220 I got a plan to get cross fleet actual data about it. It's a couple of patches and an ok from o11y (it's a bit of data). Just not today [11:00:07] <_joe_> akosiaris: I've gathered some data, but manually [11:00:42] no need, we got systemd :) [11:00:56] <_joe_> that's what I meant with "manually" :P [11:01:32] <_joe_> there is also the problem that e.g. apache will hoard memory over time, and I think that's because it's uncostrained [11:03:39] linkrecommdation is a bit interesting. one of the containers is flapping between ready and not ready [11:03:53] probably nothing to do with the reinit, but an interesting action item [11:05:14] hmm, yeah. readiness probe times out from time to time [11:05:18] Warning Unhealthy 83s (x120 over 30m) kubelet, kubernetes1008.eqiad.wmnet Readiness probe failed: Get http://10.64.68.207:8000/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers) [11:05:23] yup [11:05:26] interesting [11:05:41] let me create a task for the and move on to the rest [11:05:43] that's weird...should be almost a zero-cost thing [11:06:06] There was that issue with gunicorn and the 2 few workers, perhaps it's related [11:06:11] too few* [11:06:43] 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10Joe) [11:06:59] yeah, sounds promising [11:07:34] 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10Joe) [11:08:57] 10serviceops, 10Add-Link, 10Growth-Team: linkrecommendation flap their readiness probes too often - https://phabricator.wikimedia.org/T278223 (10akosiaris) [11:09:03] 10serviceops, 10Add-Link, 10Growth-Team: linkrecommendation flap their readiness probes too often - https://phabricator.wikimedia.org/T278223 (10akosiaris) p:05Triage→03Low [11:09:50] 10serviceops, 10Add-Link, 10Growth-Team: linkrecommendation flap their readiness probes too often - https://phabricator.wikimedia.org/T278223 (10akosiaris) Adding @kostajh and @Tgr for their information. [11:11:15] 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10JMeybohm) > The goal is to pack 4 or even 5 pods in a single modern node. I recently created T277876 where I propose we should reserve some of the resources of each node... [11:13:03] I think everything is green, right ? [11:13:10] 10serviceops, 10Lift-Wing, 10Machine-Learning-Team, 10Kubernetes: Allow k8s clusters to have their own k8s_infrastructure_users in puppet - https://phabricator.wikimedia.org/T278224 (10elukey) [11:13:20] 10serviceops, 10Lift-Wing, 10Machine-Learning-Team, 10Kubernetes: Allow k8s clusters to have their own k8s_infrastructure_users in puppet - https://phabricator.wikimedia.org/T278224 (10elukey) a:05klausman→03None [11:13:36] akosiaris: some other deployments hade timeouts during health checks as well https://logstash.wikimedia.org/goto/c5831fb2ac91c3be40ce9ed942556eb3 [11:13:41] *had [11:14:17] akosiaris: ignore me please [11:14:26] forgot to filter by cluster [11:15:10] so yeah. Everything looks pretty green [11:15:42] jayme: actually you are one to something https://logstash.wikimedia.org/goto/f678f4fd2634d1ff2fc329b247dc3943 [11:16:14] linkrecommendation is still heavily dominating that dashboard [11:16:20] yeah, but not eqiad specific [11:16:38] yeah, I don't think this has anythign to do with eqiad [11:16:45] we just uncovered something while working on eqiad [11:17:36] btw, I 've said it before, but that's a pretty nice dashboard [11:17:44] elukey: ^ you will probably need it :P [11:17:48] thanks :) [11:18:16] that's top of what I'm able to do with kibana, though :D [11:18:34] 10serviceops: Productionise mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T278225 (10jijiki) p:05Triage→03Medium [11:19:27] jayme: I think it's lunch time over there. I 'd let it be for a while, then removing downtime after lunch [11:19:40] akosiaris: sounds good to me! [11:20:02] then let's see when we should pool the traffic back [11:20:06] ok, /me off for lunch [11:20:11] +1 [11:49:08] 10serviceops, 10Add-Link, 10Growth-Team: linkrecommendation flap their readiness probes too often - https://phabricator.wikimedia.org/T278223 (10kostajh) Thanks for the heads up. No idea what's causing that, but I don't think it's specific to our application code. FWIW here is a stack trace of a call to `/he... [11:51:25] 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10Joe) A typical appserver has 96 GB of memory and 48 cores. Let's assume we can use up to 85% of those with pods, which looks a bit conservative, but it's ok for our curren... [11:54:36] 10serviceops, 10Patch-For-Review: Productionise mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T278225 (10jijiki) [11:55:30] 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10Joe) At 15 workers per pod, we get 5 pods per node (6 if we only reserve 5% of ram and cpu). That's more or less the maximum concurrency at which the sweet spot holds for... [12:17:13] 10serviceops, 10Discovery-Search, 10Maps, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): [OSM] Backport imposm3 to the debian channel - https://phabricator.wikimedia.org/T238753 (10MSantos) 05Open→03Resolved [13:16:10] akosiaris: health check from LVS failing would mean we had no healthy backend for linkrec, right? [13:18:06] yes, but this is weird. I mean.. why isn't this happening so much in codfw? [13:18:17] or is it ... let me check there [13:18:52] ah its' happening there too [13:19:18] however eqiad does seem to be worse off [13:19:31] and it's not even pooled [13:19:48] yeah [13:21:41] it's also something that clearly doesn't have to do with external vs internal releases [13:21:49] cause the internal ones issue that error too [13:25:43] ok, found it [13:26:06] memory usage is right on the limit and the pod is heavily throttled [13:26:07] hm? [13:26:12] oh [13:26:22] in both clusters? [13:26:25] why now and not before though... [13:26:41] yes, both clusters [13:27:36] but it definitely something has changed at some point https://grafana.wikimedia.org/d/CI6JRnLMz/linkrecommendation?viewPanel=127&orgId=1&var-dc=thanos&var-site=codfw&var-service=linkrecommendation&var-prometheus=k8s&var-container_name=All&from=now-7d&to=now [13:27:41] weird that this did not bite earlier [13:27:50] https://grafana.wikimedia.org/d/CI6JRnLMz/linkrecommendation?viewPanel=127&orgId=1&var-dc=thanos&var-site=codfw&var-service=linkrecommendation&var-prometheus=k8s&var-container_name=All&from=now-30d&to=now is even more telling [13:28:26] wow [13:30:51] I 'll bump limits by 50% and see what that gets us [13:31:16] aaaah I know why! [13:31:56] they changed gunicorn settings, didn't they? [13:32:18] https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/673004/2/gunicorn.conf.py#1 [13:32:24] yes. It's now 5 workers [13:32:36] and I am guessing every one of them is being initialized from scratch [13:33:46] I am gonna send a patch to preload the app [13:33:51] --preload of gunicorn [13:33:57] https://docs.gunicorn.org/en/0.16.0/configure.html#preload-app [13:35:41] +1 [13:41:09] 10serviceops, 10Add-Link, 10Growth-Team, 10Patch-For-Review: linkrecommendation flap their readiness probes too often - https://phabricator.wikimedia.org/T278223 (10akosiaris) This seems to be related to the increase of the number of workers. It was a 5x and of course it increased the memory usage of the p... [13:41:35] akosiaris: I tried preload but the app crashed badly. It will be more involved than just adding the setting to the conf file [13:42:07] akosiaris: but yeah, not your problem, we'll have a look at it [13:43:06] I did update the cfs dashboard to k8s 1.16 label values now as well https://grafana-rw.wikimedia.org/d/tn6gBadMz/jayme-container_cpu_cfs_throttled_seconds_total [13:44:06] kostajh: oh, I wasn't aware. If it's more involved we can bump the memory limit while figuring out that the app doesn't like about preloading [13:44:46] akosiaris: yeah there is some cryptic (to me) error about The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec() [13:44:59] I'll look into it more [13:45:36] oh, maybe it's just on my machine / OS. yay [13:46:06] yeah I was about to say. CoreFoundation sounds MacOSish [13:47:54] https://bugs.python.org/issue33725 heh [13:54:57] 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10jijiki) Regarding reserving RAM for the node, after we complete T264604, we will have an estimation of how much memory we will need for onhost memcached. Right now we only... [13:57:14] jayme: Aside from that, I don't see anything else being an issue. Should we pool some services? Or leave it for tomorrow ? [13:58:11] akosiaris: I don't see anything either. I would say we pool something as it's still early [14:01:56] ok, let's start with some simple ones [14:03:05] I have expanded https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Prometheus with the missing steps :) [14:03:22] elukey: Thanks! [14:04:35] elukey: cool. The LVM stuff is also at https://wikitech.wikimedia.org/wiki/Prometheus#Add_filesystems_for_a_new_instance [14:04:46] (maybe just link) [14:05:10] jayme: I had to add a sed command in there to replace '-' with '--' otherwise the code didn't work [14:05:29] I updated the k8s docs, if it is horrible feel free to revert [14:06:01] if nobody opposes I'll start modifying dashboards like https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api to include the ml-serve clusters [14:06:39] oh no, it's fine. I just thought the prometheus page might be the right point to link to to not have duplicate docs [14:07:09] ahh yes yes [14:07:10] elukey: eheh, yeah. Feel free to make that variable dynamic. I was just lazy! [14:07:24] elukey: sure go ahead. We are planning though to do some spring cleaning on those after we are done with the reinits [14:07:41] we can probably ditch half those dashboards in factp [14:07:53] ah so I may need to wait then :D [14:08:35] akosiaris: ok, looks good. I'll make a patch to deploy it once the new image is built [14:08:51] all right so except the two tasks that I opened for namespaces and shared user configs I think that the ml-serve clusters are up! [14:09:07] thanks a lot for the patience and the support [14:11:39] kostajh: thanks! [14:12:09] elukey: thanks as well for the patience and the willingness to go through our process. I 'd like to believe it did become a bit better along the way [14:13:20] akosiaris: Tobias has also some backlog of notes to add IIRC, so we'll keep adding things as we go [14:17:50] ok [14:18:33] I 'll pool a few larger services. The 4-5 small ones don't have anything serious anyway [14:23:33] okay [14:24:17] ah finally, it's picking up a bit https://grafana.wikimedia.org/d/000000519/kubernetes-overview?viewPanel=9&orgId=1&from=now-30m&to=now [14:42:05] This has gone well, I am gonna do 1 more batch of services and leave the very large ones for the end [14:44:17] 10serviceops, 10MediaWiki-General, 10SRE, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10AMooney) @lmata, This needs PET code review correct? [14:49:10] 10serviceops, 10MediaWiki-General, 10SRE, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10lmata) @AMooney yes please [15:09:10] akosiaris, jayme: https://1.bp.blogspot.com/-mowX0GRZVJQ/VfjQ-1Cp85I/AAAAAAAABDc/RYST-flOr3g/s1600/IMG_5140.JPG [15:10:39] 10serviceops, 10Add-Link, 10Growth-Team, 10Patch-For-Review: linkrecommendation flap their readiness probes too often - https://phabricator.wikimedia.org/T278223 (10akosiaris) 05Open→03Resolved a:03akosiaris Deployment in eqiad worked. https://grafana-rw.wikimedia.org/d/CI6JRnLMz/linkrecommendation?o... [15:11:06] effie: rotfl [15:11:15] effie: I got multiple ones for you ;-) [15:11:49] yes, they practically write themselves [15:48:20] Some 1600k rps in eqiad k8s. I think I am gonna pause at this for the day. [15:48:33] jayme: objections ? ^ [15:48:34] nice :) [15:48:52] No, I think it's fine to do the rest tomorrow morning [15:57:32] ok [16:11:30] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10User-jijiki: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 (10JMeybohm) 05Open→03Resolved All nodes running kernel 4.19 now [16:11:32] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [16:12:41] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm) [16:12:44] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [16:13:34] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Check/Rebuild all docker-pkg build docker images running on kubernetes - https://phabricator.wikimedia.org/T274254 (10JMeybohm) [16:13:57] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Refactor users in production-images - https://phabricator.wikimedia.org/T274852 (10JMeybohm) 05Open→03Resolved Done and rolled out [16:14:32] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [16:14:34] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review, 10User-fsero: Set up PodSecurityPolicies in clusters - https://phabricator.wikimedia.org/T228967 (10JMeybohm) 05Open→03Resolved This is active in all clusters now [16:14:36] 10serviceops, 10Prod-Kubernetes, 10User-fsero: Kubernetes clusters roadmap - https://phabricator.wikimedia.org/T212123 (10JMeybohm) [16:19:12] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [16:19:14] 10serviceops, 10Prod-Kubernetes, 10User-fsero: Kubernetes clusters roadmap - https://phabricator.wikimedia.org/T212123 (10JMeybohm) [16:19:16] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review, 10User-fsero: Set up PodSecurityPolicies in clusters - https://phabricator.wikimedia.org/T228967 (10JMeybohm) 05Resolved→03Open But wait, it's currently still not fully active and blocked by: T274262 This can be closed when https://gerrit.wikimedia.... [16:38:54] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) Trying to break down my current thoughts: ### Onhost memcached In terms of functionality, I don't see a difference between being a DaemonSet and running on the host its... [16:47:02] 10serviceops, 10Analytics-Radar, 10observability, 10Patch-For-Review, and 2 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10thcipriani) [16:48:25] 10serviceops, 10Analytics-Radar, 10observability, 10Patch-For-Review, and 2 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10thcipriani) Is this still in progress or is this work superseded by #mw-on-k8s work? [16:49:11] 10serviceops, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10thcipriani) [17:08:54] 10serviceops, 10MediaWiki-General, 10SRE, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10AMooney) [17:21:23] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10JMeybohm) ### onhost memcached >It's still an open question how we will inject the node IP into the mcrouter configuration. it would mean we'd need to pass the host IP as an e...