[08:29:59] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm) [08:32:51] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm) [08:37:03] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6931958, @Tgr wrote: > The timeouts are still an issue, but tha... [09:09:03] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm) [09:14:12] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) >>! In T277297#6933465, @akosiaris wrote: >>>! In T277297#6931958, @Tgr wrote: >... [09:19:32] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm) [09:19:37] <_joe_> ok, here it is: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/670220 [09:19:55] <_joe_> 1000 lines, nice round number :P [09:20:18] eheh [09:21:01] but you cheated by adding some lines in mathoid [09:22:45] <_joe_> lol yes, by complete mistake [09:23:19] <_joe_> or better, it should've gone to another patch :P [09:24:53] <_joe_> also not sure why it gets a -1 in CI but rake gives me a green light locally [09:25:15] <_joe_> oh right I removed kubeyaml [09:25:55] <_joe_> bbiab [09:34:21] hello folks, I have a couple of questions about https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Networking [09:34:51] I see that I can, in theory, merge https://gerrit.wikimedia.org/r/c/operations/homer/public/+/661055 and apply it to cr1/cr2 eqiad/codfw [09:35:06] but I suppose that it may cause alerts until the calico config is up right? [09:39:16] elukey: Will it fire if the BGP sessions are not established? [09:39:53] jayme: IIRC yes, it will complain after a while if BGP sessions with peers are not up [09:40:48] probably ok if we do homer + calico sync one right after the other [09:41:53] elukey: the calico sync can be done before the homer run as well...the sync will work I guess, it will just fail (and retry over and over) to establish BGP [09:43:54] jayme: makes sense yes, I asked since knowing if the sync "works" (even failing but trying to get BGP sessions) might be a good sign before applying network rules on the routers [09:44:38] elukey: I'm not 100%, but that's an easy thing to try first [09:49:15] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10JMeybohm) [09:49:18] jayme: I am going to double check with Arzhel if it is ok to merge the network change first and then deploy calico, if the alerts are not going off after a millisecond I'll follow the order in the guide (BTW really nice work, thanks to all) [09:49:18] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10JMeybohm) [09:55:31] ok there is no big deal, we can keep the order [09:58:03] cool [09:58:06] thanks for checking [09:58:21] I'll also update the docs [09:58:38] qq - how should I check if calico works correctly after the steps? [09:59:00] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) >>! In T277711#6927861, @JMeybohm wrote: > I don't really like option 3 just because it moves parts of the software stack to the node itself and I would personally lik... [10:03:03] elukey: "kubectl -n kube-system get deployment,daemonset" should have all calico related stuff showing as READY and AVAILABLE [10:03:57] the BGP part is done by the "daemonset.apps/calico-node" so it's probably enough to look at those being DESIRED == READY == AVAILABLE [10:05:00] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) >>! In T277711#6933756, @jijiki wrote: >>>! In T277711#6927861, @JMeybohm wrote: >> I don't really like option 3 just because it moves parts of the software stack to the... [10:06:15] jayme: nice, will use it and add to the docs if needed [10:06:34] I am also almost able to understand what the kubectl command does [10:07:08] maybe in one year I'll be able to operate a k8s cluster [10:07:10] :D [10:07:19] all right proceeding with the config changes [10:07:49] nice :) you will ofc need to do something like "sudo -i; kube_env admin ml" before kubectl [10:09:43] jayme: ah I thought it was fine to run the kubectl command from the ml masters, the above is from deploy1002 right ? [10:10:36] oh, yeah. Sure. Both should be fine [10:10:58] perfect! [10:15:34] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) >>! In T277711#6933792, @Joe wrote: >>>! In T277711#6933756, @jijiki wrote: >>>>! In T277711#6927861, @JMeybohm wrote: >>> I don't really like option 3 just because it... [10:19:37] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10JMeybohm) >>! In T277711#6933792, @Joe wrote: > That is already done in the MediaWiki chart. But that does now deploy mcrouter as a sidecar in each MW pod. AIUI this might co... [10:19:48] jayme: I tried helmfile -e ml-serve-codfw -l name=calico-crds sync and got that calico-crds does not exist (failed to download wmf-stable/calico-ctds) [10:21:45] elukey: hmm...typo? [10:22:01] I just did exactly that and it worked fine :) [10:23:58] jayme: then I am missing a step before helmfile, do I need to source some file? [10:24:05] nope [10:24:42] in your IRC line it sais "ctds", though [10:25:46] jayme: yes that one is me reporting, the command before is right though [10:26:03] are you root? [10:26:16] I am yes, now with sudo -i worked, pebcak [10:26:54] ah, yeah. a7s mentioned that it does not work with "sudo -s" [10:27:16] I think it doesn't work with me using it [10:27:50] calico sync worked! checking [10:30:04] all good :) [10:30:13] nice :) [10:30:22] proceeding with eqiad then [10:30:24] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) >>! In T277711#6933874, @JMeybohm wrote: >>>! In T277711#6933792, @Joe wrote: >> That is already done in the MediaWiki chart. > > But that does now deploy mcrouter as a... [10:43:24] looks that all is working, updated the docs [10:46:25] all right time for coredns [11:05:22] <_joe_> jayme: say I update a configmap that is mounted in a container, and deploy it. The container will see the changes in the files, correct? [11:07:09] <_joe_> I'm asking because I think I don't need to restart the deployments when we change something in the mcrouter configuration [11:09:33] _joe_: yes it will. But maybe not immediately and through some nasty symlink trickery [11:10:03] <_joe_> jayme: uhm so maybe inotify won't work properly [11:10:08] <_joe_> ok, I'll make a note [11:11:14] it usually works with symlinks as well. But ofc you can make it so it does not. I did check manually for every software where I wanted that to work in the past [11:12:15] <_joe_> yeah it's one of the things to test [11:12:19] <_joe_> one of the many [12:59:21] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6932358, @kostajh wrote: >>>! In T277297#6... [13:02:14] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) >>! In T277297#6934504, @akosiaris wrote: >>>! In T277297#6... [13:46:03] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) >>! In T277297#6928483, @akosiaris wrote: >>>! In T277297#6... [14:19:29] jayme: (if you still have time) - I tried to execute the -n name=namespace sync for ml-serve-codfw, since I thought (looking in deployment charts) that ml-serve-values.yaml would have overridden the list with namespaces: {} [14:19:50] but then if I do kubectl get namespaces I see all the prod ones :( [14:20:19] elukey: ah...I was curious what would happen, tbh. :) [14:20:25] ahahaha okok [14:21:03] I would have to take a look at helmfile again, but in some cases, stuff gets deep merged [14:22:02] I was hoping that is not the case there, but unfortunately it is [14:22:49] ah ok so it is not getting overridden [14:22:58] yep [14:23:10] it is like namespaces in prod + {} [14:23:13] sigh [14:23:43] exactly. But I would have to chase it down to figure the details, not before Wednessday I fear [14:24:04] jayme: I can try to figure out what's happening, thanks anyway :) [14:24:37] now I get the "yaml-engineering" part of k8s :D [14:24:53] elukey "helmfile template ..." and "helmfile build" are your "frieds" (that will make you cry nevertheless) when doing so [14:25:49] jayme: I will try this new joy [14:26:28] ah dammit, of course they are deep merged, otherwise we would not be able to create the ci namespace in staging [14:26:30] sigh ... [14:27:33] yep [14:27:51] akosiaris: I know that it was all a plan to allow me to play with yaml files and enjoy them [14:28:06] but I thought it's maybe different when you set them explicit empty [14:28:52] Yeah that was my suggestion [14:29:12] elukey: anyway, just issue destroy instead of sync and don't try to populate namespaces I 'd say [14:29:22] but otherwise... it seems like you are done ? [14:29:28] as in... you got a k8s cluster? [14:29:50] akosiaris: the prometheus part is still to do (I am following up with Filippo), but yes seems so! [14:30:59] I need to find a solution for the namespace part no? Otherwise all the beatiful tools that I need to install will not have their own namespace [14:32:11] (I see namespaces terminating after the destroy) [14:32:13] I am not sure that you do. I 've only done a couple of times the kubeflow installation process. IIRC the istio part (what a pain) will create it's own namespace istio-system [14:32:22] the rest.. I don't remember I fear. [14:32:59] I am pretty sure that knative will give us emotions as well [14:34:08] I'll finish the monitoring part and then I'll try to play a bit with the cluster, to see if everything is fine [14:34:19] (sort of hello world, no idea yet how to do it) [14:37:26] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6934621, @kostajh wrote: >>>! In T277297#6... [14:37:28] knative? [14:37:38] why on earth are you into knative already? [14:37:43] this is even more beta than kubeflow [14:42:17] akosiaris: how can you do elastic ml things otherwise? I am really surprised that you even ask :D [14:42:26] akosiaris: take a look to this beauty https://www.kubeflow.org/docs/components/kfserving/kfserving/ [14:43:11] * elukey sees Alex's sad face from distance [14:43:56] :-o [14:44:09] this it *on top* of knative [14:44:26] and istio! [14:44:36] also kfserving is still in beta afaics [14:45:03] can we consider beta software running on top of beta software as beta? :D [14:45:13] it more like beta/2 [14:45:44] jayme: production-ready! [14:45:53] don't forget certmanager - IIRC that's beta too [14:46:23] I kinda hoped it was included elsewhere [14:46:29] haha [14:46:46] elsewhere "in the cloud" :) [14:47:12] jayme: exactly, this way of doing things in-house is terrible [14:47:33] jokes aside, I am wondering how gcloud manages these things [14:47:50] I cannot imagine the pager of the people on-call [14:48:16] apparently at least cert-manager is stable since a while now. Nice! [14:48:29] jayme: you just enlightened my day! [14:55:07] O M G [14:55:20] so that was what kfserving was about [14:55:28] kf over knative's serving... TIL [14:55:49] and this is, IIUC, a little part of kubeflow [14:56:00] the training stuff brings in pipelines etc.. [15:04:01] This Kubeflow component has beta status. [15:04:15] that's what I see in there elukey, that's what's grabbing my attention [15:04:17] fun fun fun fun [15:11:28] 10serviceops, 10SRE, 10User-jijiki: Put rdb200[78] into service - https://phabricator.wikimedia.org/T255681 (10Legoktm) [15:25:08] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Papaul) [15:27:47] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Papaul) 05Resolved→03Open a:05Dzahn→03Papaul Reopen this task since i have to offline all the servers in Netbox and remove the disks [15:32:32] <_joe_> elukey: I strongly suggest you re-consider using istio [15:39:18] _joe_ yes I know but kubeflow needs istio, is not something that we can opt out :( [15:57:14] elukey: btw at some point I would love a rundown of of what that infra all is (yes I know it's too early for something sensible, but I am mentioning it early enough so that ML doesn't end up being the sole team able to even have a peek into it). Whenever I peek at kubeflow there is something new in there and I am rather unaware of the architecture [15:57:14] and components by now (plus I found out about kfserving recently, I wasn't even aware of that). [15:57:53] hmmm TL;DR is probably something along the lines of "knowledge sharing" I guess [16:02:05] akosiaris: +1 we'll try to document everything as we go, otherwise we'll be too siloed, I agree [16:07:24] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) Given it has created some doubts, let me clarify: I've created a first version of the charts that implements solution 1 (and not a complete version of it, either). I did... [16:07:51] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) a:03Joe [16:08:10] elukey: if you do some kind of in-person rundown I would love to join [16:08:47] <_joe_> +1 [16:11:26] of course yes! Hopefully in a month time, after a lot of tears, me and Tobias will able to show something [16:11:59] <_joe_> I would also like to understand what's going to go on with abstract wikipedia, and reuse knative there at the very least [16:14:05] 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Reedy) [16:36:36] 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Daimona) [18:03:29] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) @fgiunchedi could you re-run your analysis to see if mw1307 (10.64.0.169) is still exhibiting the issue? [19:55:06] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) Sure, I wasn't sure if you prefer your own ticket or this. Either is fine with me. [19:57:41] 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) @Papaul fyi, this one is separate from T277119. I had to somehow separate them and instead by rack this is by purchase date. You will see though that... [21:13:52] 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Papaul) @Dzahn thanks for the update. I am planning on racking mw2401 to mw2411 in A5 and not in A4 since A4 is a 10G rack , i will like to keep this rack on... [21:42:47] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10Legoktm) >>! In T273521#6927841, @JMeybohm wrote: >>>! In T273521#6926308, @Legoktm wr... [22:05:26] 10serviceops: Make sure restricted docker images show up in debmonitor - https://phabricator.wikimedia.org/T278188 (10Legoktm) [22:25:52] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10akosiaris) [22:50:18] dcops wants to install new appservers in a row (A5) hat has not been used for mw before. they are asking if that is a problem, I am saying it's not.. but it means we have to define new mcrouter/scap proxies for that.. though that is hopefully it or there was nothing else needed like ACLs per row [22:50:36] also the reason they want it that way is.. that the rack we are _not_ getting is dedicated 10G [22:52:37] I thought we don't need a proxy per row, we just need all the proxies to be in *different* rows so we don't risk losing them at once [22:53:07] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Papaul) 05Open→03Resolved asset tag mgmt removed and all the disk removed also for server [22:53:09] that is, a new row wouldn't mean we need new proxies, we could move one there if we ever needed to but we'd keep the same total number [22:53:13] have I got that wrong? [22:55:26] No, i think you got it just right. In my mind I just conflated that as one thing "each row has a proxy so we can lose some" [23:01:47] 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2249.codfw.wmnet` - mw2249.codfw.wmnet (**PASS**) - Downtime... [23:02:41] 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) @Papaul We can do that, it isn't a problem. We can use A3 and A5. Thank you [23:32:28] beware of DNS generation conflicts that can still happen [23:32:46] for example if you run the decom cookbook and at the same time someone adds 30 new servers in netbox [23:32:58] you can get a surprise diff at the DNS step [23:33:18] just happened to but I could see it must be Papaul and check with him [23:33:31] those are our new codfw boxes [23:34:26] 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2250.codfw.wmnet` - mw2250.codfw.wmnet (**PASS**) - Downtime... [23:51:15] 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn)