[08:29:59] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm)
[08:32:51] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm)
[08:37:03] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6931958, @Tgr wrote:  > The timeouts are still an issue, but tha...
[09:09:03] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm)
[09:14:12] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) >>! In T277297#6933465, @akosiaris wrote: >>>! In T277297#6931958, @Tgr wrote: >...
[09:19:32] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm)
[09:19:37] <_joe_>	 ok, here it is: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/670220
[09:19:55] <_joe_>	 1000 lines, nice round number :P
[09:20:18] <jayme>	 eheh
[09:21:01] <jayme>	 but you cheated by adding some lines in mathoid
[09:22:45] <_joe_>	 lol yes, by complete mistake
[09:23:19] <_joe_>	 or better, it should've gone to another patch :P
[09:24:53] <_joe_>	 also not sure why it gets a -1 in CI but rake gives me a green light locally
[09:25:15] <_joe_>	 oh right I removed kubeyaml 
[09:25:55] <_joe_>	 bbiab
[09:34:21] <elukey>	 hello folks, I have a couple of questions about https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Networking
[09:34:51] <elukey>	 I see that I can, in theory, merge https://gerrit.wikimedia.org/r/c/operations/homer/public/+/661055 and apply it to cr1/cr2 eqiad/codfw
[09:35:06] <elukey>	 but I suppose that it may cause alerts until the calico config is up right?
[09:39:16] <jayme>	 elukey: Will it fire if the BGP sessions are not established?
[09:39:53] <elukey>	 jayme: IIRC yes, it will complain after a while if BGP sessions with peers are not up
[09:40:48] <elukey>	 probably ok if we do homer + calico sync one right after the other
[09:41:53] <jayme>	 elukey: the calico sync can be done before the homer run as well...the sync will work I guess, it will just fail (and retry over and over) to establish BGP 
[09:43:54] <elukey>	 jayme: makes sense yes, I asked since knowing if the sync "works" (even failing but trying to get BGP sessions) might be a good sign before applying network rules on the routers
[09:44:38] <jayme>	 elukey: I'm not 100%, but that's an easy thing to try first
[09:49:15] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10JMeybohm)
[09:49:18] <elukey>	 jayme: I am going to double check with Arzhel if it is ok to merge the network change first and then deploy calico, if the alerts are not going off after a millisecond I'll follow the order in the guide (BTW really nice work, thanks to all)
[09:49:18] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10JMeybohm)
[09:55:31] <elukey>	 ok there is no big deal, we can keep the order
[09:58:03] <jayme>	 cool
[09:58:06] <jayme>	 thanks for checking
[09:58:21] <elukey>	 I'll also update the docs
[09:58:38] <elukey>	 qq - how should I check if calico works correctly after the steps?
[09:59:00] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) >>! In T277711#6927861, @JMeybohm wrote: > I don't really like option 3 just because it moves parts of the software stack to the node itself and I would personally lik...
[10:03:03] <jayme>	 elukey: "kubectl -n kube-system get deployment,daemonset" should have all calico related stuff showing as READY and AVAILABLE 
[10:03:57] <jayme>	 the BGP part is done by the "daemonset.apps/calico-node" so it's probably enough to look at those being DESIRED == READY == AVAILABLE
[10:05:00] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) >>! In T277711#6933756, @jijiki wrote: >>>! In T277711#6927861, @JMeybohm wrote: >> I don't really like option 3 just because it moves parts of the software stack to the...
[10:06:15] <elukey>	 jayme: nice, will use it and add to the docs if needed
[10:06:34] <elukey>	 I am also almost able to understand what the kubectl command does
[10:07:08] <elukey>	 maybe in one year I'll be able to operate a k8s cluster
[10:07:10] <elukey>	 :D
[10:07:19] <elukey>	 all right proceeding with the config changes
[10:07:49] <jayme>	 nice :) you will ofc need to do something like "sudo -i; kube_env admin ml<TAB><TAB><WHATEVER>" before kubectl
[10:09:43] <elukey>	 jayme: ah I thought it was fine to run the kubectl command from the ml masters, the above is from deploy1002 right ?
[10:10:36] <jayme>	 oh, yeah. Sure. Both should be fine
[10:10:58] <elukey>	 perfect!
[10:15:34] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) >>! In T277711#6933792, @Joe wrote: >>>! In T277711#6933756, @jijiki wrote: >>>>! In T277711#6927861, @JMeybohm wrote: >>> I don't really like option 3 just because it...
[10:19:37] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10JMeybohm) >>! In T277711#6933792, @Joe wrote: > That is already done in the MediaWiki chart.  But that does now deploy mcrouter as a sidecar in each MW pod. AIUI this might co...
[10:19:48] <elukey>	 jayme: I tried helmfile -e ml-serve-codfw -l name=calico-crds sync and got that calico-crds does not exist (failed to download wmf-stable/calico-ctds)
[10:21:45] <jayme>	 elukey: hmm...typo?
[10:22:01] <jayme>	 I just did exactly that and it worked fine :)
[10:23:58] <elukey>	 jayme: then I am missing a step before helmfile, do I need to source some file?
[10:24:05] <jayme>	 nope
[10:24:42] <jayme>	 in your IRC line it sais "ctds", though 
[10:25:46] <elukey>	 jayme: yes that one is me reporting, the command before is right though
[10:26:03] <jayme>	 are you root?
[10:26:16] <elukey>	 I am yes, now with sudo -i worked, pebcak
[10:26:54] <jayme>	 ah, yeah. a7s mentioned that it does not work with "sudo -s"
[10:27:16] <elukey>	 I think it doesn't work with me using it
[10:27:50] <elukey>	 calico sync worked! checking
[10:30:04] <elukey>	 all good :)
[10:30:13] <jayme>	 nice :)
[10:30:22] <elukey>	 proceeding with eqiad then
[10:30:24] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) >>! In T277711#6933874, @JMeybohm wrote: >>>! In T277711#6933792, @Joe wrote: >> That is already done in the MediaWiki chart. >  > But that does now deploy mcrouter as a...
[10:43:24] <elukey>	 looks that all is working, updated the docs
[10:46:25] <elukey>	 all right time for coredns
[11:05:22] <_joe_>	 jayme: say I update a configmap that is mounted in a container,  and deploy it. The container will see the changes in the files, correct?
[11:07:09] <_joe_>	 I'm asking because I think I don't need to restart the deployments when we change something in the mcrouter configuration
[11:09:33] <jayme>	 _joe_: yes it will. But maybe not immediately and through some nasty symlink trickery
[11:10:03] <_joe_>	 jayme: uhm so maybe inotify won't work properly
[11:10:08] <_joe_>	 ok, I'll make a note
[11:11:14] <jayme>	 it usually works with symlinks as well. But ofc you can make it so it does not. I did check manually for every software where I wanted that to work in the past
[11:12:15] <_joe_>	 yeah it's one of the things to test
[11:12:19] <_joe_>	 one of the many
[12:59:21] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6932358, @kostajh wrote: >>>! In T277297#6...
[13:02:14] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) >>! In T277297#6934504, @akosiaris wrote: >>>! In T277297#6...
[13:46:03] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) >>! In T277297#6928483, @akosiaris wrote: >>>! In T277297#6...
[14:19:29] <elukey>	 jayme: (if you still have time) - I tried to execute the -n name=namespace sync for ml-serve-codfw, since I thought (looking in deployment charts) that ml-serve-values.yaml would have overridden the list with namespaces: {}
[14:19:50] <elukey>	 but then if I do kubectl get namespaces I see all the prod ones :(
[14:20:19] <jayme>	 elukey: ah...I was curious what would happen, tbh. :) 
[14:20:25] <elukey>	 ahahaha okok
[14:21:03] <jayme>	 I would have to take a look at helmfile again, but in some cases, stuff gets deep merged 
[14:22:02] <jayme>	 I was hoping that is not the case there, but unfortunately it is
[14:22:49] <elukey>	 ah ok so it is not getting overridden
[14:22:58] <jayme>	 yep
[14:23:10] <elukey>	 it is like namespaces in prod + {}
[14:23:13] <elukey>	 sigh
[14:23:43] <jayme>	 exactly. But I would have to chase it down to figure the details, not before Wednessday I fear
[14:24:04] <elukey>	 jayme: I can try to figure out what's happening, thanks anyway :)
[14:24:37] <elukey>	 now I get the "yaml-engineering" part of k8s :D
[14:24:53] <jayme>	 elukey "helmfile template ..." and "helmfile build" are your "frieds" (that will make you cry nevertheless) when doing so
[14:25:49] <elukey>	 jayme: I will try this new joy
[14:26:28] <akosiaris>	 ah dammit, of course they are deep merged, otherwise we would not be able to create the ci namespace in staging
[14:26:30] <akosiaris>	 sigh ...
[14:27:33] <jayme>	 yep
[14:27:51] <elukey>	 akosiaris: I know that it was all a plan to allow me to play with yaml files and enjoy them
[14:28:06] <jayme>	 but I thought it's maybe different when you set them explicit empty
[14:28:52] <akosiaris>	 Yeah that was my suggestion
[14:29:12] <akosiaris>	 elukey: anyway, just issue destroy instead of sync and don't try to populate namespaces I 'd say
[14:29:22] <akosiaris>	 but otherwise... it seems like you are done ?
[14:29:28] <akosiaris>	 as in... you got a k8s cluster?
[14:29:50] <elukey>	 akosiaris: the prometheus part is still to do (I am following up with Filippo), but yes seems so!
[14:30:59] <elukey>	 I need to find a solution for the namespace part no? Otherwise all the beatiful tools that I need to install will not have their own namespace
[14:32:11] <elukey>	 (I see namespaces terminating after the destroy)
[14:32:13] <akosiaris>	 I am not sure that you do. I 've only done a couple of times the kubeflow installation process. IIRC the istio part (what a pain) will create it's own namespace istio-system
[14:32:22] <akosiaris>	 the rest.. I don't remember I fear.
[14:32:59] <elukey>	 I am pretty sure that knative will give us emotions as well
[14:34:08] <elukey>	 I'll finish the monitoring part and then I'll try to play a bit with the cluster, to see if everything is fine
[14:34:19] <elukey>	 (sort of hello world, no idea yet how to do it)
[14:37:26] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6934621, @kostajh wrote: >>>! In T277297#6...
[14:37:28] <akosiaris>	 knative? 
[14:37:38] <akosiaris>	 why on earth are you into knative already?
[14:37:43] <akosiaris>	 this is even more beta than kubeflow
[14:42:17] <elukey>	 akosiaris: how can you do elastic ml things otherwise? I am really surprised that you even ask :D
[14:42:26] <elukey>	 akosiaris: take a look to this beauty https://www.kubeflow.org/docs/components/kfserving/kfserving/
[14:43:11] * elukey sees Alex's sad face from distance
[14:43:56] <jayme>	 :-o
[14:44:09] <jayme>	 this it *on top* of knative
[14:44:26] <elukey>	 and istio!
[14:44:36] <elukey>	 also kfserving is still in beta afaics
[14:45:03] <jayme>	 can we consider beta software running on top of beta software as beta? :D
[14:45:13] <jayme>	 it more like beta/2
[14:45:44] <elukey>	 jayme: production-ready!
[14:45:53] <jayme>	 don't forget certmanager - IIRC that's beta too
[14:46:23] <elukey>	 I kinda hoped it was included elsewhere
[14:46:29] <jayme>	 haha
[14:46:46] <jayme>	 elsewhere "in the cloud" :)
[14:47:12] <elukey>	 jayme: exactly, this way of doing things in-house is terrible
[14:47:33] <elukey>	 jokes aside, I am wondering how gcloud manages these things 
[14:47:50] <elukey>	 I cannot imagine the pager of the people on-call
[14:48:16] <jayme>	 apparently at least cert-manager is stable since a while now. Nice!
[14:48:29] <elukey>	 jayme: you just enlightened my day!
[14:55:07] <akosiaris>	 O M G 
[14:55:20] <akosiaris>	 so that was what kfserving was about
[14:55:28] <akosiaris>	 kf over knative's serving... TIL
[14:55:49] <elukey>	 and this is, IIUC, a little part of kubeflow
[14:56:00] <elukey>	 the training stuff brings in pipelines etc..
[15:04:01] <akosiaris>	 This Kubeflow component has beta status. 
[15:04:15] <akosiaris>	 that's what I see in there elukey, that's what's grabbing my attention
[15:04:17] <akosiaris>	 fun fun fun fun
[15:11:28] <wikibugs>	 10serviceops, 10SRE, 10User-jijiki: Put rdb200[78] into service - https://phabricator.wikimedia.org/T255681 (10Legoktm)
[15:25:08] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Papaul)
[15:27:47] <wikibugs>	 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17  (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Papaul) 05Resolved→03Open a:05Dzahn→03Papaul Reopen this task since i have to offline all the servers  in Netbox and remove the disks
[15:32:32] <_joe_>	 elukey: I strongly suggest you re-consider using istio
[15:39:18] <elukey>	 _joe_ yes I know but kubeflow needs istio, is not something that we can opt out :(
[15:57:14] <akosiaris>	 elukey: btw at some point I would love a rundown of of what that infra all is (yes I know it's too early for something sensible, but I am mentioning it early enough so that ML doesn't end up being the sole team able to even have a peek into it). Whenever I peek at kubeflow there is something new in there and I am rather unaware of the architecture
[15:57:14] <akosiaris>	 and components by now (plus I found out about kfserving recently, I wasn't even aware of that).
[15:57:53] <akosiaris>	 hmmm TL;DR is probably something along the lines of "knowledge sharing" I guess
[16:02:05] <elukey>	 akosiaris: +1 we'll try to document everything as we go, otherwise we'll be too siloed, I agree
[16:07:24] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) Given it has created some doubts, let me clarify: I've created a first version of the charts that implements solution 1 (and not a complete version of it, either).  I did...
[16:07:51] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) a:03Joe
[16:08:10] <jayme>	 elukey: if you do some kind of in-person rundown I would love to join
[16:08:47] <_joe_>	 +1
[16:11:26] <elukey>	 of course yes! Hopefully in a month time, after a lot of tears, me and Tobias will able to show something
[16:11:59] <_joe_>	 I would also like to understand what's going to go on with abstract wikipedia, and reuse knative there at the very least
[16:14:05] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Reedy)
[16:36:36] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Daimona)
[18:03:29] <wikibugs>	 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) @fgiunchedi could you re-run your analysis to see if mw1307 (10.64.0.169) is still exhibiting the issue?
[19:55:06] <wikibugs>	 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17  (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) Sure, I wasn't sure if you prefer your own ticket or this. Either is fine with me.
[19:57:41] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) @Papaul fyi, this one is separate from T277119. I had to somehow separate them and instead by rack this is by purchase date.  You will see though that...
[21:13:52] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Papaul) @Dzahn thanks for the update. I am planning on racking mw2401 to mw2411 in A5 and not in A4 since A4 is a 10G rack , i will like to keep this rack on...
[21:42:47] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10Legoktm) >>! In T273521#6927841, @JMeybohm wrote: >>>! In T273521#6926308, @Legoktm wr...
[22:05:26] <wikibugs>	 10serviceops: Make sure restricted docker images show up in debmonitor - https://phabricator.wikimedia.org/T278188 (10Legoktm)
[22:25:52] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10akosiaris)
[22:50:18] <mutante>	 dcops wants to install new appservers in a row (A5) hat has not been used for mw before. they are asking if that is a problem, I am saying it's not.. but it means we have to define new mcrouter/scap proxies for that.. though that is hopefully it or there was nothing else needed like ACLs per row
[22:50:36] <mutante>	 also the reason they want it that way is.. that the rack we are _not_ getting is dedicated 10G
[22:52:37] <rzl>	 I thought we don't need a proxy per row, we just need all the proxies to be in *different* rows so we don't risk losing them at once
[22:53:07] <wikibugs>	 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17  (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Papaul) 05Open→03Resolved asset tag mgmt removed and all the disk  removed also for server
[22:53:09] <rzl>	 that is, a new row wouldn't mean we need new proxies, we could move one there if we ever needed to but we'd keep the same total number
[22:53:13] <rzl>	 have I got that wrong?
[22:55:26] <mutante>	 No, i think you got it just right. In my mind I just conflated that as one thing "each row has a proxy so we can lose some"
[23:01:47] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2249.codfw.wmnet` - mw2249.codfw.wmnet (**PASS**)   - Downtime...
[23:02:41] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) @Papaul We can do that, it isn't a problem. We can use A3 and A5. Thank you
[23:32:28] <mutante>	 beware of DNS generation conflicts that can still happen
[23:32:46] <mutante>	 for example if you run the decom cookbook and at the same time someone adds 30 new servers in netbox
[23:32:58] <mutante>	 you can get a surprise diff at the DNS step
[23:33:18] <mutante>	 just happened to but I could see it must be Papaul and check with him
[23:33:31] <mutante>	 those are our new codfw boxes
[23:34:26] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2250.codfw.wmnet` - mw2250.codfw.wmnet (**PASS**)   - Downtime...
[23:51:15] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn)