[07:47:31] <elukey>	 hello people
[07:47:45] <elukey>	 one question - how should I generated the client_token for prometheus?
[07:47:50] <elukey>	 *generate
[07:48:10] <akosiaris>	 good thing you asked
[07:48:32] <akosiaris>	 you shouldn't, it's already generated. All clusters must share currently the same users due to mess in hiera/puppet
[07:48:54] <akosiaris>	 up to now it hasn't been an issue since there were no other clusters, but I guess we should revisit it :-)
[07:49:05] <akosiaris>	 so just reuse what the other clusters use
[07:49:51] <elukey>	 akosiaris: buongiorno, grazie :D
[07:50:12] <elukey>	 I am opening a task for the namespace override thing as well
[07:52:34] <wikibugs>	 10serviceops, 10Lift-Wing, 10Machine-Learning-Team: Allow namespaces to be overriden in deployment-chart's admin_ng - https://phabricator.wikimedia.org/T278208 (10elukey)
[07:52:51] <elukey>	 not sure if I have used the right terms :D
[07:54:24] <wikibugs>	 10serviceops, 10Lift-Wing, 10Machine-Learning-Team, 10Kubernetes: Allow namespaces to be overriden in deployment-chart's admin_ng - https://phabricator.wikimedia.org/T278208 (10JMeybohm)
[08:02:35] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris)
[08:03:18] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10akosiaris)
[08:03:29] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) 05Open→03Resolved a:03akosiaris Added steps in the eqiad task T277741 from the action items list, I am gonna boldy resolve th...
[08:17:30] <jayme>	 akosiaris: should we start or wait for the puppetmaster situation to settle?
[08:18:25] <akosiaris>	 jayme: settled 
[08:18:28] <akosiaris>	 ;-)
[08:18:36] <jayme>	 great :)
[08:18:43] <akosiaris>	 let me inform -sre that we are starting
[08:18:49] <jayme>	 ack
[08:24:27] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10ops-monitoring-bot) Icinga downtime set by akosiaris@cumin1001 for 1 day, 0:00:00 18 host(s) and their services with reason: Reinitialize eqia...
[08:24:49] <akosiaris>	 Step #1 down
[08:24:51] <akosiaris>	 done*
[08:25:06] * jayme rebased patches
[08:25:30] <jayme>	 the pybal/prometheus/confd downtimes we still have to click, right?
[08:27:31] <akosiaris>	 yeah I but I added helpful links to those
[08:27:52] <jayme>	 ah, sweet!
[08:27:58] <akosiaris>	 we 'll figure out a good way to downtime them in the cookbook
[08:29:32] <jayme>	 okay. Will you downtime services and I do pybal/prometheus/confd?
[08:30:26] <akosiaris>	 services downtimed
[08:30:35] <akosiaris>	 jayme: +1 
[08:30:44] <akosiaris>	 I 'll depool eqiad now
[08:30:51] <jayme>	 ack
[08:33:49] <akosiaris>	 so, restbase-async doesn't need to change this time around. It's already in codfw
[08:33:59] <akosiaris>	 I 've striked that out in the task already
[08:34:02] <jayme>	 that's why I striked it :)
[08:35:00] <akosiaris>	 you did? I thought I did yesterday? damn I should be consuming something memory enhancing it seems 
[08:35:22] <akosiaris>	 anyway, action item for later (/me consuming some memory enhancement or whatever)
[08:35:36] <akosiaris>	 puppet disabled across the cluster.
[08:36:19] <jayme>	 pybal/prometheus/confd downtimed until ~15:30 UTC
[08:36:59] <akosiaris>	 perfect. Traffic is being cut over as we speak
[08:37:07] <akosiaris>	 gonna wait another 5m and it should be done
[08:38:00] <jayme>	 there is a service missing in the list :o
[08:38:03] <jayme>	 zotero
[08:38:20] <akosiaris>	 oh dammit
[08:38:34] <akosiaris>	 how did that elude us?
[08:38:39] <akosiaris>	 did it last time too ?
[08:39:07] <jayme>	 I would guess so
[08:39:21] <akosiaris>	 ok, added and reissued
[08:39:23] <akosiaris>	 now it's done
[08:40:28] <jayme>	 Looking at SAL we missed it last week
[08:41:48] <akosiaris>	 well, at least it was zotero which is only accessed by citoid which was switched over so no requests were making it to zotero anyway
[08:42:07] <akosiaris>	 so we got lucky it seems
[08:42:54] <akosiaris>	 https://grafana.wikimedia.org/d/000000519/kubernetes-overview?viewPanel=9&orgId=1&from=now-1h&to=now says we can begin
[08:43:00] <akosiaris>	 I 'll poweroff the master VMs
[08:45:07] <akosiaris>	 jayme: I 'll regenerate the kubemaster.svc.eqiad.wmnet cert now
[08:45:19] <jayme>	 okay, I'll merge the homer patch
[08:49:54] <jayme>	 akosiaris: the homer commands I can probabbly scope to the core routers like "cr*eqiad*" ?
[08:51:20] <jayme>	 hmm...no :)
[08:51:34] <akosiaris>	 yes you can
[08:51:37] <akosiaris>	 you should actually
[08:51:55] <akosiaris>	 homer "cr*codfw*" commit "T277191, adding kubernetes2017"
[08:52:06] <akosiaris>	 that was my homer invocation last time
[08:52:19] <jayme>	 but for "cr*eqiad" it just said "No diff"
[08:52:31] <jayme>	 "cr*eqiad*"
[08:53:56] <akosiaris>	 ;-)
[08:54:35] <jayme>	 oh, nono. I did "cr*eqiad*" diff and it selected both routers :)
[08:54:50] <akosiaris>	 yes, you want both :-)
[08:55:01] <akosiaris>	 you will have to answer twice yes though IIRC
[08:55:19] <jayme>	 I especially wanted a puppet run on cumin1001 before the diff ;-)
[08:57:13] <jayme>	 ok, Homer run completed successfully on 2 devices: ['cr1-eqiad.wikimedia.org', 'cr2-eqiad.wikimedia.org']
[08:58:12] <akosiaris>	 perfect
[08:58:20] <akosiaris>	 so I guess, it's puppet merge and reimage time ?
[08:58:23] <jayme>	 akosiaris: I'll merge the worker puppet patch now and start reimaging
[08:58:26] <jayme>	 ack
[08:58:32] <akosiaris>	 ok, doing masters then
[08:59:05] <jayme>	 akosiaris: kill and reboot etcd first
[09:00:23] <akosiaris>	 indeed, good point
[09:00:25] <akosiaris>	 doing so
[09:04:58] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` ['kubernetes1001.eqiad.wmnet', 'kubern...
[09:07:03] <jayme>	 I'll do the ganeti instances as well
[09:21:45] <_joe_>	 we can delete everything from etcd, if you don't want to reinitialize it, sorry late to the game
[09:22:06] <jayme>	 thats what we do :)
[09:22:11] <akosiaris>	 _joe_: done already 
[09:22:18] <jayme>	 (reboot is just because of ther kernel updates
[09:22:22] <akosiaris>	 ah you mean pybal etcd ?
[09:22:23] <_joe_>	 oh I see
[09:22:27] <_joe_>	 no
[09:22:31] <akosiaris>	 we meant kubernetes etcd 
[09:22:34] <_joe_>	 I meant the k8s ones, 
[09:22:35] <akosiaris>	 that is getting wiped 
[09:22:42] <akosiaris>	 yeah the process is a wipe of that datastore
[09:22:50] <_joe_>	 I thought you were going the nuclear way (remove the data files)
[09:23:29] <akosiaris>	 ETCDCTL_API=3 etcdctl --endpoints https://foobar.site.wmnet:2379 del "" --from-key=true
[09:23:41] <akosiaris>	 it's the neutron way
[09:23:56] <akosiaris>	 Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.15", GitCommit:"2adc8d7091e89b6e3ca8d048140618ec89b39369", GitTreeState:"clean", BuildDate:"2020-09-02T11:31:21Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
[09:23:59] <akosiaris>	 API is up and running
[09:24:15] <akosiaris>	 looking for issues now
[09:24:46] <akosiaris>	 old nodes are trying to authenticate, I see Unable to authenticate the request due to an error: [invalid bearer token, square/go-jose: error in cryptographic primitive]
[09:24:58] <jayme>	 akosiaris: did you do the controllermanager token?
[09:25:04] <akosiaris>	 which is when a no longer invalid token is being sent to the API server, which should be from the nodes
[09:25:05] <akosiaris>	 yes I did
[09:25:08] <jayme>	 cool
[09:25:28] <akosiaris>	 ah dammit
[09:25:35] <jayme>	 oh
[09:25:37] <akosiaris>	 we depooled zotero but forgot to downtime it
[09:25:38] <akosiaris>	 lol
[09:25:46] <volans>	 eh
[09:25:48] <akosiaris>	 ok ACKing it,
[09:25:51] <_joe_>	 it's always zotero
[09:26:46] <jayme>	 dammit, I thought we would come through without a alert this time but now we got a page :-|
[09:28:12] <jayme>	 akosiaris: why the heck did we get "PROBLEM - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster1001 is CRITICAL" ?
[09:28:34] <jayme>	 obviously I've downtimed only the eqiad ones...
[09:28:57] <akosiaris>	 oh, I put only the eqiad ones in the list, didn't I ?
[09:28:59] <akosiaris>	 sigh
[09:29:27] <_joe_>	 that's ok
[09:29:37] <akosiaris>	 oh wait, that shouldn't be alerting
[09:29:46] <akosiaris>	 kubemaster on codfw should be fine
[09:29:47] <_joe_>	 akosiaris: disregard
[09:29:47] <jayme>	 it's codfw
[09:29:58] <_joe_>	 it's a problem with the monitoring
[09:30:04] <akosiaris>	 ?
[09:30:05] <_joe_>	 nothing should be wrong in codfw
[09:30:22] <_joe_>	 do you want the lengthy explanation?
[09:30:38] <akosiaris>	 I think I have an idea. kubemaster is also used for ML ?
[09:30:49] <_joe_>	 no, it's much lamer
[09:30:59] <akosiaris>	 dammit, I was hoping for a duplicate key
[09:31:40] <_joe_>	 the two files for codfw and eqiad come from the same confd template
[09:31:47] <_joe_>	 so the error file is the same for both
[09:31:57] <_joe_>	 so the two alerts fire up at the same time
[09:32:01] <jayme>	 ouch :)
[09:32:06] <_joe_>	 we'd have to duplicate the file per dc
[09:32:28] <akosiaris>	 and this is just for kubemaster?
[09:32:56] <akosiaris>	 why isn't it happening for the other services?
[09:33:15] <jayme>	 it's probably for all, but we did not get into the race condition of all nodes being depooled at the same time
[09:33:49] <_joe_>	 ^^
[09:34:34] <akosiaris>	 ok
[09:36:15] <akosiaris>	 jayme: I think I 'll merge the admin_ng/ change while waiting for the reimaging to finish
[09:37:40] <jayme>	 akosiaris: sure, go ahead. You might as well also sync already. The nodes should pop in one by one then
[09:37:52] <jayme>	 akosiaris: forgot this one https://gerrit.wikimedia.org/r/c/operations/puppet/+/674269
[09:39:31] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm)
[09:41:04] <akosiaris>	 jayme: ok, doing so. I 've merged the kube1017 patch as well
[09:41:11] <jayme>	 thanks
[09:45:11] <akosiaris>	 nodes have indeed started showing up in the api
[09:47:16] <jayme>	 cool!
[09:47:50] <jayme>	 akosiaris: we forgot 2017 in conftool-data as well https://gerrit.wikimedia.org/r/c/operations/puppet/+/674270
[09:48:32] <akosiaris>	 lol, can't wait to out of behind pybal
[09:48:34] <akosiaris>	 merging
[09:48:48] <jayme>	 do we need to add default state as well somewhere?
[09:49:08] <jayme>	 1017 is weight:0 pooled:inactive now
[09:49:12] <akosiaris>	 we used to have it and then we removed it
[09:49:25] <akosiaris>	 it was causing it's own set of issues
[09:49:43] <jayme>	 ah, okay. So I just set the weight to 10 and pool it manually
[09:56:05] <jayme>	 akosiaris: did you do helmfile sync with the -l steps or in one go?
[09:56:06] <akosiaris>	 omg, all cluster components seem to be running fine
[09:56:26] <akosiaris>	 jayme: up to calico one by one. After calico said ok, I went for the rest in 1 go
[09:56:33] <akosiaris>	 but nothing failed to start with
[09:57:02] <jayme>	 great
[09:59:14] <akosiaris>	 so, I think we are at the look for problems and if none are spotted deploy services stage
[10:01:48] <jayme>	 yep. I currently have 1015 that fails the check for successful puppet run
[10:02:05] <jayme>	 but the node itself looks fine
[10:03:32] <akosiaris>	 yeah, 1004 complained about it but a quick puppet run was fine. I think it's just latency on monitoring catching up with the state
[10:07:29] <jayme>	 *1005 not 1015 I meant. That one still hangs in the reboot cookbooks check
[10:10:56] <jayme>	 1005 is green now as well
[10:14:20] <akosiaris>	 yeah, looks like we got a few last ones pending for 100{2,3,8,14} but it's things I am not worried about (EDAC, SSH)
[10:14:24] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1003.eqiad.wmnet', 'kubernetes1002.eqiad.wmnet', 'kubernetes1010.eqiad.wmne...
[10:14:50] <akosiaris>	 I think we can deploy everything indeed. jayme: wanna have the honours?
[10:15:50] <jayme>	 akosiaris: we have a CRITICAL: Hosts in IPVS but unknown to PyBal: set(['kubernetes1017.eqiad.wmnet']) (it's from 1.5h ago, so I sheduled a recheck)
[10:16:25] <jayme>	 akosiaris: yeah, I can do the deploy-all :D
[10:16:34] <akosiaris>	 yeah that one has me puzzled too
[10:16:48] <akosiaris>	 I 've looked into IPVS, for the various services 1017 is pooled indeed
[10:17:02] <jayme>	 akosiaris: kubernetes*017 also do still show up as "staged" in netbox
[10:17:05] <akosiaris>	 pybal of course would love to remove it
[10:17:29] <akosiaris>	 jayme: ah good point let me fix that while you deploy 
[10:17:39] <jayme>	 akosiaris: let me know how 
[10:17:47] <jayme>	 or what I was missing
[10:18:55] <akosiaris>	 editing a web ui manually ?
[10:19:03] <jayme>	 uh
[10:19:07] <akosiaris>	 Staged -> Active is a manual thing IIRC
[10:19:13] <jayme>	 did not expect that :) okay
[10:19:27] <akosiaris>	 but!!! I found out something intereting
[10:19:37] <akosiaris>	 https://netbox.wikimedia.org/dcim/devices/2875/ says kubernetes2017 is a juniper
[10:19:40] <akosiaris>	 volans ^ 
[10:19:46] <akosiaris>	 I think think that's true :P
[10:19:54] <akosiaris>	 I don't think*
[10:19:59] <akosiaris>	 although what do I know ?
[10:20:08] <jayme>	 unless I reimaged it as a juniper :-P
[10:21:16] <volans>	 akosiaris: looking
[10:21:17] <volans>	 LOL
[10:21:50] <volans>	 I don't think we set that from any automation IIRC
[10:22:05] <volans>	 it might be set manually by dcops upon device creation, but I need to check to confirm
[10:22:24] <akosiaris>	 ok, maybe then a human error. makes sense
[10:22:42] <volans>	 but we could add a report to check that
[10:22:51] * jayme pooled all nodes
[10:23:33] <volans>	 btw I'm working for you (wrt icinga downtime in spicerack)
[10:23:38] <akosiaris>	 ok, both nodes set to active now in netbox and platform attribute removed
[10:24:16] <volans>	 usually we set Linux
[10:24:42] <akosiaris>	 I can do that, easy enough
[10:24:55] <wikibugs>	 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10Joe)
[10:26:21] <jayme>	 akosiaris: pyball diff check recovery just came in
[10:26:30] <akosiaris>	 \o/
[10:26:49] <volans>	 anyway don't worry Ill have a look at the platform and see the current status
[10:26:53] <jayme>	 confd template is still crying for kubemaster
[10:27:41] <akosiaris>	 I am sensing a .err file
[10:27:54] <jayme>	 but IIRC that needs some manual action to clean the errors up, right?
[10:27:56] <jayme>	 yes, that
[10:28:02] <akosiaris>	 root@puppetmaster1001:/var/run/confd-template# find .
[10:28:02] <akosiaris>	 .
[10:28:02] <akosiaris>	 ./.kubemaster873550242.err
[10:28:02] <akosiaris>	 ./.kubemaster246259658.err
[10:28:08] <akosiaris>	 let me delete those
[10:28:17] <akosiaris>	 _joe_: why do we even keep those around btw ^ ?
[10:28:22] <akosiaris>	 I can't remember right now
[10:29:12] <_joe_>	 akosiaris: because we check if the file generated is newer than the error
[10:29:35] <_joe_>	 the problem probably is the generated file is untouched from the version before the errors?
[10:30:29] <akosiaris>	 I 'll need to look at the check but I can tell you deleting those fixes it
[10:30:29] <volans>	 I've created T278221 fwiw
[10:31:07] <akosiaris>	 confd compilation was ok though already, those looked like stale files
[10:31:32] <akosiaris>	 anyway, confd recoveries came in
[10:32:21] <akosiaris>	 It's so nice to run kubectl get pods --all-namespaces -w and just see pod creation just scrolling by
[10:32:44] <akosiaris>	 eventstreams being deployed now
[10:33:27] <jayme>	 is there a way to search and cancel all the downtimes I've created in icinga?
[10:42:51] <akosiaris>	 https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=6
[10:43:01] <akosiaris>	 not nice for sure. A lot of click click
[10:43:29] <jayme>	 okay, thats the path I took 
[10:54:51] <jayme>	 deploy went good, double checking time I think
[10:57:09] <wikibugs>	 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10Joe) Some data from one appserver: - httpd uses less than 1 GB of memory and 1 cpu. If we assume we'll reduce the number of workers, it can be safe to assume e.g. 600 MB a...
[10:58:53] <akosiaris>	 _joe_: btw, regarding https://phabricator.wikimedia.org/T278220 I got a plan to get cross fleet actual data about it. It's a couple of patches and an ok from o11y (it's a bit of data). Just not today
[11:00:07] <_joe_>	 akosiaris: I've gathered some data, but manually
[11:00:42] <akosiaris>	 no need, we got systemd :)
[11:00:56] <_joe_>	 that's what I meant with "manually" :P
[11:01:32] <_joe_>	 there is also the problem that e.g. apache will hoard memory over time, and I think that's because it's uncostrained
[11:03:39] <akosiaris>	 linkrecommdation is a bit interesting. one of the containers is flapping between ready and not ready
[11:03:53] <akosiaris>	 probably nothing to do with the reinit, but an interesting action item
[11:05:14] <jayme>	 hmm, yeah. readiness probe times out from time to time
[11:05:18] <akosiaris>	   Warning  Unhealthy  83s (x120 over 30m)  kubelet, kubernetes1008.eqiad.wmnet  Readiness probe failed: Get http://10.64.68.207:8000/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
[11:05:23] <akosiaris>	 yup
[11:05:26] <akosiaris>	 interesting
[11:05:41] <akosiaris>	 let me create a task for the and move on to the rest
[11:05:43] <jayme>	 that's weird...should be almost a zero-cost thing
[11:06:06] <akosiaris>	 There was that issue with gunicorn and the 2 few workers, perhaps it's related
[11:06:11] <akosiaris>	 too few*
[11:06:43] <wikibugs>	 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10Joe)
[11:06:59] <jayme>	 yeah, sounds promising
[11:07:34] <wikibugs>	 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10Joe)
[11:08:57] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team: linkrecommendation flap their readiness probes too often - https://phabricator.wikimedia.org/T278223 (10akosiaris)
[11:09:03] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team: linkrecommendation flap their readiness probes too often - https://phabricator.wikimedia.org/T278223 (10akosiaris) p:05Triage→03Low
[11:09:50] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team: linkrecommendation flap their readiness probes too often - https://phabricator.wikimedia.org/T278223 (10akosiaris) Adding @kostajh and @Tgr for their information.
[11:11:15] <wikibugs>	 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10JMeybohm) > The goal is to pack 4 or even 5 pods in a single modern node.  I recently created T277876 where I propose we should reserve some of the resources of each node...
[11:13:03] <akosiaris>	 I think everything is green, right ?
[11:13:10] <wikibugs>	 10serviceops, 10Lift-Wing, 10Machine-Learning-Team, 10Kubernetes: Allow k8s clusters to have their own k8s_infrastructure_users in puppet - https://phabricator.wikimedia.org/T278224 (10elukey)
[11:13:20] <wikibugs>	 10serviceops, 10Lift-Wing, 10Machine-Learning-Team, 10Kubernetes: Allow k8s clusters to have their own k8s_infrastructure_users in puppet - https://phabricator.wikimedia.org/T278224 (10elukey) a:05klausman→03None
[11:13:36] <jayme>	 akosiaris: some other deployments hade timeouts during health checks as well https://logstash.wikimedia.org/goto/c5831fb2ac91c3be40ce9ed942556eb3
[11:13:41] <jayme>	 *had
[11:14:17] <jayme>	 akosiaris: ignore me please
[11:14:26] <jayme>	 forgot to filter by cluster
[11:15:10] <jayme>	 so yeah. Everything looks pretty green
[11:15:42] <akosiaris>	 jayme: actually you are one to something https://logstash.wikimedia.org/goto/f678f4fd2634d1ff2fc329b247dc3943
[11:16:14] <akosiaris>	 linkrecommendation is still heavily dominating that dashboard
[11:16:20] <jayme>	 yeah, but not eqiad specific
[11:16:38] <akosiaris>	 yeah, I don't think this has anythign to do with eqiad
[11:16:45] <akosiaris>	 we just uncovered something while working on eqiad
[11:17:36] <akosiaris>	 btw, I 've said it before, but that's a pretty nice dashboard
[11:17:44] <akosiaris>	 elukey: ^ you will probably need it :P
[11:17:48] <jayme>	 thanks :)
[11:18:16] <jayme>	 that's top of what I'm able to do with kibana, though :D
[11:18:34] <wikibugs>	 10serviceops: Productionise mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T278225 (10jijiki) p:05Triage→03Medium
[11:19:27] <akosiaris>	 jayme: I think it's lunch time over there. I 'd let it be for a while, then removing downtime after lunch
[11:19:40] <jayme>	 akosiaris: sounds good to me!
[11:20:02] <akosiaris>	 then let's see when we should pool the traffic back
[11:20:06] <akosiaris>	 ok, /me off for lunch
[11:20:11] <jayme>	 +1
[11:49:08] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team: linkrecommendation flap their readiness probes too often - https://phabricator.wikimedia.org/T278223 (10kostajh) Thanks for the heads up. No idea what's causing that, but I don't think it's specific to our application code. FWIW here is a stack trace of a call to `/he...
[11:51:25] <wikibugs>	 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10Joe) A typical appserver has 96 GB of memory and 48 cores. Let's assume we can use up to 85% of those with pods, which looks a bit conservative, but it's ok for our curren...
[11:54:36] <wikibugs>	 10serviceops, 10Patch-For-Review: Productionise mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T278225 (10jijiki)
[11:55:30] <wikibugs>	 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10Joe) At 15 workers per pod, we get 5 pods per node (6 if we only reserve 5% of ram and cpu). That's more or less the maximum concurrency at which the sweet spot holds for...
[12:17:13] <wikibugs>	 10serviceops, 10Discovery-Search, 10Maps, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): [OSM] Backport imposm3 to the debian channel - https://phabricator.wikimedia.org/T238753 (10MSantos) 05Open→03Resolved
[13:16:10] <jayme>	 akosiaris: health check from LVS failing would mean we had no healthy backend for linkrec, right?
[13:18:06] <akosiaris>	 yes, but this is weird. I mean.. why isn't this happening so much in codfw?
[13:18:17] <akosiaris>	 or is it ... let me check there
[13:18:52] <akosiaris>	 ah its' happening there too
[13:19:18] <akosiaris>	 however eqiad does seem to be worse off
[13:19:31] <akosiaris>	 and it's not even pooled 
[13:19:48] <jayme>	 yeah
[13:21:41] <akosiaris>	 it's also something that clearly doesn't have to do with external vs internal releases
[13:21:49] <akosiaris>	 cause the internal ones issue that error too
[13:25:43] <akosiaris>	 ok, found it
[13:26:06] <akosiaris>	 memory usage is right on the limit and the pod is heavily throttled
[13:26:07] <jayme>	 hm?
[13:26:12] <jayme>	 oh
[13:26:22] <jayme>	 in both clusters?
[13:26:25] <akosiaris>	 why now and not before though...
[13:26:41] <akosiaris>	 yes, both clusters
[13:27:36] <akosiaris>	 but it definitely something has changed at some point https://grafana.wikimedia.org/d/CI6JRnLMz/linkrecommendation?viewPanel=127&orgId=1&var-dc=thanos&var-site=codfw&var-service=linkrecommendation&var-prometheus=k8s&var-container_name=All&from=now-7d&to=now
[13:27:41] <jayme>	 weird that this did not bite earlier
[13:27:50] <akosiaris>	 https://grafana.wikimedia.org/d/CI6JRnLMz/linkrecommendation?viewPanel=127&orgId=1&var-dc=thanos&var-site=codfw&var-service=linkrecommendation&var-prometheus=k8s&var-container_name=All&from=now-30d&to=now is even more telling
[13:28:26] <jayme>	 wow
[13:30:51] <akosiaris>	 I 'll bump limits by 50% and see what that gets us
[13:31:16] <akosiaris>	 aaaah I know why!
[13:31:56] <jayme>	 they changed gunicorn settings, didn't they?
[13:32:18] <akosiaris>	 https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/673004/2/gunicorn.conf.py#1
[13:32:24] <akosiaris>	 yes. It's now 5 workers
[13:32:36] <akosiaris>	 and I am guessing every one of them is being initialized from scratch
[13:33:46] <akosiaris>	 I am gonna send a patch to preload the app
[13:33:51] <akosiaris>	 --preload of gunicorn
[13:33:57] <akosiaris>	 https://docs.gunicorn.org/en/0.16.0/configure.html#preload-app
[13:35:41] <jayme>	 +1
[13:41:09] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team, 10Patch-For-Review: linkrecommendation flap their readiness probes too often - https://phabricator.wikimedia.org/T278223 (10akosiaris) This seems to be related to the increase of the number of workers. It was a 5x and of course it increased the memory usage of the p...
[13:41:35] <kostajh>	 akosiaris: I tried preload but the app crashed badly. It will be more involved than just adding the setting to the conf file
[13:42:07] <kostajh>	 akosiaris: but yeah, not your problem, we'll have a look at it
[13:43:06] <jayme>	 I did update the cfs dashboard to k8s 1.16 label values now as well https://grafana-rw.wikimedia.org/d/tn6gBadMz/jayme-container_cpu_cfs_throttled_seconds_total
[13:44:06] <akosiaris>	 kostajh: oh, I wasn't aware. If it's more involved we can bump the memory limit while figuring out that the app doesn't like about preloading
[13:44:46] <kostajh>	 akosiaris: yeah there is some cryptic (to me) error about The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec()
[13:44:59] <kostajh>	 I'll look into it more
[13:45:36] <kostajh>	 oh, maybe it's just on my machine / OS. yay
[13:46:06] <akosiaris>	 yeah I was about to say. CoreFoundation sounds MacOSish
[13:47:54] <akosiaris>	 https://bugs.python.org/issue33725 heh
[13:54:57] <wikibugs>	 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10jijiki) Regarding reserving RAM for the node, after we complete T264604, we will have an estimation of how much memory we will need for onhost memcached. Right now we only...
[13:57:14] <akosiaris>	 jayme: Aside from that, I don't see anything else being an issue. Should we pool some services? Or leave it for tomorrow ?
[13:58:11] <jayme>	 akosiaris: I don't see anything either. I would say we pool something as it's still early
[14:01:56] <akosiaris>	 ok, let's start with some simple ones
[14:03:05] <elukey>	 I have expanded https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Prometheus with the missing steps :)
[14:03:22] <akosiaris>	 elukey: Thanks!
[14:04:35] <jayme>	 elukey: cool. The LVM stuff is also at https://wikitech.wikimedia.org/wiki/Prometheus#Add_filesystems_for_a_new_instance
[14:04:46] <jayme>	 (maybe just link)
[14:05:10] <elukey>	 jayme: I had to add a sed command in there to replace '-' with '--' otherwise  the code didn't work
[14:05:29] <elukey>	 I updated the k8s docs, if it is horrible feel free to revert
[14:06:01] <elukey>	 if nobody opposes I'll start modifying dashboards like https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api to include the ml-serve clusters
[14:06:39] <jayme>	 oh no, it's fine. I just thought the prometheus page might be the right point to link to to not have duplicate docs
[14:07:09] <elukey>	 ahh yes yes
[14:07:10] <jayme>	 elukey: eheh, yeah. Feel free to make that variable dynamic. I was just lazy!
[14:07:24] <akosiaris>	 elukey: sure go ahead. We are planning though to do some spring cleaning on those after we are done with the reinits
[14:07:41] <akosiaris>	 we can probably ditch half those dashboards in factp
[14:07:53] <elukey>	 ah so I may need to wait then :D
[14:08:35] <kostajh>	 akosiaris: ok, looks good. I'll make a patch to deploy it once the new image is built
[14:08:51] <elukey>	 all right so except the two tasks that I opened for namespaces and shared user configs I think that the ml-serve clusters are up!
[14:09:07] <elukey>	 thanks a lot for the patience and the support
[14:11:39] <akosiaris>	 kostajh: thanks! 
[14:12:09] <akosiaris>	 elukey: thanks as well for the patience and the willingness to go through our process. I 'd like to believe it did become a bit better along the way
[14:13:20] <elukey>	 akosiaris: Tobias has also some backlog of notes to add IIRC, so we'll keep adding things as we go
[14:17:50] <akosiaris>	 ok
[14:18:33] <akosiaris>	 I 'll pool a few larger services. The 4-5 small ones don't have anything serious anyway
[14:23:33] <jayme>	 okay
[14:24:17] <akosiaris>	 ah finally, it's picking up a bit https://grafana.wikimedia.org/d/000000519/kubernetes-overview?viewPanel=9&orgId=1&from=now-30m&to=now
[14:42:05] <akosiaris>	 This has gone well, I am gonna do 1 more batch of services and leave the very large ones for the end
[14:44:17] <wikibugs>	 10serviceops, 10MediaWiki-General, 10SRE, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10AMooney) @lmata, This needs PET code review correct?
[14:49:10] <wikibugs>	 10serviceops, 10MediaWiki-General, 10SRE, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10lmata) @AMooney yes please
[15:09:10] <effie>	 akosiaris, jayme: https://1.bp.blogspot.com/-mowX0GRZVJQ/VfjQ-1Cp85I/AAAAAAAABDc/RYST-flOr3g/s1600/IMG_5140.JPG
[15:10:39] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team, 10Patch-For-Review: linkrecommendation flap their readiness probes too often - https://phabricator.wikimedia.org/T278223 (10akosiaris) 05Open→03Resolved a:03akosiaris Deployment in eqiad worked. https://grafana-rw.wikimedia.org/d/CI6JRnLMz/linkrecommendation?o...
[15:11:06] <akosiaris>	 effie: rotfl
[15:11:15] <akosiaris>	 effie: I got multiple ones for you ;-)
[15:11:49] <effie>	 yes, they practically write themselves
[15:48:20] <akosiaris>	 Some 1600k rps in eqiad k8s. I think I am gonna pause at this for the day.
[15:48:33] <akosiaris>	 jayme: objections ? ^
[15:48:34] <jayme>	 nice :)
[15:48:52] <jayme>	 No, I think it's fine to do the rest tomorrow morning
[15:57:32] <akosiaris>	 ok
[16:11:30] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10User-jijiki: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 (10JMeybohm) 05Open→03Resolved All nodes running kernel 4.19 now
[16:11:32] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm)
[16:12:41] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm)
[16:12:44] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm)
[16:13:34] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Check/Rebuild all docker-pkg build docker images running on kubernetes - https://phabricator.wikimedia.org/T274254 (10JMeybohm)
[16:13:57] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Refactor users in production-images - https://phabricator.wikimedia.org/T274852 (10JMeybohm) 05Open→03Resolved Done and rolled out
[16:14:32] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm)
[16:14:34] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review, 10User-fsero: Set up PodSecurityPolicies in clusters - https://phabricator.wikimedia.org/T228967 (10JMeybohm) 05Open→03Resolved This is active in all clusters now
[16:14:36] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10User-fsero: Kubernetes clusters roadmap - https://phabricator.wikimedia.org/T212123 (10JMeybohm)
[16:19:12] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm)
[16:19:14] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10User-fsero: Kubernetes clusters roadmap - https://phabricator.wikimedia.org/T212123 (10JMeybohm)
[16:19:16] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review, 10User-fsero: Set up PodSecurityPolicies in clusters - https://phabricator.wikimedia.org/T228967 (10JMeybohm) 05Resolved→03Open But wait, it's currently still not fully active and blocked by: T274262  This can be closed when https://gerrit.wikimedia....
[16:38:54] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) Trying to break down my current thoughts:  ### Onhost memcached  In terms of functionality, I don't see a difference between being a DaemonSet and running on the host its...
[16:47:02] <wikibugs>	 10serviceops, 10Analytics-Radar, 10observability, 10Patch-For-Review, and 2 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10thcipriani)
[16:48:25] <wikibugs>	 10serviceops, 10Analytics-Radar, 10observability, 10Patch-For-Review, and 2 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10thcipriani) Is this still in progress or is this work superseded by #mw-on-k8s work?
[16:49:11] <wikibugs>	 10serviceops, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10thcipriani)
[17:08:54] <wikibugs>	 10serviceops, 10MediaWiki-General, 10SRE, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10AMooney)
[17:21:23] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10JMeybohm) ### onhost memcached >It's still an open question how we will inject the node IP into the mcrouter configuration. it would mean we'd need to pass the host IP as an e...