[03:30:56] <wikibugs>	 10serviceops, 10DBA, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Krinkle) The regression can be seen on the MediaWiki RED dashboard as well, as a 40 percentage point drop in PHP-FPM responses that respon...
[06:07:08] <wikibugs>	 10serviceops, 10DBA, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Marostegui) We pool and repool hosts pretty much all the time during core hours so it is sort of normal that anything can match any of tho...
[07:19:01] <wikibugs>	 10serviceops, 10DBA, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Joe) 05Open→03Invalid We've had all supportive services serving from codfw for the well announced rebuild of the eqiad kubernetes clus...
[07:38:37] <wikibugs>	 10serviceops, 10DBA, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Marostegui) Thanks @Joe - I have started to repool db1141
[09:13:15] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Clean up/Consolidate kubernetes related dashboards - https://phabricator.wikimedia.org/T275641 (10JMeybohm)
[09:17:20] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10SRE, 10SRE-tools: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm)
[10:15:53] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Increase visibility of container/pod ressource exhaustion - https://phabricator.wikimedia.org/T266216 (10JMeybohm)
[10:50:24] <wikibugs>	 10serviceops, 10Analytics-Radar, 10Cassandra, 10ContentTranslation, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10MSantos)
[11:01:40] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm)
[11:06:59] <thesocialdev>	 hello everyone, I need help to clear a restbase cache for one article in order to test if the mobileapps was successful 
[11:07:05] <thesocialdev>	 we used to have a procedure https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/RESTBase_services_for_apps/Deployment_process#Refresh_RESTBase_cache
[11:08:10] <thesocialdev>	 But this is now obsolete or I can't do it from deployment.eqiad.wmnet, does anyone know how to proper fetch restbase internally?
[12:13:47] <_joe_>	 thesocialdev: yes, let me update that page
[12:14:27] <_joe_>	 thesocialdev: before I edit it, can you give me the URL?
[12:14:31] <_joe_>	 so that we can test it
[12:15:17] <_joe_>	 sorry for not noticing earlier
[12:16:19] <_joe_>	 basically you need to call
[12:16:54] <_joe_>	 https://restbase.discovery.wmnet:7443 instead than the http url :)
[12:50:12] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10hnowlan) For reference the 15s timeout is the Envoy default for upst...
[13:08:34] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm)
[13:08:40] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm) 05Open→03Resolved a:03JMeybohm It's safe to say we did this and we have tasks for follow ups (mostly from T277191)
[13:25:29] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Archive/Remove deprecated calico gerrit repositories - https://phabricator.wikimedia.org/T267539 (10JMeybohm) The first two repos where already read-only with an archived description. I've done so for the third one as well.  @akosiaris do you have broad...
[13:28:16] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Archive/Remove deprecated calico gerrit repositories - https://phabricator.wikimedia.org/T267539 (10akosiaris) 05Open→03Resolved a:03akosiaris >>! In T267539#6941218, @JMeybohm wrote: > The first two repos where already read-only with an archived...
[13:28:19] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes, 10User-fsero: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10akosiaris)
[13:50:48] <wikibugs>	 10serviceops, 10Analytics-Radar, 10Cassandra, 10ContentTranslation, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10Ottomata)
[14:22:33] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Support multiple kubernetes versions with puppet - https://phabricator.wikimedia.org/T278329 (10JMeybohm)
[14:30:43] <thesocialdev>	 _joe_: perfect, it works! thank you so much
[15:30:38] <elukey>	 qq from a k8s newbie - do we need to have tiller on the ml-clusters?
[15:31:08] <elukey>	 I also checked on kubemaster1001 (namespace kube-system) but I don't see pods for it
[15:31:24] <elukey>	 (or deployments)
[15:31:34] <jayme>	 elukey: we only use it for the service deployments
[15:31:51] <jayme>	 elukey: the admin stuff is migrated to helm 3 (which does not have/need tiller anymore)
[15:32:12] <jayme>	 so I guess you don't want tiller around
[15:33:01] <elukey>	 jayme: perfect thanks :) More basic question - I am trying to play a bit with kubectl, creating a deployment with an image from our docker registry, but I am hitting the wall of the pod security policies
[15:33:12] <elukey>	 (namely no pod is created, I see it via get events)
[15:33:30] <elukey>	 is there any RTFM somewhere on wikitech/deployment-chart that could help me? :)
[15:34:12] <jayme>	 "it's complicated" :D
[15:35:15] <elukey>	 lovely :D
[15:35:16] <jayme>	 I fear we don't have any docs on PSPs currently and it's pretty hard to understand tbh...that's why it will be removed again from k8s
[15:36:04] <elukey>	 I am asking since if I am not able to create a simple deployment/pod there is no chance that I can even start thinking about istio :D
[15:36:25] <jayme>	 let me look something up quickly
[15:38:46] <jayme>	 elukey: so the basic docs for all this are at https://v1-16.docs.kubernetes.io/docs/concepts/policy/pod-security-policy/ 
[15:39:11] <jayme>	 it might make sense to read through the example in there to get a bit of an idea how that works in general
[15:40:24] <jayme>	 we have two profiles "restricted" and "privileged" (all of that in helmfile_psp.yaml) where privileged can be used from kube-system namespace only
[15:40:46] <jayme>	 see the comments at the end of that file
[15:42:06] <jayme>	 the weird thing is: "Your" user needs to have "use" permissions for a PSP in order to be able to start a pod that applies to the PSPs rules
[15:42:46] <jayme>	 that's what the ClusterRole "allow-privileged-psp" and "allow-restricted-psp" is for
[15:42:46] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10akosiaris) I 've just uploaded the above change for review. The idea is to use our current setup to gauge more accurately over a period of time what...
[15:44:00] <jayme>	 One of the ClusterRoles need to be bound to "your" user, like it's done at the end of the helmfile_psp.yaml for the service account of the kube-system namespace
[15:45:45] <jayme>	 for the service namespaces that is done in the helmfile_namespaces.yaml (that also is where the deploy users get bound to the role)
[15:45:56] * elukey Bad memory access (SIGBUS) 
[15:47:11] <jayme>	 yeah, it's bad...and because nobody will ever be able to memorize that, I have this: https://phabricator.wikimedia.org/P15078
[15:48:27] <jayme>	 (which includes tiller, because I obviously need that for "our" clusters) - but it gives an idea
[15:50:39] <elukey>	 ah so you do a kubectl apply -f of that namespace?
[15:50:55] <jayme>	 yep
[15:51:20] <jayme>	 (now we can dump all the above into a wiki page and call it documentation, hehe :D)
[15:51:41] <elukey>	 ok so to have my playground I should create something similar, and then work on that namespace
[15:52:18] <akosiaris>	 I tend to piggyback on the default namespace when I do these things but tbh, I haven't tried with a PSP lately
[15:52:22] <jayme>	 correct. As long as you do your api calls with an admin user, you wshould be fine
[15:52:36] <akosiaris>	 but I think I abuse the admin account too?
[15:52:45] <akosiaris>	 yeah cause no other account has default access
[15:52:49] <jayme>	 akosiaris: default will probably not work anymore
[15:52:57] <akosiaris>	 ah dammit :P
[15:53:07] <akosiaris>	 I 'll just copy yours :P
[15:53:08] <elukey>	 and the admin accounts are defined in.. ?
[15:53:15] <elukey>	 (last question for today I promise)
[15:53:24] <jayme>	 for single pods, yes - but probably not for something using replicas
[15:53:42] <akosiaris>	 on deploy1002:/etc/kubernetes/<cluster>-admin-<namespace>.config
[15:54:07] <akosiaris>	 and on the apiserver in the tokenfile that kube-apiserver is using is the map
[15:54:30] <akosiaris>	 is that what you are asking Luca? cause we can also give you the RBAC workshop
[15:54:38] <akosiaris>	 you will be happy you learnt kerberos ;)
[15:54:46] <jayme>	 hrhr
[15:56:55] <elukey>	 akosiaris: so I started with trying to use kubectl and ended up in here :D Jokes aside, if you have time for some teaching class for new k8s admins I'd really be happy :D
[15:57:12] <elukey>	 (nothing can be worse than Kerberos)
[15:57:22] <akosiaris>	 have you seen oauth2 ?
[15:57:29] <elukey>	 barely 
[15:57:32] <akosiaris>	 cause it's essentially kerberos 
[15:57:40] <akosiaris>	 the basic abstract concepts at least
[15:58:11] <akosiaris>	 but point taken
[15:58:33] <akosiaris>	 I think that Wolfgang has a workshop already, mutante already did it too
[15:59:02] <akosiaris>	 it might not be rightly applicable to the ml-cluster, but it should be possible to do it locally as well
[16:05:49] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10BPirkle) Subscribing to follow along. May have implications for imag...
[16:07:38] <elukey>	 akosiaris: ack yes, but any info would be really nice for me and Tobias to know what do to etc.. I have to admit that I feel very ignorant and lost about k8s :D
[16:08:26] <akosiaris>	 not surprised. It took me months to start feeling ok around it
[16:08:33] <akosiaris>	 let me find that doc 
[16:08:53] <apergos>	 yeah there is a workshop all documented and stuff
[16:09:24] <akosiaris>	 elukey: I 'll send you and Tobias the workshop
[16:09:45] <elukey>	 <3
[16:11:17] <apergos>	 https://wikitech.wikimedia.org/wiki/Kubernetes/Kubernetes_Workshop  
[16:11:45] <apergos>	 ah sorry, you were going to send it... but I think if it's on wikitech it's ok to just post to the channel
[16:12:14] <akosiaris>	 Ah it has moved to wikitech already
[16:12:17] <akosiaris>	 even better
[16:12:36] <akosiaris>	 elukey: you refer to it ^ then
[16:30:54] <elukey>	 wow how much time did it take to write that document??
[16:35:11] <akosiaris>	 a lot I guess
[16:35:24] <apergos>	 a long time i think, I saw early versions of it when people were being invited to try it out
[16:38:16] <akosiaris>	 all props to wkandek and mutante btw 
[16:44:04] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10aborrero)
[16:52:36] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6941092, @hnowlan wrote: > For reference t...
[17:13:17] <wikibugs>	 10serviceops, 10DBA, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Krinkle) @Joe Thanks, I did see the "kubernetes rebuild in eqiad" in the SAL but didn't connect the dots. Thanks.
[17:20:58] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update kubernetes-client - https://phabricator.wikimedia.org/T278356 (10JMeybohm)
[17:33:19] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki)
[17:33:55] <wikibugs>	 10serviceops, 10SRE, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki)
[17:34:01] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10Patch-For-Review, 10User-jijiki: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki)
[17:34:04] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki)
[17:34:54] <wikibugs>	 10serviceops, 10User-jijiki: Productionise mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T278225 (10jijiki)
[20:31:14] <kostajh>	 I had to run `helmfile -e staging -i apply` twice in order for the new chart version to be seen as something to update, is that normal?
[20:48:45] <wikibugs>	 10serviceops, 10DNS, 10SRE, 10Traffic, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn)
[21:42:17] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) (sorry for not quoting)  As far as connectivity goes, we can run both mcrouter and onhost memcached on a unix socket, if that is of any help. Generally speaking, we ha...
[23:29:34] <wikibugs>	 10serviceops: bring 35 new mediawiki appserver in codfw into production (mw2377 and up) - https://phabricator.wikimedia.org/T278396 (10Dzahn)