[03:30:56] 10serviceops, 10DBA, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Krinkle) The regression can be seen on the MediaWiki RED dashboard as well, as a 40 percentage point drop in PHP-FPM responses that respon... [06:07:08] 10serviceops, 10DBA, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Marostegui) We pool and repool hosts pretty much all the time during core hours so it is sort of normal that anything can match any of tho... [07:19:01] 10serviceops, 10DBA, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Joe) 05Open→03Invalid We've had all supportive services serving from codfw for the well announced rebuild of the eqiad kubernetes clus... [07:38:37] 10serviceops, 10DBA, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Marostegui) Thanks @Joe - I have started to repool db1141 [09:13:15] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Clean up/Consolidate kubernetes related dashboards - https://phabricator.wikimedia.org/T275641 (10JMeybohm) [09:17:20] 10serviceops, 10Prod-Kubernetes, 10SRE, 10SRE-tools: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) [10:15:53] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Increase visibility of container/pod ressource exhaustion - https://phabricator.wikimedia.org/T266216 (10JMeybohm) [10:50:24] 10serviceops, 10Analytics-Radar, 10Cassandra, 10ContentTranslation, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10MSantos) [11:01:40] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [11:06:59] hello everyone, I need help to clear a restbase cache for one article in order to test if the mobileapps was successful [11:07:05] we used to have a procedure https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/RESTBase_services_for_apps/Deployment_process#Refresh_RESTBase_cache [11:08:10] But this is now obsolete or I can't do it from deployment.eqiad.wmnet, does anyone know how to proper fetch restbase internally? [12:13:47] <_joe_> thesocialdev: yes, let me update that page [12:14:27] <_joe_> thesocialdev: before I edit it, can you give me the URL? [12:14:31] <_joe_> so that we can test it [12:15:17] <_joe_> sorry for not noticing earlier [12:16:19] <_joe_> basically you need to call [12:16:54] <_joe_> https://restbase.discovery.wmnet:7443 instead than the http url :) [12:50:12] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10hnowlan) For reference the 15s timeout is the Envoy default for upst... [13:08:34] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [13:08:40] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10JMeybohm) 05Open→03Resolved a:03JMeybohm It's safe to say we did this and we have tasks for follow ups (mostly from T277191) [13:25:29] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Archive/Remove deprecated calico gerrit repositories - https://phabricator.wikimedia.org/T267539 (10JMeybohm) The first two repos where already read-only with an archived description. I've done so for the third one as well. @akosiaris do you have broad... [13:28:16] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Archive/Remove deprecated calico gerrit repositories - https://phabricator.wikimedia.org/T267539 (10akosiaris) 05Open→03Resolved a:03akosiaris >>! In T267539#6941218, @JMeybohm wrote: > The first two repos where already read-only with an archived... [13:28:19] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes, 10User-fsero: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10akosiaris) [13:50:48] 10serviceops, 10Analytics-Radar, 10Cassandra, 10ContentTranslation, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10Ottomata) [14:22:33] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Support multiple kubernetes versions with puppet - https://phabricator.wikimedia.org/T278329 (10JMeybohm) [14:30:43] _joe_: perfect, it works! thank you so much [15:30:38] qq from a k8s newbie - do we need to have tiller on the ml-clusters? [15:31:08] I also checked on kubemaster1001 (namespace kube-system) but I don't see pods for it [15:31:24] (or deployments) [15:31:34] elukey: we only use it for the service deployments [15:31:51] elukey: the admin stuff is migrated to helm 3 (which does not have/need tiller anymore) [15:32:12] so I guess you don't want tiller around [15:33:01] jayme: perfect thanks :) More basic question - I am trying to play a bit with kubectl, creating a deployment with an image from our docker registry, but I am hitting the wall of the pod security policies [15:33:12] (namely no pod is created, I see it via get events) [15:33:30] is there any RTFM somewhere on wikitech/deployment-chart that could help me? :) [15:34:12] "it's complicated" :D [15:35:15] lovely :D [15:35:16] I fear we don't have any docs on PSPs currently and it's pretty hard to understand tbh...that's why it will be removed again from k8s [15:36:04] I am asking since if I am not able to create a simple deployment/pod there is no chance that I can even start thinking about istio :D [15:36:25] let me look something up quickly [15:38:46] elukey: so the basic docs for all this are at https://v1-16.docs.kubernetes.io/docs/concepts/policy/pod-security-policy/ [15:39:11] it might make sense to read through the example in there to get a bit of an idea how that works in general [15:40:24] we have two profiles "restricted" and "privileged" (all of that in helmfile_psp.yaml) where privileged can be used from kube-system namespace only [15:40:46] see the comments at the end of that file [15:42:06] the weird thing is: "Your" user needs to have "use" permissions for a PSP in order to be able to start a pod that applies to the PSPs rules [15:42:46] that's what the ClusterRole "allow-privileged-psp" and "allow-restricted-psp" is for [15:42:46] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10akosiaris) I 've just uploaded the above change for review. The idea is to use our current setup to gauge more accurately over a period of time what... [15:44:00] One of the ClusterRoles need to be bound to "your" user, like it's done at the end of the helmfile_psp.yaml for the service account of the kube-system namespace [15:45:45] for the service namespaces that is done in the helmfile_namespaces.yaml (that also is where the deploy users get bound to the role) [15:45:56] * elukey Bad memory access (SIGBUS) [15:47:11] yeah, it's bad...and because nobody will ever be able to memorize that, I have this: https://phabricator.wikimedia.org/P15078 [15:48:27] (which includes tiller, because I obviously need that for "our" clusters) - but it gives an idea [15:50:39] ah so you do a kubectl apply -f of that namespace? [15:50:55] yep [15:51:20] (now we can dump all the above into a wiki page and call it documentation, hehe :D) [15:51:41] ok so to have my playground I should create something similar, and then work on that namespace [15:52:18] I tend to piggyback on the default namespace when I do these things but tbh, I haven't tried with a PSP lately [15:52:22] correct. As long as you do your api calls with an admin user, you wshould be fine [15:52:36] but I think I abuse the admin account too? [15:52:45] yeah cause no other account has default access [15:52:49] akosiaris: default will probably not work anymore [15:52:57] ah dammit :P [15:53:07] I 'll just copy yours :P [15:53:08] and the admin accounts are defined in.. ? [15:53:15] (last question for today I promise) [15:53:24] for single pods, yes - but probably not for something using replicas [15:53:42] on deploy1002:/etc/kubernetes/-admin-.config [15:54:07] and on the apiserver in the tokenfile that kube-apiserver is using is the map [15:54:30] is that what you are asking Luca? cause we can also give you the RBAC workshop [15:54:38] you will be happy you learnt kerberos ;) [15:54:46] hrhr [15:56:55] akosiaris: so I started with trying to use kubectl and ended up in here :D Jokes aside, if you have time for some teaching class for new k8s admins I'd really be happy :D [15:57:12] (nothing can be worse than Kerberos) [15:57:22] have you seen oauth2 ? [15:57:29] barely [15:57:32] cause it's essentially kerberos [15:57:40] the basic abstract concepts at least [15:58:11] but point taken [15:58:33] I think that Wolfgang has a workshop already, mutante already did it too [15:59:02] it might not be rightly applicable to the ml-cluster, but it should be possible to do it locally as well [16:05:49] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10BPirkle) Subscribing to follow along. May have implications for imag... [16:07:38] akosiaris: ack yes, but any info would be really nice for me and Tobias to know what do to etc.. I have to admit that I feel very ignorant and lost about k8s :D [16:08:26] not surprised. It took me months to start feeling ok around it [16:08:33] let me find that doc [16:08:53] yeah there is a workshop all documented and stuff [16:09:24] elukey: I 'll send you and Tobias the workshop [16:09:45] <3 [16:11:17] https://wikitech.wikimedia.org/wiki/Kubernetes/Kubernetes_Workshop [16:11:45] ah sorry, you were going to send it... but I think if it's on wikitech it's ok to just post to the channel [16:12:14] Ah it has moved to wikitech already [16:12:17] even better [16:12:36] elukey: you refer to it ^ then [16:30:54] wow how much time did it take to write that document?? [16:35:11] a lot I guess [16:35:24] a long time i think, I saw early versions of it when people were being invited to try it out [16:38:16] all props to wkandek and mutante btw [16:44:04] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10aborrero) [16:52:36] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6941092, @hnowlan wrote: > For reference t... [17:13:17] 10serviceops, 10DBA, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Krinkle) @Joe Thanks, I did see the "kubernetes rebuild in eqiad" in the SAL but didn't connect the dots. Thanks. [17:20:58] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update kubernetes-client - https://phabricator.wikimedia.org/T278356 (10JMeybohm) [17:33:19] 10serviceops, 10SRE, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) [17:33:55] 10serviceops, 10SRE, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) [17:34:01] 10serviceops, 10Performance-Team, 10SRE, 10Patch-For-Review, 10User-jijiki: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki) [17:34:04] 10serviceops, 10SRE, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [17:34:54] 10serviceops, 10User-jijiki: Productionise mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T278225 (10jijiki) [20:31:14] I had to run `helmfile -e staging -i apply` twice in order for the new chart version to be seen as something to update, is that normal? [20:48:45] 10serviceops, 10DNS, 10SRE, 10Traffic, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) [21:42:17] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) (sorry for not quoting) As far as connectivity goes, we can run both mcrouter and onhost memcached on a unix socket, if that is of any help. Generally speaking, we ha... [23:29:34] 10serviceops: bring 35 new mediawiki appserver in codfw into production (mw2377 and up) - https://phabricator.wikimedia.org/T278396 (10Dzahn)