[07:40:13] <jayme>	 All this screaming in #-operations seems a bit scary in first place but it looks as if mwdebug1001 is solely responsible (high load) for the latency spikes. Is there a way to "know" if/who is testing something there?
[08:13:27] <jynus>	 jayme: is mwdebug responsible for these spikes? https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=1601280766718&orgId=1&to=1601367166718&viewPanel=9&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200
[08:14:46] <apergos>	 effie was working on some mwdebug and memcached related things yesterday and said if there were errors it could be related, but I don't know if that's still true today 
[08:15:07] <effie>	 ah 
[08:16:12] <jayme>	 jynus: I think yes. I've exluced the mwdebug hosts from the metrics enf everything looks fine again
[08:16:29] <jynus>	 cool
[08:16:30] <effie>	 jayme: we have started another task to separate mwdebug metrics from production 
[08:16:34] <effie>	 oh 
[08:16:36] <effie>	 grr
[08:16:38] <effie>	 jynus:
[08:16:51] <jayme>	 effie: yeah, I saw that one. +1
[08:17:06] <effie>	 you two should never talk to each other publicly on a channel 
[08:17:13] <jynus>	 sorty to ping, lots of errors were worrying
[08:17:13] <effie>	 jayme: I meant to say jynus :p
[08:17:36] <jynus>	 if they are expected/not real user impact, no worries on my side anymore :-)
[08:17:46] <effie>	 there is no user impact 
[08:17:56] <jynus>	 thanks!
[08:17:58] <jayme>	 effie: eheh
[08:18:17] <jynus>	 BTW, maybe related
[08:18:55] <jynus>	 there where a few UNCACHED downtimes on catchpoint tonight, check root mail
[08:33:08] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 (10JMeybohm)
[08:34:30] <wikibugs>	 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm)
[09:21:25] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: envoy service proxy: Add networkpolicy egress rule for enabled listeners - https://phabricator.wikimedia.org/T264076 (10JMeybohm)
[09:21:39] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: envoy service proxy: Add networkpolicy egress rule for enabled listeners - https://phabricator.wikimedia.org/T264076 (10JMeybohm) p:05Triage→03Medium
[09:55:00] <jayme>	 akosiaris: removing the clusterrolebinding from the admin helmfile turned out to be not so smart :/ https://paste.debian.net/hidden/4947b26e/
[09:55:34] <jayme>	 akosiaris: at least there is some kind of transition needed
[09:55:44] <akosiaris>	 argh
[09:56:18] <jayme>	 I've just added back the default we have in eqiad to the staging cluster (as the eqiad object is not tiller managed)
[09:56:26] <akosiaris>	 ok
[09:56:33] <akosiaris>	 codfw will have the same issue I guess
[09:56:56] <jayme>	 yeah! I *think* nothing broke badly...the nodes went to notready ofc
[09:58:16] <jayme>	 As it is a binding, maybe I can simply add it under a different name in codfw, then let tiller delete the original one, recreate the default version and then remove the temporary one. What do you think?
[10:04:50] <akosiaris>	 jayme: ah, that's a good idea
[10:04:52] <akosiaris>	 +1
[10:12:49] <jayme>	 akosiaris: interestingly, we don't have an rbac-deploy-clusterrole release at all in eqiad. That seems strange
[10:13:35] <akosiaris>	 probably because it had failed to apply?
[10:13:45] <akosiaris>	 and the deploy was "rolled back" ?
[10:14:00] <jayme>	 akosiaris: probably, yes. But why don't we see any issues then?
[10:17:51] <jayme>	 akosiaris: ah..helm brainmelt situation
[10:18:06] <akosiaris>	 lol
[10:18:25] <akosiaris>	 I 'll keep that phrase for future use :)
[10:18:41] <jayme>	 so roles are *probably there* and thus is why we don't see any functional issues
[10:18:53] <jayme>	 sure, feel free to :)
[10:19:40] <akosiaris>	 in codfw and staging, I think I probably had deleted the binding and applied via helmfile
[10:19:45] <jayme>	 I remember trying to establish it as a official codename for bugs in that area but the helm devs did not seem to like it very much :)
[10:19:58] <akosiaris>	 in eqiad, we just had edited the initialize_cluster.sh script to fix it
[10:20:06] <akosiaris>	 which is a better approach ofc
[10:21:03] <jayme>	 It's not only the system:node binding. I thought we're missing all of common/rbac/rbac.yaml in eqiad
[10:21:13] <jayme>	 (because helm was insisting there is no release)
[10:21:34] <jayme>	 so all the tiller, rsyslog and prometheus stuff
[10:22:14] <jayme>	 anyways...will try to fix brainmelt after lunch and then try to not break codfw after that
[10:22:49] <jayme>	 we should add --atomic to helmfile.d/admin/**/helmfile.yaml to avoid this shit
[10:22:59] * jayme adding a task
[10:22:59] <akosiaris>	 wait... how does it work then?
[10:23:18] <akosiaris>	 if we are missing rsyslog and prometheus a lot of stuff should be broken
[10:23:29] <jayme>	 it works because helm is brainmelt and it created all those bindings successfully
[10:24:07] <jayme>	 but still complains that the release fails (e.g. does not exitst because there only is one version of the release and that is in failed state)
[10:24:15] <akosiaris>	 ah, that makes sense
[10:24:34] <jayme>	 yeah...*sense*  ... in the helm universe :)
[10:26:11] <akosiaris>	 lol
[10:26:56] <jayme>	 going to lunch. Meanwhile, try not to apply helmfile.d/admin to codfw :D
[10:27:05] <akosiaris>	 :)
[10:32:16] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10ArielGlenn) 05Resolved→03Open New failure. Here's the output: ` ep 28 16:38:10 deneb docker-report-releng[23588]: INFO[docker-report] Building debmonitor report for...
[13:24:20] <jayme>	 akosiaris: citoid->zotero looking better now ofc. But still some 503 UC, I guess because of slow zotero: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=now-15m&to=now&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s-staging&var-app=citoid&var-destination=zotero
[13:25:47] <akosiaris>	 yeah, judging from https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=46&orgId=1&refresh=5m&from=now-12h&to=now 500s are not unheard of for citoid. And it does gracefully fallback 
[13:26:56] <akosiaris>	 so if we get something like this as a pattern https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=46&orgId=1&refresh=5m&from=now-7d&to=now, we definitely have no made things worse
[13:27:45] <jayme>	 sweet. I'll roll out to eqiad then to have it comparable tomorrow
[13:28:40] <jayme>	 I guess the only traffic there is service-checker as well. But at least it's running all the time and not only manually by me as in staging
[15:31:50] <Pchelolo>	 _joe_: can I have a little of your time?
[15:31:58] <_joe_>	 sure
[15:32:17] <_joe_>	 I'm writing a presentation, so any distraction is welcome :)
[15:32:21] <Pchelolo>	 So, I'm struggling with trying to make api-gateway in staging connect to mobileapps in stging
[15:32:33] <Pchelolo>	 I can't go over HTTPS, the certs won't match
[15:32:34] <_joe_>	 oh, interesting problem :)
[15:32:58] <Pchelolo>	 but for some reason when I go over staging.svc.eqiad.wmnet:8888 it's a wrong port -http is not exposed
[15:32:59] <_joe_>	 akosiaris, jayme ^^
[15:33:21] <_joe_>	 mobileapps doesn't expose non-tls anymore
[15:33:30] <Pchelolo>	 but can we expose it in staging?
[15:33:31] <_joe_>	 so we might need to re-define the certs for staging
[15:33:56] <_joe_>	 I'd prefer to have every service expose the same cert in staging, for staging.svc.eqiad.wmnet tbh
[15:34:26] <Pchelolo>	 I have a ticket, https://phabricator.wikimedia.org/T260917 for proper certs for staging
[15:34:38] <_joe_>	 oh great I was about to ask
[15:34:51] <_joe_>	 and btw
[15:35:03] <_joe_>	 we also want to have an alternative configuration for the service-proxy
[15:35:14] <_joe_>	 not for api-gateway, but for the other services 
[15:36:02] <Pchelolo>	 how would you suggest I proceed at this point? just leave staging in a broken state and wait for the ability to use TLS in staging?
[15:36:13] <_joe_>	 so for your problem we need to do what follows
[15:36:33] <_joe_>	 - generate a cert for staging.
[15:36:49] <_joe_>	 - provide it in /etc/helmfile-defaults/private/$service/staging.yaml
[15:37:01] <_joe_>	 Pchelolo: it's ok if this gets fixed, say, tomorrow?
[15:37:38] <Pchelolo>	 yeah, sure. I'm off tomorrow, but I will just deploy my thing to prod today. It's not something that can break everything
[15:37:47] <Pchelolo>	 and then followup on staging
[15:38:08] <_joe_>	 +1
[15:38:15] <Pchelolo>	 alrighty! thank you
[15:38:50] <wikibugs>	 10serviceops, 10Kubernetes: Support TLS for service-to-service communication in k8s staging - https://phabricator.wikimedia.org/T260917 (10Joe) I think we could try to provide an easy way to use TLS by creating a certificate for `staging.svc.{eqiad,codfw}.wmnet` and distribute it as the cert/key pair for all s...
[15:39:44] <wikibugs>	 10serviceops, 10Kubernetes: Support TLS for service-to-service communication in k8s staging - https://phabricator.wikimedia.org/T260917 (10Joe) p:05Triage→03High
[15:41:05] <jayme>	 I can look at that tomorrow I guess (if you dont want to use it as disctraction, _joe_)
[15:42:47] <hnowlan>	 hey jayme - I have a calico change I'd like to apply, am I safe to go with that in codfw? (asking based on your comment earlier) 
[15:42:48] <_joe_>	 jayme: be my guest
[15:43:44] <jayme>	 hnowlan: I fixed the situation on all clusters, so go ahead. Thanks for asking!
[15:44:51] <wikibugs>	 10serviceops, 10Push-Notification-Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to the MW API for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (10Mholloway)
[15:46:56] <wikibugs>	 10serviceops, 10Push-Notification-Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to the MW API for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (10Mholloway) a:05Mholloway→03None
[16:18:23] <wikibugs>	 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog: [OSM] Install imposm3 in Maps master - https://phabricator.wikimedia.org/T238753 (10sdkim) a:03sdkim
[16:21:04] <wikibugs>	 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog: [OSM] Install imposm3 in Maps master - https://phabricator.wikimedia.org/T238753 (10sdkim)
[17:29:02] <wikibugs>	 10serviceops, 10Push-Notification-Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to MediaWiki for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (10Mholloway)
[21:28:08] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10Performance-Team (Radar), 10Sustainability: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10BBlack) 05Stalled→03Resolved a:03ema This should've been closed back when T250781 closed - all purge traffic now goes via kafka queues and mul...
[22:44:42] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Operations, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) {F32368154}  Nope no deploys have happened recently. It has been happening every few hours since the 24th