[07:40:13] All this screaming in #-operations seems a bit scary in first place but it looks as if mwdebug1001 is solely responsible (high load) for the latency spikes. Is there a way to "know" if/who is testing something there? [08:13:27] jayme: is mwdebug responsible for these spikes? https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=1601280766718&orgId=1&to=1601367166718&viewPanel=9&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 [08:14:46] effie was working on some mwdebug and memcached related things yesterday and said if there were errors it could be related, but I don't know if that's still true today [08:15:07] ah [08:16:12] jynus: I think yes. I've exluced the mwdebug hosts from the metrics enf everything looks fine again [08:16:29] cool [08:16:30] jayme: we have started another task to separate mwdebug metrics from production [08:16:34] oh [08:16:36] grr [08:16:38] jynus: [08:16:51] effie: yeah, I saw that one. +1 [08:17:06] you two should never talk to each other publicly on a channel [08:17:13] sorty to ping, lots of errors were worrying [08:17:13] jayme: I meant to say jynus :p [08:17:36] if they are expected/not real user impact, no worries on my side anymore :-) [08:17:46] there is no user impact [08:17:56] thanks! [08:17:58] effie: eheh [08:18:17] BTW, maybe related [08:18:55] there where a few UNCACHED downtimes on catchpoint tonight, check root mail [08:33:08] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 (10JMeybohm) [08:34:30] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [09:21:25] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: envoy service proxy: Add networkpolicy egress rule for enabled listeners - https://phabricator.wikimedia.org/T264076 (10JMeybohm) [09:21:39] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: envoy service proxy: Add networkpolicy egress rule for enabled listeners - https://phabricator.wikimedia.org/T264076 (10JMeybohm) p:05Triage→03Medium [09:55:00] akosiaris: removing the clusterrolebinding from the admin helmfile turned out to be not so smart :/ https://paste.debian.net/hidden/4947b26e/ [09:55:34] akosiaris: at least there is some kind of transition needed [09:55:44] argh [09:56:18] I've just added back the default we have in eqiad to the staging cluster (as the eqiad object is not tiller managed) [09:56:26] ok [09:56:33] codfw will have the same issue I guess [09:56:56] yeah! I *think* nothing broke badly...the nodes went to notready ofc [09:58:16] As it is a binding, maybe I can simply add it under a different name in codfw, then let tiller delete the original one, recreate the default version and then remove the temporary one. What do you think? [10:04:50] jayme: ah, that's a good idea [10:04:52] +1 [10:12:49] akosiaris: interestingly, we don't have an rbac-deploy-clusterrole release at all in eqiad. That seems strange [10:13:35] probably because it had failed to apply? [10:13:45] and the deploy was "rolled back" ? [10:14:00] akosiaris: probably, yes. But why don't we see any issues then? [10:17:51] akosiaris: ah..helm brainmelt situation [10:18:06] lol [10:18:25] I 'll keep that phrase for future use :) [10:18:41] so roles are *probably there* and thus is why we don't see any functional issues [10:18:53] sure, feel free to :) [10:19:40] in codfw and staging, I think I probably had deleted the binding and applied via helmfile [10:19:45] I remember trying to establish it as a official codename for bugs in that area but the helm devs did not seem to like it very much :) [10:19:58] in eqiad, we just had edited the initialize_cluster.sh script to fix it [10:20:06] which is a better approach ofc [10:21:03] It's not only the system:node binding. I thought we're missing all of common/rbac/rbac.yaml in eqiad [10:21:13] (because helm was insisting there is no release) [10:21:34] so all the tiller, rsyslog and prometheus stuff [10:22:14] anyways...will try to fix brainmelt after lunch and then try to not break codfw after that [10:22:49] we should add --atomic to helmfile.d/admin/**/helmfile.yaml to avoid this shit [10:22:59] * jayme adding a task [10:22:59] wait... how does it work then? [10:23:18] if we are missing rsyslog and prometheus a lot of stuff should be broken [10:23:29] it works because helm is brainmelt and it created all those bindings successfully [10:24:07] but still complains that the release fails (e.g. does not exitst because there only is one version of the release and that is in failed state) [10:24:15] ah, that makes sense [10:24:34] yeah...*sense* ... in the helm universe :) [10:26:11] lol [10:26:56] going to lunch. Meanwhile, try not to apply helmfile.d/admin to codfw :D [10:27:05] :) [10:32:16] 10serviceops, 10Operations, 10Patch-For-Review: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10ArielGlenn) 05Resolved→03Open New failure. Here's the output: ` ep 28 16:38:10 deneb docker-report-releng[23588]: INFO[docker-report] Building debmonitor report for... [13:24:20] akosiaris: citoid->zotero looking better now ofc. But still some 503 UC, I guess because of slow zotero: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=now-15m&to=now&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s-staging&var-app=citoid&var-destination=zotero [13:25:47] yeah, judging from https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=46&orgId=1&refresh=5m&from=now-12h&to=now 500s are not unheard of for citoid. And it does gracefully fallback [13:26:56] so if we get something like this as a pattern https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=46&orgId=1&refresh=5m&from=now-7d&to=now, we definitely have no made things worse [13:27:45] sweet. I'll roll out to eqiad then to have it comparable tomorrow [13:28:40] I guess the only traffic there is service-checker as well. But at least it's running all the time and not only manually by me as in staging [15:31:50] _joe_: can I have a little of your time? [15:31:58] <_joe_> sure [15:32:17] <_joe_> I'm writing a presentation, so any distraction is welcome :) [15:32:21] So, I'm struggling with trying to make api-gateway in staging connect to mobileapps in stging [15:32:33] I can't go over HTTPS, the certs won't match [15:32:34] <_joe_> oh, interesting problem :) [15:32:58] but for some reason when I go over staging.svc.eqiad.wmnet:8888 it's a wrong port -http is not exposed [15:32:59] <_joe_> akosiaris, jayme ^^ [15:33:21] <_joe_> mobileapps doesn't expose non-tls anymore [15:33:30] but can we expose it in staging? [15:33:31] <_joe_> so we might need to re-define the certs for staging [15:33:56] <_joe_> I'd prefer to have every service expose the same cert in staging, for staging.svc.eqiad.wmnet tbh [15:34:26] I have a ticket, https://phabricator.wikimedia.org/T260917 for proper certs for staging [15:34:38] <_joe_> oh great I was about to ask [15:34:51] <_joe_> and btw [15:35:03] <_joe_> we also want to have an alternative configuration for the service-proxy [15:35:14] <_joe_> not for api-gateway, but for the other services [15:36:02] how would you suggest I proceed at this point? just leave staging in a broken state and wait for the ability to use TLS in staging? [15:36:13] <_joe_> so for your problem we need to do what follows [15:36:33] <_joe_> - generate a cert for staging. [15:36:49] <_joe_> - provide it in /etc/helmfile-defaults/private/$service/staging.yaml [15:37:01] <_joe_> Pchelolo: it's ok if this gets fixed, say, tomorrow? [15:37:38] yeah, sure. I'm off tomorrow, but I will just deploy my thing to prod today. It's not something that can break everything [15:37:47] and then followup on staging [15:38:08] <_joe_> +1 [15:38:15] alrighty! thank you [15:38:50] 10serviceops, 10Kubernetes: Support TLS for service-to-service communication in k8s staging - https://phabricator.wikimedia.org/T260917 (10Joe) I think we could try to provide an easy way to use TLS by creating a certificate for `staging.svc.{eqiad,codfw}.wmnet` and distribute it as the cert/key pair for all s... [15:39:44] 10serviceops, 10Kubernetes: Support TLS for service-to-service communication in k8s staging - https://phabricator.wikimedia.org/T260917 (10Joe) p:05Triage→03High [15:41:05] I can look at that tomorrow I guess (if you dont want to use it as disctraction, _joe_) [15:42:47] hey jayme - I have a calico change I'd like to apply, am I safe to go with that in codfw? (asking based on your comment earlier) [15:42:48] <_joe_> jayme: be my guest [15:43:44] hnowlan: I fixed the situation on all clusters, so go ahead. Thanks for asking! [15:44:51] 10serviceops, 10Push-Notification-Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to the MW API for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (10Mholloway) [15:46:56] 10serviceops, 10Push-Notification-Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to the MW API for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (10Mholloway) a:05Mholloway→03None [16:18:23] 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog: [OSM] Install imposm3 in Maps master - https://phabricator.wikimedia.org/T238753 (10sdkim) a:03sdkim [16:21:04] 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog: [OSM] Install imposm3 in Maps master - https://phabricator.wikimedia.org/T238753 (10sdkim) [17:29:02] 10serviceops, 10Push-Notification-Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to MediaWiki for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (10Mholloway) [21:28:08] 10serviceops, 10Operations, 10Traffic, 10Performance-Team (Radar), 10Sustainability: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10BBlack) 05Stalled→03Resolved a:03ema This should've been closed back when T250781 closed - all purge traffic now goes via kafka queues and mul... [22:44:42] 10serviceops, 10Machine Learning Platform, 10ORES, 10Operations, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) {F32368154} Nope no deploys have happened recently. It has been happening every few hours since the 24th