[09:04:30] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: redis for docker-registry should have maxmemory-policy set to allkeys-lru - https://phabricator.wikimedia.org/T256726 (10JMeybohm) [09:42:43] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396 (10JMeybohm) 05Open→03Resolved Seems it is requited to try to fetch the tag list while bypassing the caches once to have the lasting reference... [09:42:47] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [10:01:03] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: redis for docker-registry should have maxmemory-policy set to allkeys-lru - https://phabricator.wikimedia.org/T256726 (10JMeybohm) p:05Triage→03Low [10:46:10] hello there [10:46:23] docker-reporter-base-images.service and docker-reporter-releng-images.service are sad on deneb [10:46:41] Jun 30 00:01:31 deneb docker-report-base[22599]: requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://docker-registry.wikimedia.org/v2/envoy-tls-local-proxy/tags/list [10:46:58] and: [10:46:58] Jun 29 14:01:18 deneb docker-report-releng[23347]: requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://docker-registry.wikimedia.org/v2/_catalog?last=releng%2Fnode10&n=100 [10:47:29] in both cases after the error message there's a log entry that sounds suspiciously like a lie: [10:47:34] Jun 30 00:01:31 deneb docker-report-base[22599]: All images submitted correctly! [10:47:42] Jun 29 14:01:18 deneb docker-report-releng[23347]: All images submitted correctly! [11:08:04] that was me ema [11:08:15] at least the 404 [11:12:17] jayme: ack [11:13:33] Sorry for the noise...will try to figure out what I've missed. I had removed the image envoy-tls-local-proxy from registry but obviously not completely :-/ [11:14:13] jayme: I've tried restarting docker-reporter-base-images.service but it's still failing [11:15:00] Yeah. The docker image tags are gone but the registry still advertises https://docker-registry.wikimedia.org/v2/envoy-tls-local-proxy ... [11:15:14] jayme: which URL is the image catalog? Perhaps it's cached in the CDN :) [11:15:34] ema: I've restartet docker-report-releng, that should be fine [11:16:27] ema: https://docker-registry.wikimedia.org/v2/_catalog - it definitely is cached in CDN, but the docker registry still includes the image in the catalog response as well [11:18:29] mmh no, it looks like we're skipping the cache: [11:18:29] < x-cache: cp3050 miss, cp3050 pass [11:18:44] (which makes sense I suppose) [11:20:38] I'm inclined to push a scratch docker image as envoy-tls-local-proxy:dontuseme to fix this [11:20:43] (temporarily) [11:25:47] did so and restarted docker-reporter-base-images [11:25:57] ack [11:26:07] you might want to !log this on #-operations [11:33:04] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396 (10JMeybohm) 05Resolved→03Open This led to failing docker-reporter-base-images.service on deneb. I'm definitely missing something here... [11:33:07] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [11:33:24] yeah. Thanks for the heads up ema [11:34:07] np! [11:34:34] do we want to have the CDN cache docker-registry actually? [11:36:44] caching stuff like https://docker-registry.wikimedia.org/v2/envoy-tls-local-proxy/tags/list seems like a recipe for headaches to me [11:39:10] ah but it's open to the world and we probably don't want it to melt, I see [11:40:20] Yea. Internally (for pulling images etc.) we're using docker-registry.discovery.wmnet (which I guess is not behind the CDN). Maybe that should be used for the docker-report jobs as well [11:41:05] correct, docker-registry.discovery.wmnet is the service itself (ie: not behind the CDN) [11:43:04] Will take another look after lunch. Something is definitely not okay there (although avoiding CDN would not have helped us in this specific case) [11:43:25] enjoy your lunch! [13:14:19] ema: we 've had issues in CI with it being cached too aggressively. It's now setting way too aggressive no-cache headers in the nginx fronting it [13:15:09] chances are that we should revisit part of the decision. As you noted caching tags/list is a recipe for pain, but caching for like 5mins /v2/catalog isn't [13:17:16] code is https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/docker_registry_ha/templates/registry-http-nginx.conf.erb#12 and task was: https://phabricator.wikimedia.org/T211719 [13:33:29] akosiaris: shouldn't we use docker-registry.discovery.wmnet for the anything from within out network anyways? [13:34:53] was about to change that in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/docker/reporter/report.pp#18 [13:48:29] jayme: yes, but CI isn't part ouf our network. It also resides heavily in WMCS [13:48:45] akosiaris: also I *think* the TCP 81 server is not used anymore. LVS is configured to port 443 and I only see cache miss/pass for catalog and tags [13:49:01] akosiaris: Ah, okay. [13:49:27] akosiaris: correction: tags get a cache hit [13:51:41] jayme: which endpoint do you refer to? [13:52:10] e.g. for https://docker-registry.wikimedia.org/v2/wmfdebug/tags/list I see misses only [13:52:15] so I guess not that? [13:52:54] akosiaris: hmm https://paste.debian.net/1154502/ [13:54:49] interesting, plus I see the header missing now [13:55:03] I can reproduce btw [13:55:58] that's weird, CI should have begun complaining already if we are caching tag lists [13:56:15] And this is the reference for TCP 81 is no longer used https://github.com/wikimedia/puppet/blob/production/hieradata/common/service.yaml#L598 [14:34:01] 10serviceops, 10Operations, 10Epic, 10Patch-For-Review: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10dancy) [14:35:16] 10serviceops, 10Operations, 10Epic, 10Patch-For-Review: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10hashar) [14:36:25] 10serviceops, 10Operations, 10Epic, 10Patch-For-Review: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10hashar) CI stills use Jessie based container from docker-registry.wikimedia.org/wikimedia-jessie . The last remaining task is to have some se... [14:47:26] 10serviceops, 10Operations, 10Kubernetes: Fix nginx config and caching for docker registry - https://phabricator.wikimedia.org/T256762 (10JMeybohm) [15:17:41] 10serviceops, 10LDAP-Access-Requests, 10Operations, 10observability, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10AMooney) @KFrancis, it looks like the form for this work has been Approved. Can this task move forwa... [16:21:44] ahoyhoy. Some of the helm/envoy stuff for API gateway is getting towards done at the moment and I figured it might be time to ask about service discovery. Do ye have any strong preferences or ideas about how envoy instances should find appservers? There's a fairly wide range of options as to how we could do things (including just static configs given how infrequently we actually add new mw instances) [16:22:04] obviously static configs is not really desirable or scalable longterm :) [16:23:03] hnowlan: I think _j.oe_ has been thinking about that, he's back from vacation tomorrow [16:23:39] there's some thinking about that from last year in https://docs.google.com/document/d/13eDGv-uP0_QmbxjxgzQ9zycLY_uoH6CZP525mSPKK-A/edit but [16:23:41] the tentative plan is to build a full-on xDS server, but that only part-answers the question obviously [16:23:53] beware that half of that document is stuff that has since happened, and the other half is still pie-in-the-sky [16:24:15] (and I don't think the 'stuff that already happened' part was updated with "and here's how we actually implemented this") [16:29:14] ah, cool- Thanks for the details! xDS seemed like the right idea to me but I wasn't sure how far into it we were willing to go in terms of work/resources. [16:30:55] yeah I think it's some combination of "worth the effort" and "we knew he was going to come back from vacation having written *something* from scratch without pay, maybe it'll be this" [16:31:29] haha [16:32:26] someday the appservers will be k8s pods and then that part gets a lot simpler, so it might just be static until then [16:32:38] static envoy config -> static input for the xDS server [16:33:09] or if we want to get really fancy, maybe xDS gets it from puppet [16:33:25] depends on how far out MW-on-k8s turns out to be, I guess [16:37:30] rzl: he joked-not-joked to me privately about writing an xDS controller before he left [16:37:50] no I know, he didn't even pretend to be joking to me [17:39:37] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog: mobileapps kubernetes deployment is timing out - https://phabricator.wikimedia.org/T256786 (10Mholloway) [17:44:21] 10serviceops, 10Core Platform Team, 10Operations, 10Traffic, and 3 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10aaron) >>! In T250205#6158994, @Krinkle wrote: >>>! In T250205#6154883, @aaron wrote: >> I'm not fond of the idea of not sending purges for in... [17:49:39] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog: mobileapps kubernetes deployment is timing out - https://phabricator.wikimedia.org/T256786 (10Mholloway) [18:29:43] hello people, qq - is the mediawiki errors alert WIP by someone? [18:30:28] it seems constantly flapping, IIUC due to parsoid/wtp new logging config, but it has been flapping for a long time [18:30:45] hmm I don't know of any work [18:31:16] you're right that it's noisy though -- offhand I don't know if we need to fix the alert or the errors [18:31:23] prrrrobably the errors [18:32:10] ah the task was https://phabricator.wikimedia.org/T256459 [18:44:42] totally a hack but https://gerrit.wikimedia.org/r/c/operations/puppet/+/608708 [18:53:39] oh [18:53:52] Krinkle wrote some patches to raise the threshold over the weekend [18:54:11] https://gerrit.wikimedia.org/r/c/operations/puppet/+/608188 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/608189 [18:54:16] I meant to merge those [18:54:24] ahah [18:57:05] sorry elukey but I like Krinkle's patches better, will start with those [18:58:04] ah there are patches, good! I thought to merge it as temporary band aid waiting for something better, please go ahead with Timo's patches [18:58:26] thanks :) [19:01:06] thank you :) [19:01:08] that alert had been annoying me too [23:42:49] 10serviceops, 10Operations, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Krinkle) [23:51:45] 10serviceops, 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10CDanis) There's no alert yet for memcache NIC saturation, and I don't believe there's one for TKOs either (@elukey is that ri...