[09:04:30] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: redis for docker-registry should have maxmemory-policy set to allkeys-lru - https://phabricator.wikimedia.org/T256726 (10JMeybohm)
[09:42:43] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396 (10JMeybohm) 05Open→03Resolved Seems it is requited to try to fetch the tag list while bypassing the caches once to have the lasting reference...
[09:42:47] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm)
[10:01:03] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: redis for docker-registry should have maxmemory-policy set to allkeys-lru - https://phabricator.wikimedia.org/T256726 (10JMeybohm) p:05Triage→03Low
[10:46:10] <ema>	 hello there
[10:46:23] <ema>	 docker-reporter-base-images.service and docker-reporter-releng-images.service are sad on deneb
[10:46:41] <ema>	 Jun 30 00:01:31 deneb docker-report-base[22599]: requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://docker-registry.wikimedia.org/v2/envoy-tls-local-proxy/tags/list
[10:46:58] <ema>	 and:
[10:46:58] <ema>	 Jun 29 14:01:18 deneb docker-report-releng[23347]: requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://docker-registry.wikimedia.org/v2/_catalog?last=releng%2Fnode10&n=100
[10:47:29] <ema>	 in both cases after the error message there's a log entry that sounds suspiciously like a lie:
[10:47:34] <ema>	 Jun 30 00:01:31 deneb docker-report-base[22599]: All images submitted correctly!
[10:47:42] <ema>	 Jun 29 14:01:18 deneb docker-report-releng[23347]: All images submitted correctly!
[11:08:04] <jayme>	 that was me ema
[11:08:15] <jayme>	 at least the 404
[11:12:17] <ema>	 jayme: ack
[11:13:33] <jayme>	 Sorry for the noise...will try to figure out what I've missed. I had removed the image envoy-tls-local-proxy from registry but obviously not completely :-/
[11:14:13] <ema>	 jayme: I've tried restarting docker-reporter-base-images.service but it's still failing
[11:15:00] <jayme>	 Yeah. The docker image tags are gone but the registry still advertises https://docker-registry.wikimedia.org/v2/envoy-tls-local-proxy ... 
[11:15:14] <ema>	 jayme: which URL is the image catalog? Perhaps it's cached in the CDN :)
[11:15:34] <jayme>	 ema: I've restartet docker-report-releng, that should be fine
[11:16:27] <jayme>	 ema: https://docker-registry.wikimedia.org/v2/_catalog - it definitely is cached in CDN, but the docker registry still includes the image in the catalog response as well
[11:18:29] <ema>	 mmh no, it looks like we're skipping the cache:
[11:18:29] <ema>	 < x-cache: cp3050 miss, cp3050 pass
[11:18:44] <ema>	 (which makes sense I suppose)
[11:20:38] <jayme>	 I'm inclined to push a scratch docker image as envoy-tls-local-proxy:dontuseme to fix this
[11:20:43] <jayme>	 (temporarily)
[11:25:47] <jayme>	 did so and restarted docker-reporter-base-images
[11:25:57] <ema>	 ack
[11:26:07] <ema>	 you might want to !log this on #-operations
[11:33:04] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396 (10JMeybohm) 05Resolved→03Open This led to failing docker-reporter-base-images.service on deneb. I'm definitely missing something here...
[11:33:07] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm)
[11:33:24] <jayme>	 yeah. Thanks for the heads up ema
[11:34:07] <ema>	 np!
[11:34:34] <ema>	 do we want to have the CDN cache docker-registry actually?
[11:36:44] <ema>	 caching stuff like https://docker-registry.wikimedia.org/v2/envoy-tls-local-proxy/tags/list seems like a recipe for headaches to me 
[11:39:10] <ema>	 ah but it's open to the world and we probably don't want it to melt, I see
[11:40:20] <jayme>	 Yea. Internally (for pulling images etc.) we're using docker-registry.discovery.wmnet (which I guess is not behind the CDN). Maybe that should be used for the docker-report jobs as well
[11:41:05] <ema>	 correct, docker-registry.discovery.wmnet is the service itself (ie: not behind the CDN)
[11:43:04] <jayme>	 Will take another look after lunch. Something is definitely not okay there (although avoiding CDN would not have helped us in this specific case)
[11:43:25] <ema>	 enjoy your lunch!
[13:14:19] <akosiaris>	 ema: we 've had issues in CI with it being cached too aggressively. It's now setting way too aggressive no-cache headers in the nginx fronting it
[13:15:09] <akosiaris>	 chances are that we should revisit part of the decision. As you noted caching tags/list is a recipe for pain, but caching for like 5mins /v2/catalog isn't
[13:17:16] <akosiaris>	 code is https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/docker_registry_ha/templates/registry-http-nginx.conf.erb#12 and task was: https://phabricator.wikimedia.org/T211719
[13:33:29] <jayme>	 akosiaris: shouldn't we use docker-registry.discovery.wmnet for the anything from within out network anyways?
[13:34:53] <jayme>	 was about to change that in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/docker/reporter/report.pp#18
[13:48:29] <akosiaris>	 jayme: yes, but CI isn't part ouf our network. It also resides heavily in WMCS
[13:48:45] <jayme>	 akosiaris: also I *think* the TCP 81 server is not used anymore. LVS is configured to port 443 and I only see cache miss/pass for catalog and tags
[13:49:01] <jayme>	 akosiaris: Ah, okay.
[13:49:27] <jayme>	 akosiaris: correction: tags get a cache hit
[13:51:41] <akosiaris>	 jayme: which endpoint do you refer to?
[13:52:10] <akosiaris>	 e.g. for https://docker-registry.wikimedia.org/v2/wmfdebug/tags/list I see misses only
[13:52:15] <akosiaris>	 so I guess not that?
[13:52:54] <jayme>	 akosiaris: hmm https://paste.debian.net/1154502/
[13:54:49] <akosiaris>	 interesting, plus I see the header missing now
[13:55:03] <akosiaris>	 I can reproduce btw
[13:55:58] <akosiaris>	 that's weird, CI should have begun complaining already if we are caching tag lists
[13:56:15] <jayme>	 And this is the reference for TCP 81 is no longer used https://github.com/wikimedia/puppet/blob/production/hieradata/common/service.yaml#L598
[14:34:01] <wikibugs>	 10serviceops, 10Operations, 10Epic, 10Patch-For-Review: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10dancy)
[14:35:16] <wikibugs>	 10serviceops, 10Operations, 10Epic, 10Patch-For-Review: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10hashar)
[14:36:25] <wikibugs>	 10serviceops, 10Operations, 10Epic, 10Patch-For-Review: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10hashar) CI stills use Jessie based container from docker-registry.wikimedia.org/wikimedia-jessie . The last remaining task is to have some se...
[14:47:26] <wikibugs>	 10serviceops, 10Operations, 10Kubernetes: Fix nginx config and caching for docker registry - https://phabricator.wikimedia.org/T256762 (10JMeybohm)
[15:17:41] <wikibugs>	 10serviceops, 10LDAP-Access-Requests, 10Operations, 10observability, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10AMooney) @KFrancis, it looks like the form for this work has been Approved. Can this task move forwa...
[16:21:44] <hnowlan>	 ahoyhoy. Some of the helm/envoy stuff for API gateway is getting towards done at the moment and I figured it might be time to ask about service discovery. Do ye have any strong preferences or ideas about how envoy instances should find appservers? There's a fairly wide range of options as to how we could do things (including just static configs given how infrequently we actually add new mw instances)
[16:22:04] <hnowlan>	 obviously static configs is not really desirable or scalable longterm :) 
[16:23:03] <rzl>	 hnowlan: I think _j.oe_ has been thinking about that, he's back from vacation tomorrow
[16:23:39] <cdanis>	 there's some thinking about that from last year in https://docs.google.com/document/d/13eDGv-uP0_QmbxjxgzQ9zycLY_uoH6CZP525mSPKK-A/edit but
[16:23:41] <rzl>	 the tentative plan is to build a full-on xDS server, but that only part-answers the question obviously
[16:23:53] <cdanis>	 beware that half of that document is stuff that has since happened, and the other half is still pie-in-the-sky
[16:24:15] <cdanis>	 (and I don't think the 'stuff that already happened' part was updated with "and here's how we actually implemented this")
[16:29:14] <hnowlan>	 ah, cool- Thanks for the details! xDS seemed like the right idea to me but I wasn't sure how far into it we were willing to go in terms of work/resources. 
[16:30:55] <rzl>	 yeah I think it's some combination of "worth the effort" and "we knew he was going to come back from vacation having written *something* from scratch without pay, maybe it'll be this"
[16:31:29] <hnowlan>	 haha
[16:32:26] <rzl>	 someday the appservers will be k8s pods and then that part gets a lot simpler, so it might just be static until then
[16:32:38] <rzl>	 static envoy config -> static input for the xDS server
[16:33:09] <rzl>	 or if we want to get really fancy, maybe xDS gets it from puppet
[16:33:25] <rzl>	 depends on how far out MW-on-k8s turns out to be, I guess
[16:37:30] <cdanis>	 rzl: he joked-not-joked to me privately about writing an xDS controller before he left
[16:37:50] <rzl>	 no I know, he didn't even pretend to be joking to me
[17:39:37] <wikibugs>	 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog: mobileapps kubernetes deployment is timing out - https://phabricator.wikimedia.org/T256786 (10Mholloway)
[17:44:21] <wikibugs>	 10serviceops, 10Core Platform Team, 10Operations, 10Traffic, and 3 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10aaron) >>! In T250205#6158994, @Krinkle wrote: >>>! In T250205#6154883, @aaron wrote: >> I'm not fond of the idea of not sending purges for in...
[17:49:39] <wikibugs>	 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog: mobileapps kubernetes deployment is timing out - https://phabricator.wikimedia.org/T256786 (10Mholloway)
[18:29:43] <elukey>	 hello people, qq - is the mediawiki errors alert WIP by someone?
[18:30:28] <elukey>	 it seems constantly flapping, IIUC due to parsoid/wtp new logging config, but it has been flapping for a long time
[18:30:45] <rzl>	 hmm I don't know of any work
[18:31:16] <rzl>	 you're right that it's noisy though -- offhand I don't know if we need to fix the alert or the errors
[18:31:23] <rzl>	 prrrrobably the errors
[18:32:10] <elukey>	 ah the task was https://phabricator.wikimedia.org/T256459
[18:44:42] <elukey>	 totally a hack but https://gerrit.wikimedia.org/r/c/operations/puppet/+/608708
[18:53:39] <cdanis>	 oh
[18:53:52] <cdanis>	 Krinkle wrote some patches to raise the threshold over the weekend
[18:54:11] <cdanis>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/608188 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/608189
[18:54:16] <cdanis>	 I meant to merge those
[18:54:24] <cdanis>	 ahah
[18:57:05] <cdanis>	 sorry elukey but I like Krinkle's patches better, will start with those
[18:58:04] <elukey>	 ah there are patches, good! I thought to merge it as temporary band aid waiting for something better, please go ahead with Timo's patches
[18:58:26] <elukey>	 thanks :)
[19:01:06] <cdanis>	 thank you :)
[19:01:08] <cdanis>	 that alert had been annoying me too
[23:42:49] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Krinkle)
[23:51:45] <wikibugs>	 10serviceops, 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10CDanis) There's no alert yet for memcache NIC saturation, and I don't believe there's one for TKOs either (@elukey  is that ri...