[02:21:05] 06serviceops, 06Data-Persistence, 13Patch-For-Review: Deploy instance of hoarde as artifact-cache(?) in k8s - https://phabricator.wikimedia.org/T414112#11506054 (10Ottomata) > Bikeshedding of service names welcome! :) Do you prefer Artifact over Entity? Entity seems like a more common term for this concep... [02:51:10] 06serviceops, 13Patch-For-Review: Move EXCLUDED_SERVICES attribute from sre.discovery.datacentre to service catalog - https://phabricator.wikimedia.org/T412211#11506068 (10Scott_French) After some [[ https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1224041/comment/ba8bbabc_d88d266c/ | discussio... [09:00:45] 06serviceops, 10MediaWiki-extensions-Score, 10Wikimedia-SVG-rendering, 07Upstream: Deploy LilyPond 2.24 with Cairo support to shellbox containers - https://phabricator.wikimedia.org/T385404#11506405 (10AnthonyFok) a:03akosiaris [09:04:19] 06serviceops, 06Growth-Team, 10MW-on-K8s: Do not alert about a failed cron job when logs are already discarded - https://phabricator.wikimedia.org/T414167 (10Urbanecm_WMF) 03NEW [09:06:28] 06serviceops, 06Growth-Team, 10MW-on-K8s: Do not alert about a failed cron job when logs are already discarded - https://phabricator.wikimedia.org/T414167#11506430 (10Urbanecm_WMF) I noticed [Logstash has the error logs](https://logstash.wikimedia.org/goto/46c4c819d5b06186796398e71bf1bcb6). They are very har... [09:07:41] 06serviceops, 06Growth-Team, 10MW-on-K8s: Do not alert about a failed cron job when logs are already discarded - https://phabricator.wikimedia.org/T414167#11506436 (10Urbanecm_WMF) Since the error is now investigated, I manually deleted it to reset the alerting. [09:12:52] 06serviceops, 10Ceph, 06Data-Persistence, 10SRE-swift-storage: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506446 (10elukey) Tried to replicate Alex's test with the following: On registry1004 (not serving live traffic): - `sudo iptables -A INPUT -p tcp -s 10.192.32.7... [09:29:37] 06serviceops, 10MediaWiki-extensions-Score, 10Wikimedia-SVG-rendering, 07Upstream: Deploy LilyPond 2.24 with Cairo support to shellbox containers - https://phabricator.wikimedia.org/T385404#11506503 (10akosiaris) a:05akosiaris→03None Thanks for you work on backporting LilyPond to bookworm-backports, th... [09:56:23] 06serviceops, 10Ceph, 06Data-Persistence, 10SRE-swift-storage: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506561 (10elukey) The problem seems an exact replica of https://github.com/distribution/distribution/issues/2225, so I tried to add the following snippet to the r... [10:10:09] 06serviceops, 10Ceph, 06Data-Persistence, 10SRE-swift-storage: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506589 (10elukey) The only thing that I found on the docker distribution logs is was: ` Jan 09 09:52:25 registry1004 docker-registry[676]: time="2026-01-09T09:52... [10:52:30] 06serviceops, 10Ceph, 06Data-Persistence, 10SRE-swift-storage: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506742 (10elukey) The very interesting thing is that after a few tries I got: ` elukey@build2001:~$ sudo docker push docker-registry.svc.eqiad.wmnet/test/istio/b... [10:54:56] 06serviceops, 10Ceph, 10SRE-swift-storage, 06Release-Engineering-Team (Radar): Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11506760 (10elukey) To keep archives happy - I am working in T394476 to properly onboard ceph apu... [11:17:56] 06serviceops, 10Ceph, 10SRE-swift-storage, 06Release-Engineering-Team (Radar): Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11506803 (10elukey) [11:17:59] 06serviceops, 13Patch-For-Review: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#11506802 (10elukey) [11:30:10] 06serviceops, 06SRE Observability, 06MediaWiki-Platform-Team (Kanban Board): Improve MediaWiki periodic job alerts - https://phabricator.wikimedia.org/T412799#11506857 (10DAlangi_WMF) Anything left to do here? Seems like the patch has been merged (and has rolled out?), and this could perhaps be resolved. I... [11:30:48] 06serviceops, 10Ceph, 06Data-Persistence, 10SRE-swift-storage: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506858 (10elukey) I had a chat with Matthew about apus, and they confirmed that there is no explicit rate/bw limit in place for the docker-registry account. I obs... [12:19:44] 06serviceops, 10Ceph, 06Data-Persistence, 10SRE-swift-storage: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11507159 (10elukey) Since having nginx is not really needed for this test, I went back to testing with a direct push to registry1004.eqiad.wmnet:5002: ` elukey@bui... [12:31:14] 06serviceops, 06Traffic, 06MediaWiki-Platform-Team (Kanban Board), 07OKR-Work, 13Patch-For-Review: api-gateway helm chart: rest routes should return retry-after when a rate limit applies. - https://phabricator.wikimedia.org/T405636#11507215 (10daniel) 05Open→03In progress [12:31:50] 06serviceops, 06Traffic, 06MediaWiki-Platform-Team (Kanban Board), 07OKR-Work, 13Patch-For-Review: api-gateway helm chart: rest routes should return retry-after when a rate limit applies. - https://phabricator.wikimedia.org/T405636#11507235 (10daniel) a:05Hokwelum→03daniel [12:53:10] 06serviceops, 13Patch-For-Review: Move EXCLUDED_SERVICES attribute from sre.discovery.datacentre to service catalog - https://phabricator.wikimedia.org/T412211#11507303 (10Blake) Based on the discussion in IRC about Scott's analysis above, it seems like we're good to remove this cookbook. I'll put together a... [14:39:30] 06serviceops, 10Ceph, 06Data-Persistence, 10SRE-swift-storage: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11507647 (10elukey) Retried the same op that led to the HTTP 500 after lunch: ` elukey@build2001:~$ sudo docker push registry1004.eqiad.wmnet:5002/test/cert-manage... [15:20:21] 06serviceops, 10MW-on-K8s, 06Release-Engineering-Team (Priority Backlog 📥): mwdebug: people in the "deployment" group should be able to launch 'experimental' instances for testing purposes - https://phabricator.wikimedia.org/T324003#11507773 (10MLechvien-WMF) 05Open→03Resolved a:03MLechvien-WMF Clo... [15:32:16] 06serviceops, 07Epic, 10ServiceOps new: ☂️ [FY2025-26][Hypothesis] WE6.2.1 Production Readiness Checklist - https://phabricator.wikimedia.org/T400263#11507821 (10MLechvien-WMF) [15:32:49] 06serviceops, 07Epic, 10ServiceOps new: ☂️ [FY2025-26][Hypothesis] WE6.2.1 Production Readiness Checklist - https://phabricator.wikimedia.org/T400263#11507824 (10MLechvien-WMF) p:05Triage→03Medium [15:36:22] 06serviceops, 10Ceph, 06Data-Persistence, 10SRE-swift-storage: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11507843 (10elukey) Tried another test, this time on build2002 (bookworm, with a more up-to-date version of dockerd). I tried to push the calico typha's image (less... [17:13:44] 06serviceops, 06Content-Transform-Team, 07OKR-Work: Migrate parsoidtest functionality to kubernetes - https://phabricator.wikimedia.org/T386246#11508232 (10MLechvien-WMF) a:03jijiki [17:41:59] 06serviceops, 10Ceph, 06Data-Persistence, 10SRE-swift-storage: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11508332 (10elukey) The sequence of events before the blob unknown seems to be the following on the docker registry: 1) "PUT /v2/test/calico/node/blobs/uploads/...... [17:48:46] 06serviceops, 10Ceph, 06Data-Persistence, 10SRE-swift-storage: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11508364 (10elukey) Looks like it worked! ` elukey@build2002:~$ sudo docker push registry1004.eqiad.wmnet:5002/test/restricted/mediawiki-webserver:2025-03-04-10595...