[07:28:03] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10akosiaris) [07:37:27] good morning [07:37:49] so, deneb has been critical in icinga for the past 3 days now without an ack [07:38:00] the issue is a failure in docker-reporter-releng-images.service [07:38:28] jayme: is this somehow on your/anyone's radar? [07:43:09] it fails from time to time ( https://phabricator.wikimedia.org/T251918 ) [07:48:59] hashar: and what happens when it fails? [07:49:11] as in, what are the consequences of this service being broken for the past 3 days? [07:49:52] my understanding is that it is some kind of script that inspects the release engineering docker images [07:49:57] which is run via systemd somehow [07:49:57] 10serviceops, 10Operations: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10ema) The service failed 3 days ago due to another image this time: ` root@deneb:~# journalctl -u docker-reporter-releng-images.service | grep FAIL Jul 06 16:54:38 deneb docker-report-releng... [07:50:17] and whenever the script fails, the systemd unit fails causing the Icinga check to emit a notification [07:50:46] last time it was on docker-registry.wikimedia.org/releng/bazel:0.4.0 , but the task does not have any ohter information. [07:51:07] I don't know what that docker report script is doing nor do I have access to any of the logs / detailled errors. [07:53:37] alright, thanks hashar! I've ack'ed the alert for now given that it's a known issue [07:55:11] that script injects image information to debmonitor [07:55:20] 10serviceops, 10Operations: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10hashar) That `releng/ci-common` image is a scratch image containing scripts shared by our base images ci-jessie, ci-stretch, ci-buster. It does not have any Debian OS layer, thus if the repo... [07:55:44] and that releng/ci-common does not have any packages installed. It has FROM scratch and then COPY bunch of files [07:56:29] I commented on the task about it, it is probably not a good pattern [07:58:16] ema: Thanks for bringing this up again. I was not aware of it failing again [08:00:30] jayme: np! http://icinga.wikimedia.org/alerts is where all the bad news are at :) [08:04:05] 10serviceops, 10Operations: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) >>! In T251918#6296048, @hashar wrote: > That `releng/ci-common` image is a scratch image containing scripts shared by our base images ci-jessie, ci-stretch, ci-buster. It does not... [08:07:01] 10serviceops, 10Operations: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10MoritzMuehlenhoff) >>! In T251918#6296059, @JMeybohm wrote: > Maybe we can just skip images that are not debian based? Sounds good, we could simply test for the presence of /etc/debian_vers... [08:08:50] 10serviceops, 10Operations: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) a:03JMeybohm >>! In T251918#6296062, @MoritzMuehlenhoff wrote: > > Sounds good, we could simply test for the presence of /etc/debian_version which is owned by the base-files pack... [08:23:22] jayme: moritzm: sorry for the mess with the releng/ci-common scratch container [08:23:54] I might dish it out later though. I guess it is an attempt to share common scripts/files between containershaving different base images [08:23:54] all fine, it's an actual bug in the reporter script after all [08:24:17] it might not be ideal though. For another use case I went to use symbolic links at the root of each images pointing to a common directory [08:24:29] and COPY those symlinks in the images (which get resolved and copy the file) [08:25:04] it sounded like using a scratch layer might be easier, notably to express the dependency when using docker-pkg [08:47:29] 10serviceops, 10Operations, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) @Krinkle @aaron if you have time, let's follow up on the question that I asked about what happens if a Redis shard disappears. It would be really nice... [09:59:17] 10serviceops, 10LDAP-Access-Requests, 10Operations, 10observability, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10jcrespo) I will get notified when this can move forward and https://gerrit.wikimedia.org/r/c/operati... [10:26:55] 10serviceops, 10Operations, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) Another idea to add in here - recently John and Moritz needed TLS for memcached and imported memcached 1.6.6 (latest upstream) into out buster reposito... [15:33:54] 10serviceops, 10Prod-Kubernetes: nrpe dying on kubernetes100[1,3,4] - https://phabricator.wikimedia.org/T257679 (10JMeybohm) [15:59:10] 10serviceops, 10Prod-Kubernetes: nrpe dieing on kubernetes100[1,3,4] - https://phabricator.wikimedia.org/T257679 (10MoritzMuehlenhoff) [17:40:07] 10serviceops, 10Operations, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10Krinkle) CentralAuth and ChronologyProtector are both still high-profile consumers of main stash. Both are scheduled for migration, but currently only with rel...