[06:47:08] 10serviceops, 10DBA, 10Phabricator, 10Release-Engineering-Team-TODO, and 3 others: Improve privilege separation for phabricator's config files and mysql credentials - https://phabricator.wikimedia.org/T146055 (10Marostegui) 05Open→03Resolved a:03jcrespo [08:44:55] 10serviceops, 10Operations, 10Traffic, 10conftool, 10Patch-For-Review: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Joe) etcd-mirror is ok, it's just that writes need to be serialized and etcd2 is orders of magnit... [08:49:43] 10serviceops, 10Operations, 10Patch-For-Review: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10hashar) @Dzahn from the Gerrit change, I -1 ed it cause we had lost https://releases.wikimedia.org/charts/ , turns out that has been moved somewhere else. So we can decom releas... [08:58:19] 10serviceops: Draft a plan for upgrading kubernetes machines to buster - https://phabricator.wikimedia.org/T245272 (10akosiaris) >>! In T245272#5885288, @Jdforrester-WMF wrote: >> Buster comes with docker `18.09.1+dfsg1-7.1+deb10u1`. We probably want to run extensive tests before widely using it. We 've been hol... [08:59:12] 10serviceops: Draft a plan for upgrading kubernetes machines to buster - https://phabricator.wikimedia.org/T245272 (10akosiaris) >>! In T245272#6405308, @MoritzMuehlenhoff wrote: > Now that stretch-backports is end-of-lifed and Stretch in LTS, there's an additional, officially supported 4.19 kernel in stretch (b... [09:20:03] docker-reporter-releng-images.service seems to be broken on deneb FYI [09:28:28] Thanks. I'll take a look (probably touched last ;-)) [09:36:39] hm..temporary 504 from debmonitor it seems [11:00:22] jayme: confirming what you already knew but the combination of envoyproxy.io/scrape and prometheus.io/scrape works jsut dine [11:00:25] *fine [11:01:42] hnowlan: nice! [12:41:17] jayme: issues with debmonitor? can I help? [12:42:08] volans: no really "issues"...more "issue" in terms that it lost one request ;) [12:42:59] *not [12:44:20] did not really dig into it yet as the next request was okay again without human interaction. [12:45:08] jayme: UI or debmonitor-client? [12:45:30] volans: cli [12:45:50] Aug 31 16:30:13 deneb docker-report-releng[6189]: ERROR:debmonitor:Failed to execute DebMonitor CLI: Failed to send the update to the DebMonitor server: 504 [12:47:27] ack, I'll have a look [12:49:23] something, something post to docker-registry.wikimedia.org/releng/quibble-stretch-php70:0.0.44-s5 [12:50:19] yes and not only that one [12:50:31] we have some recurring issue with 504, I'll dig a bit more [12:50:35] debmonitor.discovery.wmnet.access.log.1:44 [12:50:36] debmonitor.discovery.wmnet.access.log.5.gz:43 [12:50:38] all the others 0 [12:51:20] so I guess is a race with something else, my first bet is the GC unless mori.tz deployed debmonitor the other day to install it on the new hosts [13:04:34] volans: don't you log django tracebacks somewhere? [13:05:26] that's a gateway timeout from nginx [13:05:33] might not be in the logs at all [13:06:12] the prod debmonitor instances are still unchanged at this point [13:06:21] jayme: all the 504 are at 6:25... does it ring a bell? :D [13:11:14] volans: It's more or less shortly after the switchover, but therer are successfull requests right before and after. [13:12:59] switchover of what? [13:14:09] service switchover...but I now see that you meant the bunch of other 504s [13:14:29] yeah I meant that seems related to some cron.daily either local or remote [13:14:46] "mine" was ~18:30 [13:14:59] wut? [13:15:23] can't find it in the logs [13:15:29] my TZ, mind fart... [13:15:36] still [13:15:40] did a zgrep ' 504 ' debmonitor.discovery.wmnet.access.log* [13:15:47] 31/Aug/2020:16:30:13 [13:16:27] right [13:16:42] somehow skipped my eye [13:17:07] it's the only one though :D [13:17:21] that's what I was saying :P [13:18:04] it was just *one* in a series (during scan of releng docker images). [13:20:07] for $reasons I don't know right now it was super slow [13:20:08] [2020-08-31T16:30:16] [pid: 2115|app: 0|req: 14298/36649] 127.0.0.1 () {42 vars in 746 bytes} [Mon Aug 31 16:29:43 2020] POST /images/docker-registry.wikimedia.org/releng/quibble-stretch-php70:0.0.44-s5/update => generated 0 bytes in 33345 msecs (HTTP/1.1 201) 6 headers in 397 bytes (1 switches on core 0) [13:20:14] 33s [13:20:33] before and after requests are in the order of 300~600ms [13:20:55] was it a completely new image? [13:21:06] that has a large list of packages that were not in debmonitor at all before? [13:21:53] also, switchdc coming up, my digging will have to wait [13:22:44] volans: quite possible that it was completely new. I don't see it in the logs prior to that [13:23:07] but I don't know anything about the image [13:23:29] akc, I'll check a bit the DB after, my current theory is that it might have took more time than usual to create all the objects in the DB [13:23:40] at the same time I want to have a look at the 504 at cron.daily time [13:24:04] sounds reasonable [13:24:09] both :) [14:36:23] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10AMooney) [15:52:53] 10serviceops, 10Operations, 10Performance-Team, 10Datacenter-Switchover: Unexplained increase in save times, possibly associated with DC switchover - https://phabricator.wikimedia.org/T261763 (10RLazarus) [16:16:32] 10serviceops, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10jcrespo) [17:46:13] 10serviceops, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10Volans) The context of the outdated info was confd stuck on one of the puppetmaster, so when... [17:52:00] 10serviceops, 10Operations, 10Patch-For-Review: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `releases1001.eqiad.wmnet` - releases1001.eqiad.wmnet (**PASS**) - Downtimed host on... [18:15:40] 10serviceops, 10Operations, 10Platform Team Workboards (Clinic Duty Team), 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10thcipriani) [21:57:29] 10serviceops, 10Operations, 10Patch-For-Review: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10Dzahn) 05Stalled→03Resolved [21:57:33] 10serviceops, 10Continuous-Integration-Infrastructure, 10Operations, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) [22:27:21] 10serviceops, 10Operations, 10ops-codfw: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10Dzahn) a:05Dzahn→03None [22:51:21] 10serviceops, 10Operations, 10ops-codfw: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10colewhite) p:05Triage→03Medium