[11:24:16] FYI, I've just deployed the new lossy Wikipedia logos. Not that I expect it to affect performance metrics in a significant way... [12:01:59] based on a very rough estimate this could save users globally as much as 5GB/minute [12:02:49] or 7TB per day. but that's difficult to get a reliable figure for, this is extrapolated from one minute of traffic on a single Varnish frontend [12:05:11] the order of magnitude is probably correct, but not the actual figure [12:29:01] quite nice [12:31:13] Krinkle: re: slow kibana, I dug in a little a couple of days ago. There's a regression that will be fixed in 7.9.1 and the other glaring issue I could find is https://github.com/elastic/kibana/issues/76401 [12:54:16] godog: ack, I'll see if I can file some issues as well. So -next is close to latest stable? [12:54:51] Krinkle: it is in fact the latest stable yeah, 7.9 [12:57:11] FWIW I'm using 'varnish webrequest 50x' dashboard as a testbed [20:01:08] dpifke: would it make sense to have only one navtiming instance produce the last_handled metric at a time? [20:01:28] e.g. the alert would query it without specifying the dc [20:01:49] or maybe by max()-ing it [20:02:04] and letting the inactive one just repeat the old timestamp [20:02:23] given graphite is not active-active and that we intentionally switch off the secondary [20:07:28] I'm not sure we can aggregate across Prometheus instances (DCs), and even if we could, I'm not sure that's a great idea from a reliability standpoint. [20:08:09] they would both write to the same graphite instance, so only one needs to have done something ,right? [20:08:25] I'm thinking it's possible to add an "is_active" label, and then aggregate across that if we want true numbers, or select just {is_active="true"} for alerts. [20:08:33] ah hm.. right we'd have to add these to the list of global aggregated metrics [20:08:52] The trick is we need to send NaN in the correct place so that it doesn't extrapolate when the label changes. [20:09:41] I have to look at what Icinga does when given an NaN. In this case, it should fail open, but I can imagine cases where the desired behavior would be the opposite. [20:09:46] grafana does support querying multiple proms in one panel but aggregating functions over them is limited and confusing at best indeed. [20:10:00] .. right we use prom directly, right [20:10:04] The alert is coming from a PromQL query. [20:11:19] It's not an actionable alert, so arguably it shouldn't require a manual silence, but DC switchovers are rare enough that it's not a huge amount of toil to do so. [20:11:52] I don't want to add a bunch of complexity that makes it fragile. [20:11:55] yeah, if we make them separate alerts with the dc in the title, that should suffice. [20:12:18] is that the only alert that did/should fire? [20:12:23] from navtming [20:12:51] I think so. Whatever we do for navtiming will get copied over to XHGui, ArcLamp, etc. But those alerts aren't live yet.