[08:08:40] phedenskog: I think the thing I miss the most about the new alerts emails is the lack of "back to normal" emails. in the past if there was a fluke overnight and the recovery emails followed minutes later, there was no need to investigate. now I need to look even at those, because there's no recovery email notification [08:09:23] yes I also noticed, I'll ask godog. [08:18:42] {{done}} -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/674803 [08:37:33] thanks! [08:53:13] np, FWIW I understand the rationale for recovery emails, IME though it results in starting to ignore recovery emails for all/most cases to address flukes [08:54:12] I guess what I'm saying is that flukes should be the exception, whereas recovery emails will be sent in all cases [15:14:47] Ack, I think maybe perf is special/different in that regard. Our threshold usually has to be smaller than the amount of flukes we want to tolerate. There's still ways to address that (delays, smarter queries etc), and we do; but there's a point after which either you can no longer detect most regression or find out days too late. So we've gotten used to some minimal noise, where it's not really a fluke (the change is real), but we don't [15:14:47] care if it didn't last that long. Given user generated data. [15:15:01] Daily variance is huge. [15:15:30] Things will be a bit more stable as we move toward histogram bins [15:15:53] ( godog ) [15:51:25] Krinkle: ack, thank you for the context! yeah definitely I see the value of being notified of recoveries in some cases [15:52:56] I hope moving to AM will give us (SRE at least) plenty of opportunities to really notify only on real problems [16:01:28] naive question, does AM have some better ideas about "multi-tenant" in general? [16:14:45] cdanis: hitting the nail on the head there.. yeah, with their move to AM value has decreased for the most part for us. Some improvements too, but cost-benefit isn't looking as good I think now. Certainly makes me want to reopen the conversion of having Grafana send emails directly which we know basically does everything right without hours of manual work syncing stuff between alert and query dashes only to end up still with a subpar [16:14:45] result. [16:15:06] I'm all for reuse and collab but maybe this just isn't the same use case. [16:15:46] We've always been the odd one out with SRE having to whack a mole our alerts since they're not direct service health indicators usually. [16:16:09] (And the ones that are of course are fine in the new system.) [16:17:21] Fundemantally what we do is: look for significant change and if it's persistent, investigation early and often with as few clicks as possible with as much visual context as possible [16:18:52] Krinkle: yesI see. I think the license issue is hard: WPT comes bundled with other software and other licenses and before (I think that was removed) there was one file including all licenses. Also Im not sure how you can change a license, seen other projects where contributors needed to agree on that license so that their contribution could be included. [16:21:16] cdanis: I'm hopeful though that maybe AM can still work because it does seem to have tighter integration than a icinga had. I don't know if that's our creation or upstream. But right now it seems AM is given none of the useful data and only reads the manual key-value pairs set which means lots of manual work. If it could receive and make use of the implicit metadata Grafana provides that'd be huge (dash/panel title, metric name/value, [16:21:16] dash url, drilldown url, etc) [16:28:01] phedenskog: That's a good question. The MIT license and such don't require new/fork contributions to be under the same license so as long as you include the original license you can license new changes differently [16:28:12] But.. I don't think apache 2.0 works that way [16:30:31] Ah looks like it does actually. Yeah, seems Apache is not sticky/copyleft. [16:30:56] But yeah they probably do need to acknowledge the original license somewhere and the copyright statements from that [17:00:13] Krinkle: I have to go shortly, but that's good feedback re: your experience Grafana/AM, happy to discuss further here or phab or wherever [17:01:46] cdanis: it should yeah, the main idea in my mind is to have a team associated with each alert [17:02:10] it is still WIP but some of these ideas are outlined at https://wikitech.wikimedia.org/wiki/Alertmanager [17:02:25] ok gotta go [17:13:09] aye, I was typing on mobile, I meant simply "the [move to AM]" not "their [move to AM]". didn't meant to divide us.