[14:55:11] Hello 0lly, is there a way to turn off AlertLinting alerts? They're the top source of alerts for us: https://logstash.wikimedia.org/goto/caa96d343625283623ec846a5726aff3 . If there is a better way to deal with this LMK [15:02:29] inflatador: hmm, looks like there may be a broader issue here. is there a task about it yet? [15:04:29] herron T412447 has the alerts we recently wrote that seem to be triggering the linting alerts [15:04:29] T412447: OpenSearch on K8s: Monitor and rotate TLS certificates - https://phabricator.wikimedia.org/T412447 [15:06:10] it seems like team-data-platform/probes.yaml is just missing a `# deploy-site: eqiad, codfw` tag? [15:07:16] taavi ACK, looks like that is documented here as well https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems [15:08:29] It might also be a query that doesn't usually have results [15:08:44] * herron stares into the backlog of active alert AlertLintProblems [15:08:55] maybe there's something we can do to make these less fragile as well [15:09:24] Maybe it would be better to move this into CI? We actually know these alerts work [15:10:15] But with the ProbeSlow/ProbeDown stuff it's probably that the metrics don't have results. Anyway, I'll get a patch up to fix as described in the docs [15:23:59] inflatador: note that case does not apply to your case, as the metrics will always be there, even if the condition does not match. that case you linked to is for cases when you have, for example, an foobaroid_errors metric which only appears when the first error occurs [15:24:39] taavi the alertlint error messages say `"probe_duration_seconds" metric with "address" label but there are no series matching {address=~"10\\.2\\.[12]\\.91"} in the last 1w` [15:27:12] inflatador: yes, and if you look at the 'site' label you can see those are all on edge sites, not in the core sites (eqiad/codfw) where that metric is present. that's why I suggested filtering the alert files to the core datacenters [15:29:06] taavi ACK, can do [15:31:31] OK, https://gerrit.wikimedia.org/r/c/operations/alerts/+/1236766 is up for review if y'all have time to look [15:31:52] you don't need the 'pint disable promql/series' comments with the site filter [15:33:48] Maybe not, but I also don't want to troubleshoot linting errors for known-good alerts [15:34:24] Let me fix my comment to make that clear though [20:17:35] FIRING: DiskSpace: Disk space kafka-logging1002:9100:/srv 3.938% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=kafka-logging1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace