[00:19:52] I added some WIP code with an idea how to keep it organized. With a focus on SRE subteams rather than hosts; in this new world where there are no more puppet nodes. [00:20:15] roughly like this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1159612 [00:20:54] that would be splitting up existing misc site checks for my subteam.. and a structure for other SRE subteams to put their checks [09:03:17] godog: fwiw I noticed that a lot of pontoon prometheus boxes in cloud vps are trying (but failing) to talk to various wikiprod management IPs, is that known/expected? [09:24:00] taavi: yes expected because pontoon runs the same configs as production [09:24:24] mutante: thank you for putting a patch together! would you mind filing a companion task too? we'll discuss tomorrow at the team meeting [10:00:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:04:41] FIRING: [2x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [10:05:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:41:37] evaluation failures is me playing with thanos sidecar limits, should be resolved ~soon [10:51:35] FIRING: [2x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries [10:54:41] heh that's me as well, now the alert is working as expected [10:55:30] I'm backing out of the "series limit" for now, will revisit [10:56:35] FIRING: [4x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries [11:06:35] FIRING: [4x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries [11:09:41] RESOLVED: [2x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [11:11:35] RESOLVED: [4x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries [12:41:09] godog: is it possible to alert on a "wildcard" metric name? like "mem-.*-utilization" ? (cc topranks) [12:42:38] XioNoX: yes, internally prometheus treats the metric name as the __name__ label [12:42:54] so you can select/match on that as with other labels [12:43:45] that's good to know, thanks! [12:44:07] sure np, XioNoX what's the context btw ? [12:44:33] https://phabricator.wikimedia.org/T395998 and for example : https://phabricator.wikimedia.org/P77019 [12:44:49] to not have to write 100 similar alerting rules for the different types of metrics [12:45:46] the `-utilization` metric is a percentage calculated by the device from size and allocation [12:46:16] indeed, depending on what the metric looks like on prometheus it might make sense to rename the metric and put some of the name in labels [12:46:37] ok [12:46:53] at least there is a way :) [12:47:29] mandalorian-this-is-the-way.gif [14:09:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:14:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:45:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:55:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:30:49] godog: alright, will do! [18:25:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:30:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:31:17] godog: for your meeting https://phabricator.wikimedia.org/T397264