[00:19:52] <mutante>	 I added some WIP code with an idea how to keep it organized. With a focus on SRE subteams rather than hosts; in this new world where there are no more puppet nodes.
[00:20:15] <mutante>	 roughly like this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1159612
[00:20:54] <mutante>	 that would be splitting up existing misc site checks for my subteam.. and a structure for other SRE subteams to put their checks
[09:03:17] <taavi>	 godog: fwiw I noticed that a lot of pontoon prometheus boxes in cloud vps are trying (but failing) to talk to various wikiprod management IPs, is that known/expected?
[09:24:00] <godog>	 taavi: yes expected because pontoon runs the same configs as production
[09:24:24] <godog>	 mutante: thank you for putting a patch together! would you mind filing a companion task too? we'll discuss tomorrow at the team meeting
[10:00:40] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:04:41] <jinxer-wm>	 FIRING: [2x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus  - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[10:05:40] <jinxer-wm>	 RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:41:37] <godog>	 evaluation failures is me playing with thanos sidecar limits, should be resolved ~soon
[10:51:35] <jinxer-wm>	 FIRING: [2x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries
[10:54:41] <godog>	 heh that's me as well, now the alert is working as expected
[10:55:30] <godog>	 I'm backing out of the "series limit" for now, will revisit
[10:56:35] <jinxer-wm>	 FIRING: [4x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries
[11:06:35] <jinxer-wm>	 FIRING: [4x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries
[11:09:41] <jinxer-wm>	 RESOLVED: [2x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus  - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[11:11:35] <jinxer-wm>	 RESOLVED: [4x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries
[12:41:09] <XioNoX>	 godog: is it possible to alert on a "wildcard" metric name? like "mem-.*-utilization" ? (cc topranks)
[12:42:38] <godog>	 XioNoX: yes, internally prometheus treats the metric name as the __name__ label
[12:42:54] <godog>	 so you can select/match on that as with other labels
[12:43:45] <XioNoX>	 that's good to know, thanks!
[12:44:07] <godog>	 sure np, XioNoX what's the context btw ?
[12:44:33] <XioNoX>	 https://phabricator.wikimedia.org/T395998 and for example : https://phabricator.wikimedia.org/P77019
[12:44:49] <XioNoX>	 to not have to write 100 similar alerting rules for the different types of metrics
[12:45:46] <XioNoX>	 the `-utilization` metric is a percentage calculated by the device from size and allocation
[12:46:16] <godog>	 indeed, depending on what the metric looks like on prometheus it might make sense to rename the metric and put some of the name in labels
[12:46:37] <XioNoX>	 ok
[12:46:53] <XioNoX>	 at least there is a way :)
[12:47:29] <godog>	 mandalorian-this-is-the-way.gif
[14:09:40] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:14:40] <jinxer-wm>	 RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:45:40] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:55:40] <jinxer-wm>	 RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:30:49] <mutante>	 godog: alright, will do!
[18:25:40] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[18:30:40] <jinxer-wm>	 RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[23:31:17] <mutante>	 godog: for your meeting https://phabricator.wikimedia.org/T397264