[03:20:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:30:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:44:31] FWIW refactoring ops.pp into something more understandable is T355125 [07:44:32] T355125: Refactor Prometheus/Puppet integration to support scrape configuration snippets - https://phabricator.wikimedia.org/T355125 [07:55:39] nice! [17:28:21] ah, nice! FWIW, the "mess" I meant was more about the data flow from service catalog to points where 'sites' is used, rather than the specific code (ops.pp) being particularly messy :) [17:36:13] hmmm ... since the probes now succeed, I guess we're now also reporting certificate expiry, which means `CertAlmostExpired` is firing for data-gateway-staging [17:36:58] IIRC, we rotate on a shorter schedule in staging in order to catch issues that break automated rotation [17:41:57] urandom: FYI in case someone asks about this one ^ [18:35:16] yes, that's the case [0] and isn't something we can vary per-service on the issuance side. so, it seems like excluding the service from the alert (negative match on `instance` label) is the only obvious way around it. [18:35:16] [0] https://gerrit.wikimedia.org/g/operations/puppet/+/0b09ed770c11a36cd6e10d7929bfa5836455a402/hieradata/role/common/pki/multirootca.yaml#18 [18:41:13] swfrench-wmf: do you mean this is "normal"? i.e. the expiry is short enough to run afoul of the threshold for alerting? [18:45:15] effectively, yeah - everything is "working as implemented" but the thresholds in CertAlmostExpired aren't really appropriate for the short expiration / rotation cycle in staging [18:47:11] that's...unfortunate [18:47:32] (more aligned with the default production "discovery" policy - i.e., 4w lifetime with rotation ~ 11 days out) [18:47:39] yea