[03:20:40] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[03:30:40] <jinxer-wm>	 RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:44:31] <godog>	 FWIW refactoring ops.pp into something more understandable is T355125
[07:44:32] <stashbot>	 T355125: Refactor Prometheus/Puppet integration to support scrape configuration snippets - https://phabricator.wikimedia.org/T355125
[07:55:39] <XioNoX>	 nice!
[17:28:21] <swfrench-wmf>	 ah, nice! FWIW, the "mess" I meant was more about the data flow from service catalog to points where 'sites' is used, rather than the specific code (ops.pp) being particularly messy :)
[17:36:13] <swfrench-wmf>	 hmmm ... since the probes now succeed, I guess we're now also reporting certificate expiry, which means `CertAlmostExpired` is firing for data-gateway-staging
[17:36:58] <swfrench-wmf>	 IIRC, we rotate on a shorter schedule in staging in order to catch issues that break automated rotation
[17:41:57] <swfrench-wmf>	 urandom: FYI in case someone asks about this one ^
[18:35:16] <swfrench-wmf>	 yes, that's the case [0] and isn't something we can vary per-service on the issuance side. so, it seems like excluding the service from the alert (negative match on `instance` label) is the only obvious way around it.
[18:35:16] <swfrench-wmf>	 [0] https://gerrit.wikimedia.org/g/operations/puppet/+/0b09ed770c11a36cd6e10d7929bfa5836455a402/hieradata/role/common/pki/multirootca.yaml#18
[18:41:13] <urandom>	 swfrench-wmf: do you mean this is "normal"?  i.e. the expiry is short enough to run afoul of the threshold for alerting?
[18:45:15] <swfrench-wmf>	 effectively, yeah - everything is "working as implemented" but the thresholds in CertAlmostExpired aren't really appropriate for the short expiration / rotation cycle in staging
[18:47:11] <urandom>	 that's...unfortunate
[18:47:32] <swfrench-wmf>	 (more aligned with the default production "discovery" policy - i.e., 4w lifetime with rotation ~ 11 days out)
[18:47:39] <urandom>	 yea