[08:12:25] FIRING: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:17:12] FIRING: ThanosQueryInstantLatencyHigh: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:17:12] FIRING: ThanosQueryRangeLatencyHigh: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [08:22:12] RESOLVED: [2x] ThanosQueryInstantLatencyHigh: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:22:12] RESOLVED: ThanosQueryRangeLatencyHigh: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [08:22:25] RESOLVED: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:23:25] FIRING: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:28:25] RESOLVED: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:30:12] FIRING: ThanosQueryInstantLatencyHigh: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [10:35:12] RESOLVED: [2x] ThanosQueryInstantLatencyHigh: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [13:48:25] FIRING: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:12] FIRING: ThanosQueryHttpRequestQueryRangeErrorRateHigh: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [13:50:43] ^^ looking [13:53:25] RESOLVED: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:12] RESOLVED: ThanosQueryHttpRequestQueryRangeErrorRateHigh: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [15:44:57] Hey 0lly, if we want to add a prometheus job for blackbox autodiscovery on the dse-k8s clusters, would that be at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/prometheus/k8s.pp#46 ? Ref T414037 . We aren't going to merging anything without y'all's approval, just wanted to make sure I'm looking in the right place [15:44:58] T414037: Consider enabling Blackbox K8s autodiscovery on dse-k8s Prometheus instance - https://phabricator.wikimedia.org/T414037 [16:20:57] Hi Inflatador, I replied here about this before the break regarding TLS endpoints/certificates monitoring on the Kubernetes clusters. I think we already have something in place that should enable you to do what you're looking for, but I'm not sure if it's sufficient for your goal. [16:21:25] inflatador: you might want to take a look at hieradata/common/services.yaml (e.g. citoid) to see how the monitoring is set up. [16:22:46] tappof thanks, sorry I missed your update [16:33:53] tappof oh OK, it sounds like we just need to add config similar to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/service.yaml#216 for our service (`opensearch-ipoid`)? [16:54:56] yes inflatador [16:57:20] ACK, thanks again! [17:01:02] you're welcome inflatador [20:04:48] FIRING: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:14:48] RESOLVED: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure