[05:44:16] topranks, tappof, following up from https://gerrit.wikimedia.org/r/c/operations/alerts/+/1282099 looks like we're still getting the AlertLintProblem alerts - `has "gnmi_interfaces_interface_state_counters_out_queue_red_drop_pkts" metric with "instance" label but there are no series matching {instance=~"cr.*"} in the last 1w` (and a couple similar) [05:47:32] oh, I see that you sent https://gerrit.wikimedia.org/r/c/operations/alerts/+/1283993 nevermind [06:09:55] XioNoX: I'll take a look this morning [08:55:16] tappof: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1284561 I'm deploying this, it should be good now as `alert2002:~$ grep asw2-ulsfo /etc/icinga/objects/puppet_hosts.cfg` doesn't return anything anymore [08:55:50] ack XioNoX [08:56:02] but of course ping me if any issue [09:04:48] FIRING: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:14:48] RESOLVED: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:40:25] FIRING: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:42:15] XioNoX: still getting an error: "'asw2-ulsfo' is not a valid parent for host 'mr1-ulsfo' (file '/etc/nagios/nagios_host.cfg', line 978)!" [09:44:23] tappof: weird, that same patch removes asw2-ulsfo as a parent from mr1-ulsfo - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1284561/3/hieradata/common.yaml#b2608 could it be a race condition? [09:48:12] maybe we can try to remove it manually and see if puppet tries to re-add it? [09:48:26] I'm looking into it. [10:29:41] FIRING: [2x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [10:34:41] RESOLVED: [2x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [10:35:25] RESOLVED: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:40:25] FIRING: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:25] RESOLVED: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:25] FIRING: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:25] RESOLVED: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:17] XioNoX: Sorry, I forgot to update you... it's fixed now after manually cleaning up the Icinga config file. [12:09:11] ok! thx [12:18:25] FIRING: SystemdUnitFailed: prometheus-ipip-exporter.service on prometheus7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:25] RESOLVED: SystemdUnitFailed: prometheus-ipip-exporter.service on prometheus7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:28:41] FIRING: PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [12:33:41] FIRING: [2x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [12:34:48] FIRING: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:38:41] RESOLVED: [2x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [12:44:48] RESOLVED: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:40:35] FIRING: DiskSpace: Disk space centrallog1002:9100:/srv 3.969% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=centrallog1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace