[00:50:25] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:50:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:36:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:51:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:06:40] RESOLVED: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:07:10] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:11:55] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:50:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:41] FIRING: PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus5002:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqsin%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [13:25:41] RESOLVED: PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus5002:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqsin%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [13:30:25] FIRING: [2x] SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:30:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:25] RESOLVED: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:02:25] FIRING: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:07:25] RESOLVED: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:41:25] FIRING: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:36:25] RESOLVED: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:41:55] FIRING: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:26:25] 🤔 [22:27:18] I've seen that alert before, let me fix it. [22:27:34] I have a hypothesis as to why it happens, I'll send a patch after fixing it. [22:36:55] RESOLVED: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:40:25] FIRING: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:42:52] This seems like a different error, debugging it shows a problemm in the icinga config: [22:42:52] > Error: Could not find any hostgroup matching 'lswtest-d8-eqiad' (config file '/etc/icinga/objects/puppet_hosts.cfg', starting on line 29767) [22:57:48] I think that T411098 is related. [22:57:49] T411098: Move sretest1006 to rack D8 and connect to lswtest-d8-eqiad - https://phabricator.wikimedia.org/T411098 [23:03:11] that problem starting coincides with some changes in netbox around the same time [23:04:20] (it's also a problem that the icinga alert "Check correctness of the icinga configuration" didn't fire in here) [23:08:32] I think that we need to revert this change: https://netbox.wikimedia.org/extras/changelog/252684/ [23:11:28] got a patch that might fix it. running pcc [23:11:52] Great!! [23:14:29] heh, pcc isn't going to tell us much because it's naggen-driven [23:14:53] have a look? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211828 [23:15:42] +1, thank you!! [23:20:38] hmm, now we have 'lswtest-d8-eqiad' is not a valid parent for host 'sretest1006' [23:21:28] I'm wondering if there needs something to be changed on the rows. [23:23:37] I think some more information needed by icinga in puppet. [23:24:13] topranks: Have any tips on what is needed to monitor a new switch? [23:25:18] I'm guessing we'll need a new entry in l3_switches_mgmt and infra_devices. [23:25:26] cwhite: sry this is because of me [23:25:36] Yeah I need to figure out what I did that caused it [23:25:50] we don’t need to monitor that it’s only a test device [23:26:46] topranks: I think it's caused by sretest1006 reporting that switch to icinga as its parent. [23:26:46] Ah wait a second [23:27:01] yeah ok no probs [23:28:08] cwhite: do you know how this parent relationship is determined? [23:28:24] it makes sense we moved that host to that rack and connected it earlier [23:31:10] * cwhite looking [23:33:30] From the Icinga config: https://www.irccloud.com/pastebin/VhV5yeOq/ [23:34:00] * denisse investigating what generated it. [23:35:25] RESOLVED: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:36:34] denisse: thanks yeah, I thought I'd doctored things so this wouldn't happen but whatever is building that outsmarted me [23:37:12] I wonder if this is the cause: https://netbox.wikimedia.org/dcim/devices/2245/interfaces/ [23:39:47] yes but how exactly, I can probably sort it out if I understand the data path [23:40:03] I've made some changes which might help but I'm sort of in the dark about how it works that out [23:40:14] it could be using lldp also perhaps [23:40:25] FIRING: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:40:26] parent comes from the lldp puppet fact [23:42:21] topranks: ^^ [23:42:30] god I hate computers [23:45:44] topranks: what are your thoughts for next step? [23:46:36] I guess all I can do is add the test switch to monitoring and then add a downtime for it [23:46:59] I was looking there is no way to spoof or otherwise control LLDP [23:49:22] sry I see you guys already added the hostgroup [23:49:29] I had a look and it doesn't appear we have a knob that instructs icinga to completely ignore the host [23:49:30] Icinga is still not happy ? [23:49:54] topranks: still not happy, no [23:49:59] perhaps it needs to be live in Netbox too [23:50:15] when I added the real siwtches in those rows I added this [23:50:15] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056031 [23:50:31] I don't think the entries in monitoring.pp make a difference here though [23:51:24] the entry in monitoring.pp did create a hostgroup, but I think the problem is there isn't an actual host to mark as the parent [23:52:36] yeah, I think I see it, we need to have it in "common.yaml" as well [23:52:59] I also made it 'active' not planned in netbox which will sync its location to hiera [23:56:36] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211848 [23:57:48] with that patch merged, /me watches the next puppet run [23:57:59] ok cool, fingers crossed that's it [23:58:02] sorry for the hassle [23:58:17] No worries, thanks for the patch!! [23:58:17] this happened before and I tried to beat the system, I didn't realise LLDP was the ultimate source [23:58:40] so there is no way to trick it about what it's connected to... [23:58:45] you guys run a tight ship!! [23:59:40] no worries! it's good to get a refresher on where all the various datapoints make their way into icinga :)