[00:50:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:50:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:50:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:21:40] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:36:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:51:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:06:40] <jinxer-wm>	 RESOLVED: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:07:10] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:11:55] <jinxer-wm>	 RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:50:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:20:41] <jinxer-wm>	 FIRING: PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus5002:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqsin%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[13:25:41] <jinxer-wm>	 RESOLVED: PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus5002:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqsin%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[13:30:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:30:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:45:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:02:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:07:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:41:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:36:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:41:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:26:25] <cwhite>	 🤔
[22:27:18] <denisse>	 I've seen that alert before, let me fix it. 
[22:27:34] <denisse>	 I have a hypothesis as to why it happens, I'll send a patch after fixing it. 
[22:36:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:40:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:42:52] <denisse>	 This seems like a different error, debugging it  shows a problemm in the icinga config:
[22:42:52] <denisse>	 > Error: Could not find any hostgroup matching 'lswtest-d8-eqiad' (config file '/etc/icinga/objects/puppet_hosts.cfg', starting on line 29767)
[22:57:48] <denisse>	 I think that T411098 is related.
[22:57:49] <stashbot>	 T411098: Move sretest1006 to rack D8 and connect to lswtest-d8-eqiad - https://phabricator.wikimedia.org/T411098
[23:03:11] <cwhite>	 that problem starting coincides with some changes in netbox around the same time
[23:04:20] <cwhite>	 (it's also a problem that the icinga alert "Check correctness of the icinga configuration" didn't fire in here)
[23:08:32] <denisse>	 I think that we need to revert this change: https://netbox.wikimedia.org/extras/changelog/252684/
[23:11:28] <cwhite>	 got a patch that might fix it.  running pcc
[23:11:52] <denisse>	 Great!! 
[23:14:29] <cwhite>	 heh, pcc isn't going to tell us much because it's naggen-driven
[23:14:53] <cwhite>	 have a look? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211828
[23:15:42] <denisse>	 +1, thank you!!
[23:20:38] <cwhite>	 hmm, now we have 'lswtest-d8-eqiad' is not a valid parent for host 'sretest1006'
[23:21:28] <denisse>	 I'm wondering if there needs something to be changed on the rows.
[23:23:37] <cwhite>	 I think some more information needed by icinga in puppet.
[23:24:13] <cwhite>	 topranks: Have any tips on what is needed to monitor a new switch?
[23:25:18] <cwhite>	 I'm guessing we'll need a new entry in l3_switches_mgmt and infra_devices.
[23:25:26] <topranks>	 cwhite: sry this is because of me
[23:25:36] <topranks>	 Yeah I need to figure out what I did that caused it
[23:25:50] <topranks>	 we don’t need to monitor that it’s only a test device
[23:26:46] <cwhite>	 topranks: I think it's caused by sretest1006 reporting that switch to icinga as its parent.
[23:26:46] <topranks>	 Ah wait a second
[23:27:01] <topranks>	 yeah ok no probs
[23:28:08] <topranks>	 cwhite: do you know how this parent relationship is determined?
[23:28:24] <topranks>	 it makes sense we moved that host to that rack and connected it earlier 
[23:31:10] * cwhite looking
[23:33:30] <denisse>	 From the Icinga config:  https://www.irccloud.com/pastebin/VhV5yeOq/
[23:34:00] * denisse investigating what generated it.
[23:35:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:36:34] <topranks>	 denisse: thanks yeah, I thought I'd doctored things so this wouldn't happen but whatever is building that outsmarted me 
[23:37:12] <denisse>	 I wonder if this is the cause: https://netbox.wikimedia.org/dcim/devices/2245/interfaces/
[23:39:47] <topranks>	 yes but how exactly, I can probably sort it out if I understand the data path 
[23:40:03] <topranks>	 I've made some changes which might help but I'm sort of in the dark about how it works that out 
[23:40:14] <topranks>	 it could be using lldp also perhaps 
[23:40:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:40:26] <cwhite>	 parent comes from the lldp puppet fact
[23:42:21] <cwhite>	 topranks: ^^
[23:42:30] <topranks>	 god I hate computers 
[23:45:44] <cwhite>	 topranks: what are your thoughts for next step?  
[23:46:36] <topranks>	 I guess all I can do is add the test switch to monitoring and then add a downtime for it 
[23:46:59] <topranks>	 I was looking there is no way to spoof or otherwise control LLDP 
[23:49:22] <topranks>	 sry I see you guys already added the hostgroup
[23:49:29] <cwhite>	 I had a look and it doesn't appear we have a knob that instructs icinga to completely ignore the host
[23:49:30] <topranks>	 Icinga is still not happy ?
[23:49:54] <cwhite>	 topranks: still not happy, no
[23:49:59] <topranks>	 perhaps it needs to be live in Netbox too 
[23:50:15] <topranks>	 when I added the real siwtches in those rows I added this
[23:50:15] <topranks>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056031
[23:50:31] <topranks>	 I don't think the entries in monitoring.pp make a difference here though 
[23:51:24] <cwhite>	 the entry in monitoring.pp did create a hostgroup, but I think the problem is there isn't an actual host to mark as the parent
[23:52:36] <topranks>	 yeah, I think I see it, we need to have it  in "common.yaml" as well 
[23:52:59] <topranks>	 I also made it 'active' not planned in netbox which will sync its location to hiera 
[23:56:36] <topranks>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211848
[23:57:48] <cwhite>	 with that patch merged, /me watches the next puppet run
[23:57:59] <topranks>	 ok cool, fingers crossed that's it 
[23:58:02] <topranks>	 sorry for the hassle 
[23:58:17] <denisse>	 No worries, thanks for the patch!!
[23:58:17] <topranks>	 this happened before and I tried to beat the system, I didn't realise LLDP was the ultimate source 
[23:58:40] <topranks>	 so there is no way to trick it about what it's connected to...
[23:58:45] <topranks>	 you guys run a tight ship!!
[23:59:40] <cwhite>	 no worries!  it's good to get a refresher on where all the various datapoints make their way into icinga :)