[00:30:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:50:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:07:57] mutante: thank you ! [09:14:24] hi folks, do we have visibility of CPU usage per core in prometheus? [09:15:58] oh.. we got it under node_cpu_seconds_total [09:19:27] yes vgutierrez [09:20:11] ok.. 2nd question, what would be the way of rendering CPU usage for specific CPU cores based on a system fact? [09:20:30] in lvs instances I'm interested in softirq CPU usage on CPUs assigned to handle the queues of the NIC [09:21:07] that's probably exposed on some fact [09:21:28] or I could use liberica-fp prometheus exporter to expose the list of CPUs used [09:21:33] libericad[4078506]: time=2025-06-18T08:34:39.337Z level=INFO msg="forwarding cores auto-detected" forwarding_cores="[0 2 4 6 8 10 12 14]" [09:23:02] the idea is to render something like this without hardcoding the list of CPUS: https://usercontent.irccloud-cdn.com/file/mduUL6jv/image.png [09:25:27] (avg by (cpu) (rate(node_cpu_seconds_total{mode="softirq", instance=~"lvs.*:.*"}[5m])) * 100) shows the average percentage of CPU time spent handling softirqs, per core, on LVS instances. If you want to display this measurement only for specific CPUs, you likely need to export the list of relevant CPUs via a dedicated metric from your exporter, and then perform a join between the [09:25:29] two metrics. [09:26:22] yes.. I need to pick specific CPUs [09:26:38] what would be the way of exposing that list of CPUs so I can perform the join? [09:26:55] as a cpu label? [09:27:26] so something like liberica_fp_forwarding_cpu_enabled{cpu="0"} 1? [09:27:36] es [09:27:39] s/es/yes [09:35:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:10:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:23:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:28:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:38:55] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:43:55] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:44:10] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:48:55] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:12:12] tappof: I've added you on the review of the nic queue cpu exporter (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1160764) [13:13:59] ack vgutierrez, I'll take a look [13:14:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:29:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:27:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:17:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:18:20] Hey 0lly, is it possible to override the team for `KubernetesContainerReachingMemoryLimit` alerts on a namespace basis? DPE SRE has a few apps running on wikikube and we're not getting any notifications at the moment. See https://alerts.wikimedia.org/?q=container%3Dflink-main-container for an example [17:34:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:49:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:31:15] Just a heads up, merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155335 (cc herron for visibility) [19:31:40] ack thanks for the heads up ryankemper [19:35:34] Is there anything I need to run to get pyrra to see the new change, or will it happen automatically with puppet running on `titan*`? It doesn't look like https://wikitech.wikimedia.org/wiki/Pyrra#Onboarding_a_new_SLO_with_Pyrra calls out any specific manual steps so hopefully it's just the puppet patch [19:36:15] Ah yeah I see them appearing in https://slo.wikimedia.org/?search=wdqs now [19:41:55] yeah automatically along with the next puppet run [19:44:06] Yup everything looks good. There's still some old `wdqs-availability` SLOs hanging around; I realized I neglected to absent them so it makes sense they're still here [19:44:38] herron: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1161025 here's a patchset to absent the old slo and then the followup patch to remove the absented resource definition once puppet has been ran [19:47:39] great, lgtm! [20:04:46] This is all done, thanks for the help! I will need to do some fixup of https://wikitech.wikimedia.org/wiki/SLO/WDQS to update it with the post-graph-split reality. Once that's done (probably by tomorrow US-morning), I'll circle back in here to see about getting the status moved from draft to approved [20:27:47] sounds good! [23:26:25] FIRING: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:31:25] RESOLVED: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:56] rather impressive CPU and network rx jump on titan100[12] a bit after 23:20, correlated with the probe failures. seems like a query that touched a _lot_ of data, but ultimately returned little to the client. [23:40:20] having a hard time correlating with a specific trigger from the apache2 logs. in any case, seems to be back to normal :) [23:41:19] swfrench-wmf: Thank you! I'll be on the look in case it happens again.