[00:28:11] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.003e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/opsvar-lag_datasource=eqiad+prometheus/opsvar-mirror_name=main-eqiad_to_main-codfw [01:47:31] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [01:49:53] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [01:53:35] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [01:54:41] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [03:38:29] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 9376 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/opsvar-lag_datasource=eqiad+prometheus/opsvar-mirror_name=main-eqiad_to_main-codfw [03:38:41] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1002.64 seconds [04:03:01] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 245.07 seconds [06:28:35] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:28:57] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.008 second response time [06:28:59] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.eqiad.wmnet.crt] [07:00:11] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:07:23] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [07:07:43] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.541 second response time [10:59:03] PROBLEM - Check systemd state on ms-be1035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:08:43] RECOVERY - Check systemd state on ms-be1035 is OK: OK - running: The system is fully operational