[01:43:41] PROBLEM - MegaRAID on db1072 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [01:43:42] ACKNOWLEDGEMENT - MegaRAID on db1072 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T199636 [01:43:53] 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T199636 (10ops-monitoring-bot) [03:27:41] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 888.36 seconds [03:41:41] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type=create_container https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:42:42] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:56:02] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 154.00 seconds [04:43:31] PROBLEM - HHVM rendering on mw2218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:44:22] RECOVERY - HHVM rendering on mw2218 is OK: HTTP OK: HTTP/1.1 200 OK - 75395 bytes in 0.307 second response time [05:40:39] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T199636 (10Marostegui) Can we this disk replaced? Thanks! [05:40:53] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T199636 (10Marostegui) p:05Triage>03Normal [05:41:06] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T199636 (10Marostegui) a:03Cmjohnson [07:24:11] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 51.54, 35.78, 22.88 [07:27:01] PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 58.32, 45.47, 28.74 [07:32:31] PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 48.50, 44.55, 33.06 [07:32:52] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 6.33, 21.35, 23.01 [07:45:41] RECOVERY - High CPU load on API appserver on mw1286 is OK: OK - load average: 11.12, 21.98, 28.91 [11:28:41] PROBLEM - Host cp3033 is DOWN: PING CRITICAL - Packet loss = 100% [11:29:01] RECOVERY - Host cp3033 is UP: PING OK - Packet loss = 0%, RTA = 83.65 ms [11:47:32] PROBLEM - Host cp3033 is DOWN: PING CRITICAL - Packet loss = 100% [11:53:31] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3033_v4, cp3033_v6 [11:53:41] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3033_v4, cp3033_v6 [11:53:41] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3033_v4, cp3033_v6 [11:53:41] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3033_v4, cp3033_v6 [11:53:42] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3033_v4, cp3033_v6 [11:53:51] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3033_v4, cp3033_v6 [11:53:51] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3033_v4, cp3033_v6 [11:53:52] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3033_v4, cp3033_v6 [11:54:02] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3033_v4, cp3033_v6 [11:54:11] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3033_v4, cp3033_v6 [11:54:11] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3033_v4, cp3033_v6 [11:54:12] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3033_v4, cp3033_v6 [11:54:12] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3033_v4, cp3033_v6 [11:54:12] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3033_v4, cp3033_v6 [11:54:21] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3033_v4, cp3033_v6 [11:54:31] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3033_v4, cp3033_v6 [12:03:41] PROBLEM - cpjobqueue endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:03:51] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/base (Get base CSS) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/page/metadata/{title}{/revision}{ [12:03:51] ended metadata for Video article on English Wikipedia) timed out before a response was received [12:03:51] PROBLEM - apertium apy on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:52] PROBLEM - mathoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:04:01] PROBLEM - configured eth on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:04:11] PROBLEM - DPKG on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:04:11] PROBLEM - SSH on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:12] PROBLEM - pdfrender on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:12] PROBLEM - eventstreams on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:21] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received [12:04:52] RECOVERY - mathoid endpoints health on scb2001 is OK: All endpoints are healthy [12:05:01] RECOVERY - configured eth on scb2001 is OK: OK - interfaces up [12:05:02] RECOVERY - SSH on scb2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [12:05:11] RECOVERY - DPKG on scb2001 is OK: All packages OK [12:05:11] RECOVERY - pdfrender on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.075 second response time [12:05:12] RECOVERY - eventstreams on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.101 second response time [12:05:22] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [12:05:42] RECOVERY - cpjobqueue endpoints health on scb2001 is OK: All endpoints are healthy [12:05:52] RECOVERY - apertium apy on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.074 second response time [12:06:01] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [15:34:22] PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Puppet has 37 failures. Last run 5 minutes ago with 37 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[cpjobqueue/deploy],Exec[chown /srv/deployment/cpjobqueue for deploy-service],Package[recommendation-api/deploy] [15:59:51] RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:47:44] (03PS1) 10Framawiki: Create Reconstruction NS at frwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445929 (https://phabricator.wikimedia.org/T199631) [21:53:32] (03PS2) 10Framawiki: Create Reconstruction NS at frwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445929 (https://phabricator.wikimedia.org/T199631)