[00:16:21] 6operations, 7HTTPS: https://wikipedia.com and similar throw certificate warning - https://phabricator.wikimedia.org/T42998#1922322 (10Parent5446) [00:18:11] 6operations, 6Security: Wikipedia.com warns about bad certificate - https://phabricator.wikimedia.org/T123147#1922300 (10Parent5446) Making public since the main bug this is a duplicate of is already public. [00:18:12] 6operations, 6Security: Wikipedia.com warns about bad certificate - https://phabricator.wikimedia.org/T123147#1922326 (10Parent5446) [00:18:14] 6operations, 6Security: Wikipedia.com warns about bad certificate - https://phabricator.wikimedia.org/T123147#1922331 (10Parent5446) [00:43:06] (03PS1) 10Papaul: partman:Changed lv_name from root to srv [puppet] - 10https://gerrit.wikimedia.org/r/263158 [00:43:06] (03PS1) 10Papaul: partman:Changed lv_name from root to srv [puppet] - 10https://gerrit.wikimedia.org/r/263158 [01:22:02] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1922403 (10eliza) To clarify (as I realize I might not have been clear) I'll inquire w/Caitlin about converting jimmy@ to a Google Group after the Wikimedia 15 project, but in the meantime - adding slien@wikimedia.org to jimmy@'s... [02:02:26] RECOVERY - Last backup of the tools filesystem on labstore1001 is OK: OK - Last run for unit replicate-tools was successful [02:26:43] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 11m 19s) [02:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:40] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jan 9 02:33:40 UTC 2016 (duration 6m 57s) [02:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:31:08] 6operations, 6Security: Wikipedia.com warns about bad certificate - https://phabricator.wikimedia.org/T123147#1922508 (10Bawolff) Is there any actual reason to keep this secret? [04:00:15] RECOVERY - Last backup of the maps filesystem on labstore1001 is OK: OK - Last run for unit replicate-maps was successful [04:22:35] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: puppet fail [04:49:44] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [05:08:14] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:15] PROBLEM - puppet last run on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:35] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:45] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:08:55] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:09:26] PROBLEM - RAID on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:09:35] PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:09:54] PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:12:15] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:13:26] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:34] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:16:35] RECOVERY - Disk space on mw1006 is OK: DISK OK [05:18:34] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:22:45] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:24:45] RECOVERY - Disk space on mw1006 is OK: DISK OK [05:25:54] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [05:32:24] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:25] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:34:15] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [05:34:34] RECOVERY - RAID on mw1006 is OK: OK: no RAID installed [05:34:34] RECOVERY - configured eth on mw1006 is OK: OK - interfaces up [05:34:54] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [05:35:14] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:35:15] RECOVERY - Disk space on mw1006 is OK: DISK OK [05:35:15] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [05:35:35] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:35:44] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [05:35:55] RECOVERY - DPKG on mw1006 is OK: All packages OK [05:44:05] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:44:54] PROBLEM - RAID on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:56] PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:45:45] PROBLEM - puppet last run on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:25] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:49:54] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:24] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [05:50:25] RECOVERY - DPKG on mw1006 is OK: All packages OK [05:51:14] RECOVERY - configured eth on mw1006 is OK: OK - interfaces up [05:51:54] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:58:36] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:54] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:59:26] PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:02:54] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:03:04] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [06:03:05] RECOVERY - DPKG on mw1006 is OK: All packages OK [06:03:44] RECOVERY - configured eth on mw1006 is OK: OK - interfaces up [06:03:44] RECOVERY - RAID on mw1006 is OK: OK: no RAID installed [06:04:26] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [06:04:45] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:21:25] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 22 failures [06:30:55] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: puppet fail [06:31:25] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:55] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:25] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:25] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:04] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:15] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:35] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:56:45] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:05] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:05] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:57:15] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:45] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:58:15] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:45] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:55] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:17:15] PROBLEM - RAID on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:24] PROBLEM - configured eth on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:16] RECOVERY - configured eth on mw1007 is OK: OK - interfaces up [07:19:54] PROBLEM - nutcracker port on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:45] PROBLEM - puppet last run on mw1010 is CRITICAL: CRITICAL: Puppet has 1 failures [07:21:15] RECOVERY - RAID on mw1007 is OK: OK: no RAID installed [07:21:45] RECOVERY - nutcracker port on mw1007 is OK: TCP OK - 0.000 second response time on port 11212 [07:36:05] PROBLEM - RAID on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:45:45] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:55] PROBLEM - DPKG on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:05] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:54:04] PROBLEM - nutcracker process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:56:24] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [07:56:55] PROBLEM - configured eth on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:14] RECOVERY - nutcracker process on mw1007 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:59:05] RECOVERY - configured eth on mw1007 is OK: OK - interfaces up [08:00:05] PROBLEM - dhclient process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:02:45] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:45] PROBLEM - nutcracker port on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:35] PROBLEM - nutcracker process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:45] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [08:04:54] PROBLEM - Disk space on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:25] PROBLEM - salt-minion processes on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:25] PROBLEM - configured eth on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:54] RECOVERY - nutcracker process on mw1007 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:11:05] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:17:04] PROBLEM - nutcracker process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:19:05] RECOVERY - nutcracker process on mw1007 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:19:24] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [08:25:15] PROBLEM - nutcracker process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:35] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:35] RECOVERY - nutcracker port on mw1007 is OK: TCP OK - 0.000 second response time on port 11212 [08:30:24] RECOVERY - salt-minion processes on mw1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:31:35] RECOVERY - dhclient process on mw1007 is OK: PROCS OK: 0 processes with command name dhclient [08:35:04] PROBLEM - nutcracker port on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:37:54] PROBLEM - dhclient process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:37:55] RECOVERY - nutcracker process on mw1007 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:39:45] PROBLEM - puppet last run on mw1016 is CRITICAL: CRITICAL: puppet fail [08:39:54] RECOVERY - dhclient process on mw1007 is OK: PROCS OK: 0 processes with command name dhclient [08:42:24] RECOVERY - Disk space on mw1007 is OK: DISK OK [08:44:05] PROBLEM - nutcracker process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:05] PROBLEM - dhclient process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:49:05] PROBLEM - salt-minion processes on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:50:24] RECOVERY - nutcracker process on mw1007 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:51:05] RECOVERY - salt-minion processes on mw1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:52:14] RECOVERY - dhclient process on mw1007 is OK: PROCS OK: 0 processes with command name dhclient [08:53:34] RECOVERY - nutcracker port on mw1007 is OK: TCP OK - 0.000 second response time on port 11212 [08:56:34] PROBLEM - nutcracker process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:56:54] PROBLEM - Disk space on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:57:24] PROBLEM - salt-minion processes on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:25] RECOVERY - salt-minion processes on mw1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:59:54] PROBLEM - nutcracker port on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:15] PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: Puppet has 1 failures [09:00:44] PROBLEM - dhclient process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:44] PROBLEM - SSH on mw1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:15] PROBLEM - salt-minion processes on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:24] PROBLEM - nutcracker process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:45] PROBLEM - DPKG on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:55] PROBLEM - configured eth on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:55] PROBLEM - nutcracker port on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:03:05] PROBLEM - RAID on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:05:54] PROBLEM - salt-minion processes on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:08:25] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:08:46] RECOVERY - DPKG on mw1016 is OK: All packages OK [09:08:55] RECOVERY - configured eth on mw1016 is OK: OK - interfaces up [09:09:04] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [09:09:14] RECOVERY - RAID on mw1016 is OK: OK: no RAID installed [09:09:25] RECOVERY - Disk space on mw1007 is OK: DISK OK [09:09:34] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [09:09:54] RECOVERY - SSH on mw1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [09:10:04] RECOVERY - salt-minion processes on mw1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:10:25] RECOVERY - nutcracker process on mw1016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:13:04] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:15:45] PROBLEM - Disk space on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:45] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:14] PROBLEM - salt-minion processes on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:14] RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:56:54] PROBLEM - salt-minion processes on cygnus is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:05:24] RECOVERY - salt-minion processes on cygnus is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:14:35] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [10:14:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:15:55] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:18:09] seems transient but significant [10:20:04] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:20:45] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:20:55] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:40:05] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 895 [10:45:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 1620502 Threads: 2 Questions: 42377173 Slow queries: 17291 Opens: 60008 Flush tables: 2 Open tables: 416 Queries per second avg: 26.150 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [11:25:22] 6operations, 10Wikimedia-Mailing-lists: Need listadmin password reset for Translators-l mailing list - https://phabricator.wikimedia.org/T123163#1922639 (10Az1568) 3NEW [11:25:34] PROBLEM - NTP on mw1007 is CRITICAL: NTP CRITICAL: No response from NTP server [11:26:29] 6operations, 10Wikimedia-Mailing-lists: Need listadmin password reset for Translators-l mailing list - https://phabricator.wikimedia.org/T123163#1922646 (10Az1568) [11:40:05] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: Puppet has 1 failures [12:06:55] RECOVERY - salt-minion processes on mw1007 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:06:55] RECOVERY - configured eth on mw1007 is OK: OK - interfaces up [12:07:06] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:07:14] RECOVERY - NTP on mw1007 is OK: NTP OK: Offset 0.008060455322 secs [12:07:14] RECOVERY - nutcracker port on mw1007 is OK: TCP OK - 0.000 second response time on port 11212 [12:08:04] RECOVERY - dhclient process on mw1007 is OK: PROCS OK: 0 processes with command name dhclient [12:08:04] RECOVERY - nutcracker process on mw1007 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:08:04] RECOVERY - DPKG on mw1007 is OK: All packages OK [12:08:35] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [12:08:35] RECOVERY - Disk space on mw1007 is OK: DISK OK [12:08:54] RECOVERY - RAID on mw1007 is OK: OK: no RAID installed [12:11:25] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:15:34] PROBLEM - RAID on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:35] PROBLEM - puppet last run on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:17:34] RECOVERY - RAID on mw1005 is OK: OK: no RAID installed [12:17:34] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures [12:36:25] PROBLEM - puppet last run on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:36:25] PROBLEM - RAID on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:04] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:45] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:48:15] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:54] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:54] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:54] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:54] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:26] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:50:25] PROBLEM - puppet last run on es2008 is CRITICAL: CRITICAL: puppet fail [12:58:06] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [13:03:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 58.33% of data above the critical threshold [5000000.0] [13:04:35] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:05:34] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [13:05:35] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:05:35] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:05:35] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [13:07:04] RECOVERY - DPKG on mw1005 is OK: All packages OK [13:07:34] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 59.09% of data above the critical threshold [5000000.0] [13:08:24] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [13:09:35] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [13:11:15] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [13:11:45] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:46] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:46] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:46] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:15] RECOVERY - Disk space on mw1005 is OK: DISK OK [13:13:15] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:55] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:14:46] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [13:15:14] RECOVERY - DPKG on mw1005 is OK: All packages OK [13:15:25] RECOVERY - puppet last run on es2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:15:46] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [13:15:46] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:15:46] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [13:15:46] RECOVERY - RAID on mw1005 is OK: OK: no RAID installed [13:19:54] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:47:36] (03PS1) 10Lokal Profil: Reducing the number of parameters passed to functions [dumps/dcat] - 10https://gerrit.wikimedia.org/r/263169 [13:56:44] (03PS2) 10Lokal Profil: Reduce the number of parameters passed to functions [dumps/dcat] - 10https://gerrit.wikimedia.org/r/263169 [14:02:34] PROBLEM - puppet last run on mw2061 is CRITICAL: CRITICAL: puppet fail [14:29:06] PROBLEM - puppet last run on es2009 is CRITICAL: CRITICAL: puppet fail [14:31:35] RECOVERY - puppet last run on mw2061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:33:35] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 1 failures [14:56:06] RECOVERY - puppet last run on es2009 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [14:58:24] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:47:16] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: Puppet has 30 failures [15:59:15] PROBLEM - DPKG on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:59:55] PROBLEM - dhclient process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:59:55] PROBLEM - RAID on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:00:14] PROBLEM - Disk space on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:00:15] PROBLEM - nutcracker process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:00:35] PROBLEM - configured eth on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:00:54] PROBLEM - nutcracker port on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:00:54] PROBLEM - SSH on mw1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:00:55] PROBLEM - salt-minion processes on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:15] PROBLEM - puppet last run on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:25] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:05:25] PROBLEM - RAID on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:55] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:55] PROBLEM - dhclient process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:06:24] PROBLEM - DPKG on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:06:24] PROBLEM - salt-minion processes on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:06:35] PROBLEM - nutcracker port on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:06:44] PROBLEM - configured eth on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:08:25] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:14:44] PROBLEM - salt-minion processes on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:15:35] PROBLEM - Disk space on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:18:54] PROBLEM - puppet last run on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:22:55] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:28:15] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:29:15] PROBLEM - salt-minion processes on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:29:24] PROBLEM - dhclient process on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:30:44] RECOVERY - DPKG on mw1002 is OK: All packages OK [16:31:04] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [16:31:04] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:31:14] RECOVERY - dhclient process on mw1002 is OK: PROCS OK: 0 processes with command name dhclient [16:31:15] RECOVERY - RAID on mw1002 is OK: OK: no RAID installed [16:31:24] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:31:25] RECOVERY - DPKG on mw1013 is OK: All packages OK [16:31:25] RECOVERY - dhclient process on mw1003 is OK: PROCS OK: 0 processes with command name dhclient [16:31:26] RECOVERY - Disk space on mw1002 is OK: DISK OK [16:31:35] RECOVERY - nutcracker process on mw1002 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:31:35] RECOVERY - nutcracker port on mw1013 is OK: TCP OK - 0.000 second response time on port 11212 [16:31:44] RECOVERY - configured eth on mw1013 is OK: OK - interfaces up [16:31:45] RECOVERY - configured eth on mw1002 is OK: OK - interfaces up [16:32:05] RECOVERY - nutcracker port on mw1002 is OK: TCP OK - 0.000 second response time on port 11212 [16:32:05] RECOVERY - SSH on mw1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:32:06] RECOVERY - salt-minion processes on mw1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:32:14] RECOVERY - Disk space on mw1013 is OK: DISK OK [16:32:34] RECOVERY - RAID on mw1013 is OK: OK: no RAID installed [16:35:25] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:36:25] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:47:54] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:24:04] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: puppet fail [17:50:45] PROBLEM - RAID on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:15] PROBLEM - DPKG on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:35] PROBLEM - nutcracker port on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:54] PROBLEM - configured eth on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:55] PROBLEM - Disk space on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:53:15] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:53:34] PROBLEM - SSH on mw1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:35] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:56:05] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:00:35] RECOVERY - DPKG on mw1011 is OK: All packages OK [18:00:54] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [18:00:55] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [18:01:15] RECOVERY - RAID on mw1011 is OK: OK: no RAID installed [18:01:15] RECOVERY - configured eth on mw1011 is OK: OK - interfaces up [18:01:15] RECOVERY - Disk space on mw1011 is OK: DISK OK [18:01:36] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [18:01:55] RECOVERY - SSH on mw1011 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [18:02:24] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:03:45] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:29:24] PROBLEM - puppet last run on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:29:35] PROBLEM - nutcracker process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:30:34] PROBLEM - DPKG on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:30:34] PROBLEM - nutcracker port on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:30:45] PROBLEM - RAID on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:30:45] PROBLEM - configured eth on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:32:35] RECOVERY - nutcracker port on mw1008 is OK: TCP OK - 0.000 second response time on port 11212 [18:33:54] PROBLEM - Disk space on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:36:15] PROBLEM - SSH on mw1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:24] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 58.33% of data above the critical threshold [5000000.0] [18:36:55] PROBLEM - dhclient process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:37:55] RECOVERY - Disk space on mw1008 is OK: DISK OK [18:38:35] PROBLEM - salt-minion processes on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:38:55] RECOVERY - dhclient process on mw1008 is OK: PROCS OK: 0 processes with command name dhclient [18:39:06] RECOVERY - configured eth on mw1008 is OK: OK - interfaces up [18:40:05] RECOVERY - nutcracker process on mw1008 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [18:40:15] RECOVERY - SSH on mw1008 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [18:40:34] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:40:54] RECOVERY - DPKG on mw1008 is OK: All packages OK [18:41:06] RECOVERY - RAID on mw1008 is OK: OK: no RAID installed [18:44:35] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:56:15] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:19:36] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: puppet fail [19:44:35] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [19:45:24] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 90 failures [19:54:55] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: puppet fail [19:58:25] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: puppet fail [19:58:41] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:53] YuviPanda: ^ tools-checker-01 seems to be hanging? [20:07:29] (might just be my internet borking) [20:08:02] I tried bastion and it was ok [20:08:12] let's check the checker [20:08:19] yeah, it's not nfs itself that's having issues [20:08:31] debug1: Local version string SSH-2.0-OpenSSH_6.9p1 Ubuntu-2~trusty1 [20:08:31] ssh_exchange_identification: Connection closed by remote host [20:08:42] ^ tools-checker-01 [20:09:16] faidon@bastion-restricted-01:~$ telnet tools-checker-01.eqiad.wmflabs 22 [20:09:19] Trying 10.68.16.228... [20:09:21] Connected to tools-checker-01.eqiad.wmflabs. [20:09:24] Escape character is '^]'. [20:09:26] Connection closed by foreign host. [20:22:05] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:27:36] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:30:29] tools-checker-01 login: [1048932.288785] NFS: Server labstore1003.eqiad.wmnet reports our clientid is in use [20:30:39] (from the console log) [20:32:02] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.076 second response time [20:32:44] PROBLEM - RAID on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:32:52] huh, but https://wikitech.wikimedia.org/wiki/Special:NovaAddress actually already had tools-checker-02? what on earth [20:33:22] as in: the warning should have come from -02, not -01 [20:33:32] s/have come/came [20:36:54] RECOVERY - RAID on mw1004 is OK: OK: no RAID installed [20:43:05] PROBLEM - RAID on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:02] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:32] um…valhallasw`cloud ^ ? [20:55:39] * andrewbogott is still playing catch-up [20:55:56] PROBLEM - nutcracker port on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:24] PROBLEM - configured eth on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:40] 208.80.154.14 - - [09/Jan/2016:20:54:47 +0000] "GET /nfs/home HTTP/1.1" 499 0 "-" "check_http/v1.4.15 (nagios-plugins 1.4.15)" [20:56:44] PROBLEM - Disk space on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:15] NFS is working, though. [20:57:35] RECOVERY - RAID on mw1004 is OK: OK: no RAID installed [20:57:38] meanwhile, mw1004 is oom [20:57:55] RECOVERY - nutcracker port on mw1004 is OK: TCP OK - 0.000 second response time on port 11212 [20:58:05] well, was [20:58:15] RECOVERY - configured eth on mw1004 is OK: OK - interfaces up [20:58:36] RECOVERY - Disk space on mw1004 is OK: DISK OK [20:59:12] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.025 second response time [21:00:45] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:14:04] PROBLEM - puppet last run on db2048 is CRITICAL: CRITICAL: puppet fail [21:26:45] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: puppet fail [21:41:24] RECOVERY - puppet last run on db2048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:54:04] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:41:35] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 6 failures [22:47:05] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 6 failures [22:52:24] PROBLEM - puppet last run on mw1010 is CRITICAL: CRITICAL: Puppet has 13 failures [23:08:34] PROBLEM - RAID on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:08:45] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:09:04] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:09:05] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:09:35] PROBLEM - puppet last run on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:14] PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:05] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:11:55] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:12:54] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:14:34] PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:14:54] RECOVERY - Disk space on mw1006 is OK: DISK OK [23:16:25] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:17:15] PROBLEM - RAID on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:17:24] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:19:24] RECOVERY - RAID on mw1010 is OK: OK: no RAID installed [23:19:45] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:05] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:29:24] PROBLEM - DPKG on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:29:45] PROBLEM - RAID on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:30:05] PROBLEM - Check size of conntrack table on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:33:34] RECOVERY - RAID on mw1006 is OK: OK: no RAID installed [23:33:35] RECOVERY - Disk space on mw1006 is OK: DISK OK [23:33:44] RECOVERY - DPKG on mw1010 is OK: All packages OK [23:33:54] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [23:33:55] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:34:15] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:34:16] RECOVERY - Check size of conntrack table on mw1010 is OK: OK: nf_conntrack is 21 % full [23:34:34] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 59 minutes ago with 0 failures [23:34:54] RECOVERY - DPKG on mw1006 is OK: All packages OK [23:35:06] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [23:35:14] RECOVERY - configured eth on mw1006 is OK: OK - interfaces up [23:35:25] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [23:38:04] PROBLEM - puppet last run on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:39:15] PROBLEM - RAID on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:41:14] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [23:42:55] PROBLEM - SSH on mw1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:42:55] PROBLEM - puppet last run on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:43:25] RECOVERY - DPKG on restbase1008 is OK: All packages OK [23:43:55] PROBLEM - RAID on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:46:36] PROBLEM - SSH on mw1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:50:35] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:50:55] PROBLEM - Check size of conntrack table on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:51:35] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:51:46] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:51:46] PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:52:05] PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:52:35] PROBLEM - configured eth on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:52:45] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:52:45] RECOVERY - SSH on mw1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [23:53:05] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:53:15] PROBLEM - nutcracker process on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:53:35] PROBLEM - salt-minion processes on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:54:24] PROBLEM - DPKG on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:55:25] PROBLEM - Disk space on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:55:25] PROBLEM - dhclient process on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:44] RECOVERY - salt-minion processes on mw1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:59:24] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:59:25] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:59:34] RECOVERY - SSH on mw1010 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)