[00:00:34] RECOVERY - DPKG on mw1010 is OK: All packages OK [00:01:04] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 53 minutes ago with 0 failures [00:01:04] RECOVERY - configured eth on mw1010 is OK: OK - interfaces up [00:01:26] RECOVERY - Check size of conntrack table on mw1010 is OK: OK: nf_conntrack is 19 % full [00:01:38] RECOVERY - Disk space on mw1010 is OK: DISK OK [00:01:38] RECOVERY - dhclient process on mw1010 is OK: PROCS OK: 0 processes with command name dhclient [00:01:38] RECOVERY - nutcracker process on mw1010 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:04:14] RECOVERY - RAID on mw1012 is OK: OK: no RAID installed [00:05:04] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:12] (03PS1) 10Aude: Set logos for mobile login page for Wikidata and Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263201 (https://phabricator.wikimedia.org/T123175) [00:05:46] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:06:34] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [00:06:34] RECOVERY - configured eth on mw1006 is OK: OK - interfaces up [00:06:35] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [00:07:05] RECOVERY - Disk space on mw1006 is OK: DISK OK [00:07:15] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [00:07:24] PROBLEM - puppet last run on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:24] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:07:42] (03PS2) 10Aude: Set logos for mobile login page for Wikidata and Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263201 (https://phabricator.wikimedia.org/T123175) [00:07:44] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:08:05] PROBLEM - SSH on mw1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:08:15] RECOVERY - DPKG on mw1006 is OK: All packages OK [00:08:56] RECOVERY - RAID on mw1006 is OK: OK: no RAID installed [00:09:54] PROBLEM - Check size of conntrack table on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:10:35] PROBLEM - RAID on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:11:24] PROBLEM - DPKG on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:11:34] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:11:34] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:11:36] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:11:46] PROBLEM - SSH on mw1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:12:05] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:13:35] PROBLEM - configured eth on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:15:05] PROBLEM - configured eth on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:15:05] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:15:05] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:15:14] PROBLEM - nutcracker port on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:15:15] PROBLEM - DPKG on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:06] RECOVERY - nutcracker port on mw1010 is OK: TCP OK - 0.000 second response time on port 11212 [00:23:24] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:23:55] RECOVERY - configured eth on mw1010 is OK: OK - interfaces up [00:24:35] PROBLEM - nutcracker process on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:35] PROBLEM - Disk space on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:35] PROBLEM - dhclient process on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:54] PROBLEM - salt-minion processes on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:45] RECOVERY - nutcracker process on mw1010 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:26:45] RECOVERY - dhclient process on mw1010 is OK: PROCS OK: 0 processes with command name dhclient [00:27:35] PROBLEM - nutcracker port on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:29:34] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:29:44] RECOVERY - nutcracker port on mw1010 is OK: TCP OK - 0.000 second response time on port 11212 [00:30:15] PROBLEM - configured eth on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:30:55] RECOVERY - Disk space on mw1010 is OK: DISK OK [00:37:16] PROBLEM - nutcracker process on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:37:16] PROBLEM - dhclient process on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:37:16] PROBLEM - Disk space on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:37:24] RECOVERY - salt-minion processes on mw1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:38:14] PROBLEM - nutcracker port on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:40:14] RECOVERY - nutcracker port on mw1010 is OK: TCP OK - 0.000 second response time on port 11212 [00:40:44] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [00:40:44] RECOVERY - Disk space on mw1012 is OK: DISK OK [00:41:25] RECOVERY - nutcracker process on mw1010 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:41:25] RECOVERY - Disk space on mw1010 is OK: DISK OK [00:41:25] RECOVERY - dhclient process on mw1010 is OK: PROCS OK: 0 processes with command name dhclient [00:42:04] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:42:04] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [00:45:45] PROBLEM - salt-minion processes on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:25] PROBLEM - nutcracker port on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:55] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:55] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:47:36] PROBLEM - dhclient process on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:47:36] PROBLEM - nutcracker process on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:47:36] PROBLEM - Disk space on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:48:24] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:48:24] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:48:55] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [00:48:55] RECOVERY - Disk space on mw1012 is OK: DISK OK [00:52:44] RECOVERY - nutcracker port on mw1010 is OK: TCP OK - 0.000 second response time on port 11212 [00:55:14] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:14] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:58:55] PROBLEM - nutcracker port on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:24] RECOVERY - Disk space on mw1010 is OK: DISK OK [01:02:24] RECOVERY - dhclient process on mw1010 is OK: PROCS OK: 0 processes with command name dhclient [01:07:15] RECOVERY - nutcracker port on mw1010 is OK: TCP OK - 0.000 second response time on port 11212 [01:07:34] RECOVERY - DPKG on mw1010 is OK: All packages OK [01:07:45] RECOVERY - configured eth on mw1010 is OK: OK - interfaces up [01:08:35] PROBLEM - Disk space on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:35] PROBLEM - dhclient process on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:13:35] PROBLEM - nutcracker port on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:13:45] PROBLEM - DPKG on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:13:55] PROBLEM - configured eth on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:17:55] PROBLEM - RAID on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:06] PROBLEM - puppet last run on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:55] RECOVERY - Disk space on mw1010 is OK: DISK OK [01:18:55] RECOVERY - dhclient process on mw1010 is OK: PROCS OK: 0 processes with command name dhclient [01:19:15] PROBLEM - SSH on mw1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:19:36] RECOVERY - nutcracker port on mw1010 is OK: TCP OK - 0.000 second response time on port 11212 [01:19:45] RECOVERY - DPKG on mw1010 is OK: All packages OK [01:20:04] RECOVERY - configured eth on mw1010 is OK: OK - interfaces up [01:20:04] RECOVERY - RAID on mw1010 is OK: OK: no RAID installed [01:20:14] PROBLEM - Disk space on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:20:34] RECOVERY - Check size of conntrack table on mw1010 is OK: OK: nf_conntrack is 5 % full [01:20:45] RECOVERY - nutcracker process on mw1010 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:20:54] RECOVERY - SSH on mw1010 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [01:21:04] RECOVERY - salt-minion processes on mw1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:21:26] PROBLEM - DPKG on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:21:54] PROBLEM - configured eth on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:22:14] RECOVERY - Disk space on mw1015 is OK: DISK OK [01:24:35] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:29:46] RECOVERY - DPKG on mw1015 is OK: All packages OK [01:30:35] PROBLEM - Disk space on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:34:34] PROBLEM - nutcracker port on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:34:34] PROBLEM - salt-minion processes on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:15] PROBLEM - DPKG on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:38:24] RECOVERY - DPKG on mw1015 is OK: All packages OK [01:38:44] RECOVERY - salt-minion processes on mw1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:38:44] RECOVERY - nutcracker port on mw1015 is OK: TCP OK - 0.000 second response time on port 11212 [01:39:04] RECOVERY - Disk space on mw1015 is OK: DISK OK [01:44:55] PROBLEM - salt-minion processes on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:44:55] PROBLEM - nutcracker port on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:45:15] PROBLEM - Disk space on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:47:54] PROBLEM - dhclient process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:48:34] PROBLEM - nutcracker process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:48:45] PROBLEM - DPKG on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:05] RECOVERY - nutcracker process on mw1015 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:01:05] RECOVERY - SSH on mw1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [02:01:15] RECOVERY - DPKG on mw1015 is OK: All packages OK [02:01:37] RECOVERY - salt-minion processes on mw1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:01:37] RECOVERY - nutcracker port on mw1015 is OK: TCP OK - 0.000 second response time on port 11212 [02:01:37] RECOVERY - configured eth on mw1015 is OK: OK - interfaces up [02:01:56] RECOVERY - Disk space on mw1015 is OK: DISK OK [02:02:04] RECOVERY - RAID on mw1015 is OK: OK: no RAID installed [02:02:25] RECOVERY - dhclient process on mw1015 is OK: PROCS OK: 0 processes with command name dhclient [02:04:24] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 60 failures [02:06:05] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:11:45] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [02:11:45] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:11:54] RECOVERY - configured eth on mw1012 is OK: OK - interfaces up [02:12:15] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [02:12:15] RECOVERY - Disk space on mw1012 is OK: DISK OK [02:12:15] RECOVERY - DPKG on mw1012 is OK: All packages OK [02:12:36] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:12:55] RECOVERY - SSH on mw1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [02:13:34] RECOVERY - RAID on mw1012 is OK: OK: no RAID installed [02:16:25] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:16:54] PROBLEM - RAID on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:55] PROBLEM - configured eth on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:04] PROBLEM - DPKG on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:44] PROBLEM - SSH on mw1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:17:44] PROBLEM - dhclient process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:54] PROBLEM - nutcracker process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:21:55] RECOVERY - nutcracker process on mw1009 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:23:05] RECOVERY - configured eth on mw1009 is OK: OK - interfaces up [02:23:15] RECOVERY - DPKG on mw1009 is OK: All packages OK [02:25:55] RECOVERY - DPKG on restbase1006 is OK: All packages OK [02:28:16] PROBLEM - nutcracker process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:29:14] PROBLEM - salt-minion processes on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:29:14] PROBLEM - nutcracker port on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:29:25] PROBLEM - configured eth on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:29:26] PROBLEM - DPKG on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:29:54] PROBLEM - Disk space on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:15] RECOVERY - Disk space on mw1009 is OK: DISK OK [02:42:25] PROBLEM - Disk space on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:15] RECOVERY - puppet last run on restbase1006 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [02:52:14] RECOVERY - salt-minion processes on mw1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:25] PROBLEM - salt-minion processes on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:14] PROBLEM - puppet last run on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:15] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:44] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:45] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:45] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:55] PROBLEM - RAID on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:55] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:34] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:07:45] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:54] RECOVERY - Disk space on mw1005 is OK: DISK OK [03:08:35] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:54] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [03:10:35] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [03:13:34] RECOVERY - DPKG on mw1005 is OK: All packages OK [03:19:16] RECOVERY - salt-minion processes on mw1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:19:16] RECOVERY - nutcracker port on mw1009 is OK: TCP OK - 0.000 second response time on port 11212 [03:19:45] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:15] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:15] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:25] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:34] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 60 failures [03:22:14] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [03:22:15] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:22:26] RECOVERY - Disk space on mw1005 is OK: DISK OK [03:23:04] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:05] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [03:25:35] PROBLEM - salt-minion processes on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:35] PROBLEM - nutcracker port on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:26:35] RECOVERY - dhclient process on mw1009 is OK: PROCS OK: 0 processes with command name dhclient [03:27:14] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:27:15] PROBLEM - DPKG on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:27:36] PROBLEM - configured eth on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:27:54] PROBLEM - RAID on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:28:15] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:28:35] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:28:44] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:31:25] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [03:31:34] RECOVERY - DPKG on mw1007 is OK: All packages OK [03:31:54] RECOVERY - configured eth on mw1007 is OK: OK - interfaces up [03:32:04] RECOVERY - RAID on mw1007 is OK: OK: no RAID installed [03:32:55] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:33:04] RECOVERY - Disk space on mw1005 is OK: DISK OK [03:33:04] PROBLEM - dhclient process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:34:44] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [03:35:44] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:36:24] PROBLEM - puppet last run on kafka1020 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:24] PROBLEM - puppet last run on mw2181 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:34] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:54] PROBLEM - puppet last run on mw2072 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:54] PROBLEM - puppet last run on mw2051 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:54] PROBLEM - puppet last run on elastic1002 is CRITICAL: CRITICAL: puppet fail [03:37:24] PROBLEM - puppet last run on mw1034 is CRITICAL: CRITICAL: Puppet has 1 failures [03:37:55] PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: Puppet has 1 failures [03:39:15] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:40:54] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:37] PROBLEM - RAID on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:55] PROBLEM - puppet last run on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:45:35] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:46:34] RECOVERY - nutcracker port on mw1009 is OK: TCP OK - 0.000 second response time on port 11212 [03:46:34] RECOVERY - salt-minion processes on mw1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:48:05] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 41 minutes ago with 0 failures [03:49:35] RECOVERY - Disk space on mw1005 is OK: DISK OK [03:52:45] PROBLEM - salt-minion processes on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:45] PROBLEM - nutcracker port on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:54:15] PROBLEM - puppet last run on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:55:15] RECOVERY - Disk space on mw1009 is OK: DISK OK [03:55:45] RECOVERY - SSH on mw1009 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [03:55:54] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:57:05] RECOVERY - configured eth on mw1009 is OK: OK - interfaces up [03:57:44] PROBLEM - configured eth on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:57:55] PROBLEM - DPKG on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:57:55] RECOVERY - nutcracker process on mw1009 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [03:58:56] RECOVERY - salt-minion processes on mw1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:58:56] RECOVERY - nutcracker port on mw1009 is OK: TCP OK - 0.000 second response time on port 11212 [03:59:54] RECOVERY - dhclient process on mw1009 is OK: PROCS OK: 0 processes with command name dhclient [04:01:15] RECOVERY - puppet last run on kafka1020 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [04:01:25] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [04:01:54] RECOVERY - puppet last run on mw2072 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [04:02:24] RECOVERY - puppet last run on mw1034 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [04:02:55] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [04:03:15] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: puppet fail [04:03:25] RECOVERY - puppet last run on mw2181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:03:35] PROBLEM - configured eth on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:03:55] RECOVERY - puppet last run on elastic1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:03:56] RECOVERY - puppet last run on mw2051 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:04:05] RECOVERY - configured eth on mw1016 is OK: OK - interfaces up [04:04:24] RECOVERY - DPKG on mw1016 is OK: All packages OK [04:04:34] PROBLEM - nutcracker process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:04:34] PROBLEM - SSH on mw1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:05:26] PROBLEM - salt-minion processes on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:05:26] PROBLEM - nutcracker port on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:06:25] PROBLEM - dhclient process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:06:25] PROBLEM - SSH on mw1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:06:34] RECOVERY - SSH on mw1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [04:08:05] PROBLEM - Disk space on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:17:05] PROBLEM - SSH on mw1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:19:04] PROBLEM - DPKG on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:19:25] PROBLEM - dhclient process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:20:15] PROBLEM - nutcracker process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:20:35] PROBLEM - nutcracker port on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:20:55] PROBLEM - configured eth on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:22:24] PROBLEM - salt-minion processes on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:22:34] PROBLEM - Disk space on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:22:44] RECOVERY - Disk space on mw1009 is OK: DISK OK [04:23:05] RECOVERY - dhclient process on mw1009 is OK: PROCS OK: 0 processes with command name dhclient [04:24:14] RECOVERY - nutcracker port on mw1009 is OK: TCP OK - 0.000 second response time on port 11212 [04:24:15] RECOVERY - salt-minion processes on mw1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:24:15] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:24:15] RECOVERY - nutcracker process on mw1016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [04:24:25] RECOVERY - Disk space on mw1016 is OK: DISK OK [04:24:34] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [04:24:55] RECOVERY - configured eth on mw1016 is OK: OK - interfaces up [04:25:14] RECOVERY - DPKG on mw1016 is OK: All packages OK [04:25:15] RECOVERY - SSH on mw1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [04:25:15] RECOVERY - RAID on mw1016 is OK: OK: no RAID installed [04:25:25] RECOVERY - nutcracker process on mw1009 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [04:25:27] RECOVERY - dhclient process on mw1016 is OK: PROCS OK: 0 processes with command name dhclient [04:27:46] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [04:29:25] PROBLEM - dhclient process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:30:35] PROBLEM - nutcracker port on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:30:35] PROBLEM - salt-minion processes on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:31:55] PROBLEM - nutcracker process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:32:44] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [04:32:54] RECOVERY - nutcracker port on mw1009 is OK: TCP OK - 0.000 second response time on port 11212 [04:32:54] RECOVERY - salt-minion processes on mw1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:43:16] PROBLEM - nutcracker port on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:43:24] PROBLEM - salt-minion processes on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:43:45] PROBLEM - Disk space on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:46] RECOVERY - nutcracker process on mw1009 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [04:53:45] RECOVERY - salt-minion processes on mw1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:56:48] !log restarting HHVM on eqiad jobrunners, OOM, memleak faster than the 24h restarts [04:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:58:42] !log powercycling mw1005, mw1008, mw1009 -- unresponsive due to OOM [04:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:59:04] PROBLEM - nutcracker process on mw1009 is CRITICAL: Timeout while attempting connection [04:59:34] PROBLEM - dhclient process on mw1008 is CRITICAL: Timeout while attempting connection [04:59:34] PROBLEM - configured eth on mw1008 is CRITICAL: Timeout while attempting connection [04:59:34] PROBLEM - DPKG on mw1008 is CRITICAL: Timeout while attempting connection [04:59:44] PROBLEM - RAID on mw1008 is CRITICAL: Timeout while attempting connection [05:00:14] PROBLEM - salt-minion processes on mw1009 is CRITICAL: Timeout while attempting connection [05:00:44] PROBLEM - nutcracker process on mw1008 is CRITICAL: Connection refused by host [05:00:54] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: puppet fail [05:00:55] RECOVERY - SSH on mw1009 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [05:00:55] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [05:00:55] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:00:55] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:00:55] RECOVERY - dhclient process on mw1009 is OK: PROCS OK: 0 processes with command name dhclient [05:00:56] RECOVERY - RAID on mw1005 is OK: OK: no RAID installed [05:01:05] RECOVERY - Disk space on mw1005 is OK: DISK OK [05:01:14] RECOVERY - nutcracker process on mw1009 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:01:25] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 2 hours ago with 0 failures [05:01:38] RECOVERY - dhclient process on mw1008 is OK: PROCS OK: 0 processes with command name dhclient [05:01:38] RECOVERY - configured eth on mw1008 is OK: OK - interfaces up [05:01:38] RECOVERY - DPKG on mw1008 is OK: All packages OK [05:01:44] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [05:01:44] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [05:01:45] RECOVERY - RAID on mw1008 is OK: OK: no RAID installed [05:02:15] RECOVERY - nutcracker port on mw1009 is OK: TCP OK - 0.000 second response time on port 11212 [05:02:15] RECOVERY - salt-minion processes on mw1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:02:24] RECOVERY - configured eth on mw1009 is OK: OK - interfaces up [05:02:25] RECOVERY - RAID on mw1009 is OK: OK: no RAID installed [05:02:44] RECOVERY - DPKG on mw1009 is OK: All packages OK [05:02:44] RECOVERY - DPKG on mw1005 is OK: All packages OK [05:02:44] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [05:02:45] RECOVERY - Disk space on mw1009 is OK: DISK OK [05:02:54] RECOVERY - nutcracker process on mw1008 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:05:04] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:10:44] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:30:44] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:31:14] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:05] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:15] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:16] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:54] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:14] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:15] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:15] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:06] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:35] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:05] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:15] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:57:54] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:54] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:57:55] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail [06:58:15] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:15] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:15] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:16] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:34] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:14] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:24:55] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [08:05:06] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: puppet fail [08:07:15] PROBLEM - puppet last run on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:25] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:13:54] PROBLEM - DPKG on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:14:05] PROBLEM - configured eth on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:14:06] PROBLEM - nutcracker port on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:14:15] PROBLEM - RAID on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:14:26] PROBLEM - salt-minion processes on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:17:37] PROBLEM - Disk space on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:18:15] PROBLEM - nutcracker process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:18:15] PROBLEM - dhclient process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:19:34] PROBLEM - puppet last run on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:21:15] PROBLEM - RAID on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:22:25] RECOVERY - RAID on mw1007 is OK: OK: no RAID installed [08:22:35] RECOVERY - salt-minion processes on mw1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:23:36] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [08:23:36] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 39 minutes ago with 0 failures [08:23:44] RECOVERY - Disk space on mw1007 is OK: DISK OK [08:24:05] RECOVERY - DPKG on mw1007 is OK: All packages OK [08:24:15] RECOVERY - dhclient process on mw1007 is OK: PROCS OK: 0 processes with command name dhclient [08:24:15] RECOVERY - nutcracker process on mw1007 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:24:15] RECOVERY - configured eth on mw1007 is OK: OK - interfaces up [08:24:18] 6operations, 6Performance-Team, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1923208 (10ori) The time it takes each job runner to OOM has been steadily shrinking, so restarting once a day is now inadequate. [08:24:24] RECOVERY - nutcracker port on mw1007 is OK: TCP OK - 0.000 second response time on port 11212 [08:25:24] PROBLEM - configured eth on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:35] PROBLEM - SSH on mw1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:36] PROBLEM - DPKG on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:45] PROBLEM - RAID on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:55] PROBLEM - dhclient process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:15] PROBLEM - puppet last run on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:27:24] RECOVERY - RAID on mw1014 is OK: OK: no RAID installed [08:27:44] RECOVERY - SSH on mw1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [08:27:44] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 26 minutes ago with 0 failures [08:27:55] RECOVERY - dhclient process on mw1016 is OK: PROCS OK: 0 processes with command name dhclient [08:28:55] PROBLEM - salt-minion processes on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:28:55] PROBLEM - nutcracker process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:29:24] !log restarting hhvm on jobrunners again [08:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09, Master [08:29:54] RECOVERY - DPKG on mw1016 is OK: All packages OK [08:31:14] PROBLEM - Disk space on mw1016 is CRITICAL: Timeout while attempting connection [08:31:15] PROBLEM - nutcracker port on mw1016 is CRITICAL: Timeout while attempting connection [08:32:14] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:32:37] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 25 minutes ago with 0 failures [08:33:05] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:33:05] RECOVERY - nutcracker process on mw1016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:33:14] RECOVERY - Disk space on mw1016 is OK: DISK OK [08:33:15] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [08:33:35] !log Attempting to isolate cause of T122069 by toggling job types on mw1169. Disabling Puppet to prevent it from clobbering config changes. [08:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09, Master [08:33:45] RECOVERY - configured eth on mw1016 is OK: OK - interfaces up [08:34:05] RECOVERY - RAID on mw1016 is OK: OK: no RAID installed [08:41:39] !log mw1169 -- disables cirrus jobs. [08:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09, Master [08:45:31] !log mw1168 -- disabled puppet; disabled restbase jobs [08:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09, Master [08:48:03] !log mw1167 -- disabled puppet; disabled deleteLinks and refreshLinks* jobs [08:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09, Master [08:55:11] !log mw1166 -- disabled puppet; disabled categoryMembershipChange jobs [08:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09, Master [09:18:26] PROBLEM - puppet last run on mw2117 is CRITICAL: CRITICAL: puppet fail [09:47:35] RECOVERY - puppet last run on mw2117 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:53:15] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: puppet fail [09:58:25] PROBLEM - salt-minion processes on cygnus is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:00:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 656 [10:06:46] RECOVERY - salt-minion processes on cygnus is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:13:05] !log disabled categoryMembershipChange on mw1165 too, then restart jobrunner / jobchron / hhvm on mw1165 and mw1164 [10:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:20:15] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:30:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 686 [10:35:06] RECOVERY - check_mysql on db1008 is OK: Uptime: 1706302 Threads: 3 Questions: 42742375 Slow queries: 17858 Opens: 60018 Flush tables: 2 Open tables: 416 Queries per second avg: 25.049 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [12:09:36] PROBLEM - puppet last run on mw1050 is CRITICAL: CRITICAL: Puppet has 1 failures [12:32:25] RECOVERY - puppet last run on mw1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:34:55] PROBLEM - DPKG on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:35:05] PROBLEM - puppet last run on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:35:14] PROBLEM - SSH on mw1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:15] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:36:05] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:36:14] PROBLEM - RAID on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:36:15] PROBLEM - configured eth on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:36:24] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:36:55] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:38:16] PROBLEM - nutcracker port on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:38:35] PROBLEM - RAID on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:38:38] oh dear [12:38:45] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:38:45] PROBLEM - SSH on mw1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:45] PROBLEM - DPKG on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:38:56] PROBLEM - RAID on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:04] PROBLEM - puppet last run on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:24] PROBLEM - Check size of conntrack table on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:25] PROBLEM - SSH on mw1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:39:26] PROBLEM - dhclient process on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:26] PROBLEM - salt-minion processes on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:35] PROBLEM - nutcracker process on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:35] PROBLEM - RAID on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:35] PROBLEM - nutcracker process on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:35] PROBLEM - dhclient process on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:45] PROBLEM - nutcracker port on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:54] PROBLEM - puppet last run on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:55] PROBLEM - salt-minion processes on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:40:04] PROBLEM - puppet last run on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:40:05] PROBLEM - dhclient process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:40:05] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:40:15] PROBLEM - nutcracker port on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:40:16] PROBLEM - configured eth on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:40:25] PROBLEM - Disk space on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:40:34] PROBLEM - configured eth on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:40:34] PROBLEM - salt-minion processes on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:40:35] PROBLEM - DPKG on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:40:35] PROBLEM - DPKG on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:40:37] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 626m 20s) [12:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:40:55] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:41:24] RECOVERY - Check size of conntrack table on mw1010 is OK: OK: nf_conntrack is 0 % full [12:42:14] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [12:42:24] RECOVERY - nutcracker port on mw1010 is OK: TCP OK - 0.000 second response time on port 11212 [12:43:51] *phew* [12:44:04] PROBLEM - salt-minion processes on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:05] PROBLEM - configured eth on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:14] PROBLEM - RAID on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:24] PROBLEM - puppet last run on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:26] PROBLEM - DPKG on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:35] PROBLEM - SSH on mw1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:44:44] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:44:45] PROBLEM - dhclient process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:56] PROBLEM - configured eth on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:56] PROBLEM - nutcracker process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:15] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: puppet fail [12:47:35] PROBLEM - Check size of conntrack table on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:44] PROBLEM - Disk space on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:34] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:44] PROBLEM - nutcracker port on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:04] PROBLEM - puppet last run on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:15] PROBLEM - RAID on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:25] PROBLEM - dhclient process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:55] PROBLEM - puppet last run on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:50:04] PROBLEM - nutcracker port on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:50:24] PROBLEM - salt-minion processes on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:50:24] PROBLEM - nutcracker process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:50:25] PROBLEM - Disk space on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:50:34] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [12:50:54] PROBLEM - configured eth on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:50:55] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:51:04] PROBLEM - SSH on mw1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:51:05] RECOVERY - Disk space on mw1012 is OK: DISK OK [12:51:06] PROBLEM - nutcracker port on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:51:14] PROBLEM - salt-minion processes on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:51:25] RECOVERY - dhclient process on mw1016 is OK: PROCS OK: 0 processes with command name dhclient [12:51:44] PROBLEM - configured eth on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:51:45] PROBLEM - DPKG on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:51:55] PROBLEM - RAID on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:51:55] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:52:24] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:52:24] RECOVERY - nutcracker process on mw1016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:52:34] PROBLEM - SSH on mw1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:52:54] PROBLEM - nutcracker process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:53:15] RECOVERY - nutcracker port on mw1008 is OK: TCP OK - 0.000 second response time on port 11212 [12:53:15] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:54:34] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [12:54:35] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:54:35] PROBLEM - DPKG on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:54:35] RECOVERY - Disk space on mw1009 is OK: DISK OK [12:54:36] PROBLEM - Disk space on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:54:45] PROBLEM - dhclient process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:54:55] RECOVERY - configured eth on mw1016 is OK: OK - interfaces up [12:55:05] PROBLEM - nutcracker process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:05] PROBLEM - RAID on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:15] PROBLEM - configured eth on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:15] PROBLEM - nutcracker port on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:15] PROBLEM - SSH on mw1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:34] PROBLEM - puppet last run on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:54] PROBLEM - salt-minion processes on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:54] PROBLEM - dhclient process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:56:46] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:57:25] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:57:25] PROBLEM - DPKG on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:57:26] PROBLEM - Disk space on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:57:45] PROBLEM - dhclient process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:57:57] Not in the loop atm, I saw a deploy above - is this ^ related or has something died? [12:59:05] RECOVERY - nutcracker process on mw1008 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:59:34] RECOVERY - nutcracker port on mw1002 is OK: TCP OK - 0.000 second response time on port 11212 [13:00:05] RECOVERY - dhclient process on mw1008 is OK: PROCS OK: 0 processes with command name dhclient [13:00:45] PROBLEM - Disk space on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:01:04] PROBLEM - Disk space on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:01:25] PROBLEM - configured eth on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:04] RECOVERY - dhclient process on mw1016 is OK: PROCS OK: 0 processes with command name dhclient [13:02:34] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:05] RECOVERY - Disk space on mw1009 is OK: DISK OK [13:03:05] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:05] PROBLEM - dhclient process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:14] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [13:04:54] RECOVERY - salt-minion processes on mw1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:05:15] RECOVERY - Disk space on mw1002 is OK: DISK OK [13:05:35] PROBLEM - nutcracker process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:55] PROBLEM - nutcracker port on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:55] PROBLEM - nutcracker port on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:55] PROBLEM - salt-minion processes on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:14] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: puppet fail [13:06:26] PROBLEM - dhclient process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:44] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:07:55] RECOVERY - nutcracker port on mw1002 is OK: TCP OK - 0.000 second response time on port 11212 [13:08:04] RECOVERY - nutcracker process on mw1009 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:09:15] PROBLEM - Disk space on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:09:25] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:14] PROBLEM - salt-minion processes on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:25] PROBLEM - Disk space on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:44] RECOVERY - dhclient process on mw1002 is OK: PROCS OK: 0 processes with command name dhclient [13:14:04] RECOVERY - SSH on mw1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [13:14:04] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:14] PROBLEM - nutcracker port on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:15] PROBLEM - nutcracker process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:24] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [13:14:35] PROBLEM - dhclient process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:24] RECOVERY - Disk space on mw1013 is OK: DISK OK [13:16:25] PROBLEM - RAID on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:35] RECOVERY - dhclient process on mw1016 is OK: PROCS OK: 0 processes with command name dhclient [13:17:34] PROBLEM - Disk space on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:35] PROBLEM - nutcracker port on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:35] PROBLEM - nutcracker process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:35] PROBLEM - salt-minion processes on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:35] PROBLEM - nutcracker process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:55] PROBLEM - DPKG on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:04] PROBLEM - configured eth on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:06] PROBLEM - Disk space on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:15] PROBLEM - RAID on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:44] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:18:45] PROBLEM - puppet last run on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:19:15] PROBLEM - SSH on mw1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:19:35] RECOVERY - nutcracker process on mw1015 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:19:35] RECOVERY - Disk space on mw1016 is OK: DISK OK [13:19:36] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [13:20:05] RECOVERY - DPKG on mw1015 is OK: All packages OK [13:20:05] RECOVERY - configured eth on mw1015 is OK: OK - interfaces up [13:20:14] RECOVERY - Disk space on mw1015 is OK: DISK OK [13:20:15] PROBLEM - SSH on mw1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:44] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [13:22:24] RECOVERY - dhclient process on mw1009 is OK: PROCS OK: 0 processes with command name dhclient [13:22:35] PROBLEM - Disk space on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:23:44] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:23:45] RECOVERY - nutcracker process on mw1016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:24:55] PROBLEM - dhclient process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:54] PROBLEM - nutcracker process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:14] PROBLEM - dhclient process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:14] PROBLEM - nutcracker port on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:15] PROBLEM - DPKG on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:16] PROBLEM - configured eth on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:45] RECOVERY - DPKG on mw1016 is OK: All packages OK [13:27:15] PROBLEM - dhclient process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:27:34] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:27:35] RECOVERY - SSH on mw1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [13:27:55] RECOVERY - nutcracker process on mw1015 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:28:24] RECOVERY - configured eth on mw1015 is OK: OK - interfaces up [13:28:24] RECOVERY - DPKG on mw1015 is OK: All packages OK [13:28:36] PROBLEM - dhclient process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:55] RECOVERY - Disk space on mw1012 is OK: DISK OK [13:31:34] RECOVERY - dhclient process on mw1015 is OK: PROCS OK: 0 processes with command name dhclient [13:32:15] PROBLEM - nutcracker port on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:15] PROBLEM - salt-minion processes on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:16] PROBLEM - nutcracker process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:16] PROBLEM - Disk space on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:25] RECOVERY - nutcracker port on mw1015 is OK: TCP OK - 0.000 second response time on port 11212 [13:33:14] RECOVERY - RAID on mw1015 is OK: OK: no RAID installed [13:33:14] PROBLEM - DPKG on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:55] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:44] RECOVERY - dhclient process on mw1002 is OK: PROCS OK: 0 processes with command name dhclient [13:35:15] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:25] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:25] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:36:25] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:36:25] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [13:36:25] RECOVERY - Disk space on mw1016 is OK: DISK OK [13:39:34] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:40:25] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [13:40:55] PROBLEM - dhclient process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:05] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:05] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:06] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:06] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:25] RECOVERY - nutcracker port on mw1002 is OK: TCP OK - 0.112 second response time on port 11212 [13:41:25] RECOVERY - DPKG on mw1016 is OK: All packages OK [13:41:44] RECOVERY - dhclient process on mw1016 is OK: PROCS OK: 0 processes with command name dhclient [13:42:35] PROBLEM - salt-minion processes on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:35] PROBLEM - Disk space on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:36] PROBLEM - nutcracker port on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:55] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:04] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:46:45] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:47:35] PROBLEM - nutcracker port on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:47:44] PROBLEM - DPKG on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:47:55] PROBLEM - dhclient process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:48:04] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [13:48:44] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [13:49:04] RECOVERY - DPKG on mw1005 is OK: All packages OK [13:49:15] RECOVERY - Disk space on mw1005 is OK: DISK OK [13:49:15] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [13:49:15] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:49:15] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:49:24] RECOVERY - RAID on mw1005 is OK: OK: no RAID installed [13:49:54] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [13:52:05] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:00:54] PROBLEM - NTP on mw1008 is CRITICAL: NTP CRITICAL: No response from NTP server [14:01:24] PROBLEM - NTP on mw1014 is CRITICAL: NTP CRITICAL: No response from NTP server [14:17:35] PROBLEM - NTP on mw1002 is CRITICAL: NTP CRITICAL: No response from NTP server [14:19:35] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:21:44] RECOVERY - NTP on mw1002 is OK: NTP OK: Offset 0.0004699230194 secs [14:22:35] RECOVERY - dhclient process on mw1002 is OK: PROCS OK: 0 processes with command name dhclient [14:22:54] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:25:54] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:28:54] PROBLEM - dhclient process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:29:05] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:44] RECOVERY - NTP on mw1014 is OK: NTP OK: Offset -0.2307537794 secs [14:37:25] RECOVERY - Disk space on mw1014 is OK: DISK OK [14:43:35] PROBLEM - Disk space on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:56:44] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:56:44] PROBLEM - puppet last run on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:57:44] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:57:55] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:57:56] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:05] PROBLEM - RAID on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:25] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:45] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [15:00:05] RECOVERY - Disk space on mw1005 is OK: DISK OK [15:03:45] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:04:55] RECOVERY - Disk space on mw1012 is OK: DISK OK [15:05:54] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [15:09:25] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:09:36] RECOVERY - NTP on mw1008 is OK: NTP OK: Offset -0.2316396236 secs [15:11:15] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:14:06] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:15:46] <_joe_> something in the jobqueue is actively killing the jobrunners [15:15:55] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:16:05] <_joe_> I'm honestly not lucid enough to investigate [15:16:24] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [15:16:34] <_joe_> and a rolling restart doesn't seem like a potential fix if not for a short amount of time [15:16:54] RECOVERY - RAID on mw1012 is OK: OK: no RAID installed [15:16:54] RECOVERY - configured eth on mw1012 is OK: OK - interfaces up [15:16:55] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:17:15] RECOVERY - Disk space on mw1012 is OK: DISK OK [15:17:34] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [15:17:34] RECOVERY - DPKG on mw1012 is OK: All packages OK [15:17:54] RECOVERY - SSH on mw1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [15:18:55] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:19:56] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:22:15] RECOVERY - dhclient process on mw1014 is OK: PROCS OK: 0 processes with command name dhclient [15:23:54] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [15:25:05] RECOVERY - Disk space on mw1005 is OK: DISK OK [15:28:34] PROBLEM - dhclient process on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:36] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [15:29:25] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:29:25] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:31:35] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:35:05] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:35:44] RECOVERY - Disk space on mw1005 is OK: DISK OK [15:37:54] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:38:34] _joe_, morning/evening - this has been going on for a couple hours, any ideas? [15:38:35] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:41:07] <_joe_> myrcx: it's memory consumption due to some job in the jobqueue, but I'm just off of a 16 hours flight and 9 hours of jetlag, so I'm not really able to get farther than that [15:41:28] <_joe_> myrcx: nothing compromising the immediate functionality of the sites, anyways [15:41:55] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:42:06] well thats good to hear _joe_ [15:42:33] 16 hour flight? o.O [15:44:06] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:44:15] PROBLEM - NTP on mw1016 is CRITICAL: NTP CRITICAL: No response from NTP server [15:45:14] PROBLEM - NTP on mw1002 is CRITICAL: NTP CRITICAL: No response from NTP server [15:46:15] RECOVERY - NTP on mw1016 is OK: NTP OK: Offset 0.003399848938 secs [15:56:46] RECOVERY - dhclient process on mw1002 is OK: PROCS OK: 0 processes with command name dhclient [15:56:46] RECOVERY - nutcracker process on mw1002 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:56:55] RECOVERY - RAID on mw1002 is OK: OK: no RAID installed [15:56:56] RECOVERY - configured eth on mw1002 is OK: OK - interfaces up [15:57:04] RECOVERY - SSH on mw1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [15:57:04] RECOVERY - nutcracker port on mw1002 is OK: TCP OK - 0.000 second response time on port 11212 [15:57:44] RECOVERY - salt-minion processes on mw1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:57:45] RECOVERY - NTP on mw1002 is OK: NTP OK: Offset -0.001879572868 secs [15:58:04] RECOVERY - DPKG on mw1002 is OK: All packages OK [15:58:15] RECOVERY - Disk space on mw1002 is OK: DISK OK [16:00:24] PROBLEM - NTP on mw1014 is CRITICAL: NTP CRITICAL: No response from NTP server [16:01:36] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:03:15] PROBLEM - RAID on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:03:25] PROBLEM - puppet last run on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:14] RECOVERY - RAID on mw1015 is OK: OK: no RAID installed [16:05:25] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 26 minutes ago with 0 failures [16:06:25] RECOVERY - dhclient process on mw1008 is OK: PROCS OK: 0 processes with command name dhclient [16:06:25] RECOVERY - configured eth on mw1008 is OK: OK - interfaces up [16:06:25] RECOVERY - DPKG on mw1008 is OK: All packages OK [16:06:26] RECOVERY - Disk space on mw1008 is OK: DISK OK [16:06:33] 6operations, 6Performance-Team, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1923406 (10jcrespo) :-/ https://grafana.wikimedia.org/dashboard/db/job-queue-health [16:06:34] RECOVERY - RAID on mw1008 is OK: OK: no RAID installed [16:06:35] PROBLEM - puppet last run on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:06:35] RECOVERY - SSH on mw1008 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:07:24] PROBLEM - RAID on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:07:25] RECOVERY - nutcracker process on mw1008 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:07:45] RECOVERY - nutcracker port on mw1008 is OK: TCP OK - 0.000 second response time on port 11212 [16:07:45] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:08:06] PROBLEM - SSH on mw1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:09:36] PROBLEM - configured eth on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:09:45] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:10:53] 6operations, 6Performance-Team, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1923408 (10Samtar) @jcrespo getting a **tonne** of timeout alerts for RAID etc in #wikimedia-operations for mw1008, mw1015 and mw1004 - related? [16:12:35] PROBLEM - nutcracker process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:13:45] PROBLEM - DPKG on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:14:34] RECOVERY - nutcracker process on mw1004 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:15:54] RECOVERY - DPKG on mw1004 is OK: All packages OK [16:16:34] RECOVERY - SSH on mw1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:22:06] PROBLEM - DPKG on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:22:55] PROBLEM - dhclient process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:22:55] PROBLEM - nutcracker process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:23:25] PROBLEM - salt-minion processes on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:24:05] RECOVERY - DPKG on mw1004 is OK: All packages OK [16:24:14] RECOVERY - configured eth on mw1004 is OK: OK - interfaces up [16:25:25] RECOVERY - salt-minion processes on mw1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:28:34] PROBLEM - NTP on mw1016 is CRITICAL: NTP CRITICAL: No response from NTP server [16:30:35] PROBLEM - DPKG on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:04] PROBLEM - salt-minion processes on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:33:34] RECOVERY - nutcracker process on mw1004 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:33:34] RECOVERY - dhclient process on mw1004 is OK: PROCS OK: 0 processes with command name dhclient [16:34:04] RECOVERY - salt-minion processes on mw1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:36:46] RECOVERY - RAID on mw1004 is OK: OK: no RAID installed [16:37:35] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 57 failures [16:39:04] RECOVERY - DPKG on mw1004 is OK: All packages OK [16:39:46] PROBLEM - SSH on mw1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:41:44] RECOVERY - SSH on mw1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:43:06] PROBLEM - RAID on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:44:24] PROBLEM - NTP on mw1009 is CRITICAL: NTP CRITICAL: No response from NTP server [16:45:05] PROBLEM - configured eth on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:45:06] PROBLEM - RAID on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:45:24] PROBLEM - configured eth on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:45:44] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:45:45] PROBLEM - DPKG on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:46:04] PROBLEM - SSH on mw1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:46:35] PROBLEM - nutcracker port on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:25] PROBLEM - DPKG on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:25] PROBLEM - Disk space on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:48:04] PROBLEM - SSH on mw1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:16] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:49:34] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:14] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:44] RECOVERY - NTP on mw1014 is OK: NTP OK: Offset 0.005953550339 secs [16:50:44] RECOVERY - nutcracker port on mw1004 is OK: TCP OK - 0.000 second response time on port 11212 [16:51:26] RECOVERY - Disk space on mw1004 is OK: DISK OK [16:51:34] RECOVERY - Disk space on mw1014 is OK: DISK OK [16:51:34] RECOVERY - configured eth on mw1014 is OK: OK - interfaces up [16:51:44] RECOVERY - DPKG on mw1014 is OK: All packages OK [16:51:45] RECOVERY - RAID on mw1014 is OK: OK: no RAID installed [16:51:54] RECOVERY - SSH on mw1014 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:52:05] RECOVERY - SSH on mw1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:52:14] RECOVERY - SSH on mw1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:52:14] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:52:34] RECOVERY - salt-minion processes on mw1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:52:34] RECOVERY - dhclient process on mw1014 is OK: PROCS OK: 0 processes with command name dhclient [16:52:35] RECOVERY - nutcracker process on mw1014 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:52:44] RECOVERY - nutcracker port on mw1014 is OK: TCP OK - 0.000 second response time on port 11212 [16:53:16] RECOVERY - configured eth on mw1012 is OK: OK - interfaces up [16:53:25] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:54:04] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [16:54:55] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:55:36] RECOVERY - configured eth on mw1004 is OK: OK - interfaces up [16:55:36] RECOVERY - DPKG on mw1004 is OK: All packages OK [16:56:15] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:56:55] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [16:57:36] RECOVERY - RAID on mw1004 is OK: OK: no RAID installed [16:58:34] PROBLEM - SSH on mw1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:58:34] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:59:44] PROBLEM - configured eth on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:02:06] RECOVERY - DPKG on mw1009 is OK: All packages OK [17:02:14] RECOVERY - SSH on mw1009 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [17:02:15] RECOVERY - dhclient process on mw1009 is OK: PROCS OK: 0 processes with command name dhclient [17:02:35] RECOVERY - nutcracker process on mw1009 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:02:46] PROBLEM - SSH on mw1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:05] RECOVERY - NTP on mw1009 is OK: NTP OK: Offset 0.02224814892 secs [17:03:06] RECOVERY - nutcracker port on mw1009 is OK: TCP OK - 0.000 second response time on port 11212 [17:03:15] RECOVERY - salt-minion processes on mw1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:03:15] RECOVERY - Disk space on mw1009 is OK: DISK OK [17:03:15] RECOVERY - configured eth on mw1009 is OK: OK - interfaces up [17:03:15] RECOVERY - RAID on mw1009 is OK: OK: no RAID installed [17:03:25] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:03:55] RECOVERY - configured eth on mw1012 is OK: OK - interfaces up [17:04:05] PROBLEM - RAID on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:04:36] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:05:04] PROBLEM - dhclient process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:05:04] PROBLEM - nutcracker process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:05:35] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:06:24] PROBLEM - configured eth on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:06:24] PROBLEM - DPKG on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:07:34] PROBLEM - salt-minion processes on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:07:35] PROBLEM - nutcracker port on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:10:14] PROBLEM - configured eth on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:10:16] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:10:54] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [17:12:35] PROBLEM - Disk space on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:13:44] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [17:17:05] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:17:06] RECOVERY - DPKG on mw1012 is OK: All packages OK [17:17:25] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:19:55] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:20:54] RECOVERY - Disk space on mw1012 is OK: DISK OK [17:21:34] PROBLEM - NTP on mw1005 is CRITICAL: NTP CRITICAL: No response from NTP server [17:23:27] PROBLEM - DPKG on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:23:35] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:27:14] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:32:35] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [17:32:35] RECOVERY - salt-minion processes on mw1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:33:26] RECOVERY - nutcracker port on mw1010 is OK: TCP OK - 0.000 second response time on port 11212 [17:34:04] RECOVERY - configured eth on mw1010 is OK: OK - interfaces up [17:34:04] RECOVERY - DPKG on mw1010 is OK: All packages OK [17:34:04] RECOVERY - RAID on mw1010 is OK: OK: no RAID installed [17:34:25] RECOVERY - Check size of conntrack table on mw1010 is OK: OK: nf_conntrack is 13 % full [17:34:25] RECOVERY - dhclient process on mw1010 is OK: PROCS OK: 0 processes with command name dhclient [17:34:25] RECOVERY - SSH on mw1010 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [17:34:26] RECOVERY - nutcracker process on mw1010 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:34:26] RECOVERY - Disk space on mw1010 is OK: DISK OK [17:36:35] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:38:45] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:42:17] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [17:42:25] PROBLEM - dhclient process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:42:26] PROBLEM - nutcracker process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:48:25] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:03:14] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [18:03:15] RECOVERY - Disk space on mw1013 is OK: DISK OK [18:03:45] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [18:03:45] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [18:03:45] RECOVERY - RAID on mw1013 is OK: OK: no RAID installed [18:04:05] RECOVERY - nutcracker port on mw1013 is OK: TCP OK - 0.000 second response time on port 11212 [18:04:54] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:04:55] PROBLEM - NTP on mw1016 is CRITICAL: NTP CRITICAL: No response from NTP server [18:04:56] RECOVERY - configured eth on mw1013 is OK: OK - interfaces up [18:04:56] RECOVERY - DPKG on mw1013 is OK: All packages OK [18:07:54] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:20:54] PROBLEM - NTP on mw1004 is CRITICAL: NTP CRITICAL: No response from NTP server [18:24:06] RECOVERY - NTP on mw1005 is OK: NTP OK: Offset -0.4870939255 secs [18:31:24] RECOVERY - NTP on mw1004 is OK: NTP OK: Offset 0.001511096954 secs [18:35:25] RECOVERY - DPKG on mw1005 is OK: All packages OK [18:36:05] RECOVERY - Disk space on mw1005 is OK: DISK OK [18:36:05] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [18:36:05] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:36:05] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [18:36:05] RECOVERY - RAID on mw1005 is OK: OK: no RAID installed [18:36:15] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [18:36:44] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [18:37:04] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [18:37:05] RECOVERY - Disk space on mw1016 is OK: DISK OK [18:37:05] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [18:37:05] RECOVERY - nutcracker process on mw1016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [18:37:05] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:38:14] RECOVERY - NTP on mw1016 is OK: NTP OK: Offset 0.008183121681 secs [18:38:14] RECOVERY - SSH on mw1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [18:38:14] RECOVERY - configured eth on mw1016 is OK: OK - interfaces up [18:38:45] RECOVERY - RAID on mw1016 is OK: OK: no RAID installed [18:38:45] RECOVERY - DPKG on mw1016 is OK: All packages OK [18:38:46] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:38:46] RECOVERY - dhclient process on mw1016 is OK: PROCS OK: 0 processes with command name dhclient [18:41:14] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:02:55] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 13 failures [19:03:34] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: puppet fail [19:08:46] PROBLEM - RAID on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:09:05] PROBLEM - puppet last run on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:09:24] PROBLEM - SSH on mw1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:15] PROBLEM - dhclient process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:11:55] PROBLEM - configured eth on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:12:04] PROBLEM - DPKG on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:12:05] PROBLEM - RAID on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:12:14] PROBLEM - nutcracker port on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:12:15] PROBLEM - SSH on mw1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:15] PROBLEM - DPKG on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:12:24] PROBLEM - configured eth on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:12:24] PROBLEM - Disk space on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:14] PROBLEM - NTP on mw1004 is CRITICAL: NTP CRITICAL: No response from NTP server [19:13:14] PROBLEM - RAID on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:15] PROBLEM - configured eth on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:15] PROBLEM - SSH on mw1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:15] PROBLEM - nutcracker port on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:24] PROBLEM - salt-minion processes on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:35] PROBLEM - puppet last run on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:57] PROBLEM - nutcracker process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:58] PROBLEM - salt-minion processes on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:04] PROBLEM - dhclient process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:34] PROBLEM - Disk space on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:34] PROBLEM - DPKG on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:17:14] PROBLEM - dhclient process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:17:15] PROBLEM - nutcracker process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:17:25] PROBLEM - nutcracker port on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:18:45] PROBLEM - salt-minion processes on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:19:25] RECOVERY - nutcracker port on mw1008 is OK: TCP OK - 0.000 second response time on port 11212 [19:19:35] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:20:04] RECOVERY - nutcracker process on mw1015 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:20:14] RECOVERY - dhclient process on mw1008 is OK: PROCS OK: 0 processes with command name dhclient [19:24:45] RECOVERY - configured eth on mw1015 is OK: OK - interfaces up [19:24:46] RECOVERY - Disk space on mw1015 is OK: DISK OK [19:24:54] RECOVERY - DPKG on mw1015 is OK: All packages OK [19:24:54] RECOVERY - salt-minion processes on mw1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:25:35] PROBLEM - nutcracker process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:25:44] PROBLEM - nutcracker port on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:25:45] RECOVERY - dhclient process on mw1015 is OK: PROCS OK: 0 processes with command name dhclient [19:25:46] PROBLEM - salt-minion processes on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:26:10] Can someone revive grrrit-wm ? [19:26:20] I apparently no longer have access. I guess the bot moved to a different host. [19:26:35] PROBLEM - Disk space on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:26:35] RECOVERY - nutcracker port on mw1015 is OK: TCP OK - 0.000 second response time on port 11212 [19:27:17] Leah: k8s says 'no' :( [19:27:35] RECOVERY - dhclient process on mw1002 is OK: PROCS OK: 0 processes with command name dhclient [19:28:05] I don't understand. [19:28:35] PROBLEM - dhclient process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:28:58] Leah: https://phabricator.wikimedia.org/T123167 [19:29:21] Thx. [19:29:44] RECOVERY - nutcracker process on mw1002 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:29:44] RECOVERY - configured eth on mw1002 is OK: OK - interfaces up [19:29:45] RECOVERY - nutcracker process on mw1008 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:29:54] RECOVERY - nutcracker port on mw1002 is OK: TCP OK - 0.000 second response time on port 11212 [19:29:54] RECOVERY - nutcracker port on mw1008 is OK: TCP OK - 0.000 second response time on port 11212 [19:30:04] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:30:44] RECOVERY - dhclient process on mw1008 is OK: PROCS OK: 0 processes with command name dhclient [19:30:53] valhallasw`cloud: I tried skimming https://wikitech.wikimedia.org/wiki/Grrrit-wm but there's so much goddamn abstraction. [19:31:07] Whatever happened to like a single Python file? Bleh. [19:31:24] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 8 failures [19:31:32] Leah: a) it's the test case for k8s, b) grrrit-wm has always been a tad complicated [19:31:51] Am I supposed to know what k8s is? I keep reading it as Kates. [19:32:03] Kubernetes? [19:32:10] *nod* [19:32:13] Speaking of abstraction! [19:32:52] it's shorter! ;-) [19:32:55] RECOVERY - Disk space on mw1008 is OK: DISK OK [19:33:49] Maybe I'll write a replacement bot that doesn't require Docker, SGE, and 14 other technologies. [19:35:45] Leah: and then you have no-where to run it because tool labs requires you to run something not-in-a-screen ;-) [19:35:55] RECOVERY - RAID on mw1015 is OK: OK: no RAID installed [19:36:05] PROBLEM - dhclient process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:14] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 57 minutes ago with 0 failures [19:36:14] PROBLEM - nutcracker process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:15] PROBLEM - configured eth on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:15] PROBLEM - nutcracker port on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:16] PROBLEM - nutcracker port on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:24] RECOVERY - SSH on mw1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [19:36:25] PROBLEM - salt-minion processes on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:31] Right, screen is too easy. We need more hipster technologies that nobody knows how to operate. :| [19:37:25] RECOVERY - Disk space on mw1002 is OK: DISK OK [19:37:33] Apparently bastion's fingerprint changed. I got in, but can't access tools-k8s-master-01.tools.eqiad.wmflabs. [19:37:39] I'll leave a note on the talk page. [19:38:04] RECOVERY - dhclient process on mw1002 is OK: PROCS OK: 0 processes with command name dhclient [19:38:05] Oh, "tool labs admin"... great. [19:38:14] RECOVERY - nutcracker process on mw1002 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:38:15] RECOVERY - nutcracker port on mw1002 is OK: TCP OK - 0.000 second response time on port 11212 [19:38:16] RECOVERY - configured eth on mw1002 is OK: OK - interfaces up [19:38:24] RECOVERY - SSH on mw1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [19:38:30] !log restarting hhvm on jobrunners again [19:38:34] <_joe_> Leah: I don't think kubernetes qualifies as "hipster technology". SGE qualifies as "crappy abandonware we abuse", on the other hand [19:38:50] <_joe_> paravoid: I don't think that would do much good [19:38:55] RECOVERY - salt-minion processes on mw1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:39:07] _joe_: I meant Docker and SGE. :P [19:39:15] PROBLEM - Disk space on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:39:15] <_joe_> paravoid: I saw mw1005 go OOM after ~1 hour [19:39:24] RECOVERY - DPKG on mw1002 is OK: All packages OK [19:40:15] RECOVERY - RAID on mw1002 is OK: OK: no RAID installed [19:40:25] PROBLEM - nutcracker process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:40:44] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [19:41:15] PROBLEM - dhclient process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:44:31] !log powercycling mw1004, mw1008, mw1012 [19:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:44] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:47:04] RECOVERY - configured eth on mw1012 is OK: OK - interfaces up [19:47:04] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:47:04] RECOVERY - RAID on mw1004 is OK: OK: no RAID installed [19:47:04] RECOVERY - RAID on mw1012 is OK: OK: no RAID installed [19:47:05] RECOVERY - Disk space on mw1012 is OK: DISK OK [19:47:05] RECOVERY - Disk space on mw1004 is OK: DISK OK [19:47:05] RECOVERY - configured eth on mw1004 is OK: OK - interfaces up [19:47:06] RECOVERY - DPKG on mw1004 is OK: All packages OK [19:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09, Master <-- wtf [19:47:15] RECOVERY - dhclient process on mw1008 is OK: PROCS OK: 0 processes with command name dhclient [19:47:15] RECOVERY - configured eth on mw1008 is OK: OK - interfaces up [19:47:24] RECOVERY - Disk space on mw1008 is OK: DISK OK [19:47:24] RECOVERY - DPKG on mw1008 is OK: All packages OK [19:47:25] RECOVERY - RAID on mw1008 is OK: OK: no RAID installed [19:47:30] https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 Fascinating. [19:47:35] RECOVERY - SSH on mw1008 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [19:47:35] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [19:47:45] RECOVERY - SSH on mw1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [19:47:45] RECOVERY - DPKG on mw1012 is OK: All packages OK [19:47:46] RECOVERY - SSH on mw1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [19:47:46] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:47:46] RECOVERY - dhclient process on mw1004 is OK: PROCS OK: 0 processes with command name dhclient [19:47:46] RECOVERY - nutcracker process on mw1004 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:47:55] RECOVERY - nutcracker port on mw1004 is OK: TCP OK - 0.000 second response time on port 11212 [19:47:55] RECOVERY - salt-minion processes on mw1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:47:55] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [19:48:34] RECOVERY - nutcracker process on mw1008 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:48:35] RECOVERY - nutcracker port on mw1008 is OK: TCP OK - 0.000 second response time on port 11212 [19:49:53] (Fixed.) [19:50:44] RECOVERY - NTP on mw1004 is OK: NTP OK: Offset -0.0008457899094 secs [19:51:55] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:52:04] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:58:15] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:01:45] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:07:55] (03PS1) 10Mdann52: Fix $wgSitename for my.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263237 (https://phabricator.wikimedia.org/T123191) [20:08:12] (03CR) 10jenkins-bot: [V: 04-1] Fix $wgSitename for my.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263237 (https://phabricator.wikimedia.org/T123191) (owner: 10Mdann52) [20:18:26] (03PS2) 10Mdann52: Fix $wgSitename for my.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263237 (https://phabricator.wikimedia.org/T123191) [20:24:02] (03CR) 10Luke081515: [C: 031] "Looks good to me (but I can not see the new sitename)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263237 (https://phabricator.wikimedia.org/T123191) (owner: 10Mdann52) [20:25:09] (03CR) 10Luke081515: "To clarify: I can not see the new sitename => not possible at my computer, not a fault at the code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263237 (https://phabricator.wikimedia.org/T123191) (owner: 10Mdann52) [20:28:11] (03CR) 10Alex Monk: [C: 031] Set logos for mobile login page for Wikidata and Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263201 (https://phabricator.wikimedia.org/T123175) (owner: 10Aude) [20:28:29] (03CR) 10Luke081515: [C: 031] Added noindex rule for uawikimedia's user namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261902 (https://phabricator.wikimedia.org/T122732) (owner: 10Base) [20:29:58] (03CR) 10Luke081515: [C: 031] Get rid of old unused $wgAllowed* variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256853 (https://phabricator.wikimedia.org/T50493) (owner: 10Alex Monk) [20:30:12] (03PS2) 10Alex Monk: Enable Wikilove extenstion on es.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262894 (https://phabricator.wikimedia.org/T122765) (owner: 10Mdann52) [20:30:14] (03PS4) 10Alex Monk: additional import sources for kn.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262895 (https://phabricator.wikimedia.org/T122955) (owner: 10Mdann52) [20:31:06] (03CR) 10Alex Monk: [C: 031] Prepare for merge of ApiSandbox into core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262999 (owner: 10Anomie) [20:33:20] (03CR) 10Alex Monk: [C: 031] Fix $wgSitename for my.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263237 (https://phabricator.wikimedia.org/T123191) (owner: 10Mdann52) [20:34:56] (03PS3) 10Alex Monk: Enable WikiLove extension on es.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262894 (https://phabricator.wikimedia.org/T122765) (owner: 10Mdann52) [20:35:50] (03PS4) 10Alex Monk: Enable WikiLove extension on es.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262894 (https://phabricator.wikimedia.org/T122765) (owner: 10Mdann52) [20:38:31] (03CR) 10Alex Monk: [C: 031] Enable WikiLove extension on es.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262894 (https://phabricator.wikimedia.org/T122765) (owner: 10Mdann52) [20:42:14] (03PS5) 10Alex Monk: additional import sources for kn.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262895 (https://phabricator.wikimedia.org/T122955) (owner: 10Mdann52) [20:42:54] (03CR) 10Alex Monk: [C: 031] additional import sources for kn.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262895 (https://phabricator.wikimedia.org/T122955) (owner: 10Mdann52) [20:51:42] (03CR) 10Alex Monk: [C: 04-1] "see task - logo inconsistency" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263051 (https://phabricator.wikimedia.org/T122476) (owner: 10Mdann52) [21:10:16] (03PS1) 10Tim Landscheidt: gridengine: Fix status check for gridengine-master [puppet] - 10https://gerrit.wikimedia.org/r/263242 [21:56:45] RECOVERY - DPKG on restbase1007 is OK: All packages OK [21:58:16] expect me to be in here late or at weird hours tomorrow, I just got in about 1.5 hours ago from long trip with (again) delays and jet lag should be loooooads of fun [21:58:45] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:00:05] RECOVERY - DPKG on restbase1005 is OK: All packages OK [22:00:11] !log restbase: 1005-1009 now on node 4.2 [22:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:51] 7Blocked-on-Operations, 6operations, 10RESTBase, 6Services: Switch RESTBase to use Node.js 4.2 - https://phabricator.wikimedia.org/T107762#1923674 (10GWicke) According to logstash, the heap limit is reached less often on nodes running 4.2. Current migration status in Eqiad: - 1001 - 1004: node 0.10 - 100... [22:05:44] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [100000000.0] [22:11:04] ^ that might be me [22:11:28] (I'm grepping lighttpd error logs for https://phabricator.wikimedia.org/T104799#1923667, and there's quite a few large files...) [22:17:24] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2577.15 ms [22:20:05] RECOVERY - puppet last run on restbase1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:20:46] gwicke: I thought we agreed to wait until next week [22:20:55] also, it's the weekend [22:21:05] plus people are flying [22:21:07] come on [22:21:27] paravoid: only 1/2 of the nodes are switched [22:22:00] I saw, but we discussed this in person and agreed to stop at one node until next week [22:22:19] I don't remember discussing stopping at one node [22:22:27] even if we hadn't, changing node on half of the restbase fleet on a sunday is not ok [22:22:40] we agreed not to convert all / most before the weekend [22:22:56] can we apply some common sense please? [22:23:02] I'm sitting on an airport lounge [22:23:06] I switched 2/9 today [22:23:10] the rest of my team is home recovering from jetlag [22:23:13] and it's a sunday [22:23:28] do you need more reasons to not do changes live in prod? [22:23:36] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 224.13 ms [22:23:48] paravoid: I wouldn't have switched it if I wasn't around to monitor it [22:23:59] so you can relax & travel [22:25:21] this is just reckless behavior and makes me deeply unhappy [22:25:33] we've been over this before and I thought we had reached an understanding [22:26:26] I'm honestly less happy about switching all of scb at once [22:26:37] it's a lot safer to switch one node at a time, imho [22:27:28] the mobile content service is doing okay, but if it wasn't we could have gotten an outage from that [22:28:11] that's really orthogonal to this isn't it [22:28:22] my flight is boarding in two minutes, so I'm gonna go now [22:28:24] bye [22:28:35] paravoid: have a good flight! [22:38:14] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:38:54] PROBLEM - puppet last run on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:54] PROBLEM - DPKG on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:55] PROBLEM - configured eth on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:39:24] PROBLEM - RAID on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:40:14] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [22:42:55] RECOVERY - configured eth on mw1007 is OK: OK - interfaces up [22:42:55] RECOVERY - DPKG on mw1007 is OK: All packages OK [22:46:15] PROBLEM - Disk space on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:46:24] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:55] PROBLEM - nutcracker port on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:47:06] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [75000000.0] [22:48:15] RECOVERY - Disk space on mw1007 is OK: DISK OK [22:48:25] PROBLEM - puppet last run on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:48:45] RECOVERY - nutcracker port on mw1007 is OK: TCP OK - 0.056 second response time on port 11212 [22:50:26] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [22:56:45] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:59:55] PROBLEM - configured eth on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:00:05] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:01:55] PROBLEM - configured eth on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:03:24] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 3437.28 ms [23:03:55] PROBLEM - DPKG on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:03:55] PROBLEM - dhclient process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:03:55] PROBLEM - nutcracker process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:04:05] RECOVERY - configured eth on mw1013 is OK: OK - interfaces up [23:04:15] PROBLEM - salt-minion processes on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:05:06] PROBLEM - Disk space on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:05:06] PROBLEM - RAID on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:05:14] PROBLEM - puppet last run on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:06:24] PROBLEM - DPKG on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:06:46] PROBLEM - RAID on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:07:25] PROBLEM - nutcracker port on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:08:25] PROBLEM - puppet last run on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:08:35] PROBLEM - salt-minion processes on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:09:05] PROBLEM - Disk space on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:09:15] PROBLEM - DPKG on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:09:54] PROBLEM - nutcracker port on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:09:55] PROBLEM - dhclient process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:05] PROBLEM - nutcracker process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:05] PROBLEM - SSH on mw1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:10:05] PROBLEM - nutcracker port on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:05] PROBLEM - configured eth on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:05] PROBLEM - RAID on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:25] PROBLEM - configured eth on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:36] PROBLEM - Disk space on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:25] PROBLEM - puppet last run on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:12:16] PROBLEM - SSH on mw1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:25] PROBLEM - configured eth on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:13:15] PROBLEM - nutcracker process on mw1016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:13:15] PROBLEM - salt-minion processes on mw1016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:13:15] PROBLEM - nutcracker port on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:13:15] PROBLEM - Disk space on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:15:05] PROBLEM - DPKG on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:15:06] PROBLEM - dhclient process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:15:25] PROBLEM - dhclient process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:15:25] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:16:04] RECOVERY - Host mr1-ulsfo.oob is UP: PING WARNING - Packet loss = 16%, RTA = 1222.14 ms [23:17:14] PROBLEM - RAID on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:17:15] PROBLEM - dhclient process on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:17:25] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:17:36] PROBLEM - salt-minion processes on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:17:55] PROBLEM - configured eth on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:18:04] PROBLEM - nutcracker port on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:18:04] PROBLEM - Disk space on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:18:24] PROBLEM - puppet last run on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:18:24] PROBLEM - DPKG on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:18:25] PROBLEM - SSH on mw1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:18:44] PROBLEM - salt-minion processes on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:18:55] PROBLEM - nutcracker process on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:19:54] RECOVERY - nutcracker port on mw1013 is OK: TCP OK - 0.000 second response time on port 11212 [23:21:34] RECOVERY - Disk space on mw1016 is OK: DISK OK [23:21:34] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [23:21:35] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:21:35] RECOVERY - nutcracker process on mw1016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:22:34] RECOVERY - SSH on mw1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [23:22:36] RECOVERY - configured eth on mw1016 is OK: OK - interfaces up [23:23:14] RECOVERY - DPKG on mw1016 is OK: All packages OK [23:23:15] RECOVERY - RAID on mw1016 is OK: OK: no RAID installed [23:23:15] RECOVERY - dhclient process on mw1016 is OK: PROCS OK: 0 processes with command name dhclient [23:23:45] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 48 minutes ago with 0 failures [23:26:05] PROBLEM - nutcracker port on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:29:14] RECOVERY - salt-minion processes on mw1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:30:15] RECOVERY - salt-minion processes on mw1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:31:55] RECOVERY - dhclient process on mw1003 is OK: PROCS OK: 0 processes with command name dhclient [23:32:06] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [23:32:06] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:32:14] RECOVERY - Disk space on mw1007 is OK: DISK OK [23:32:24] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [23:32:45] RECOVERY - nutcracker port on mw1007 is OK: TCP OK - 0.000 second response time on port 11212 [23:34:35] RECOVERY - configured eth on mw1003 is OK: OK - interfaces up [23:34:36] RECOVERY - nutcracker port on mw1003 is OK: TCP OK - 0.000 second response time on port 11212 [23:34:36] RECOVERY - Disk space on mw1003 is OK: DISK OK [23:35:04] RECOVERY - DPKG on mw1003 is OK: All packages OK [23:35:04] RECOVERY - SSH on mw1003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [23:35:04] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 48 minutes ago with 0 failures [23:35:34] RECOVERY - nutcracker process on mw1003 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:35:54] RECOVERY - RAID on mw1003 is OK: OK: no RAID installed [23:36:25] PROBLEM - puppet last run on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:38:24] PROBLEM - SSH on mw1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:25] PROBLEM - DPKG on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:38:25] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:38:25] PROBLEM - dhclient process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:38:34] PROBLEM - Disk space on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:38:35] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:39:04] PROBLEM - nutcracker port on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:39:25] PROBLEM - configured eth on mw1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:39:44] PROBLEM - RAID on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:39:44] PROBLEM - salt-minion processes on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:41:45] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:41:54] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [23:42:15] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:42:34] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:42:35] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:43:25] PROBLEM - puppet last run on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:43:45] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:43:54] PROBLEM - NTP on mw1002 is CRITICAL: NTP CRITICAL: No response from NTP server [23:45:15] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: Puppet has 60 failures [23:47:14] PROBLEM - RAID on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:47:35] PROBLEM - SSH on mw1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:48:05] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:50:04] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:50:04] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:53:14] RECOVERY - nutcracker port on mw1013 is OK: TCP OK - 0.000 second response time on port 11212 [23:53:36] RECOVERY - SSH on mw1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [23:54:04] RECOVERY - Disk space on mw1012 is OK: DISK OK [23:54:04] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:54:04] RECOVERY - configured eth on mw1012 is OK: OK - interfaces up [23:54:34] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [23:54:54] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [23:54:55] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:54:55] RECOVERY - RAID on mw1013 is OK: OK: no RAID installed [23:54:55] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [23:56:04] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:56:05] RECOVERY - configured eth on mw1013 is OK: OK - interfaces up [23:56:14] RECOVERY - DPKG on mw1013 is OK: All packages OK [23:56:14] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [23:56:14] RECOVERY - Disk space on mw1013 is OK: DISK OK [23:56:54] RECOVERY - SSH on mw1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [23:56:55] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:56:56] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [23:56:56] RECOVERY - DPKG on mw1012 is OK: All packages OK [23:57:04] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 52 minutes ago with 0 failures [23:58:24] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 2081.67 ms