[00:00:00] RECOVERY - Disk space on analytics1047 is OK: DISK OK [00:05:09] RECOVERY - configured eth on analytics1047 is OK: OK - interfaces up [00:05:09] RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [00:05:19] RECOVERY - salt-minion processes on analytics1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:05:59] RECOVERY - RAID on analytics1047 is OK: OK: optimal, 13 logical, 14 physical [00:06:19] RECOVERY - Check size of conntrack table on analytics1047 is OK: OK: nf_conntrack is 0 % full [00:06:19] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 41 minutes ago with 0 failures [00:06:38] RECOVERY - DPKG on analytics1047 is OK: All packages OK [00:06:39] RECOVERY - Hadoop DataNode on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [00:06:39] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING [00:06:39] RECOVERY - dhclient process on analytics1047 is OK: PROCS OK: 0 processes with command name dhclient [00:06:51] RECOVERY - Disk space on Hadoop worker on analytics1047 is OK: DISK OK [00:15:49] PROBLEM - RAID on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:15:49] PROBLEM - Disk space on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:09] PROBLEM - Check size of conntrack table on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:10] PROBLEM - puppet last run on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:29] PROBLEM - DPKG on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:29] PROBLEM - Hadoop DataNode on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:29] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:30] PROBLEM - dhclient process on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:40] PROBLEM - Disk space on Hadoop worker on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:59] PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:59] PROBLEM - configured eth on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:08] PROBLEM - salt-minion processes on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:27:19] RECOVERY - RAID on analytics1047 is OK: OK: optimal, 13 logical, 14 physical [00:27:19] RECOVERY - Disk space on analytics1047 is OK: DISK OK [00:27:48] RECOVERY - Check size of conntrack table on analytics1047 is OK: OK: nf_conntrack is 0 % full [00:27:48] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures [00:27:59] RECOVERY - DPKG on analytics1047 is OK: All packages OK [00:27:59] RECOVERY - Hadoop DataNode on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [00:27:59] RECOVERY - dhclient process on analytics1047 is OK: PROCS OK: 0 processes with command name dhclient [00:28:00] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING [00:28:18] RECOVERY - Disk space on Hadoop worker on analytics1047 is OK: DISK OK [00:28:28] RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [00:28:28] RECOVERY - configured eth on analytics1047 is OK: OK - interfaces up [00:28:39] RECOVERY - salt-minion processes on analytics1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:15:29] !log applied Ibd302e1 to terbium for debugging broken wikidata rdf dumps [01:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:21:38] PROBLEM - dhclient process on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:21:38] PROBLEM - Hadoop DataNode on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:21:48] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:22:40] (03PS1) 10Alex Monk: labs IP aliasing: Print error and continue when not able to get instances for a project [puppet] - 10https://gerrit.wikimedia.org/r/286263 (https://phabricator.wikimedia.org/T133946) [01:23:58] PROBLEM - RAID on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:27:28] PROBLEM - DPKG on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:27:49] PROBLEM - Disk space on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:27:57] PROBLEM - configured eth on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:17] PROBLEM - salt-minion processes on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:18] PROBLEM - Disk space on Hadoop worker on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:47] PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:57] PROBLEM - Check size of conntrack table on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:57] PROBLEM - puppet last run on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:30:40] RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [01:30:47] RECOVERY - Check size of conntrack table on analytics1047 is OK: OK: nf_conntrack is 0 % full [01:30:47] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 34 minutes ago with 0 failures [01:35:50] PROBLEM - puppet last run on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:37:41] PROBLEM - Check size of conntrack table on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:39:01] RECOVERY - Disk space on Hadoop worker on analytics1047 is OK: DISK OK [01:44:50] PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:45:01] PROBLEM - Disk space on Hadoop worker on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:46:50] RECOVERY - Disk space on Hadoop worker on analytics1047 is OK: DISK OK [01:50:02] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2253068 (10Krenair) While looking at {T99072} I found another: ```krenair@tools-bastion-03:~$ host 10.68.16.97 97.16.68.10.in-addr.arpa domain name pointer ci-jessi... [01:52:50] PROBLEM - Disk space on Hadoop worker on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:41] RECOVERY - Disk space on Hadoop worker on analytics1047 is OK: DISK OK [01:55:10] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 59 minutes ago with 0 failures [01:55:10] RECOVERY - Check size of conntrack table on analytics1047 is OK: OK: nf_conntrack is 0 % full [01:55:11] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: CRITICAL: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: LOST [01:55:11] RECOVERY - dhclient process on analytics1047 is OK: PROCS OK: 0 processes with command name dhclient [01:55:32] RECOVERY - DPKG on analytics1047 is OK: All packages OK [01:55:32] RECOVERY - salt-minion processes on analytics1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:00:41] PROBLEM - Disk space on Hadoop worker on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:20] PROBLEM - Check size of conntrack table on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:20] PROBLEM - puppet last run on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:20] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:21] PROBLEM - dhclient process on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:50] PROBLEM - salt-minion processes on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:50] PROBLEM - DPKG on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:01] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [02:11:30] RECOVERY - Disk space on analytics1047 is OK: DISK OK [02:11:30] RECOVERY - RAID on analytics1047 is OK: OK: optimal, 13 logical, 14 physical [02:11:30] RECOVERY - salt-minion processes on analytics1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:11:30] RECOVERY - DPKG on analytics1047 is OK: All packages OK [02:11:40] RECOVERY - configured eth on analytics1047 is OK: OK - interfaces up [02:11:51] RECOVERY - Hadoop DataNode on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [02:12:11] RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:12:20] RECOVERY - Disk space on Hadoop worker on analytics1047 is OK: DISK OK [02:12:51] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 17 minutes ago with 0 failures [02:12:51] RECOVERY - Check size of conntrack table on analytics1047 is OK: OK: nf_conntrack is 0 % full [02:13:00] RECOVERY - dhclient process on analytics1047 is OK: PROCS OK: 0 processes with command name dhclient [02:17:22] PROBLEM - RAID on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:20:40] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: CRITICAL: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: LOST [02:20:50] PROBLEM - Apache HTTP on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:02] PROBLEM - configured eth on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:22:11] PROBLEM - HHVM rendering on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:31] PROBLEM - RAID on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:22:51] PROBLEM - nutcracker port on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:10] PROBLEM - Check size of conntrack table on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:11] PROBLEM - nutcracker process on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:21] PROBLEM - SSH on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:22] PROBLEM - DPKG on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:32] PROBLEM - Disk space on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:32] PROBLEM - puppet last run on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:51] PROBLEM - HHVM processes on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:51] PROBLEM - dhclient process on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:52] PROBLEM - salt-minion processes on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:25:11] PROBLEM - salt-minion processes on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:25:11] PROBLEM - DPKG on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:25:11] PROBLEM - Disk space on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:25:21] PROBLEM - configured eth on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:25:40] PROBLEM - Hadoop DataNode on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:00] PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:11] PROBLEM - Disk space on Hadoop worker on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:40] PROBLEM - Check size of conntrack table on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:40] PROBLEM - puppet last run on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:40] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:41] PROBLEM - dhclient process on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:01] RECOVERY - nutcracker process on mw1134 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:28:00] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80759 MB (15% inode=99%) [02:28:23] RECOVERY - Check size of conntrack table on analytics1047 is OK: OK: nf_conntrack is 0 % full [02:28:23] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 33 minutes ago with 0 failures [02:28:30] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING [02:28:31] RECOVERY - dhclient process on analytics1047 is OK: PROCS OK: 0 processes with command name dhclient [02:28:51] RECOVERY - salt-minion processes on analytics1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:28:51] RECOVERY - DPKG on analytics1047 is OK: All packages OK [02:28:51] RECOVERY - Disk space on analytics1047 is OK: DISK OK [02:28:52] RECOVERY - RAID on analytics1047 is OK: OK: optimal, 13 logical, 14 physical [02:29:02] RECOVERY - configured eth on analytics1047 is OK: OK - interfaces up [02:29:30] RECOVERY - Hadoop DataNode on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [02:29:42] RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:29:51] RECOVERY - Disk space on Hadoop worker on analytics1047 is OK: DISK OK [02:33:11] PROBLEM - nutcracker process on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:41] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:41] PROBLEM - puppet last run on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:41] PROBLEM - Check size of conntrack table on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:51] PROBLEM - dhclient process on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:11] PROBLEM - salt-minion processes on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:11] PROBLEM - DPKG on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:11] PROBLEM - RAID on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:11] PROBLEM - Disk space on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:21] PROBLEM - configured eth on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:41] PROBLEM - Hadoop DataNode on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:00] PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:11] PROBLEM - Disk space on Hadoop worker on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:20] RECOVERY - RAID on mw1134 is OK: OK: no RAID installed [02:36:32] RECOVERY - nutcracker port on mw1134 is OK: TCP OK - 0.000 second response time on port 11212 [02:36:32] RECOVERY - dhclient process on analytics1047 is OK: PROCS OK: 0 processes with command name dhclient [02:36:32] RECOVERY - Check size of conntrack table on analytics1047 is OK: OK: nf_conntrack is 0 % full [02:36:40] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 41 minutes ago with 0 failures [02:36:40] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING [02:36:41] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.821 second response time [02:36:51] RECOVERY - Check size of conntrack table on mw1134 is OK: OK: nf_conntrack is 0 % full [02:37:00] RECOVERY - salt-minion processes on analytics1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:37:00] RECOVERY - DPKG on analytics1047 is OK: All packages OK [02:37:00] RECOVERY - RAID on analytics1047 is OK: OK: optimal, 13 logical, 14 physical [02:37:00] RECOVERY - Disk space on analytics1047 is OK: DISK OK [02:37:00] RECOVERY - nutcracker process on mw1134 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:37:02] RECOVERY - SSH on mw1134 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [02:37:10] RECOVERY - configured eth on analytics1047 is OK: OK - interfaces up [02:37:11] RECOVERY - DPKG on mw1134 is OK: All packages OK [02:37:20] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 44 minutes ago with 0 failures [02:37:20] RECOVERY - Disk space on mw1134 is OK: DISK OK [02:37:22] RECOVERY - Hadoop DataNode on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [02:37:41] RECOVERY - dhclient process on mw1134 is OK: PROCS OK: 0 processes with command name dhclient [02:37:41] RECOVERY - HHVM processes on mw1134 is OK: PROCS OK: 6 processes with command name hhvm [02:37:42] RECOVERY - salt-minion processes on mw1134 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:37:42] RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:37:43] RECOVERY - configured eth on mw1134 is OK: OK - interfaces up [02:38:00] RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 66385 bytes in 0.337 second response time [02:39:52] RECOVERY - Disk space on Hadoop worker on analytics1047 is OK: DISK OK [03:49:45] (03PS1) 10Papaul: DHCP: changing the install to trusty to test since jessie is not detecting the disks Bug: T132976 [puppet] - 10https://gerrit.wikimedia.org/r/286266 (https://phabricator.wikimedia.org/T132976) [03:52:26] (03Abandoned) 10Papaul: DHCP: changing the install to trusty to test since jessie is not detecting the disks Bug: T132976 [puppet] - 10https://gerrit.wikimedia.org/r/286266 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [04:03:00] PROBLEM - salt-minion processes on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:03:09] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:03:21] PROBLEM - configured eth on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:03:30] PROBLEM - RAID on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:03:49] PROBLEM - puppet last run on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:03:59] PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:03:59] PROBLEM - dhclient process on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:05:20] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures [04:05:31] RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [04:05:39] RECOVERY - dhclient process on analytics1047 is OK: PROCS OK: 0 processes with command name dhclient [04:06:00] RECOVERY - salt-minion processes on analytics1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:06:19] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING [04:06:39] RECOVERY - configured eth on analytics1047 is OK: OK - interfaces up [04:06:50] RECOVERY - RAID on analytics1047 is OK: OK: optimal, 13 logical, 14 physical [04:50:20] PROBLEM - DPKG on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:20] PROBLEM - salt-minion processes on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:40] PROBLEM - Disk space on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:41] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:09] RECOVERY - DPKG on analytics1047 is OK: All packages OK [04:52:09] RECOVERY - salt-minion processes on analytics1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:52:20] RECOVERY - Disk space on analytics1047 is OK: DISK OK [04:52:30] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING [05:16:18] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80692 MB (15% inode=99%) [05:23:57] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80735 MB (15% inode=99%) [05:54:47] RECOVERY - Disk space on elastic1017 is OK: DISK OK [05:59:05] (03CR) 10Glaisher: add jamwiki to langlist, InitialiseSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [06:16:55] !log restarting elasticsearch server elastic1028.eqiad.wmnet (T110236) [06:16:56] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [06:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:29:57] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: puppet fail [06:30:38] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:58] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:32:27] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:46] !log restarting elasticsearch server elastic1029.eqiad.wmnet (T110236) [06:32:47] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [06:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:33:27] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures [06:43:47] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [06:47:37] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [06:56:27] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:56:37] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:58:17] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:55] !log restarting elasticsearch server elastic1030.eqiad.wmnet (T110236) [07:15:56] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [07:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:42:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [07:44:51] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [07:52:40] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:54:29] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [08:28:16] !log restarting elasticsearch server elastic1031.eqiad.wmnet (T110236) [08:28:17] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [08:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:40:31] (03CR) 10Alex Monk: "also needs to be in various dblists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [10:08:33] PROBLEM - puppet last run on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:08:42] PROBLEM - salt-minion processes on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:09:02] PROBLEM - DPKG on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:09:02] PROBLEM - Disk space on Hadoop worker on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:09:03] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:09:14] PROBLEM - Disk space on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:09:15] PROBLEM - Hadoop DataNode on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:09:23] PROBLEM - RAID on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:09:33] PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:09:43] PROBLEM - configured eth on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:10:12] PROBLEM - dhclient process on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:10:22] PROBLEM - Check size of conntrack table on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:12:43] RECOVERY - DPKG on analytics1047 is OK: All packages OK [10:12:43] RECOVERY - Disk space on Hadoop worker on analytics1047 is OK: DISK OK [10:13:13] RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:13:44] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table bawiktionary.hitcounter doesnt exist on query. Default database: information_schema. Query: DELETE FROM bawiktionary.hitcounter [10:13:54] RECOVERY - Check size of conntrack table on analytics1047 is OK: OK: nf_conntrack is 0 % full [10:14:12] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [10:14:54] RECOVERY - Disk space on analytics1047 is OK: DISK OK [10:14:58] * volans looking at dbstore1001 ^^^ [10:17:03] RECOVERY - RAID on analytics1047 is OK: OK: optimal, 13 logical, 14 physical [10:17:13] RECOVERY - configured eth on analytics1047 is OK: OK - interfaces up [10:17:43] RECOVERY - dhclient process on analytics1047 is OK: PROCS OK: 0 processes with command name dhclient [10:18:12] RECOVERY - salt-minion processes on analytics1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:18:34] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING [10:18:43] RECOVERY - Hadoop DataNode on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [10:19:00] 06Operations, 06Discovery, 03Discovery-Search-Sprint, 07Elasticsearch, 13Patch-For-Review: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#2253327 (10Gehel) First restart to enable unicast completed on eqiad and codfw. Second restart to come... [10:19:34] !log restarted slave on dbstore1001 skipping missing database T132837 [10:19:35] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [10:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:21:33] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:25:15] I'll expect it to break again later for the other missing ones, I'll keep an eye on it to fix it [10:30:41] that was my fault, I reimaged db1038 on thursday [10:31:08] yep, no problem :) [10:45:30] !log Reset slave on sanitarium:3311 due to corrupted relay log after skipping query for duplicate key T132416 [10:45:31] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [10:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:47] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2252547 (10Mtherwjs) jam.wikipedia.org redirected https://incubator.wikimedia.org/w/index.php?title=Wp/jam/Mien_Piej&redirectfrom=infopage [11:02:36] 06Operations, 10DBA: Implement slave_run_triggers_for_rbr at sanitarium for labs filtering - https://phabricator.wikimedia.org/T121207#2253354 (10Volans) [11:40:02] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table zh_cnwiki.hitcounter doesnt exist on query. Default database: information_schema. Query: DELETE FROM zh_cnwiki.hitcounter [11:44:33] already fixed, was the last one... [11:45:43] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [13:41:58] !log disabled puppet on analytics1047 and scheduled downtime for the host, IO errors in the dmesg for /dev/sdd. Stopped also Hadoop daemons to remove it from the cluster temporarily (not sure how to do it properly, will write docs). [13:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:46] 06Operations, 10ops-eqiad, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2253604 (10elukey) [14:15:48] (03CR) 10Dereckson: add jamwiki to langlist, InitialiseSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [14:35:34] 07Puppet, 10Beta-Cluster-Infrastructure: Set up puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2253666 (10Aklapper) [14:55:45] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2253697 (10Mtherwjs) [15:02:58] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 660 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5211633 keys - replication_delay is 660 [15:08:47] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5202858 keys - replication_delay is 0 [15:30:02] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2253719 (10Glaisher) Could someone provide the translations for the namespace names? If possible, please provide them in the format below with the E... [15:40:07] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2253738 (10Dereckson) [15:46:57] (03PS1) 10Dereckson: Add jam.wikipedia to RESTBase and Labs dnsrecursor [puppet] - 10https://gerrit.wikimedia.org/r/286278 (https://phabricator.wikimedia.org/T134017) [15:48:07] (03CR) 10jenkins-bot: [V: 04-1] Add jam.wikipedia to RESTBase and Labs dnsrecursor [puppet] - 10https://gerrit.wikimedia.org/r/286278 (https://phabricator.wikimedia.org/T134017) (owner: 10Dereckson) [15:53:39] (03PS2) 10Dereckson: Add jam.wikipedia to RESTBase and Labs dnsrecursor [puppet] - 10https://gerrit.wikimedia.org/r/286278 (https://phabricator.wikimedia.org/T134017) [16:07:26] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 688 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5211203 keys - replication_delay is 688 [16:08:15] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2252547 (10Dereckson) Another l10n effort is needed for the upload wizard: https://commons.wikimedia.org/wiki/Special:UploadWizard?uselang=jam [16:13:29] (03PS5) 10Dereckson: Initialize configuration for jam.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [16:15:16] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5206022 keys - replication_delay is 0 [16:15:35] (03CR) 10Dereckson: "PS5: HD logos, removed regular l10n namespaces, dblists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [16:16:05] (03CR) 10Dereckson: "Changes planned. I forgot the HD logos in InitialiseSettings.php." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [16:17:23] (03PS6) 10Dereckson: Initialize configuration for jam.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [16:17:56] (03CR) 10Dereckson: "PS6: HD logo in config too (yes for optipng)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [16:20:25] moar jam [16:24:20] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2253783 (10MF-Warburg) I'm fine with requiring the namespace translations (is there no way anymore to translate them through twn?), but the upload w... [16:32:17] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2253793 (10Dereckson) Sure, we agree, this is not a blocker. [18:09:38] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:09] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:19] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:11:50] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:12:09] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:15:18] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:17:38] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: /pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) [18:19:28] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:35:16] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: puppet fail [19:00:07] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [19:27:06] (03PS1) 10Urbanecm: Add interface editor user group on pswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286285 (https://phabricator.wikimedia.org/T133472) [19:38:38] (03PS1) 10Urbanecm: Enable user signature in VE in plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286286 (https://phabricator.wikimedia.org/T133978) [20:19:30] (03PS1) 10Urbanecm: Enable Visual Editor on all namespaces of plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286287 (https://phabricator.wikimedia.org/T133980) [20:19:49] (03CR) 10jenkins-bot: [V: 04-1] Enable Visual Editor on all namespaces of plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286287 (https://phabricator.wikimedia.org/T133980) (owner: 10Urbanecm) [20:23:28] (03PS2) 10Urbanecm: Enable Visual Editor on all namespaces of plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286287 (https://phabricator.wikimedia.org/T133980) [20:56:22] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [20:58:22] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5222062 keys - replication_delay is 0 [23:49:03] PROBLEM - DPKG on furud is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:11] PROBLEM - Disk space on furud is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:22] PROBLEM - RAID on furud is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:41] PROBLEM - configured eth on furud is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:49:42] PROBLEM - puppet last run on furud is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:50:01] PROBLEM - dhclient process on furud is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:50:32] PROBLEM - salt-minion processes on furud is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:50:33] PROBLEM - Check size of conntrack table on furud is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.