[00:08:15] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [00:10:34] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:10:35] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [00:10:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [00:14:36] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:16:45] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:17:05] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:35:40] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [00:46:00] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [00:54:10] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [01:00:29] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [01:01:52] PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [01:07:32] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [01:10:31] PROBLEM - Hadoop NodeManager on analytics1042 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [01:17:51] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [01:24:52] PROBLEM - Hadoop NodeManager on analytics1055 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [01:25:41] RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [01:26:01] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [01:32:18] RECOVERY - Hadoop NodeManager on analytics1042 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [01:33:38] RECOVERY - Hadoop NodeManager on analytics1055 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [01:38:58] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [01:45:08] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [01:55:18] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [02:02:58] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [02:04:39] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [02:10:49] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [02:12:49] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [02:18:58] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [02:23:40] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 09m 58s) [02:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:09] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [02:30:28] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jan 1 02:30:27 UTC 2016 (duration 6m 47s) [02:30:29] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [02:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:46] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [02:36:17] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [02:48:37] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [02:57:46] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [02:58:57] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [03:07:24] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [03:29:54] PROBLEM - Hadoop NodeManager on analytics1056 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [03:50:21] RECOVERY - Hadoop NodeManager on analytics1056 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [03:56:32] PROBLEM - Hadoop NodeManager on analytics1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [04:07:17] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [04:13:27] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [04:15:17] PROBLEM - Hadoop NodeManager on analytics1052 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [04:19:17] RECOVERY - Hadoop NodeManager on analytics1052 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [04:22:17] RECOVERY - Hadoop NodeManager on analytics1044 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [04:29:57] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [04:39:25] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [04:57:55] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [04:58:46] PROBLEM - Hadoop NodeManager on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:05:10] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [05:15:31] RECOVERY - Hadoop NodeManager on analytics1057 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:21:40] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [05:27:50] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [05:36:19] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [05:46:39] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [06:11:45] PROBLEM - puppet last run on mw2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:12:25] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [06:13:24] PROBLEM - puppet last run on mw2042 is CRITICAL: CRITICAL: puppet fail [06:18:43] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [06:31:33] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:55] PROBLEM - puppet last run on mw2037 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:25] RECOVERY - puppet last run on mw2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:37:46] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [06:42:56] RECOVERY - puppet last run on mw2042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:26] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [06:49:27] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [06:52:06] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [06:58:17] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [06:58:56] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:46] RECOVERY - puppet last run on mw2037 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:03:03] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [07:37:17] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:43:27] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [08:16:06] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [08:22:16] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [08:30:36] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [08:39:13] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [09:03:50] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [09:13:51] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [09:30:31] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [09:36:10] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [09:46:30] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [09:52:40] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [10:00:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 626 [10:21:18] PROBLEM - puppet last run on mw2047 is CRITICAL: CRITICAL: puppet fail [10:26:58] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [10:30:17] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 669 [10:33:35] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [10:35:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 928702 Threads: 150 Questions: 38408383 Slow queries: 11585 Opens: 58804 Flush tables: 2 Open tables: 418 Queries per second avg: 41.357 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:48:34] RECOVERY - puppet last run on mw2047 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [13:46:12] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [13:52:22] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [15:28:04] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: puppet fail [15:55:00] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:43:20] PROBLEM - HHVM rendering on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:45:20] RECOVERY - HHVM rendering on mw1057 is OK: HTTP OK: HTTP/1.1 200 OK - 71972 bytes in 0.115 second response time [19:41:51] !log Updated scholarships.wikimedia.org with latest translation data from translatewiki [19:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:52] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:17:00] ^ both NFS and the home page seem to work -- not sure what's going on here [20:30:52] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [20:56:35] PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: Puppet has 1 failures [20:57:46] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:14:11] (03PS1) 10Luke081515: Remove the unblockself right from sysops at trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) [21:20:58] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.104 second response time [21:21:38] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:21:06] Anyone around willing to try and get nodepool CI hosts talking to Jenkins again? [22:21:53] The rake-jessie jobs are backing up in zuul and the Jenkins web ui isn't showing any of the ci-jessie-* slaves [22:23:18] This happened on Wednesday as well and was resolved by either the nodepool service being restarted or a restart of nova-compute (both were done in close succession by andrewbogott) [22:41:23] 6operations, 6Labs, 10Labs-Infrastructure, 6Release-Engineering-Team, and 2 others: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins - https://phabricator.wikimedia.org/T122731#1912279 (10bd808) 3NEW [23:03:04] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, and 3 others: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins - https://phabricator.wikimedia.org/T122731#1912305 (10Unicornisaurous) [23:04:49] bd808: what do you need? [23:05:21] oh, i see [23:05:39] ori: good question :) Last time there were two things in Labs restarted and we don't really know which was needed [23:09:55] There’s no slave/cloud that matches this assignment. Did you mean ‘DebianGlue’ instead of ‘ci-jessie-wikimedia’? [23:10:04] I wonder how it got from "ci-jessie-wikimedia" to "DebianGlue" [23:12:16] ori: no, that's the problem. There is this magic "nodepool" service that attaches VMs from the "contintcloud" project to Jenkins [23:12:23] and they have all going missing [23:12:38] https://wikitech.wikimedia.org/wiki/Nodepool [23:12:40] yeah, I know. I was just startled by the (presumably) fuzzy matching [23:13:05] oh, heh [23:13:10] super fuzzy [23:14:44] labnodepool1001.eqiad.wmnet is completely unresponsive; is that still the server? [23:16:04] I'm not sure. Let me look in my irc logs from the other day [23:18:28] no joy in irc logs about which host it may be [23:37:45] !log restarting nodepool on labnodepool1001.eqiad.wmnet (T122731) [23:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:46:05] ori: sadly no visible change at https://integration.wikimedia.org/ci/computer/ [23:49:00] legoktm: Don't suppose you know how to fix nodepool CI? [23:49:13] oh gah [23:49:18] I just tried to restart nodepool [23:49:40] legoktm: With no !log? Tsk. [23:50:39] !log restarted nodepool on labnodepool1001 [23:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:51:16] twice in 15 minutes with no change makes me think the problem is elsewhere [23:51:24] * James_F nods. [23:51:46] can we try restarting the nova-compute thing that andrewbogott did last time? [23:52:15] I think that's the "brain" of the Labs networking layer [23:52:35] oh :/ [23:53:55] Breaking Labs on a holiday probably isn [23:54:00] 't a good idea. :-( [23:54:16] * legoktm isn't really here either, but it's halftime right now [23:54:46] * James_F grins. [23:54:58] I'm definitely not sitting in the office right now. *coughs*