[00:08:15] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[00:10:34] <icinga-wm>	 PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[00:10:35] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[00:10:55] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[00:14:36] <icinga-wm>	 RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[00:16:45] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[00:17:05] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[00:35:40] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[00:46:00] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[00:54:10] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[01:00:29] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[01:01:52] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[01:07:32] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[01:10:31] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1042 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[01:17:51] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[01:24:52] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1055 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[01:25:41] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[01:26:01] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[01:32:18] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1042 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[01:33:38] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1055 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[01:38:58] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[01:45:08] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[01:55:18] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[02:02:58] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[02:04:39] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[02:10:49] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[02:12:49] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[02:18:58] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[02:23:40] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 09m 58s)
[02:23:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:27:09] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[02:30:28] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jan  1 02:30:27 UTC 2016 (duration 6m 47s)
[02:30:29] <icinga-wm>	 PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail
[02:30:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:32:46] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[02:36:17] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[02:48:37] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[02:57:46] <icinga-wm>	 RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[02:58:57] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[03:07:24] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[03:29:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1056 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[03:50:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1056 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[03:56:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[04:07:17] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[04:13:27] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[04:15:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1052 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[04:19:17] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1052 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[04:22:17] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1044 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[04:29:57] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[04:39:25] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[04:57:55] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[04:58:46] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[05:05:10] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[05:15:31] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1057 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[05:21:40] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[05:27:50] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[05:36:19] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[05:46:39] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[06:11:45] <icinga-wm>	 PROBLEM - puppet last run on mw2013 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:12:25] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[06:13:24] <icinga-wm>	 PROBLEM - puppet last run on mw2042 is CRITICAL: CRITICAL: puppet fail
[06:18:43] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[06:31:33] <icinga-wm>	 PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:55] <icinga-wm>	 PROBLEM - puppet last run on mw2037 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:37:25] <icinga-wm>	 RECOVERY - puppet last run on mw2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:37:46] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[06:42:56] <icinga-wm>	 RECOVERY - puppet last run on mw2042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:47:26] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0]
[06:49:27] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[06:52:06] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[06:58:17] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[06:58:56] <icinga-wm>	 RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:00:46] <icinga-wm>	 RECOVERY - puppet last run on mw2037 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[07:03:03] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[07:37:17] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[07:43:27] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[08:16:06] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[08:22:16] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[08:30:36] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[08:39:13] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[09:03:50] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[09:13:51] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[09:30:31] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[09:36:10] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[09:46:30] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[09:52:40] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[10:00:10] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 626
[10:21:18] <icinga-wm>	 PROBLEM - puppet last run on mw2047 is CRITICAL: CRITICAL: puppet fail
[10:26:58] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[10:30:17] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 669
[10:33:35] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[10:35:14] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 928702 Threads: 150 Questions: 38408383 Slow queries: 11585 Opens: 58804 Flush tables: 2 Open tables: 418 Queries per second avg: 41.357 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[10:48:34] <icinga-wm>	 RECOVERY - puppet last run on mw2047 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[13:46:12] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[13:52:22] <icinga-wm>	 PROBLEM - SSH on rutherfordium is CRITICAL: Server answer
[15:28:04] <icinga-wm>	 PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: puppet fail
[15:55:00] <icinga-wm>	 RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:43:20] <icinga-wm>	 PROBLEM - HHVM rendering on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:45:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw1057 is OK: HTTP OK: HTTP/1.1 200 OK - 71972 bytes in 0.115 second response time
[19:41:51] <bd808>	 !log Updated scholarships.wikimedia.org with latest translation data from translatewiki
[19:41:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:09:52] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:17:00] <valhallasw`cloud>	 ^ both NFS and the home page seem to work -- not sure what's going on here
[20:30:52] <icinga-wm>	 PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail
[20:56:35] <icinga-wm>	 PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: Puppet has 1 failures
[20:57:46] <icinga-wm>	 RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:14:11] <grrrit-wm>	 (03PS1) 10Luke081515: Remove the unblockself right from sysops at trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) 
[21:20:58] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.104 second response time
[21:21:38] <icinga-wm>	 RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:21:06] <bd808>	 Anyone around willing to try and get nodepool CI hosts talking to Jenkins again?
[22:21:53] <bd808>	 The rake-jessie jobs are backing up in zuul and the Jenkins web ui isn't showing any of the ci-jessie-* slaves
[22:23:18] <bd808>	 This happened on Wednesday as well and was resolved by either the nodepool service being restarted or a restart of nova-compute (both were done in close succession by andrewbogott)
[22:41:23] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 6Release-Engineering-Team, and 2 others: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins - https://phabricator.wikimedia.org/T122731#1912279 (10bd808) 3NEW
[23:03:04] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, and 3 others: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins - https://phabricator.wikimedia.org/T122731#1912305 (10Unicornisaurous)
[23:04:49] <ori>	 bd808: what do you need?
[23:05:21] <ori>	 oh, i see
[23:05:39] <bd808>	 ori: good question :) Last time there were two things in Labs restarted and we don't really know which was needed
[23:09:55] <ori>	 There’s no slave/cloud that matches this assignment. Did you mean ‘DebianGlue’ instead of ‘ci-jessie-wikimedia’? 
[23:10:04] <ori>	 I wonder how it got from "ci-jessie-wikimedia" to "DebianGlue"
[23:12:16] <bd808>	 ori: no, that's the problem. There is this magic "nodepool" service that attaches VMs from the "contintcloud" project to Jenkins
[23:12:23] <bd808>	 and they have all going missing
[23:12:38] <bd808>	 https://wikitech.wikimedia.org/wiki/Nodepool
[23:12:40] <ori>	 yeah, I know. I was just startled by the (presumably) fuzzy matching
[23:13:05] <bd808>	 oh, heh
[23:13:10] <bd808>	 super fuzzy
[23:14:44] <ori>	 labnodepool1001.eqiad.wmnet is completely unresponsive; is that still the server?
[23:16:04] <bd808>	 I'm not sure. Let me look in my irc logs from the other day
[23:18:28] <bd808>	 no joy in irc logs about which host it may be
[23:37:45] <ori>	 !log restarting nodepool on labnodepool1001.eqiad.wmnet (T122731)
[23:37:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:46:05] <bd808>	 ori: sadly no visible change at https://integration.wikimedia.org/ci/computer/
[23:49:00] <James_F>	 legoktm: Don't suppose you know how to fix nodepool CI?
[23:49:13] <legoktm>	 oh gah
[23:49:18] <legoktm>	 I just tried to restart nodepool
[23:49:40] <James_F>	 legoktm: With no !log? Tsk.
[23:50:39] <legoktm>	 !log restarted nodepool on labnodepool1001
[23:50:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:51:16] <bd808>	 twice in 15 minutes with no change makes me think the problem is elsewhere
[23:51:24] * James_F nods.
[23:51:46] <legoktm>	 can we try restarting the nova-compute thing that andrewbogott did last time?
[23:52:15] <bd808>	 I think that's the "brain" of the Labs networking layer
[23:52:35] <legoktm>	 oh :/
[23:53:55] <James_F>	 Breaking Labs on a holiday probably isn
[23:54:00] <James_F>	 't a good idea. :-(
[23:54:16] * legoktm isn't really here either, but it's halftime right now
[23:54:46] * James_F grins.
[23:54:58] <James_F>	 I'm definitely not sitting in the office right now. *coughs*