[00:00:51] <grrrit-wm>	 (03CR) 10Dzahn: "before: SSL error: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed" [puppet] - 10https://gerrit.wikimedia.org/r/161631 (owner: 10Dzahn)
[00:07:22] <RoanKattouw>	 bblack: OK that's fine
[00:09:57] <RoanKattouw>	 gwicke: Hey so do I see correctly that mathoid doesn't have a deployment repo? How does that work? How is its config managed? How should the mathoid config be managed?
[00:10:03] <grrrit-wm>	 (03CR) 10Ori.livneh: wmflib: add to_milliseconds() / to_seconds() (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/159692 (owner: 10Ori.livneh)
[00:10:41] <gwicke>	 RoanKattouw: it doesn't have a deploy repo yet
[00:10:57] <RoanKattouw>	 OK so how does it function without one?
[00:10:57] <grrrit-wm>	 (03PS2) 10Ori.livneh: wmflib: add to_milliseconds() / to_seconds() [puppet] - 10https://gerrit.wikimedia.org/r/159692 
[00:11:17] <bblack>	 magic!
[00:11:25] <gwicke>	 RoanKattouw: I haven't looked at it in detail yet, I suspect that it's just doing a npm install in beta labs
[00:11:30] <gwicke>	 which of course won't fly in prod
[00:12:07] <RoanKattouw>	 Oh is mathoid not in prod yet?
[00:12:11] <gwicke>	 no, not yet
[00:12:14] <RoanKattouw>	 Ugh
[00:12:16] <RoanKattouw>	 OK
[00:12:27] <RoanKattouw>	 Where is it in labs?
[00:12:43] <gwicke>	 I wouldn't hold you back if you'd like to deploy it too ;)
[00:12:59] <RoanKattouw>	 I'll probably have to
[00:13:06] <RoanKattouw>	 I mean, it would be approximately zero extra work
[00:13:26] <RoanKattouw>	 But the fact that mathoid isn't already deployed means there are bugs that you haven't found yet, which means this is going to be more work than I thought :|
[00:14:15] <RoanKattouw>	 I need to do some patches to citoid too, though. It needs to not 404 on / because of the health check
[00:14:18] <gwicke>	 it's deployed in beta somewhere
[00:14:18] <grrrit-wm>	 (03PS1) 10Dzahn: icinga - fix syntax error for check_ssl_ldap [puppet] - 10https://gerrit.wikimedia.org/r/161633 
[00:14:26] <gwicke>	 looking at the puppet code
[00:14:40] <RoanKattouw>	 And I want to make it use GET query strings or maybe urlencoded POST, but not this JSON RPC thing
[00:14:56] <ori>	 RoanKattouw, gwicke: btw, have you seen http://meta.math.stackexchange.com/q/16809 ?
[00:15:31] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] icinga - fix syntax error for check_ssl_ldap [puppet] - 10https://gerrit.wikimedia.org/r/161633 (owner: 10Dzahn)
[00:15:31] <gwicke>	 no, not yet
[00:15:42] <gwicke>	 RoanKattouw: mathjax is plenty fast for us right now
[00:15:50] <gwicke>	 and SVG is very fast to render on the client
[00:16:01] <RoanKattouw>	 hah
[00:16:07] <gwicke>	 eh, ori ^^
[00:16:12] <RoanKattouw>	 Maybe once it matures, it'll still be faster
[00:16:19] <RoanKattouw>	 And hopefully better designed than MathJax
[00:16:27] <gwicke>	 yeah, we'll see
[00:16:30] <RoanKattouw>	 from an external-facing API perspective that is
[00:16:35] <gwicke>	 easy to swap one lib for another
[00:17:12] * ori nods
[00:17:30] <gwicke>	 there are quite a few nice features around mathjax though that might take a while to replicate
[00:17:30] <ori>	 not suggesting we abandon plans whenever some new repo trends on github. but this one looks pretty cool :)
[00:17:50] <gwicke>	 like Chromevox and mathml
[00:18:47] <gwicke>	 (a fork of chromevox can add readable annotations to the SVG, so that a screenreader can produce something that makes more sense than the raw latex)
[00:39:21] <grrrit-wm>	 (03PS1) 10Dzahn: add README.md to all modules [puppet] - 10https://gerrit.wikimedia.org/r/161634 
[00:41:14] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-2] "meh, that's not what i meant yet" [puppet] - 10https://gerrit.wikimedia.org/r/161634 (owner: 10Dzahn)
[00:42:08] <icinga-wm>	 RECOVERY - Certificate expiration on virt1000 is OK: SSL_CERT OK - X.509 certificate for virt1000.wikimedia.org from RapidSSL CA valid until Jan 22 21:31:13 2015 GMT (expires in 125 days)  
[00:42:20] <mutante>	 ^ yay
[00:42:45] <ori>	 have a good weekend, everybody
[00:42:59] <mutante>	 have a good weekend
[00:44:16] <grrrit-wm>	 (03CR) 10Dzahn: "17:44 <+icinga-wm> RECOVERY - Certificate expiration on virt1000 is OK: SSL_CERT OK - X.509 certificate for virt1000.wikimedia.org from Ra" [puppet] - 10https://gerrit.wikimedia.org/r/161631 (owner: 10Dzahn)
[01:19:36] <icinga-wm>	 PROBLEM - CI tmpfs disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 27 MB (5% inode=99%):  
[02:08:24] <icinga-wm>	 PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3616 MB (3% inode=99%):  
[02:19:34] <logmsgbot>	 !log LocalisationUpdate completed (1.24wmf20) at 2014-09-20 02:19:34+00:00
[02:19:49] <morebots>	 Logged the message, Master
[02:33:34] <logmsgbot>	 !log LocalisationUpdate completed (1.24wmf21) at 2014-09-20 02:33:34+00:00
[02:33:41] <morebots>	 Logged the message, Master
[02:46:05] <logmsgbot>	 !log LocalisationUpdate completed (1.24wmf22) at 2014-09-20 02:46:05+00:00
[02:46:11] <morebots>	 Logged the message, Master
[03:00:03] <icinga-wm>	 RECOVERY - Disk space on virt0 is OK: DISK OK  
[03:46:00] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Sep 20 03:46:00 UTC 2014 (duration 45m 59s)
[03:46:07] <morebots>	 Logged the message, Master
[04:11:56] <grrrit-wm>	 (03PS1) 10Ori.livneh: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/161642 
[04:12:53] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/161642 (owner: 10Ori.livneh)
[05:25:53] <icinga-wm>	 PROBLEM - Disk space on ms1001 is CRITICAL: DISK CRITICAL - free space: /export 2284794 MB (3% inode=99%):  
[06:28:13] <icinga-wm>	 PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: Epic puppet fail  
[06:28:42] <icinga-wm>	 PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Epic puppet fail  
[06:29:02] <icinga-wm>	 PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Epic puppet fail  
[06:29:52] <icinga-wm>	 PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 2 failures  
[06:30:02] <icinga-wm>	 PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures  
[06:30:03] <icinga-wm>	 PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 3 failures  
[06:30:12] <icinga-wm>	 PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:23] <icinga-wm>	 PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:23] <icinga-wm>	 PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures  
[06:30:34] <icinga-wm>	 PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:52] <icinga-wm>	 PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures  
[06:30:54] <icinga-wm>	 PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:55] <icinga-wm>	 PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:55] <icinga-wm>	 PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:02] <icinga-wm>	 PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:02] <icinga-wm>	 PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:12] <icinga-wm>	 PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:12] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:25] <icinga-wm>	 PROBLEM - puppet last run on amssq46 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:27] <icinga-wm>	 PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:32] <icinga-wm>	 PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:42] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:45:13] <icinga-wm>	 RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures  
[06:45:23] <icinga-wm>	 RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures  
[06:45:23] <icinga-wm>	 RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures  
[06:45:23] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures  
[06:45:43] <icinga-wm>	 RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures  
[06:45:54] <icinga-wm>	 RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures  
[06:46:07] <icinga-wm>	 RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures  
[06:46:16] <icinga-wm>	 RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures  
[06:46:16] <icinga-wm>	 RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures  
[06:46:16] <icinga-wm>	 RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures  
[06:46:17] <icinga-wm>	 RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures  
[06:46:17] <icinga-wm>	 RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures  
[06:46:26] <icinga-wm>	 RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures  
[06:46:46] <icinga-wm>	 RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures  
[06:46:46] <icinga-wm>	 RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures  
[06:46:47] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures  
[06:47:06] <icinga-wm>	 RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures  
[06:47:06] <icinga-wm>	 RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures  
[06:47:19] <icinga-wm>	 RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures  
[06:47:19] <icinga-wm>	 RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures  
[06:47:38] <icinga-wm>	 RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures  
[06:47:46] <icinga-wm>	 RECOVERY - puppet last run on amssq46 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures  
[06:58:26] <icinga-wm>	 PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: Puppet has 1 failures  
[07:06:17] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[07:06:39] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[07:07:33] <kaldari>	 getting lots of 503 errors on en.wiki and commons
[07:07:33] <icinga-wm>	 PROBLEM - HTTPS on cp4012 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6  
[07:07:33] <icinga-wm>	 PROBLEM - HTTPS on cp4005 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6  
[07:07:34] <icinga-wm>	 PROBLEM - RAID on cp4014 is CRITICAL: Timeout while attempting connection  
[07:07:34] <icinga-wm>	 PROBLEM - DPKG on cp4010 is CRITICAL: Timeout while attempting connection  
[07:07:34] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL: Timeout while attempting connection  
[07:07:34] <icinga-wm>	 PROBLEM - Varnish HTCP daemon on cp4014 is CRITICAL: Timeout while attempting connection  
[07:07:34] <icinga-wm>	 PROBLEM - Varnish HTCP daemon on cp4018 is CRITICAL: Timeout while attempting connection  
[07:07:35] <icinga-wm>	 PROBLEM - DPKG on cp4020 is CRITICAL: Timeout while attempting connection  
[07:07:35] <icinga-wm>	 PROBLEM - RAID on cp4004 is CRITICAL: Timeout while attempting connection  
[07:07:36] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 68827 bytes in 9.921 second response time  
[07:08:13] <icinga-wm>	 PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:08:23] <icinga-wm>	 PROBLEM - Disk space on cp4002 is CRITICAL: Timeout while attempting connection  
[07:08:23] <icinga-wm>	 PROBLEM - puppet last run on cp4003 is CRITICAL: Timeout while attempting connection  
[07:08:23] <icinga-wm>	 PROBLEM - puppet last run on lvs4002 is CRITICAL: Timeout while attempting connection  
[07:08:23] <icinga-wm>	 PROBLEM - check configured eth on cp4015 is CRITICAL: Timeout while attempting connection  
[07:08:23] <icinga-wm>	 PROBLEM - check if dhclient is running on cp4015 is CRITICAL: Timeout while attempting connection  
[07:08:24] <icinga-wm>	 PROBLEM - RAID on cp4018 is CRITICAL: Timeout while attempting connection  
[07:08:24] <icinga-wm>	 PROBLEM - DPKG on bast4001 is CRITICAL: Timeout while attempting connection  
[07:08:25] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0]  
[07:08:35] <icinga-wm>	 PROBLEM - Varnish traffic logger on cp4014 is CRITICAL: Timeout while attempting connection  
[07:08:35] <icinga-wm>	 PROBLEM - RAID on lvs4002 is CRITICAL: Timeout while attempting connection  
[07:09:03] <icinga-wm>	 PROBLEM - Host cp4015 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:09:03] <icinga-wm>	 RECOVERY - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 54%, RTA = 110.76 ms  
[07:09:14] <icinga-wm>	 RECOVERY - Disk space on cp4002 is OK: DISK OK  
[07:09:14] <icinga-wm>	 RECOVERY - check if dhclient is running on cp4015 is OK: PROCS OK: 0 processes with command name dhclient  
[07:09:14] <icinga-wm>	 RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 403 seconds ago with 0 failures  
[07:09:14] <icinga-wm>	 RECOVERY - check configured eth on cp4015 is OK: NRPE: Unable to read output  
[07:09:23] <icinga-wm>	 RECOVERY - DPKG on bast4001 is OK: All packages OK  
[07:09:23] <icinga-wm>	 RECOVERY - RAID on cp4018 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[07:09:24] <icinga-wm>	 RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 74.05 ms  
[07:09:33] <icinga-wm>	 RECOVERY - HTTPS on cp4012 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 487 days)  
[07:09:34] <icinga-wm>	 RECOVERY - HTTPS on cp4005 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 487 days)  
[07:09:34] <icinga-wm>	 RECOVERY - Varnish HTCP daemon on cp4014 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd  
[07:09:35] <icinga-wm>	 RECOVERY - DPKG on cp4010 is OK: All packages OK  
[07:09:35] <icinga-wm>	 RECOVERY - Varnish traffic logger on cp4014 is OK: PROCS OK: 2 processes with command name varnishncsa  
[07:09:35] <icinga-wm>	 RECOVERY - Varnish HTCP daemon on cp4018 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd  
[07:09:35] <icinga-wm>	 RECOVERY - DPKG on cp4020 is OK: All packages OK  
[07:09:35] <icinga-wm>	 RECOVERY - RAID on lvs4002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[07:09:36] <icinga-wm>	 RECOVERY - RAID on cp4014 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[07:09:37] <icinga-wm>	 RECOVERY - RAID on cp4004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[07:11:35] <icinga-wm>	 PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100%  
[07:11:35] <icinga-wm>	 PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100%  
[07:11:35] <icinga-wm>	 PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100%  
[07:11:35] <icinga-wm>	 PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100%  
[07:12:14] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection timed out  
[07:12:19] <legoktm>	 uhoh
[07:12:37] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[07:12:38] <kaldari>	 wonder if anyone's awake
[07:12:41] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out  
[07:12:44] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out  
[07:12:48] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out  
[07:12:54] <icinga-wm>	 RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 119.42 ms  
[07:12:54] <icinga-wm>	 PROBLEM - Host cp4004 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:12:54] <icinga-wm>	 PROBLEM - Host cp4009 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:12:54] <icinga-wm>	 PROBLEM - Host cp4008 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:12:54] <icinga-wm>	 PROBLEM - Host cp4019 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:12:55] <icinga-wm>	 PROBLEM - Host lvs4001 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:12:55] <icinga-wm>	 PROBLEM - Host cp4005 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:12:56] <icinga-wm>	 PROBLEM - Host cp4010 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:12:56] <icinga-wm>	 PROBLEM - Host cp4002 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:12:57] <icinga-wm>	 PROBLEM - Host cp4001 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:12:57] <icinga-wm>	 PROBLEM - Host lvs4002 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:12:58] <icinga-wm>	 PROBLEM - Host cp4020 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:12:58] <icinga-wm>	 PROBLEM - Host cp4006 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:12:59] <icinga-wm>	 PROBLEM - Host cp4013 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:12:59] <icinga-wm>	 RECOVERY - Host cp4018 is UP: PING WARNING - Packet loss = 50%, RTA = 119.23 ms  
[07:12:59] <legoktm>	 people should be getting pages by now?
[07:13:07] <kaldari>	 hopefully
[07:13:10] <icinga-wm>	 RECOVERY - Host cp4009 is UP: PING WARNING - Packet loss = 28%, RTA = 121.52 ms  
[07:13:10] <icinga-wm>	 RECOVERY - Host cp4004 is UP: PING WARNING - Packet loss = 28%, RTA = 122.20 ms  
[07:13:10] <icinga-wm>	 RECOVERY - Host cp4017 is UP: PING WARNING - Packet loss = 37%, RTA = 121.30 ms  
[07:13:10] <icinga-wm>	 RECOVERY - Host cp4005 is UP: PING WARNING - Packet loss = 44%, RTA = 122.25 ms  
[07:13:18] <icinga-wm>	 RECOVERY - Host cp4014 is UP: PING WARNING - Packet loss = 50%, RTA = 121.92 ms  
[07:13:18] <icinga-wm>	 RECOVERY - Host cp4008 is UP: PING WARNING - Packet loss = 50%, RTA = 122.96 ms  
[07:13:18] <icinga-wm>	 RECOVERY - Host lvs4001 is UP: PING WARNING - Packet loss = 50%, RTA = 121.91 ms  
[07:13:18] <icinga-wm>	 RECOVERY - Host cp4006 is UP: PING WARNING - Packet loss = 50%, RTA = 121.35 ms  
[07:13:18] <icinga-wm>	 RECOVERY - Host cp4020 is UP: PING WARNING - Packet loss = 54%, RTA = 121.64 ms  
[07:13:19] <icinga-wm>	 RECOVERY - Host cp4001 is UP: PING WARNING - Packet loss = 54%, RTA = 121.64 ms  
[07:13:19] <icinga-wm>	 RECOVERY - Host cp4013 is UP: PING WARNING - Packet loss = 73%, RTA = 121.80 ms  
[07:13:20] <icinga-wm>	 RECOVERY - Host cp4019 is UP: PING WARNING - Packet loss = 73%, RTA = 121.05 ms  
[07:13:20] <icinga-wm>	 RECOVERY - Host cp4010 is UP: PING WARNING - Packet loss = 80%, RTA = 120.79 ms  
[07:13:21] <icinga-wm>	 RECOVERY - Host lvs4002 is UP: PING WARNING - Packet loss = 73%, RTA = 120.67 ms  
[07:13:32] <ori>	 yes
[07:13:35] <icinga-wm>	 PROBLEM - RAID on cp4015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[07:13:38] <kaldari>	 :)
[07:13:38] <apergos>	 ah... n o pages cause phone is in the other room and I'm still in pjs
[07:13:44] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out  
[07:14:15] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 699 bytes in 0.970 second response time  
[07:14:23] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 699 bytes in 1.535 second response time  
[07:14:28] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22391 bytes in 0.817 second response time  
[07:14:42] <apergos>	 no pages on phone. hrm
[07:14:52] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22341 bytes in 1.296 second response time  
[07:14:56] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 677 bytes in 1.593 second response time  
[07:15:00] <icinga-wm>	 RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 100.98 ms  
[07:15:12] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds  
[07:15:22] <icinga-wm>	 RECOVERY - RAID on cp4015 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[07:15:22] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 68860 bytes in 4.567 second response time  
[07:16:05] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 68749 bytes in 7.090 second response time  
[07:16:29] <icinga-wm>	 RECOVERY - puppet last run on ssl3001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures  
[07:16:40] <icinga-wm>	 PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Epic puppet fail  
[07:16:40] <icinga-wm>	 PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Epic puppet fail  
[07:16:41] <icinga-wm>	 PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Epic puppet fail  
[07:17:31] <HaeB>	 hi folks, is this something we should tweet from @wikimedia/@wikipedia?  https://wikitech.wikimedia.org/wiki/Incident_response#Communicating_with_the_public
[07:17:35] <HaeB>	 happy to help
[07:17:41] <icinga-wm>	 PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 3 failures  
[07:18:51] <icinga-wm>	 PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 10.7821244706  
[07:20:51] <icinga-wm>	 PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: Puppet has 1 failures  
[07:21:51] <icinga-wm>	 PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures  
[07:22:50] <icinga-wm>	 PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 8.09920889831  
[07:23:06] <icinga-wm>	 PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 16.438700339  
[07:23:18] <icinga-wm>	 PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 2 failures  
[07:24:03] <Eloquence>	 are folks still getting errors now?
[07:26:09] <icinga-wm>	 RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures  
[07:26:43] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures  
[07:26:59] <icinga-wm>	 RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY: 3.56611737288  
[07:27:01] <icinga-wm>	 RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: 1.81911928571  
[07:27:09] <icinga-wm>	 RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures  
[07:28:18] <icinga-wm>	 RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures  
[07:29:18] <icinga-wm>	 RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures  
[07:29:18] <icinga-wm>	 RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures  
[07:29:18] <icinga-wm>	 RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures  
[07:30:18] <icinga-wm>	 RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures  
[07:31:48] <kaldari>	 Eloquence: Seems good to me
[07:32:04] <HaeB>	 looks good to me (i was gettting lots of 503s before)
[07:33:26] <Eloquence>	 kaldari, haeb - thanks. this outage should only have affected traffic that hit our san francisco data center so limited global user impact
[07:34:19] <icinga-wm>	 RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures  
[07:35:48] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[07:35:52] <Eloquence>	 ori, thanks for the en.wp post
[07:36:58] <HaeB>	 Eloquence: i figured.. but that's still a continent or two, right? a tweet at least from @wikimedia would have been defendable, i think
[07:37:18] <icinga-wm>	 RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 0.894191101695  
[07:37:49] <ori>	 defensible! :)
[07:39:11] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Move ulsfo traffic to eqiad [dns] - 10https://gerrit.wikimedia.org/r/161646 
[07:39:52] <HaeB>	 ori: defendable is defendable: https://en.wiktionary.org/wiki/defendable ! ;)
[07:41:04] <ori>	 that it may be; but it is not, alas, defensible
[07:41:44] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] Move ulsfo traffic to eqiad [dns] - 10https://gerrit.wikimedia.org/r/161646 (owner: 10Mark Bergsma)
[07:43:12] <Nemo_bis>	 What I don't understand is why the server admin log is not used nowadays
[07:43:26] <mark>	 it is?
[07:43:52] <_joe_>	 Nemo_bis: ?
[07:43:53] <Nemo_bis>	 never seen it in use for events recently
[07:44:22] <_joe_>	 it is used for manual operations on the cluster, usually
[07:44:30] <mark>	 just a lot more of those are now automated
[07:44:39] <Nemo_bis>	 https://wikitech.wikimedia.org/wiki/Incident_documentation is very useful but only after some days; during events one has to scan all the logs
[07:45:19] <_joe_>	 oh so you mean _during_ outages
[07:45:39] <Nemo_bis>	 yes, once upon a time it was very useful for communication during incidents
[07:45:52] <Nemo_bis>	 but in the last several years the communication about events has decreased a lot
[07:46:09] <Nemo_bis>	 maybe it's less needed now because users already tell each other there's a problem on twitter, dunno
[07:46:59] <HaeB>	 there's also https://wikitech.wikimedia.org/wiki/Incident_response#Logging ...
[07:47:04] <mark>	 so, in this case, there have been no actions other than my traffic move just now
[08:00:23] <grrrit-wm>	 (03PS1) 10Ori.livneh: hhvm: disable internal stat collection [puppet] - 10https://gerrit.wikimedia.org/r/161647 
[08:03:13] <_joe_>	 ori: I'm going out now, but just set hhvm.stats = false would be enough
[08:03:51] <ori>	 _joe_: except some of the configuration keys don't do anything, and some of the others are changing -- so i'd rather not risk having this become voodoo
[08:04:33] <_joe_>	 mmmh, ok, it's easy to put them back 
[08:04:52] <_joe_>	 (also, I guess we should try to look at 3.3.0 for metrics to work again)
[08:05:04] <_joe_>	 but, good night!
[08:06:20] <ori>	 good night
[08:10:41] <mark>	 ori: do you need that merged now?
[08:11:17] <ori>	 not critical but would help if issues recur in the weekend
[08:11:35] <mark>	 do I need to babysit it on all boxes?
[08:11:49] <ori>	 nah, it's safe™
[08:11:58] <ori>	 i'll stick around while it rolls out
[08:12:10] <ori>	 but it's a revert to known-good config
[08:12:12] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] hhvm: disable internal stat collection [puppet] - 10https://gerrit.wikimedia.org/r/161647 (owner: 10Ori.livneh)
[08:12:24] <ori>	 thanks
[08:14:51] <mark>	 just tested on one, seems fine indeed
[08:15:53] <Steinsplitter>	 /bits.wikimedia.org/skins-1.5/common/images/ajax-loader.gif
[08:15:56] <ori>	 cool, thanks again
[08:16:01] <Steinsplitter>	 can somone pls tell me the new url?
[08:16:32] <Steinsplitter>	 is the grafic located on a other .bits  /dir ?
[08:18:25] <Steinsplitter>	 ah, https://commons.wikimedia.org/wiki/File:Ajax-loader.gif. its the same :=)
[08:47:09] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected  
[10:51:57] <grrrit-wm>	 (03PS1) 10Anomie: 'securepoll-create-poll' for sysop on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161653 
[10:54:09] <grrrit-wm>	 (03CR) 10Anomie: "Planned for Monday SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161653 (owner: 10Anomie)
[10:56:50] <icinga-wm>	 PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[11:08:00] <icinga-wm>	 PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[11:17:19] <grrrit-wm>	 (03PS3) 10Yuvipanda: nagios_common: Refactor custom command definitions [puppet] - 10https://gerrit.wikimedia.org/r/161478 
[11:42:39] <icinga-wm>	 PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[11:44:02] <grrrit-wm>	 (03Abandoned) 10Yuvipanda: quarry: Use mysql-server-5.6 instead of 5.5 [puppet] - 10https://gerrit.wikimedia.org/r/153784 (owner: 10Yuvipanda)
[11:44:15] <grrrit-wm>	 (03Abandoned) 10Yuvipanda: toollabs: Increase space on /var for all nodes [puppet] - 10https://gerrit.wikimedia.org/r/143098 (owner: 10Yuvipanda)
[11:46:13] <grrrit-wm>	 (03Abandoned) 10Yuvipanda: Implement last command (per greg-g) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/150082 (owner: 10Yuvipanda)
[13:44:33] <hoo>	 querycachetwo is not unique... yay -.-
[13:48:30] <icinga-wm>	 RECOVERY - CI tmpfs disk space on lanthanum is OK: DISK OK  
[14:41:42] <andrewbogott>	 heh, ganglia marks large amounts of free disk space in red, and small amounts in green.  As though a surplus of free space was a crisis
[14:48:36] <grrrit-wm>	 (03PS1) 10Florianschmidtwelzow: Add REL1_24 as branch in ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161666 
[16:56:30] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 5.1GB (= 5.0GB critical): /srv/deployment/ocg/output 3903439758B: /srv/deployment/ocg/postmortem 1604914B: ocg_job_status 12299 msg: ocg_render_job_queue 0 msg  
[17:00:28] <icinga-wm>	 PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Puppet has 1 failures  
[17:05:17] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 5.1GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12301 msg: ocg_render_job_queue 0 msg  
[17:15:34] <grrrit-wm>	 (03PS1) 1001tonythomas: Corrected the exim regex expression [puppet] - 10https://gerrit.wikimedia.org/r/161679 
[17:18:37] <icinga-wm>	 RECOVERY - puppet last run on virt1009 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures  
[17:30:18] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 14.4GB (= 5.0GB critical): /srv/deployment/ocg/output 3964604775B: /srv/deployment/ocg/postmortem 1611320B: ocg_job_status 12303 msg: ocg_render_job_queue 0 msg  
[17:42:28] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 22.2GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1611320B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg  
[17:44:37] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 23.4GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1611320B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg  
[17:47:38] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 14.2GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg  
[17:50:37] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 27.2GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1611320B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg  
[17:52:09] <grrrit-wm>	 (03PS1) 10Gerrit Patch Uploader: Remove "MW" and "C" namespace alias on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161684 (https://bugzilla.wikimedia.org/70564) 
[17:52:15] <grrrit-wm>	 (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161684 (https://bugzilla.wikimedia.org/70564) (owner: 10Gerrit Patch Uploader)
[17:52:38] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 16.0GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg  
[17:55:38] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 13.1GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1640275B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg  
[18:02:47] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 20.8GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg  
[18:07:58] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 22.4GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg  
[18:08:57] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 16.8GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1640275B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg  
[18:10:59] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 23.2GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg  
[18:12:18] <icinga-wm>	 PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures  
[18:13:22] <grrrit-wm>	 (03PS2) 10Steinsplitter: Remove "MW" and "C" namespace alias on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161684 (https://bugzilla.wikimedia.org/70564) (owner: 10Gerrit Patch Uploader)
[18:18:07] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 21.6GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1645462B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg  
[18:19:07] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 25.3GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg  
[18:22:17] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 25.3GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg  
[18:24:17] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 25.4GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1645462B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg  
[18:27:18] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 25.3GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg  
[18:30:27] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 5.8GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg  
[18:30:28] <icinga-wm>	 RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures  
[18:31:17] <icinga-wm>	 RECOVERY - OCG health on ocg1003 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1255714B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg  
[18:32:27] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 25.4GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1648026B: ocg_job_status 12306 msg: ocg_render_job_queue 0 msg  
[18:35:27] <icinga-wm>	 RECOVERY - OCG health on ocg1002 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1677105B: ocg_job_status 12306 msg: ocg_render_job_queue 0 msg  
[18:41:59] <matanya>	 greg-g: awake ?
[19:25:38] <icinga-wm>	 PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: Epic puppet fail  
[19:33:08] <FlorianSW>	 for a config change (operations/mediawiki-config) there is a slot in SWAT needed, right?
[19:37:17] <JohnLewis>	 FlorianSW: it is the best way to get something merged without having luck :)
[19:38:15] <FlorianSW>	 JohnLewis: ok, thanks for the info :)
[19:44:48] <icinga-wm>	 RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures  
[20:04:59] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Remove virt1006 from the nova-compute pool [puppet] - 10https://gerrit.wikimedia.org/r/161745 
[20:06:13] <grrrit-wm>	 (03PS1) 10Krinkle: contint: Increase CPU monitoring sensitivity from 95-99 to 90-97 [puppet] - 10https://gerrit.wikimedia.org/r/161746 
[20:06:54] <grrrit-wm>	 (03PS3) 10Krinkle: contint: Ensure nodejs-legacy is installed [puppet] - 10https://gerrit.wikimedia.org/r/159226 
[20:11:05] <grrrit-wm>	 (03PS4) 10Krinkle: contint: Ensure nodejs-legacy is installed [puppet] - 10https://gerrit.wikimedia.org/r/159226 
[20:22:26] <grrrit-wm>	 (03PS5) 10Krinkle: contint: Ensure nodejs-legacy is installed [puppet] - 10https://gerrit.wikimedia.org/r/159226 
[20:22:54] <grrrit-wm>	 (03PS6) 10Krinkle: contint: Ensure nodejs-legacy is installed [puppet] - 10https://gerrit.wikimedia.org/r/159226 
[20:22:56] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Remove virt1006 from the nova-compute pool [puppet] - 10https://gerrit.wikimedia.org/r/161745 (owner: 10Andrew Bogott)
[20:23:44] <grrrit-wm>	 (03PS1) 10Krinkle: contint: Package 'php5-parsekit' is absent on Trusty, don't require it [puppet] - 10https://gerrit.wikimedia.org/r/161748 
[20:26:59] <grrrit-wm>	 (03PS2) 10Krinkle: labmon: Increase CPU sensitivity for contint from 95-99 to 90-97 [puppet] - 10https://gerrit.wikimedia.org/r/161746 
[20:29:56] <andrewbogott_afk>	 !log moved all VMs off of virt1006, disabled compute service
[20:30:03] <morebots>	 Logged the message, Master
[20:30:36] <andrewbogott>	 !log rebooting virt1006 to make good and sure it doesn't spontaneously re-enter the compute pool
[20:30:42] <morebots>	 Logged the message, Master
[20:32:59] <icinga-wm>	 PROBLEM - DPKG on virt1006 is CRITICAL: Connection refused by host  
[20:32:59] <icinga-wm>	 PROBLEM - SSH on virt1006 is CRITICAL: Connection refused  
[20:33:07] <icinga-wm>	 PROBLEM - check if dhclient is running on virt1006 is CRITICAL: Connection refused by host  
[20:33:19] <icinga-wm>	 PROBLEM - puppet last run on virt1006 is CRITICAL: Connection refused by host  
[20:33:19] <icinga-wm>	 PROBLEM - Disk space on virt1006 is CRITICAL: Connection refused by host  
[20:33:28] <icinga-wm>	 PROBLEM - RAID on virt1006 is CRITICAL: Connection refused by host  
[20:33:28] <icinga-wm>	 PROBLEM - check configured eth on virt1006 is CRITICAL: Connection refused by host  
[20:34:29] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] labmon: Increase CPU sensitivity for contint from 95-99 to 90-97 [puppet] - 10https://gerrit.wikimedia.org/r/161746 (owner: 10Krinkle)
[20:36:49] <icinga-wm>	 ACKNOWLEDGEMENT - DPKG on virt1006 is CRITICAL: Connection refused by host andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild.  Its failure to come up after reboot is interesting but not urgent.
[20:36:49] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on virt1006 is CRITICAL: Connection refused by host andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild.  Its failure to come up after reboot is interesting but not urgent.
[20:36:49] <icinga-wm>	 ACKNOWLEDGEMENT - NTP on virt1006 is CRITICAL: NTP CRITICAL: No response from NTP server andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild.  Its failure to come up after reboot is interesting but not urgent.
[20:36:49] <icinga-wm>	 ACKNOWLEDGEMENT - RAID on virt1006 is CRITICAL: Connection refused by host andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild.  Its failure to come up after reboot is interesting but not urgent.
[20:36:49] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on virt1006 is CRITICAL: Connection refused andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild.  Its failure to come up after reboot is interesting but not urgent.
[20:36:50] <icinga-wm>	 ACKNOWLEDGEMENT - check configured eth on virt1006 is CRITICAL: Connection refused by host andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild.  Its failure to come up after reboot is interesting but not urgent.
[20:36:50] <icinga-wm>	 ACKNOWLEDGEMENT - check if dhclient is running on virt1006 is CRITICAL: Connection refused by host andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild.  Its failure to come up after reboot is interesting but not urgent.
[20:36:51] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on virt1006 is CRITICAL: Connection refused by host andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild.  Its failure to come up after reboot is interesting but not urgent.
[20:57:10] <grrrit-wm>	 (03CR) 10Aaron Schulz: "I don't really get the hyperthreading/threads bit. There is of course less context switching with less threads. On the other hand parsing " [puppet] - 10https://gerrit.wikimedia.org/r/161473 (owner: 10GWicke)
[21:05:49] <icinga-wm>	 PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: Epic puppet fail  
[21:23:58] <icinga-wm>	 RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures  
[22:28:32] <Krinkle>	 !log Reloading Zuul to deploy I0170766cfc06b8e6
[22:28:38] <morebots>	 Logged the message, Master
[23:03:40] <grrrit-wm>	 (03CR) 10GWicke: "> I don't really get the hyperthreading/threads bit. There is of course less context switching with less threads. On the other hand parsin" [puppet] - 10https://gerrit.wikimedia.org/r/161473 (owner: 10GWicke)
[23:20:18] <grrrit-wm>	 (03CR) 10GWicke: "I should add that the linux run queue can include processes blocked on disk IO. Jobs are typically blocking on network though, which means" [puppet] - 10https://gerrit.wikimedia.org/r/161473 (owner: 10GWicke)
[23:21:10] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0]  
[23:37:29] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]