[00:00:09] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 31547 seconds ago, expected  28800  
[00:02:24] <icinga-wm>	 RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures  
[00:03:23] <YuviPanda>	 set this task to UBN! btw, https://phabricator.wikimedia.org/T85293
[00:04:41] <robla>	 spagewmf: got your email.  sounds like the right thing to do.
[00:05:19] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 31847 seconds ago, expected  28800  
[00:05:34] <spagewmf>	 robla: thx.  I'll be around for at least 2 hours to monitor. quiddity can test
[00:06:14] <robla>	 I'm curious about how it got into an unusable state in the first place, though
[00:07:07] <robla>	 spagewmf: was this just a bug from last week's deployment (or earlier) that went unnoticed until now?
[00:08:30] <robla>	 YuviPanda: that sucks  :-(
[00:08:50] <YuviPanda>	 robla: yeah, I’ll probably poke around for another hour before giving up.
[00:09:02] <YuviPanda>	 at least it’s not an NFS outage, so things that are currently working mostly continue to work :)
[00:09:14] <spagewmf>	 robla: it's fallout from the jQuery update. Collaboration team is responsiblt for Editor Engagement extensions, but we didn't notice failures until curators reported them.
[00:10:08] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 32147 seconds ago, expected  28800  
[00:15:09] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 32447 seconds ago, expected  28800  
[00:20:07] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 32747 seconds ago, expected  28800  
[00:20:30] <logmsgbot>	 !log spage Synchronized php-1.25wmf13/extensions/PageTriage/modules/ext.pageTriage.views.toolbar/ext.pageTriage.delete.js: Unbreak page curation (duration: 00m 06s)
[00:20:36] <morebots>	 Logged the message, Master
[00:21:13] <quiddity>	 spagewmf, looks good.
[00:21:29] <spagewmf>	 quiddity: OK, now for enwiki
[00:25:10] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 33047 seconds ago, expected  28800  
[00:25:14] <YuviPanda>	 robla: and yay, just managed to fix that UBN task :)
[00:26:11] <logmsgbot>	 !log spage Synchronized php-1.25wmf12/extensions/PageTriage/modules/ext.pageTriage.views.toolbar/ext.pageTriage.delete.js: Unbreak page curation on enwiki for Xmas (duration: 00m 05s)
[00:26:17] <morebots>	 Logged the message, Master
[00:26:32] <spagewmf>	 quiddity: care to curate some on enwiki?
[00:27:43] <hoo>	 spagewmf: You done deploying?
[00:28:21] <spagewmf>	 hoo: if quiddity gives the thumbs up, yes.
[00:29:29] <spagewmf>	 anyone, How_to_deploy_code's link for " the last two hour's worth of exceptions and misc. fatals" shows none at all.  Are we that good or is the graph wrong?
[00:29:31] <quiddity>	 spagewmf, done. it works.  HUGE thanks.
[00:29:48] <spagewmf>	 it's a Festivus miracle!  hoo^ we're done
[00:29:52] <quiddity>	 I'll go let them know on the talkpage, and then crawl back under a blanket.
[00:30:02] <hoo>	 spagewmf: I'd rather check that manually on fluorine
[00:30:21] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 33347 seconds ago, expected  28800  
[00:30:24] <spagewmf>	 Page Triage roasting n00bs on an open fire, Curation Toolbar nipping at your nose
[00:30:40] <grrrit-wm>	 (03CR) 10Hoo man: [C: 032] Fix Bug54847.php for broken hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181710 (owner: 10Hoo man)
[00:30:48] <grrrit-wm>	 (03Merged) 10jenkins-bot: Fix Bug54847.php for broken hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181710 (owner: 10Hoo man)
[00:31:02] <hoo>	 jenkins is fast... like no one is working :P
[00:31:10] <spagewmf>	 hoo: will do.  I added that graph link but I don't get ganglia
[00:32:12] <logmsgbot>	 !log hoo Synchronized wmf-config/Bug54847.php: Fix for invalid hashes (this prevented some people from logging in) (duration: 00m 05s)
[00:32:14] <morebots>	 Logged the message, Master
[00:35:16] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 33647 seconds ago, expected  28800  
[00:40:15] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 33947 seconds ago, expected  28800  
[00:44:24] <grrrit-wm>	 (03PS1) 10Yuvipanda: Bump version number and add python3-ldap dependency [software/shinkengen] - 10https://gerrit.wikimedia.org/r/181772 
[00:44:59] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] Bump version number and add python3-ldap dependency [software/shinkengen] - 10https://gerrit.wikimedia.org/r/181772 (owner: 10Yuvipanda)
[00:45:14] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 34247 seconds ago, expected  28800  
[00:46:36] <grrrit-wm>	 (03PS1) 10Yuvipanda: Minor cleanup [software/shinkengen] - 10https://gerrit.wikimedia.org/r/181773 
[00:47:15] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] Minor cleanup [software/shinkengen] - 10https://gerrit.wikimedia.org/r/181773 (owner: 10Yuvipanda)
[00:50:13] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 34547 seconds ago, expected  28800  
[00:55:13] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 34847 seconds ago, expected  28800  
[01:00:17] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 35147 seconds ago, expected  28800  
[01:05:19] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 35447 seconds ago, expected  28800  
[01:10:10] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 35747 seconds ago, expected  28800  
[01:15:11] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 36047 seconds ago, expected  28800  
[01:20:12] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 36348 seconds ago, expected  28800  
[01:25:09] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 36648 seconds ago, expected  28800  
[01:30:09] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 36947 seconds ago, expected  28800  
[01:32:20] <icinga-wm>	 PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: puppet fail  
[01:34:54] <grrrit-wm>	 (03PS2) 10Gage: Strongswan: Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 
[01:35:19] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 37247 seconds ago, expected  28800  
[01:35:51] <grrrit-wm>	 (03PS3) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 
[01:38:59] <grrrit-wm>	 (03PS1) 10Yuvipanda: beta: Add monitoring for mediawiki app servers [puppet] - 10https://gerrit.wikimedia.org/r/181775 
[01:40:16] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 37547 seconds ago, expected  28800  
[01:45:09] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 37847 seconds ago, expected  28800  
[01:46:02] <icinga-wm>	 RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[01:47:02] <grrrit-wm>	 (03PS2) 10Yuvipanda: beta: Add monitoring for mediawiki app servers [puppet] - 10https://gerrit.wikimedia.org/r/181775 
[01:50:09] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 38147 seconds ago, expected  28800  
[01:54:34] <grrrit-wm>	 (03PS3) 10Yuvipanda: beta: Add monitoring for mediawiki app servers [puppet] - 10https://gerrit.wikimedia.org/r/181775 
[01:54:42] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] beta: Add monitoring for mediawiki app servers [puppet] - 10https://gerrit.wikimedia.org/r/181775 (owner: 10Yuvipanda)
[01:55:13] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 38447 seconds ago, expected  28800  
[01:57:43] <hoo>	 YuviPanda: If beta runs production-like caching proxies (and it does AFAIK) that will only test the proxies
[01:58:43] <YuviPanda>	 hoo: the code I just merged?
[01:58:53] <YuviPanda>	 hoo: no, because it hits deployment-mediawiki* instances directly
[01:58:55] <hoo>	 yep, it tests external urls
[01:58:58] <hoo>	 really?
[01:59:01] <YuviPanda>	 hoo: yup
[01:59:01] * hoo looks again
[01:59:17] <YuviPanda>	 hoo: if you look at the definition for check_http_url_for_string
[01:59:28] <YuviPanda>	 git fetch https://gerrit.wikimedia.org/r/operations/puppet refs/changes/75/181775/2 && git checkout FETCH_HEAD
[01:59:29] <YuviPanda>	 gah
[01:59:31] <YuviPanda>	     command_line    $USER1$/check_http -H $ARG1$ -I $HOSTADDRESS$ -u $ARG2$ -s $ARG3$
[01:59:33] <YuviPanda>	 hoo: ^
[01:59:36] <YuviPanda>	 $HOSTADDRESS$
[01:59:37] <YuviPanda>	 :D
[02:00:03] <hoo>	 Ah :) I assumed it was just generic checking
[02:00:07] <hoo>	 So, nevermind
[02:00:11] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 38747 seconds ago, expected  28800  
[02:01:00] <YuviPanda>	 hoo: and I immediately find that mediawiki03 is deadish
[02:01:01] <YuviPanda>	 http://shinken.wmflabs.org/host/deployment-mediawiki03
[02:01:28] <hoo>	 that thing asks me for a login, and I doubt I should give it my ldap
[02:01:34] <YuviPanda>	 hoo: yup, ‘guest/guest'
[02:01:38] <YuviPanda>	 (Terrible, I know)
[02:03:41] <hoo>	 Nice... probably that thing is not pooled (in whatever way beta uses load balancing... also LVS?)
[02:03:51] <YuviPanda>	 there’s no LVS in labs, I think
[02:04:02] <YuviPanda>	 hoo: I restarted HHVM there, and it’s back up
[02:04:20] <hoo>	 mh... So, they do that on the varnish level?
[02:04:26] <hoo>	 Should also work for few hosts
[02:05:11] <YuviPanda>	 hoo: yeah, I think so.
[02:05:13] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 39047 seconds ago, expected  28800  
[02:05:21] <YuviPanda>	 hoo: I think LVS doesn’t work on labs properly because of the way networking is done
[02:05:53] <hoo>	 Quite possible... in production it does some "weird" things for performance reasons
[02:06:31] <YuviPanda>	 hoo: yup. munges with source / dest addresses or something like that
[02:06:34] <hoo>	 That boron thing is annoying... I'm *so* close to just acknowledging it :D
[02:06:42] <hoo>	 (Just kidding)
[02:06:47] <YuviPanda>	 heh
[02:06:50] <YuviPanda>	 I’ve no idea what boron is
[02:06:53] <YuviPanda>	 sounds fracky
[02:07:01] <hoo>	 yes, it is
[02:07:19] <YuviPanda>	 right
[02:07:29] <YuviPanda>	 hmm, now to figure out what else to monitor
[02:07:34] <YuviPanda>	 uploada.wm.o for beta, I guess
[02:07:43] <YuviPanda>	 not sure where exactly that’s served from. php or varnish
[02:08:28] <hoo>	 Hopefully varnish... everything else would be a performance nightmare
[02:08:34] <YuviPanda>	 heh
[02:08:43] <YuviPanda>	 but yeah, yay to better monitoring of betacluster :D
[02:10:10] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 39348 seconds ago, expected  28800  
[02:15:11] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 39647 seconds ago, expected  28800  
[02:20:15] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 39947 seconds ago, expected  28800  
[02:25:10] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 40247 seconds ago, expected  28800  
[02:30:15] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 40547 seconds ago, expected  28800  
[02:35:15] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 40847 seconds ago, expected  28800  
[02:40:10] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 41147 seconds ago, expected  28800  
[02:45:12] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 41447 seconds ago, expected  28800  
[02:50:22] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 41747 seconds ago, expected  28800  
[02:55:13] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 42047 seconds ago, expected  28800  
[03:00:11] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 42347 seconds ago, expected  28800  
[03:05:15] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 42648 seconds ago, expected  28800  
[03:10:11] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 42948 seconds ago, expected  28800  
[03:11:19] <jgage>	 ssh to boron is timing out, but it pings. checking console..
[03:15:12] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 43248 seconds ago, expected  28800  
[03:15:33] <icinga-wm>	 PROBLEM - puppet last run on search1002 is CRITICAL: CRITICAL: puppet fail  
[03:16:00] <jgage>	 it responds on console but i can't login. icinga says its other services are ok.
[03:16:49] <jgage>	 thank you though
[03:16:51] <jgage>	 oop
[03:20:14] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 43547 seconds ago, expected  28800  
[03:21:35] <icinga-wm>	 ACKNOWLEDGEMENT - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 43547 seconds ago, expected  28800 Jeff Gage were aware of the problem but unable to ssh in to investigate
[03:29:36] <icinga-wm>	 RECOVERY - puppet last run on search1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[06:33:54] <icinga-wm>	 PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 2 failures  
[06:34:23] <icinga-wm>	 PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:35:02] <icinga-wm>	 PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:35:14] <icinga-wm>	 PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:35:55] <icinga-wm>	 PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 2 failures  
[06:38:16] <icinga-wm>	 PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:45:25] <icinga-wm>	 RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures  
[06:45:46] <icinga-wm>	 RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures  
[06:46:22] <icinga-wm>	 RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures  
[06:46:32] <icinga-wm>	 RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[06:47:02] <icinga-wm>	 RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures  
[06:48:55] <icinga-wm>	 RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures  
[06:50:54] <icinga-wm>	 PROBLEM - Disk space on analytics1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 78236 MB (4% inode=99%): /var/lib/hadoop/data/e 73588 MB (3% inode=99%): /var/lib/hadoop/data/g 79871 MB (4% inode=99%): /var/lib/hadoop/data/i 76424 MB (4% inode=99%): /var/lib/hadoop/data/k 72924 MB (3% inode=99%): /var/lib/hadoop/data/a 80913 MB (4% inode=99%):  
[10:33:20] <icinga-wm>	 PROBLEM - Host cp1054 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[10:33:50] <icinga-wm>	 RECOVERY - Host cp1054 is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms  
[11:05:01] <icinga-wm>	 PROBLEM - Host ms-be2014 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[11:05:12] <icinga-wm>	 PROBLEM - Host mw1060 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[11:05:12] <icinga-wm>	 PROBLEM - Host db1006 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[11:05:12] <icinga-wm>	 PROBLEM - Host mw1142 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[11:05:12] <icinga-wm>	 PROBLEM - Host ms-fe1002 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[11:05:12] <icinga-wm>	 PROBLEM - Host search1016 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[11:05:13] <icinga-wm>	 PROBLEM - Host stat1002 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[11:05:13] <icinga-wm>	 PROBLEM - Host mw1234 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[11:05:14] <icinga-wm>	 PROBLEM - Host search-prefix.svc.eqiad.wmnet is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[11:05:20] <icinga-wm>	 PROBLEM - Host mw1006 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[11:05:41] <icinga-wm>	 RECOVERY - Host mw1060 is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms  
[11:05:42] <icinga-wm>	 RECOVERY - Host ms-be2014 is UP: PING OK - Packet loss = 0%, RTA = 43.17 ms  
[11:05:42] <icinga-wm>	 RECOVERY - Host db1006 is UP: PING OK - Packet loss = 0%, RTA = 2.10 ms  
[11:05:42] <icinga-wm>	 RECOVERY - Host ms-fe1002 is UP: PING OK - Packet loss = 0%, RTA = 2.01 ms  
[11:05:42] <icinga-wm>	 RECOVERY - Host mw1234 is UP: PING OK - Packet loss = 0%, RTA = 1.45 ms  
[11:05:50] <icinga-wm>	 RECOVERY - Host stat1002 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms  
[11:05:55] <icinga-wm>	 RECOVERY - Host mw1006 is UP: PING OK - Packet loss = 0%, RTA = 1.84 ms  
[11:06:03] <icinga-wm>	 RECOVERY - Host search1016 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms  
[11:06:03] <icinga-wm>	 RECOVERY - Host mw1142 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms  
[11:06:57] <godog>	 search-prefix.svc.eqiad.wmnet paged, feels like a false positive, looking
[11:08:20] <icinga-wm>	 RECOVERY - Host search-prefix.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 2.89 ms  
[11:09:28] <akosiaris>	 what the ...
[11:10:15] <_joe_>	 what's search-prefix ffs?
[11:10:34] <_joe_>	 akosiaris: it was 1 month (almost) we didn't have a page
[11:10:45] <_joe_>	 it seems one row was out?
[11:10:53] <akosiaris>	 and we got one in the best day possible
[11:10:57] <_joe_>	 yeah
[11:11:04] <_joe_>	 my whole family is mocking me
[11:12:25] <godog>	 looks like it got unlucky in the plugin timeout crossfire
[11:12:59] <_joe_>	 Can't initialize ipvs: No space left on device
[11:13:04] <_joe_>	 on lvs1003
[11:13:05] <_joe_>	 shit
[11:13:22] <akosiaris>	 ok, I was about to say it mustn't be a row going down
[11:13:24] <_joe_>	 no sorry, forgot sudo
[11:13:32] <akosiaris>	 a ... few
[11:13:34] <akosiaris>	 phew
[11:13:52] <akosiaris>	 some of these boxes are on row A, some on row B
[11:14:07] <akosiaris>	 even row D
[11:14:19] <_joe_>	 so... neon?
[11:14:39] <godog>	 I think so
[11:15:09] <_joe_>	 meh
[11:16:21] <_joe_>	 load average: 300.66, 208.42, 195.32
[11:17:32] <_joe_>	 ok, I'll go. merry christmas to all the opsens in the world
[11:18:20] <godog>	 hahah bye _joe_ 
[11:18:41] <_joe_>	 (and check_ganglia should be burnt. with fire. like yesterday.)
[11:19:06] <godog>	 wanted: human timestamp in icinga.log, I can't do unix timestamp conversion in my head
[11:22:11] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0]  
[11:30:46] <paravoid>	 I don't think it was neon
[11:32:13] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[11:34:12] <godog>	 paravoid: network?
[11:34:37] <paravoid>	 hrm, dunno
[11:38:54] <godog>	 I think it got unlucky going into HARD state by timing out twice in icinga, together with the other flood, possibly because check_ssl and check_ganglia were hammering neon
[11:39:23] <paravoid>	 this might be the case, but that doesn't explain the 503s spike at the same time
[11:41:13] <godog>	 indeed, that's a single datapoint 8 minutes later after recovery tho
[11:41:38] <godog>	 not 8, more like 5
[11:41:59] <paravoid>	 right
[11:42:00] <paravoid>	 anyway
[11:42:02] <paravoid>	 whatever :)
[11:42:15] <paravoid>	 it works now
[11:43:06] <godog>	 hehe indeed, lunch almost ready here, enjoy!
[12:38:37] <grrrit-wm>	 (03PS1) 10Yuvipanda: beta: Add HHVM queue size monitoring [puppet] - 10https://gerrit.wikimedia.org/r/181787 
[12:50:21] <Nikerabbit>	 hmm, just got missing styles because of 503
[12:50:47] <Nemo_bis>	 Nikerabbit: how many times? that's a frequent bug
[12:50:56] <Nikerabbit>	 Nemo_bis: just once of course
[12:51:17] <Nemo_bis>	 No visible spikes in https://gdash.wikimedia.org/dashboards/reqerror/ :(
[12:58:20] <grrrit-wm>	 (03PS2) 10Yuvipanda: beta: Add HHVM queue size monitoring [puppet] - 10https://gerrit.wikimedia.org/r/181787 
[13:16:52] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Fix motd on Trusty instances [puppet] - 10https://gerrit.wikimedia.org/r/181789 
[15:03:26] <se4598>	 meh, I just wanted to write a bug report about a graphite graph, but then the page must reload and all gdash/graphite images show up as "Bad Gateway: The proxy server received an invalid response from an upstream server."... well played
[15:03:30] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.009 second response time  
[15:10:20] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.011 second response time  
[17:32:04] <icinga-wm>	 PROBLEM - HHVM rendering on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:32:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:36:12] <icinga-wm>	 PROBLEM - HHVM queue size on mw1239 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [80.0]  
[17:44:12] <grrrit-wm>	 (03PS1) 10Nemo bis: Permanently enable unregistered users editing on it.m.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 
[17:55:21] <grrrit-wm>	 (03CR) 10Glaisher: Permanently enable unregistered users editing on it.m.wikipedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis)
[18:10:25] <grrrit-wm>	 (03CR) 10Nemo bis: Permanently enable unregistered users editing on it.m.wikipedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis)
[18:28:54] <grrrit-wm>	 (03CR) 10Florianschmidtwelzow: [C: 031] "+1 to solve the Task with this patch hopefully before 31. dezember to not run into caching issues (like the time where we activated it: T7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis)
[19:09:09] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] beta: Add HHVM queue size monitoring [puppet] - 10https://gerrit.wikimedia.org/r/181787 (owner: 10Yuvipanda)
[19:10:05] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] beta: Add HHVM queue size monitoring [puppet] - 10https://gerrit.wikimedia.org/r/181787 (owner: 10Yuvipanda)
[19:43:07] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).  
[19:43:07] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).  
[20:00:14] <icinga-wm>	 RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures  
[20:03:30] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.  
[20:03:30] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.  
[20:44:48] <_joe_>	 !log restarting hhvm on mw1239, stuck in HPHP::is_valid_var_name probably after trying to call ini_set
[20:44:56] <morebots>	 Logged the message, Master
[20:46:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time  
[20:46:41] <icinga-wm>	 RECOVERY - HHVM rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 66324 bytes in 0.107 second response time  
[20:57:20] <icinga-wm>	 RECOVERY - HHVM queue size on mw1239 is OK: OK: Less than 30.00% above the threshold [10.0]  
[21:39:40] <icinga-wm>	 PROBLEM - Disk space on analytics1014 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 74986 MB (3% inode=99%): /var/lib/hadoop/data/f 78079 MB (4% inode=99%): /var/lib/hadoop/data/h 80222 MB (4% inode=99%): /var/lib/hadoop/data/j 74536 MB (3% inode=99%): /var/lib/hadoop/data/l 80356 MB (4% inode=99%): /var/lib/hadoop/data/b 81786 MB (4% inode=99%):  
[22:38:47] <grrrit-wm>	 (03PS1) 10Yuvipanda: shinken: Add ssh checks for all monitored hosts [puppet] - 10https://gerrit.wikimedia.org/r/181807 
[22:50:43] <YuviPanda>	 hmm
[22:50:44] <YuviPanda>	 interesting
[22:50:53] <YuviPanda>	 port 22 is supposedly open by default to anywhere in the network
[22:51:05] <YuviPanda>	 but of course, everything fails...
[22:51:05] <YuviPanda>	 hmm
[22:52:19] <JohnFLewis>	 YuviPanda: port 22 just hates you