[00:00:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 31547 seconds ago, expected 28800 [00:02:24] RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:03:23] set this task to UBN! btw, https://phabricator.wikimedia.org/T85293 [00:04:41] spagewmf: got your email. sounds like the right thing to do. [00:05:19] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 31847 seconds ago, expected 28800 [00:05:34] robla: thx. I'll be around for at least 2 hours to monitor. quiddity can test [00:06:14] I'm curious about how it got into an unusable state in the first place, though [00:07:07] spagewmf: was this just a bug from last week's deployment (or earlier) that went unnoticed until now? [00:08:30] YuviPanda: that sucks :-( [00:08:50] robla: yeah, I’ll probably poke around for another hour before giving up. [00:09:02] at least it’s not an NFS outage, so things that are currently working mostly continue to work :) [00:09:14] robla: it's fallout from the jQuery update. Collaboration team is responsiblt for Editor Engagement extensions, but we didn't notice failures until curators reported them. [00:10:08] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 32147 seconds ago, expected 28800 [00:15:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 32447 seconds ago, expected 28800 [00:20:07] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 32747 seconds ago, expected 28800 [00:20:30] !log spage Synchronized php-1.25wmf13/extensions/PageTriage/modules/ext.pageTriage.views.toolbar/ext.pageTriage.delete.js: Unbreak page curation (duration: 00m 06s) [00:20:36] Logged the message, Master [00:21:13] spagewmf, looks good. [00:21:29] quiddity: OK, now for enwiki [00:25:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 33047 seconds ago, expected 28800 [00:25:14] robla: and yay, just managed to fix that UBN task :) [00:26:11] !log spage Synchronized php-1.25wmf12/extensions/PageTriage/modules/ext.pageTriage.views.toolbar/ext.pageTriage.delete.js: Unbreak page curation on enwiki for Xmas (duration: 00m 05s) [00:26:17] Logged the message, Master [00:26:32] quiddity: care to curate some on enwiki? [00:27:43] spagewmf: You done deploying? [00:28:21] hoo: if quiddity gives the thumbs up, yes. [00:29:29] anyone, How_to_deploy_code's link for " the last two hour's worth of exceptions and misc. fatals" shows none at all. Are we that good or is the graph wrong? [00:29:31] spagewmf, done. it works. HUGE thanks. [00:29:48] it's a Festivus miracle! hoo^ we're done [00:29:52] I'll go let them know on the talkpage, and then crawl back under a blanket. [00:30:02] spagewmf: I'd rather check that manually on fluorine [00:30:21] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 33347 seconds ago, expected 28800 [00:30:24] Page Triage roasting n00bs on an open fire, Curation Toolbar nipping at your nose [00:30:40] (03CR) 10Hoo man: [C: 032] Fix Bug54847.php for broken hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181710 (owner: 10Hoo man) [00:30:48] (03Merged) 10jenkins-bot: Fix Bug54847.php for broken hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181710 (owner: 10Hoo man) [00:31:02] jenkins is fast... like no one is working :P [00:31:10] hoo: will do. I added that graph link but I don't get ganglia [00:32:12] !log hoo Synchronized wmf-config/Bug54847.php: Fix for invalid hashes (this prevented some people from logging in) (duration: 00m 05s) [00:32:14] Logged the message, Master [00:35:16] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 33647 seconds ago, expected 28800 [00:40:15] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 33947 seconds ago, expected 28800 [00:44:24] (03PS1) 10Yuvipanda: Bump version number and add python3-ldap dependency [software/shinkengen] - 10https://gerrit.wikimedia.org/r/181772 [00:44:59] (03CR) 10Yuvipanda: [C: 032 V: 032] Bump version number and add python3-ldap dependency [software/shinkengen] - 10https://gerrit.wikimedia.org/r/181772 (owner: 10Yuvipanda) [00:45:14] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 34247 seconds ago, expected 28800 [00:46:36] (03PS1) 10Yuvipanda: Minor cleanup [software/shinkengen] - 10https://gerrit.wikimedia.org/r/181773 [00:47:15] (03CR) 10Yuvipanda: [C: 032 V: 032] Minor cleanup [software/shinkengen] - 10https://gerrit.wikimedia.org/r/181773 (owner: 10Yuvipanda) [00:50:13] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 34547 seconds ago, expected 28800 [00:55:13] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 34847 seconds ago, expected 28800 [01:00:17] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 35147 seconds ago, expected 28800 [01:05:19] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 35447 seconds ago, expected 28800 [01:10:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 35747 seconds ago, expected 28800 [01:15:11] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 36047 seconds ago, expected 28800 [01:20:12] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 36348 seconds ago, expected 28800 [01:25:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 36648 seconds ago, expected 28800 [01:30:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 36947 seconds ago, expected 28800 [01:32:20] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: puppet fail [01:34:54] (03PS2) 10Gage: Strongswan: Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [01:35:19] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 37247 seconds ago, expected 28800 [01:35:51] (03PS3) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [01:38:59] (03PS1) 10Yuvipanda: beta: Add monitoring for mediawiki app servers [puppet] - 10https://gerrit.wikimedia.org/r/181775 [01:40:16] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 37547 seconds ago, expected 28800 [01:45:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 37847 seconds ago, expected 28800 [01:46:02] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:47:02] (03PS2) 10Yuvipanda: beta: Add monitoring for mediawiki app servers [puppet] - 10https://gerrit.wikimedia.org/r/181775 [01:50:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 38147 seconds ago, expected 28800 [01:54:34] (03PS3) 10Yuvipanda: beta: Add monitoring for mediawiki app servers [puppet] - 10https://gerrit.wikimedia.org/r/181775 [01:54:42] (03CR) 10Yuvipanda: [C: 032] beta: Add monitoring for mediawiki app servers [puppet] - 10https://gerrit.wikimedia.org/r/181775 (owner: 10Yuvipanda) [01:55:13] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 38447 seconds ago, expected 28800 [01:57:43] YuviPanda: If beta runs production-like caching proxies (and it does AFAIK) that will only test the proxies [01:58:43] hoo: the code I just merged? [01:58:53] hoo: no, because it hits deployment-mediawiki* instances directly [01:58:55] yep, it tests external urls [01:58:58] really? [01:59:01] hoo: yup [01:59:01] * hoo looks again [01:59:17] hoo: if you look at the definition for check_http_url_for_string [01:59:28] git fetch https://gerrit.wikimedia.org/r/operations/puppet refs/changes/75/181775/2 && git checkout FETCH_HEAD [01:59:29] gah [01:59:31] command_line $USER1$/check_http -H $ARG1$ -I $HOSTADDRESS$ -u $ARG2$ -s $ARG3$ [01:59:33] hoo: ^ [01:59:36] $HOSTADDRESS$ [01:59:37] :D [02:00:03] Ah :) I assumed it was just generic checking [02:00:07] So, nevermind [02:00:11] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 38747 seconds ago, expected 28800 [02:01:00] hoo: and I immediately find that mediawiki03 is deadish [02:01:01] http://shinken.wmflabs.org/host/deployment-mediawiki03 [02:01:28] that thing asks me for a login, and I doubt I should give it my ldap [02:01:34] hoo: yup, ‘guest/guest' [02:01:38] (Terrible, I know) [02:03:41] Nice... probably that thing is not pooled (in whatever way beta uses load balancing... also LVS?) [02:03:51] there’s no LVS in labs, I think [02:04:02] hoo: I restarted HHVM there, and it’s back up [02:04:20] mh... So, they do that on the varnish level? [02:04:26] Should also work for few hosts [02:05:11] hoo: yeah, I think so. [02:05:13] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 39047 seconds ago, expected 28800 [02:05:21] hoo: I think LVS doesn’t work on labs properly because of the way networking is done [02:05:53] Quite possible... in production it does some "weird" things for performance reasons [02:06:31] hoo: yup. munges with source / dest addresses or something like that [02:06:34] That boron thing is annoying... I'm *so* close to just acknowledging it :D [02:06:42] (Just kidding) [02:06:47] heh [02:06:50] I’ve no idea what boron is [02:06:53] sounds fracky [02:07:01] yes, it is [02:07:19] right [02:07:29] hmm, now to figure out what else to monitor [02:07:34] uploada.wm.o for beta, I guess [02:07:43] not sure where exactly that’s served from. php or varnish [02:08:28] Hopefully varnish... everything else would be a performance nightmare [02:08:34] heh [02:08:43] but yeah, yay to better monitoring of betacluster :D [02:10:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 39348 seconds ago, expected 28800 [02:15:11] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 39647 seconds ago, expected 28800 [02:20:15] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 39947 seconds ago, expected 28800 [02:25:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 40247 seconds ago, expected 28800 [02:30:15] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 40547 seconds ago, expected 28800 [02:35:15] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 40847 seconds ago, expected 28800 [02:40:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 41147 seconds ago, expected 28800 [02:45:12] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 41447 seconds ago, expected 28800 [02:50:22] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 41747 seconds ago, expected 28800 [02:55:13] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 42047 seconds ago, expected 28800 [03:00:11] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 42347 seconds ago, expected 28800 [03:05:15] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 42648 seconds ago, expected 28800 [03:10:11] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 42948 seconds ago, expected 28800 [03:11:19] ssh to boron is timing out, but it pings. checking console.. [03:15:12] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 43248 seconds ago, expected 28800 [03:15:33] PROBLEM - puppet last run on search1002 is CRITICAL: CRITICAL: puppet fail [03:16:00] it responds on console but i can't login. icinga says its other services are ok. [03:16:49] thank you though [03:16:51] oop [03:20:14] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 43547 seconds ago, expected 28800 [03:21:35] ACKNOWLEDGEMENT - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet last ran 43547 seconds ago, expected 28800 Jeff Gage were aware of the problem but unable to ssh in to investigate [03:29:36] RECOVERY - puppet last run on search1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:33:54] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:23] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:02] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:14] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:55] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 2 failures [06:38:16] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:25] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:46] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:02] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:48:55] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:50:54] PROBLEM - Disk space on analytics1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 78236 MB (4% inode=99%): /var/lib/hadoop/data/e 73588 MB (3% inode=99%): /var/lib/hadoop/data/g 79871 MB (4% inode=99%): /var/lib/hadoop/data/i 76424 MB (4% inode=99%): /var/lib/hadoop/data/k 72924 MB (3% inode=99%): /var/lib/hadoop/data/a 80913 MB (4% inode=99%): [10:33:20] PROBLEM - Host cp1054 is DOWN: CRITICAL - Plugin timed out after 15 seconds [10:33:50] RECOVERY - Host cp1054 is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [11:05:01] PROBLEM - Host ms-be2014 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:05:12] PROBLEM - Host mw1060 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:05:12] PROBLEM - Host db1006 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:05:12] PROBLEM - Host mw1142 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:05:12] PROBLEM - Host ms-fe1002 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:05:12] PROBLEM - Host search1016 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:05:13] PROBLEM - Host stat1002 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:05:13] PROBLEM - Host mw1234 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:05:14] PROBLEM - Host search-prefix.svc.eqiad.wmnet is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:05:20] PROBLEM - Host mw1006 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:05:41] RECOVERY - Host mw1060 is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [11:05:42] RECOVERY - Host ms-be2014 is UP: PING OK - Packet loss = 0%, RTA = 43.17 ms [11:05:42] RECOVERY - Host db1006 is UP: PING OK - Packet loss = 0%, RTA = 2.10 ms [11:05:42] RECOVERY - Host ms-fe1002 is UP: PING OK - Packet loss = 0%, RTA = 2.01 ms [11:05:42] RECOVERY - Host mw1234 is UP: PING OK - Packet loss = 0%, RTA = 1.45 ms [11:05:50] RECOVERY - Host stat1002 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [11:05:55] RECOVERY - Host mw1006 is UP: PING OK - Packet loss = 0%, RTA = 1.84 ms [11:06:03] RECOVERY - Host search1016 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [11:06:03] RECOVERY - Host mw1142 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [11:06:57] search-prefix.svc.eqiad.wmnet paged, feels like a false positive, looking [11:08:20] RECOVERY - Host search-prefix.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 2.89 ms [11:09:28] what the ... [11:10:15] <_joe_> what's search-prefix ffs? [11:10:34] <_joe_> akosiaris: it was 1 month (almost) we didn't have a page [11:10:45] <_joe_> it seems one row was out? [11:10:53] and we got one in the best day possible [11:10:57] <_joe_> yeah [11:11:04] <_joe_> my whole family is mocking me [11:12:25] looks like it got unlucky in the plugin timeout crossfire [11:12:59] <_joe_> Can't initialize ipvs: No space left on device [11:13:04] <_joe_> on lvs1003 [11:13:05] <_joe_> shit [11:13:22] ok, I was about to say it mustn't be a row going down [11:13:24] <_joe_> no sorry, forgot sudo [11:13:32] a ... few [11:13:34] phew [11:13:52] some of these boxes are on row A, some on row B [11:14:07] even row D [11:14:19] <_joe_> so... neon? [11:14:39] I think so [11:15:09] <_joe_> meh [11:16:21] <_joe_> load average: 300.66, 208.42, 195.32 [11:17:32] <_joe_> ok, I'll go. merry christmas to all the opsens in the world [11:18:20] hahah bye _joe_ [11:18:41] <_joe_> (and check_ganglia should be burnt. with fire. like yesterday.) [11:19:06] wanted: human timestamp in icinga.log, I can't do unix timestamp conversion in my head [11:22:11] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [11:30:46] I don't think it was neon [11:32:13] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [11:34:12] paravoid: network? [11:34:37] hrm, dunno [11:38:54] I think it got unlucky going into HARD state by timing out twice in icinga, together with the other flood, possibly because check_ssl and check_ganglia were hammering neon [11:39:23] this might be the case, but that doesn't explain the 503s spike at the same time [11:41:13] indeed, that's a single datapoint 8 minutes later after recovery tho [11:41:38] not 8, more like 5 [11:41:59] right [11:42:00] anyway [11:42:02] whatever :) [11:42:15] it works now [11:43:06] hehe indeed, lunch almost ready here, enjoy! [12:38:37] (03PS1) 10Yuvipanda: beta: Add HHVM queue size monitoring [puppet] - 10https://gerrit.wikimedia.org/r/181787 [12:50:21] hmm, just got missing styles because of 503 [12:50:47] Nikerabbit: how many times? that's a frequent bug [12:50:56] Nemo_bis: just once of course [12:51:17] No visible spikes in https://gdash.wikimedia.org/dashboards/reqerror/ :( [12:58:20] (03PS2) 10Yuvipanda: beta: Add HHVM queue size monitoring [puppet] - 10https://gerrit.wikimedia.org/r/181787 [13:16:52] (03PS1) 10Tim Landscheidt: Fix motd on Trusty instances [puppet] - 10https://gerrit.wikimedia.org/r/181789 [15:03:26] meh, I just wanted to write a bug report about a graphite graph, but then the page must reload and all gdash/graphite images show up as "Bad Gateway: The proxy server received an invalid response from an upstream server."... well played [15:03:30] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.009 second response time [15:10:20] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.011 second response time [17:32:04] PROBLEM - HHVM rendering on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:08] PROBLEM - Apache HTTP on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:12] PROBLEM - HHVM queue size on mw1239 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [80.0] [17:44:12] (03PS1) 10Nemo bis: Permanently enable unregistered users editing on it.m.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 [17:55:21] (03CR) 10Glaisher: Permanently enable unregistered users editing on it.m.wikipedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis) [18:10:25] (03CR) 10Nemo bis: Permanently enable unregistered users editing on it.m.wikipedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis) [18:28:54] (03CR) 10Florianschmidtwelzow: [C: 031] "+1 to solve the Task with this patch hopefully before 31. dezember to not run into caching issues (like the time where we activated it: T7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis) [19:09:09] (03CR) 10Ori.livneh: [C: 031] beta: Add HHVM queue size monitoring [puppet] - 10https://gerrit.wikimedia.org/r/181787 (owner: 10Yuvipanda) [19:10:05] (03CR) 10Yuvipanda: [C: 032] beta: Add HHVM queue size monitoring [puppet] - 10https://gerrit.wikimedia.org/r/181787 (owner: 10Yuvipanda) [19:43:07] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:43:07] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:00:14] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:03:30] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [20:03:30] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:44:48] <_joe_> !log restarting hhvm on mw1239, stuck in HPHP::is_valid_var_name probably after trying to call ini_set [20:44:56] Logged the message, Master [20:46:40] RECOVERY - Apache HTTP on mw1239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [20:46:41] RECOVERY - HHVM rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 66324 bytes in 0.107 second response time [20:57:20] RECOVERY - HHVM queue size on mw1239 is OK: OK: Less than 30.00% above the threshold [10.0] [21:39:40] PROBLEM - Disk space on analytics1014 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 74986 MB (3% inode=99%): /var/lib/hadoop/data/f 78079 MB (4% inode=99%): /var/lib/hadoop/data/h 80222 MB (4% inode=99%): /var/lib/hadoop/data/j 74536 MB (3% inode=99%): /var/lib/hadoop/data/l 80356 MB (4% inode=99%): /var/lib/hadoop/data/b 81786 MB (4% inode=99%): [22:38:47] (03PS1) 10Yuvipanda: shinken: Add ssh checks for all monitored hosts [puppet] - 10https://gerrit.wikimedia.org/r/181807 [22:50:43] hmm [22:50:44] interesting [22:50:53] port 22 is supposedly open by default to anywhere in the network [22:51:05] but of course, everything fails... [22:51:05] hmm [22:52:19] YuviPanda: port 22 just hates you