[00:00:29] (03CR) 10Dzahn: [C: 032] racktables - apache, load mod_headers [puppet] - 10https://gerrit.wikimedia.org/r/160168 (owner: 10Dzahn) [00:05:08] (03CR) 10Dzahn: "fixed after I5a3032eb907a29ce5" [puppet] - 10https://gerrit.wikimedia.org/r/160164 (owner: 10Dzahn) [00:12:35] (03PS1) 10Ori.livneh: labs: update /data/project/apache/common-local -> /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/160170 [00:13:00] ^ mutante, it's after five on a friday, but are you up for reviewing a small beta-only change? :) [00:20:09] mutante is afk currently [00:20:26] gwicke: ah, thanks [00:23:57] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Epic puppet fail [00:25:27] (03PS2) 10Ori.livneh: labs: update /data/project/apache/common-local -> /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/160170 [00:25:41] (03CR) 10Ori.livneh: [C: 032 V: 032] "Applied in Labs; did the right thing." [puppet] - 10https://gerrit.wikimedia.org/r/160170 (owner: 10Ori.livneh) [00:42:40] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [01:13:27] (03CR) 10Dzahn: [C: 032] "just an image" [puppet] - 10https://gerrit.wikimedia.org/r/160157 (owner: 10Dzahn) [01:41:12] !log ori Synchronized php-1.24wmf20/extensions/Flow: Update flow for I4da934dfe (duration: 00m 08s) [01:41:19] Logged the message, Master [01:45:31] !log ori Synchronized php-1.24wmf20/extensions/Flow: Update flow for I4da934dfe (duration: 00m 06s) [01:45:38] Logged the message, Master [01:45:42] !log ori Synchronized php-1.24wmf21/extensions/Flow: Update flow for I4da934dfe (duration: 00m 06s) [01:45:48] Logged the message, Master [01:46:05] ebernhardson: did wmf20 twice because i forgot to update the submodule the first time around [02:05:58] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3875 MB (3% inode=99%): [02:18:28] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 1 failures [02:19:57] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: Puppet has 1 failures [02:20:57] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: Puppet has 1 failures [02:22:28] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 1 failures [02:31:09] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 1 failures [02:32:58] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Puppet has 1 failures [02:34:48] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [02:37:08] RECOVERY - puppet last run on mw1047 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [02:38:27] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-13 02:38:26+00:00 [02:38:33] Logged the message, Master [02:39:12] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [02:41:00] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [02:49:07] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [02:51:18] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [03:00:58] RECOVERY - Disk space on virt0 is OK: DISK OK [03:11:40] !log LocalisationUpdate completed (1.24wmf21) at 2014-09-13 03:11:40+00:00 [03:11:47] Logged the message, Master [03:20:49] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:47:58] I'm seeing random database errors via runJobs.php in exception.log: [03:48:05] 2014-09-12 20:45:29 mw1053 metawiki: [ef65e9e7] /rpc/RunJobs.php?wiki=metawiki&type=cirrusSearchLinksUpdate&maxtime=30&maxmem=300M Exception from line 1216 of /srv/mediawiki/php-1.24wmf20/includes/db/Database.php: A database error has occurred. Did you forget to run maintenance/update.php after upgrading? See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script [03:51:25] !log global rename for Trevor Parscal --> Trevor Parscal (WMF) looks stuck on metawiki and mswiki, in queued state for both but showJobs.php says the jobs are active and claimed [03:51:31] Logged the message, Master [04:17:04] interesting [04:22:04] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Sep 13 04:22:04 UTC 2014 (duration 22m 3s) [04:22:09] Logged the message, Master [04:41:49] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [04:42:27] !log global rename for Trevor Parscal (WMF) unstuck itself, yay [04:42:33] Logged the message, Master [04:42:45] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [04:51:08] hah [04:54:33] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [04:57:03] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Epic puppet fail [04:59:04] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [05:07:34] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [05:09:35] PROBLEM - puppet last run on cp4019 is CRITICAL: Timeout while attempting connection [05:09:54] PROBLEM - puppet last run on cp4001 is CRITICAL: Timeout while attempting connection [05:11:24] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 2 failures [05:14:27] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: Puppet has 1 failures [05:15:33] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [05:21:29] (03CR) 10Legoktm: "Is this actually dependent upon the androidsdk patch?" [puppet] - 10https://gerrit.wikimedia.org/r/153784 (owner: 10Yuvipanda) [05:26:53] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [05:27:43] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [05:27:55] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [05:28:45] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [05:29:43] RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:19:44] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:19:44] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:19:45] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [06:23:34] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:26:06] PROBLEM - SSH on pdf3 is CRITICAL: Server answer: [06:27:05] RECOVERY - SSH on pdf3 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [06:28:34] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Epic puppet fail [06:28:34] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: Epic puppet fail [06:28:46] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: Epic puppet fail [06:28:46] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Epic puppet fail [06:29:05] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Epic puppet fail [06:29:27] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:37] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 4 failures [06:29:44] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:44] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:44] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:55] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:04] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:04] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:04] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:04] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:14] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:14] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:34] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:44] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:45] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:45] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:54] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [06:33:54] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:57] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [06:36:04] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [06:36:14] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [06:36:20] PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.137 [06:37:54] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:37:55] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [06:38:54] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:41:44] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:42:05] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [06:42:05] RECOVERY - ElasticSearch health check on logstash1002 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [06:44:57] RECOVERY - Disk space on ms1004 is OK: DISK OK [06:45:14] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:35] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:45:45] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:45:46] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:45:46] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:46] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:45:54] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:45:54] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:46:06] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:44] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:56] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:56] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:47:05] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:47:44] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:47:58] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:47:59] PROBLEM - Disk space on ms1004 is CRITICAL: DISK CRITICAL - free space: / 2 MB (0% inode=94%): /var/lib/ureadahead/debugfs 2 MB (0% inode=94%): [06:48:05] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:48:15] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:51:05] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:35] PROBLEM - puppet last run on ssl1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:59:05] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [07:00:45] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [07:01:05] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [07:11:45] RECOVERY - ElasticSearch health check on logstash1003 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [07:13:54] RECOVERY - puppet last run on ssl1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [08:18:59] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: Puppet has 1 failures [08:33:11] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:38:10] PROBLEM - puppet last run on amssq37 is CRITICAL: CRITICAL: Epic puppet fail [09:56:32] RECOVERY - puppet last run on amssq37 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [10:15:43] PROBLEM - Disk space on elastic1009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20122 MB (3% inode=99%): [11:44:23] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:57:55] (03CR) 10Calak: [C: 031] Remove 'renameuser' right from bureaucrats on CentralAuth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160158 (owner: 10Legoktm) [13:06:33] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:21:04] PROBLEM - puppet last run on amssq57 is CRITICAL: CRITICAL: Epic puppet fail [14:40:35] RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:22:01] (03CR) 10JanZerebecki: "The nda group is not restricted to people not employed by the WMF. If someone should get access to things like logstash and servermon and " [puppet] - 10https://gerrit.wikimedia.org/r/159419 (owner: 10Dzahn) [16:15:14] (03CR) 10Filippo Giunchedi: "LGMT, minor comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/153783 (owner: 10Ori.livneh) [16:16:07] (03CR) 10Filippo Giunchedi: "LGTM, modulo a pending comment in modules/salt/manifests/minion.pp" [puppet] - 10https://gerrit.wikimedia.org/r/153727 (owner: 10Ori.livneh) [17:18:27] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: Puppet has 1 failures [17:31:53] (03CR) 10Hoo man: [C: 04-1] "Had a quick look" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [17:35:47] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:45:44] (03PS41) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [17:59:29] (03PS42) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [18:22:35] MatmaRex: gifs are in scope, therfore no need to nominete new one. [18:22:59] huh? [18:39:34] Steinsplitter: ooooh, you referred to the commons deletion request i started. that was a very confusing message to see in this channel. [18:44:16] yes, you are not in -common ;) [18:45:47] (03PS1) 10Ori.livneh: mediawiki: add some in-line documentation [puppet] - 10https://gerrit.wikimedia.org/r/160225 [18:45:55] Steinsplitter: *nominate [18:51:37] MatmaRex: and pls dont edit in the MW namespace there :) thanks. [18:52:42] Steinsplitter: are you implying i did something wrong in commons' MediaWiki namespace? [18:53:22] no. but ther was engough drama. so pls don't do it again. [18:53:27] (why are we discussing this on this channel? please pm me if you want to continue.) [18:53:52] Steinsplitter: are you implying i caused any of the drama? jesus [18:54:29] if somebody has problems with my edits, i'd love them to tell me [19:54:24] (03CR) 10Ori.livneh: [C: 032] mediawiki: add some in-line documentation [puppet] - 10https://gerrit.wikimedia.org/r/160225 (owner: 10Ori.livneh) [20:18:04] (03PS1) 10Ori.livneh: hhvm: add comment about translation cache [puppet] - 10https://gerrit.wikimedia.org/r/160231 [20:21:42] (03PS1) 10Ori.livneh: misc::maintenance: clean-up [puppet] - 10https://gerrit.wikimedia.org/r/160232 [20:35:08] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=1537 [critical =500]: recurring_gc_contribs_missed=0: recurring_gc_failures_missed=0: recurring_gc_jobs_required=962: recurring_gc_schedule_sanity=0 [20:39:08] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Epic puppet fail [20:40:08] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=2197 [critical =500]: recurring_gc_contribs_missed=0: recurring_gc_failures_missed=0: recurring_gc_jobs_required=962: recurring_gc_schedule_sanity=0 [20:45:08] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=506 [critical =500]: recurring_gc_contribs_missed=0: recurring_gc_failures_missed=0: recurring_gc_jobs_required=962: recurring_gc_schedule_sanity=0 [20:50:08] RECOVERY - check_fundraising_jobs on db1025 is OK: OK missing_thank_yous=0: recurring_gc_contribs_missed=0: recurring_gc_failures_missed=0: recurring_gc_jobs_required=962: recurring_gc_schedule_sanity=0 [20:53:39] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [20:58:18] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [21:07:07] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:57:23] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:00:08] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [23:03:22] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:04:22] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [23:07:36] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:08:23] RECOVERY - DPKG on fenari is OK: All packages OK [23:20:42] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:22:24] PROBLEM - puppet last run on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:34] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:29:22] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 4250 seconds ago with 0 failures [23:33:44] PROBLEM - check configured eth on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:33:44] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:34:32] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [23:35:03] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Epic puppet fail [23:36:55] RECOVERY - DPKG on fenari is OK: All packages OK [23:36:55] RECOVERY - check configured eth on fenari is OK: NRPE: Unable to read output [23:37:53] PROBLEM - SSH on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:12] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [23:38:52] RECOVERY - SSH on fenari is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [23:39:32] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4775 bytes in 5.523 second response time [23:41:33] PROBLEM - puppet last run on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:42:33] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 5040 seconds ago with 0 failures [23:42:33] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:44:03] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:45:42] PROBLEM - puppet last run on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:46:02] PROBLEM - Disk space on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:46:02] PROBLEM - check configured eth on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:47:03] PROBLEM - check if dhclient is running on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:02] PROBLEM - nutcracker process on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:02] PROBLEM - nutcracker port on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:55] RECOVERY - nutcracker port on fenari is OK: TCP OK - 0.000 second response time on port 11212 [23:49:54] RECOVERY - nutcracker process on fenari is OK: PROCS OK: 1 process with UID = 116 (nutcracker), command name nutcracker [23:49:54] RECOVERY - Disk space on fenari is OK: DISK OK [23:49:54] RECOVERY - check if dhclient is running on fenari is OK: PROCS OK: 0 processes with command name dhclient [23:49:54] RECOVERY - check configured eth on fenari is OK: NRPE: Unable to read output [23:50:43] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [23:51:07] RECOVERY - DPKG on fenari is OK: All packages OK [23:51:53] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [23:53:43] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet has 1 failures [23:54:23] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [23:55:12] PROBLEM - check configured eth on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:55:12] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:55:12] PROBLEM - nutcracker process on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:55:22] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [23:58:04] RECOVERY - nutcracker process on fenari is OK: PROCS OK: 1 process with UID = 116 (nutcracker), command name nutcracker [23:58:04] RECOVERY - check configured eth on fenari is OK: NRPE: Unable to read output [23:59:47] is fenari having issues?