[00:25:05] PROBLEM - check_mysql on db1008 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:25:06] PROBLEM - check_mysql on db78 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:30:05] PROBLEM - check_mysql on db1008 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:30:06] PROBLEM - check_mysql on db78 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:35:05] PROBLEM - check_mysql on db1008 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:35:06] RECOVERY - check_mysql on db78 is OK: Uptime: 4789387 Threads: 2 Questions: 71676574 Slow queries: 63783 Opens: 90753 Flush tables: 2 Open tables: 64 Queries per second avg: 14.965 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [00:40:02] i'm going to deploy a typo fix to prod to unbreak the IRC log stream: https://gerrit.wikimedia.org/r/#/c/82558/ [00:40:05] RECOVERY - check_mysql on db1008 is OK: Uptime: 3471671 Threads: 1 Questions: 66194960 Slow queries: 53745 Opens: 61753 Flush tables: 2 Open tables: 64 Queries per second avg: 19.067 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [00:41:45] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [00:41:45] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [00:41:45] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [00:47:31] ori-l: that seems like a poster child for static analysis? [00:49:42] !log olivneh synchronized php-1.22wmf15/includes/RecentChange.php 'Fix for bug 53720' [00:49:48] Logged the message, Master [00:49:55] ^ Krenair [00:50:13] ty ori-l [00:50:17] can you confirm the fix? [00:51:24] ori-l, looks fixed to me. [00:51:30] cool, thanks [01:14:15] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [01:16:55] (03PS1) 10Ori.livneh: statsd module: provision Ganglia backend support [operations/puppet] - 10https://gerrit.wikimedia.org/r/82563 [01:17:25] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:29:43] (03PS1) 10Dzahn: fixes for wikitravel links and updates. add a trim() when unserializing API data to fix parsing for a lot of wikis sending whitespace [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/82564 [01:32:06] (03PS2) 10Dzahn: fixes for wikitravel links and updates. add a trim() when unserializing API data to fix parsing for a lot of wikis sending whitespace [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/82564 [01:32:45] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [01:51:45] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [02:06:47] !log LocalisationUpdate completed (1.22wmf15) at Wed Sep 4 02:06:47 UTC 2013 [02:06:55] Logged the message, Master [02:11:48] !log LocalisationUpdate completed (1.22wmf14) at Wed Sep 4 02:11:48 UTC 2013 [02:11:54] Logged the message, Master [02:21:58] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Sep 4 02:21:58 UTC 2013 [02:22:04] Logged the message, Master [02:26:45] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [02:32:45] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [02:40:45] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [02:43:45] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [02:52:45] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [02:53:45] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [02:56:45] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [02:56:45] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [02:56:45] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [02:59:45] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [02:59:46] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [03:02:45] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [03:02:45] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:05:45] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:08:45] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [03:08:45] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [03:09:45] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [03:09:45] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [03:13:45] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [04:13:45] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [04:38:52] (03PS1) 10Ori.livneh: Delete old 'sysctlfile' module & related detritus [operations/puppet] - 10https://gerrit.wikimedia.org/r/82571 [05:50:45] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [05:50:45] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [05:52:24] Jamesofur: have you reported a bug? [05:52:53] not yet, but I think Tilman was going to then I'd add on :) [05:52:59] he was very nice about offering :D [06:26:21] !log truncated fact_values puppet table and reset auto increment to start at 1, puppet was broken on all hosts, see http://projects.puppetlabs.com/issues/9225 [06:26:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:26:27] Logged the message, Master [06:27:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [06:52:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:53:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [07:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [07:31:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:15] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: No successful Puppet run in the last 10 hours [07:33:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [07:34:15] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: No successful Puppet run in the last 10 hours [07:38:45] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [07:38:45] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: No successful Puppet run in the last 10 hours [07:39:45] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: No successful Puppet run in the last 10 hours [07:40:45] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: No successful Puppet run in the last 10 hours [07:40:45] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: No successful Puppet run in the last 10 hours [07:41:45] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: No successful Puppet run in the last 10 hours [07:41:45] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: No successful Puppet run in the last 10 hours [07:44:45] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: No successful Puppet run in the last 10 hours [07:46:45] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: No successful Puppet run in the last 10 hours [07:47:45] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: No successful Puppet run in the last 10 hours [07:49:45] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: No successful Puppet run in the last 10 hours [07:52:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:52:45] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: No successful Puppet run in the last 10 hours [07:53:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.145 second response time [07:57:45] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: No successful Puppet run in the last 10 hours [07:58:45] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: No successful Puppet run in the last 10 hours [08:01:31] (03CR) 10Faidon Liambotis: [C: 032] Delete old 'sysctlfile' module & related detritus [operations/puppet] - 10https://gerrit.wikimedia.org/r/82571 (owner: 10Ori.livneh) [08:08:32] (03CR) 10Faidon Liambotis: [C: 04-1] "LGTM but I won't pretend I've reviewed the Javascript code, nor that I intend to :) If you want a code review for that, maybe we should so" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82563 (owner: 10Ori.livneh) [08:10:41] RECOVERY - Puppet freshness on ms-be9 is OK: puppet ran at Wed Sep 4 08:10:38 UTC 2013 [08:10:51] RECOVERY - Puppet freshness on ms-be3 is OK: puppet ran at Wed Sep 4 08:10:43 UTC 2013 [08:12:01] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [08:13:01] RECOVERY - Puppet freshness on ms-be11 is OK: puppet ran at Wed Sep 4 08:12:56 UTC 2013 [08:14:51] RECOVERY - Puppet freshness on ms-be12 is OK: puppet ran at Wed Sep 4 08:14:42 UTC 2013 [08:15:51] RECOVERY - Puppet freshness on ms-be5 is OK: puppet ran at Wed Sep 4 08:15:49 UTC 2013 [08:17:51] RECOVERY - Puppet freshness on ms-fe2 is OK: puppet ran at Wed Sep 4 08:17:41 UTC 2013 [08:20:56] (03PS1) 10ArielGlenn: one more protoproxy -> nginx change [operations/puppet] - 10https://gerrit.wikimedia.org/r/82590 [08:22:01] RECOVERY - Puppet freshness on ms-be10 is OK: puppet ran at Wed Sep 4 08:21:58 UTC 2013 [08:22:16] (03CR) 10ArielGlenn: [C: 032] one more protoproxy -> nginx change [operations/puppet] - 10https://gerrit.wikimedia.org/r/82590 (owner: 10ArielGlenn) [08:26:21] (03CR) 10Faidon Liambotis: "(2 comments)" [operations/debs/stud] (wikimedia) - 10https://gerrit.wikimedia.org/r/82428 (owner: 10Mark Bergsma) [08:26:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:51] RECOVERY - Puppet freshness on ms-be7 is OK: puppet ran at Wed Sep 4 08:26:45 UTC 2013 [08:27:01] RECOVERY - Puppet freshness on ms-fe4 is OK: puppet ran at Wed Sep 4 08:26:55 UTC 2013 [08:28:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [08:32:11] RECOVERY - Puppet freshness on ms-fe1 is OK: puppet ran at Wed Sep 4 08:32:02 UTC 2013 [08:32:11] RECOVERY - Puppet freshness on ms-be6 is OK: puppet ran at Wed Sep 4 08:32:02 UTC 2013 [08:35:11] RECOVERY - Puppet freshness on ms-be8 is OK: puppet ran at Wed Sep 4 08:35:05 UTC 2013 [08:35:51] RECOVERY - Puppet freshness on ms-be1 is OK: puppet ran at Wed Sep 4 08:35:50 UTC 2013 [08:37:45] (03CR) 10Faidon Liambotis: "(1 comment)" [operations/debs/stud] (wikimedia) - 10https://gerrit.wikimedia.org/r/82427 (owner: 10Mark Bergsma) [08:37:51] RECOVERY - Puppet freshness on ms-be4 is OK: puppet ran at Wed Sep 4 08:37:41 UTC 2013 [08:38:54] RECOVERY - Puppet freshness on ms-fe3 is OK: puppet ran at Wed Sep 4 08:38:46 UTC 2013 [08:38:54] RECOVERY - Puppet freshness on ms-be2 is OK: puppet ran at Wed Sep 4 08:38:51 UTC 2013 [08:49:18] morning (still) [08:51:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [08:57:45] (03PS1) 10ArielGlenn: nginx sites with donotify need to use nginx module, not generic defn [operations/puppet] - 10https://gerrit.wikimedia.org/r/82592 [08:57:54] morning. [08:58:18] hashar, you want to look at that change ^^ as it affects localssl? [08:59:22] apergos: hey [09:00:02] no clue =) [09:00:25] don't we use localssl in labs? [09:00:28] IIRC beta generates several nginx sites [09:01:38] apergos: the varnish caches have role::protoproxy::ssl::beta [09:02:09] ok well let's keep an eye on those, though I think it's going to be fine [09:02:15] which create a resource 'protoproxy' for each of the possible domains [09:02:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:29] the proxy backend being set to 127.0.0.1 [09:02:32] yup [09:02:33] so yeah slightly different I guess [09:02:58] all right I'm going to get this merged [09:03:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [09:03:47] (03CR) 10ArielGlenn: [C: 032] nginx sites with donotify need to use nginx module, not generic defn [operations/puppet] - 10https://gerrit.wikimedia.org/r/82592 (owner: 10ArielGlenn) [09:04:08] seems localssl is another way to do what is in role::protoproxy::ssl::beta hehe [09:06:54] RECOVERY - Puppet freshness on ssl1005 is OK: puppet ran at Wed Sep 4 09:06:51 UTC 2013 [09:09:54] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Wed Sep 4 09:09:53 UTC 2013 [09:12:52] RECOVERY - Puppet freshness on ssl1001 is OK: puppet ran at Wed Sep 4 09:12:44 UTC 2013 [09:13:22] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:14:02] RECOVERY - Puppet freshness on amssq47 is OK: puppet ran at Wed Sep 4 09:14:01 UTC 2013 [09:14:52] RECOVERY - Puppet freshness on ssl1 is OK: puppet ran at Wed Sep 4 09:14:47 UTC 2013 [09:15:52] RECOVERY - Puppet freshness on ssl1003 is OK: puppet ran at Wed Sep 4 09:15:48 UTC 2013 [09:16:32] RECOVERY - Puppet freshness on ssl4 is OK: puppet ran at Wed Sep 4 09:16:24 UTC 2013 [09:18:22] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:19:52] RECOVERY - Puppet freshness on ssl1007 is OK: puppet ran at Wed Sep 4 09:19:48 UTC 2013 [09:21:02] RECOVERY - Puppet freshness on ssl1006 is OK: puppet ran at Wed Sep 4 09:20:54 UTC 2013 [09:22:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:02] RECOVERY - Puppet freshness on ssl3001 is OK: puppet ran at Wed Sep 4 09:22:59 UTC 2013 [09:23:12] RECOVERY - Puppet freshness on ssl1002 is OK: puppet ran at Wed Sep 4 09:23:04 UTC 2013 [09:23:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [09:26:12] RECOVERY - Puppet freshness on ssl1004 is OK: puppet ran at Wed Sep 4 09:26:06 UTC 2013 [09:26:22] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:27:12] PROBLEM - Puppet freshness on stafford is CRITICAL: No successful Puppet run in the last 10 hours [09:27:12] hashar: Do you know if I made my request with needed information, or does stuff have to be added in https://rt.wikimedia.org/Ticket/Display.html?id=5710 ? I could not find an "access request procedure" [09:27:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:28:12] RECOVERY - Puppet freshness on ssl3003 is OK: puppet ran at Wed Sep 4 09:28:06 UTC 2013 [09:28:52] RECOVERY - Puppet freshness on ssl1008 is OK: puppet ran at Wed Sep 4 09:28:46 UTC 2013 [09:29:12] RECOVERY - Puppet freshness on ssl1009 is OK: puppet ran at Wed Sep 4 09:29:07 UTC 2013 [09:29:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [09:30:12] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [09:30:22] RECOVERY - Puppet freshness on ssl3002 is OK: puppet ran at Wed Sep 4 09:30:12 UTC 2013 [09:30:22] RECOVERY - Puppet freshness on ssl3 is OK: puppet ran at Wed Sep 4 09:30:12 UTC 2013 [09:30:25] siebrand: do you happen to know which db ori-l send event logging events too ? [09:31:35] eventlogging::service::consumer { 'mysql-db1047': [09:31:36] ah [09:32:12] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Wed Sep 4 09:32:03 UTC 2013 [09:33:22] RECOVERY - Puppet freshness on ssl2 is OK: puppet ran at Wed Sep 4 09:33:13 UTC 2013 [09:38:50] siebrand: sorry no clue, if ops-requests is not the proper queue, it will get redirected. [09:39:10] hashar: k. tx [09:39:38] mysql:wikiadmin@db1047 [(none)]> use log [09:39:38] ERROR 1044 (42000): Access denied for user 'wikiadmin'@'10.64.%' to database 'log' [09:39:39] (03PS1) 10ArielGlenn: nginx module should expect enable true, not 'true' [operations/puppet] - 10https://gerrit.wikimedia.org/r/82593 [09:39:39] =( [09:40:50] (03CR) 10ArielGlenn: [C: 032] nginx module should expect enable true, not 'true' [operations/puppet] - 10https://gerrit.wikimedia.org/r/82593 (owner: 10ArielGlenn) [09:45:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:46:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [09:51:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:53:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [09:59:33] PROBLEM - RAID on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:00:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:01:23] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:09:01] (03CR) 10Mark Bergsma: "(1 comment)" [operations/debs/stud] (wikimedia) - 10https://gerrit.wikimedia.org/r/82428 (owner: 10Mark Bergsma) [10:10:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [10:25:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:27:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.194 second response time [10:29:43] (03PS1) 10Akosiaris: Only backup /var/lib/mailman if defined [operations/puppet] - 10https://gerrit.wikimedia.org/r/82595 [10:30:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:31:22] (03CR) 10Akosiaris: [C: 032] Only backup /var/lib/mailman if defined [operations/puppet] - 10https://gerrit.wikimedia.org/r/82595 (owner: 10Akosiaris) [10:33:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.829 second response time [10:36:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:36:31] RECOVERY - Puppet freshness on sodium is OK: puppet ran at Wed Sep 4 10:36:20 UTC 2013 [10:37:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.357 second response time [10:42:31] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [10:42:31] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [10:42:31] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [10:49:41] (03PS1) 10ArielGlenn: dns account not needed on fenari any more [operations/puppet] - 10https://gerrit.wikimedia.org/r/82598 [10:50:39] (03CR) 10ArielGlenn: [C: 032] dns account not needed on fenari any more [operations/puppet] - 10https://gerrit.wikimedia.org/r/82598 (owner: 10ArielGlenn) [10:50:52] are you also cleaning up the account manually? [10:51:30] I planned to, yes [10:51:41] just now I am runnign puppet to see if there is anything else wrong over there [10:51:52] cool [10:51:55] thanks for cleaning up my mess :) [10:52:01] no worries [10:52:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:53:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.145 second response time [10:57:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:58:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.420 second response time [11:02:21] RECOVERY - Puppet freshness on fenari is OK: puppet ran at Wed Sep 4 11:02:15 UTC 2013 [11:09:18] RECOVERY - Puppet freshness on pdf1 is OK: puppet ran at Wed Sep 4 11:09:10 UTC 2013 [11:25:38] RECOVERY - Puppet freshness on mw1126 is OK: puppet ran at Wed Sep 4 11:25:34 UTC 2013 [11:32:58] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [11:54:00] re [11:54:46] (03PS1) 10Akosiaris: Whitespace cleanup (mostly) [operations/puppet] - 10https://gerrit.wikimedia.org/r/82601 [12:01:02] (03CR) 10Akosiaris: [C: 032] Whitespace cleanup (mostly) [operations/puppet] - 10https://gerrit.wikimedia.org/r/82601 (owner: 10Akosiaris) [12:05:37] RECOVERY - Disk space on ms-be1 is OK: DISK OK [12:06:27] RECOVERY - search indices - check lucene status page on search1022 is OK: HTTP OK: HTTP/1.1 200 OK - 56465 bytes in 0.009 second response time [12:06:37] RECOVERY - search indices - check lucene status page on search1016 is OK: HTTP OK: HTTP/1.1 200 OK - 53551 bytes in 0.010 second response time [12:08:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:12:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [12:12:56] PROBLEM - DPKG on search19 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:21:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [12:38:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:39:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.582 second response time [12:45:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:32] iii am doomed [12:45:33] :( [12:46:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.804 second response time [12:53:22] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:57:22] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:12:39] E: Failed to fetch http://ubuntu.wikimedia.org/ubuntu/pool/main/libg/libgcrypt11/libgcrypt11_1.5.0-3ubuntu0.1_amd64.deb: 404 Not Found [13:12:40] uhhu [13:12:54] apt-get update? [13:13:34] looks like cow builder does not update :/ [13:14:06] --update [13:15:17] oh why can't I reproduce issues [13:21:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [13:30:16] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:32:16] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:35:45] (03PS3) 10Ottomata: Turn on automatic pulling for geowiki repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/82409 (owner: 10QChris) [13:35:55] (03CR) 10Ottomata: [C: 032 V: 032] Turn on automatic pulling for geowiki repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/82409 (owner: 10QChris) [13:38:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:35] (03CR) 10Ottomata: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82410 (owner: 10QChris) [13:42:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.340 second response time [13:46:16] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:56:34] I send a lengthy emails to ops list regarding a crazy issue I am facing with git :] [13:56:54] and no clue how to debug it properly :( [14:06:50] (03PS1) 10Ottomata: + more comment doc in analytics role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/82607 [14:07:00] (03CR) 10Ottomata: [C: 032 V: 032] + more comment doc in analytics role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/82607 (owner: 10Ottomata) [14:10:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:11:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [14:13:19] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:19:03] (03PS1) 10Ottomata: Installing Hive, Oozie and Hue servers on analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82611 [14:21:16] paravoid, can you help me out with a pinning problem? I have a local repository and I want to mark it as higher priority than upstream repos… but can't figure out how to refer to the local repo in preferences.d [14:29:48] (03PS1) 10Hashar: contint: move iptables under module [operations/puppet] - 10https://gerrit.wikimedia.org/r/82613 [14:33:19] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:33:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:36] (03PS1) 10Hashar: contint: prevents access to Zuul and git daemon [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 [14:40:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.511 second response time [14:43:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [14:50:43] (03CR) 10QChris: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82410 (owner: 10QChris) [14:51:29] PROBLEM - RAID on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [14:55:26] (03CR) 10Ottomata: [C: 032 V: 032] Installing Hive, Oozie and Hue servers on analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82611 (owner: 10Ottomata) [14:56:45] k [14:57:49] .wmnet ? [14:58:06] haha, so used to typing pmtpa.wmflabs [14:58:09] :) [14:58:14] i have shortcuts for wmnet [14:58:49] so, basically….puppet should do everything? :) [14:58:58] here we go?! [14:59:15] so, just an overview while that runs [14:59:19] go puppet go (plus that's another host off the puppet not running list) [14:59:28] hehe [14:59:36] hive is a sql engine built on mapreduce [14:59:50] the cool bit, is it lets you define tables based on any filetype loaded into hdfs [14:59:54] and then query them with sql [14:59:59] it consists of: [15:00:05] hive-server2 [15:00:05] hive-metastore [15:00:05] mysql db [15:00:13] hive clients interact with hive-server2 [15:00:27] hive-server2 talks to hive-metastore which has a configurable db backend [15:00:41] puppet should install mysql and set up the dbs and then do metastore and then do hive server2 [15:00:46] cool [15:01:21] oh i need to put the whole sockpuppet ca thing [15:01:22] oops [15:01:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:49] there we go [15:02:55] ok [15:02:57] more stuff [15:03:05] oozie is a fancy job scheduler for hadoop [15:03:24] it can launch predefined jobs triggered by when data is available in hdfs [15:03:37] it has oozie-server and also a mysql db [15:03:40] puppet should set all that up too [15:03:59] aaand, hue is a nice little web GUI to all of these hadoop services [15:04:06] hdfs, oozie, hive, pig, + more [15:04:17] it has a hue server, and a db backend [15:04:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [15:04:24] right now puppet will leave the default backend in place, sqlite [15:04:32] but it is possible to make hue use mysql [15:04:35] we can do that later if we need to [15:04:35] cool [15:04:39] :) [15:05:12] yay, good so far… :) [15:07:19] RECOVERY - Puppet freshness on analytics1027 is OK: puppet ran at Wed Sep 4 15:07:18 UTC 2013 [15:07:29] PROBLEM - RAID on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:10:16] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:10:31] paravoid: gmetric.js didn't have a license when i created that patch :P https://github.com/jbuchbinder/node-gmetric/issues/12 [15:14:05] ottomata: see that ? [15:16:16] RECOVERY - Disk space on analytics1027 is OK: DISK OK [15:16:16] (03PS2) 10Andrew Bogott: Labsdebrepo fixes: [operations/puppet] - 10https://gerrit.wikimedia.org/r/82532 [15:16:39] oooo [15:16:44] looks good? or did I miss something? [15:16:46] RECOVERY - RAID on analytics1027 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:17:11] gonna run puppet again LeslieCarr [15:17:12] just to see [15:17:23] missed hive failing to install [15:17:27] ah ok [15:17:47] (03CR) 10Andrew Bogott: [C: 032] Labsdebrepo fixes: [operations/puppet] - 10https://gerrit.wikimedia.org/r/82532 (owner: 10Andrew Bogott) [15:17:48] probably will fix itself second time and is an ordering issue [15:18:56] hmm that def there is [15:19:00] err: /Stage[main]/Cdh4::Hue/User[hue]/groups: change from to hive,ssl-cert failed: Could not set groups on user[hue]: Execution of '/usr/sbin/usermod -G hive,ssl-cert hue' returned 6: usermod: group 'ssl-cert' does not exist [15:19:02] i should make hue require hive/oozie first [15:19:59] hmm ssl-cert? [15:20:05] hm [15:21:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.092 second response time [15:22:19] hm. [15:22:29] gotta figure out what package ssl-cert group is created by [15:22:34] thought it would have been openssl [15:22:35] (03PS2) 10Ori.livneh: statsd module: provision Ganglia backend support [operations/puppet] - 10https://gerrit.wikimedia.org/r/82563 [15:23:21] (03CR) 10Ori.livneh: "PS2 adds license info to the header of gmetric.js; I'm cool w/pushing it too." [operations/puppet] - 10https://gerrit.wikimedia.org/r/82563 (owner: 10Ori.livneh) [15:24:25] could be the ssl-cert package [15:24:41] http://ubuntuforums.org/showthread.php?t=1175286 [15:25:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:37] (03PS1) 10Ottomata: Installing ssl-cert package to make sure ssl-cert group is created for Hue SSL. [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/82620 [15:25:38] yeah [15:25:46] LeslieCarr: ^ [15:26:03] (03CR) 10Ottomata: [C: 032 V: 032] Installing ssl-cert package to make sure ssl-cert group is created for Hue SSL. [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/82620 (owner: 10Ottomata) [15:26:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.188 second response time [15:26:42] cool [15:27:16] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:27:24] (03PS1) 10Ottomata: Updating modules/cdh4 with ssl-cert package install change for Hue [operations/puppet] - 10https://gerrit.wikimedia.org/r/82621 [15:27:34] something is totally weird with analytics1014, will check that out after this [15:27:44] (03CR) 10Ottomata: [C: 032 V: 032] Updating modules/cdh4 with ssl-cert package install change for Hue [operations/puppet] - 10https://gerrit.wikimedia.org/r/82621 (owner: 10Ottomata) [15:28:08] ok, let's try that now [15:29:26] RECOVERY - NTP on analytics1027 is OK: NTP OK: Offset -0.01433777809 secs [15:29:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.706 second response time [15:34:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:34:52] akosiaris, I'm inheriting from a class that's in the same module /and/ the same file, yet lint still says 'class inherits across namespaces.' What am I missing? [15:34:55] yay [15:35:45] looks happy [15:36:34] andrewbogott: well for starters inheritance is considered harmful. You sure it is not a typo ? [15:36:43] could i have a look ? [15:37:11] akosiaris, I didn't write the original code so I'm reluctant to rebuild it from the ground up… there's already dangerously large amounts of refactor in this one patch :) [15:37:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.574 second response time [15:37:20] Latest version is https://gerrit.wikimedia.org/r/#/c/77332/ [15:37:35] in platform.pp [15:40:02] yeah, LeslieCarr that looks way better [15:40:21] ok, Leslie, help me test this [15:40:22] hue [15:40:23] um [15:40:33] run this locally: [15:40:34] ssh -N analytics1001.wikimedia.org -L 8888:analytics1027.eqiad.wmnet:8888 [15:40:39] then http://localhost:8888 [15:40:57] class base::platform::dell-c2100 inherits base::platform::generic::dell [15:41:01] ohhh wait it is ssl now hm [15:41:16] it is unhappy because it is inheriting from a class one level down [15:41:37] (03PS1) 10Manybubbles: Setup logrotation for elasticsearch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82623 [15:41:37] i 'd say don't touch for now [15:41:39] ah yeah [15:41:40] run that [15:41:40] and [15:41:44] https://localhost:8888/accounts/login/?next=/ [15:41:44] ottomata: got a login prompt [15:41:48] ah great! [15:41:55] use your shell username and ldap pw [15:42:11] akosiaris, I've tried rearranging it so that the inherited class is in a higher scope, lint still complains. [15:42:59] e.g. renaming base::platform::generic::dell to base::platform::generic-dell or even base::platform-generic-dell [15:43:36] class base::platform::dell::c2100 inherits base::platform::dell { [15:43:40] this makes it happy [15:43:59] but the puppet-lint parser is known to have problems [15:44:12] so i really suggest avoiding it for now [15:44:23] ok, happy to :) [15:44:40] LeslieCarr: can you log in? [15:44:42] I stripped out tabs and fixed a bunch of other formatting issues but… got tired of aligning things [15:45:14] (03CR) 10MZMcBride: "What's needed to get this merged and deployed?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78944 (owner: 10QChris) [15:45:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:35] Well it is a step forward. [15:46:12] (03PS1) 10Hashar: contint: publish Zuul git over git protocol [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 [15:46:29] ottomata: gerrit pw [15:46:30] ? [15:47:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.546 second response time [15:47:51] ja [15:47:53] that should work [15:48:04] LeslieCarr [15:48:17] woot [15:48:27] great! [15:48:46]