[00:25:05] PROBLEM - check_mysql on db1008 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:25:06] PROBLEM - check_mysql on db78 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:30:05] PROBLEM - check_mysql on db1008 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:30:06] PROBLEM - check_mysql on db78 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:35:05] PROBLEM - check_mysql on db1008 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:35:06] RECOVERY - check_mysql on db78 is OK: Uptime: 4789387 Threads: 2 Questions: 71676574 Slow queries: 63783 Opens: 90753 Flush tables: 2 Open tables: 64 Queries per second avg: 14.965 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [00:40:02] i'm going to deploy a typo fix to prod to unbreak the IRC log stream: https://gerrit.wikimedia.org/r/#/c/82558/ [00:40:05] RECOVERY - check_mysql on db1008 is OK: Uptime: 3471671 Threads: 1 Questions: 66194960 Slow queries: 53745 Opens: 61753 Flush tables: 2 Open tables: 64 Queries per second avg: 19.067 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [00:41:45] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [00:41:45] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [00:41:45] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [00:47:31] ori-l: that seems like a poster child for static analysis? [00:49:42] !log olivneh synchronized php-1.22wmf15/includes/RecentChange.php 'Fix for bug 53720' [00:49:48] Logged the message, Master [00:49:55] ^ Krenair [00:50:13] ty ori-l [00:50:17] can you confirm the fix? [00:51:24] ori-l, looks fixed to me. [00:51:30] cool, thanks [01:14:15] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [01:16:55] (03PS1) 10Ori.livneh: statsd module: provision Ganglia backend support [operations/puppet] - 10https://gerrit.wikimedia.org/r/82563 [01:17:25] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:29:43] (03PS1) 10Dzahn: fixes for wikitravel links and updates. add a trim() when unserializing API data to fix parsing for a lot of wikis sending whitespace [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/82564 [01:32:06] (03PS2) 10Dzahn: fixes for wikitravel links and updates. add a trim() when unserializing API data to fix parsing for a lot of wikis sending whitespace [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/82564 [01:32:45] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [01:51:45] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [02:06:47] !log LocalisationUpdate completed (1.22wmf15) at Wed Sep 4 02:06:47 UTC 2013 [02:06:55] Logged the message, Master [02:11:48] !log LocalisationUpdate completed (1.22wmf14) at Wed Sep 4 02:11:48 UTC 2013 [02:11:54] Logged the message, Master [02:21:58] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Sep 4 02:21:58 UTC 2013 [02:22:04] Logged the message, Master [02:26:45] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [02:32:45] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [02:40:45] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [02:43:45] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [02:52:45] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [02:53:45] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [02:56:45] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [02:56:45] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [02:56:45] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [02:59:45] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [02:59:46] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [03:02:45] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [03:02:45] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:05:45] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:08:45] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [03:08:45] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [03:09:45] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [03:09:45] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [03:13:45] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [04:13:45] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [04:38:52] (03PS1) 10Ori.livneh: Delete old 'sysctlfile' module & related detritus [operations/puppet] - 10https://gerrit.wikimedia.org/r/82571 [05:50:45] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [05:50:45] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [05:52:24] Jamesofur: have you reported a bug? [05:52:53] not yet, but I think Tilman was going to then I'd add on :) [05:52:59] he was very nice about offering :D [06:26:21] !log truncated fact_values puppet table and reset auto increment to start at 1, puppet was broken on all hosts, see http://projects.puppetlabs.com/issues/9225 [06:26:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:26:27] Logged the message, Master [06:27:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [06:52:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:53:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [07:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [07:31:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:15] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: No successful Puppet run in the last 10 hours [07:33:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [07:34:15] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: No successful Puppet run in the last 10 hours [07:38:45] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [07:38:45] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: No successful Puppet run in the last 10 hours [07:39:45] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: No successful Puppet run in the last 10 hours [07:40:45] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: No successful Puppet run in the last 10 hours [07:40:45] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: No successful Puppet run in the last 10 hours [07:41:45] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: No successful Puppet run in the last 10 hours [07:41:45] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: No successful Puppet run in the last 10 hours [07:44:45] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: No successful Puppet run in the last 10 hours [07:46:45] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: No successful Puppet run in the last 10 hours [07:47:45] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: No successful Puppet run in the last 10 hours [07:49:45] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: No successful Puppet run in the last 10 hours [07:52:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:52:45] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: No successful Puppet run in the last 10 hours [07:53:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.145 second response time [07:57:45] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: No successful Puppet run in the last 10 hours [07:58:45] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: No successful Puppet run in the last 10 hours [08:01:31] (03CR) 10Faidon Liambotis: [C: 032] Delete old 'sysctlfile' module & related detritus [operations/puppet] - 10https://gerrit.wikimedia.org/r/82571 (owner: 10Ori.livneh) [08:08:32] (03CR) 10Faidon Liambotis: [C: 04-1] "LGTM but I won't pretend I've reviewed the Javascript code, nor that I intend to :) If you want a code review for that, maybe we should so" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82563 (owner: 10Ori.livneh) [08:10:41] RECOVERY - Puppet freshness on ms-be9 is OK: puppet ran at Wed Sep 4 08:10:38 UTC 2013 [08:10:51] RECOVERY - Puppet freshness on ms-be3 is OK: puppet ran at Wed Sep 4 08:10:43 UTC 2013 [08:12:01] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [08:13:01] RECOVERY - Puppet freshness on ms-be11 is OK: puppet ran at Wed Sep 4 08:12:56 UTC 2013 [08:14:51] RECOVERY - Puppet freshness on ms-be12 is OK: puppet ran at Wed Sep 4 08:14:42 UTC 2013 [08:15:51] RECOVERY - Puppet freshness on ms-be5 is OK: puppet ran at Wed Sep 4 08:15:49 UTC 2013 [08:17:51] RECOVERY - Puppet freshness on ms-fe2 is OK: puppet ran at Wed Sep 4 08:17:41 UTC 2013 [08:20:56] (03PS1) 10ArielGlenn: one more protoproxy -> nginx change [operations/puppet] - 10https://gerrit.wikimedia.org/r/82590 [08:22:01] RECOVERY - Puppet freshness on ms-be10 is OK: puppet ran at Wed Sep 4 08:21:58 UTC 2013 [08:22:16] (03CR) 10ArielGlenn: [C: 032] one more protoproxy -> nginx change [operations/puppet] - 10https://gerrit.wikimedia.org/r/82590 (owner: 10ArielGlenn) [08:26:21] (03CR) 10Faidon Liambotis: "(2 comments)" [operations/debs/stud] (wikimedia) - 10https://gerrit.wikimedia.org/r/82428 (owner: 10Mark Bergsma) [08:26:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:51] RECOVERY - Puppet freshness on ms-be7 is OK: puppet ran at Wed Sep 4 08:26:45 UTC 2013 [08:27:01] RECOVERY - Puppet freshness on ms-fe4 is OK: puppet ran at Wed Sep 4 08:26:55 UTC 2013 [08:28:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [08:32:11] RECOVERY - Puppet freshness on ms-fe1 is OK: puppet ran at Wed Sep 4 08:32:02 UTC 2013 [08:32:11] RECOVERY - Puppet freshness on ms-be6 is OK: puppet ran at Wed Sep 4 08:32:02 UTC 2013 [08:35:11] RECOVERY - Puppet freshness on ms-be8 is OK: puppet ran at Wed Sep 4 08:35:05 UTC 2013 [08:35:51] RECOVERY - Puppet freshness on ms-be1 is OK: puppet ran at Wed Sep 4 08:35:50 UTC 2013 [08:37:45] (03CR) 10Faidon Liambotis: "(1 comment)" [operations/debs/stud] (wikimedia) - 10https://gerrit.wikimedia.org/r/82427 (owner: 10Mark Bergsma) [08:37:51] RECOVERY - Puppet freshness on ms-be4 is OK: puppet ran at Wed Sep 4 08:37:41 UTC 2013 [08:38:54] RECOVERY - Puppet freshness on ms-fe3 is OK: puppet ran at Wed Sep 4 08:38:46 UTC 2013 [08:38:54] RECOVERY - Puppet freshness on ms-be2 is OK: puppet ran at Wed Sep 4 08:38:51 UTC 2013 [08:49:18] morning (still) [08:51:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [08:57:45] (03PS1) 10ArielGlenn: nginx sites with donotify need to use nginx module, not generic defn [operations/puppet] - 10https://gerrit.wikimedia.org/r/82592 [08:57:54] morning. [08:58:18] hashar, you want to look at that change ^^ as it affects localssl? [08:59:22] apergos: hey [09:00:02] no clue =) [09:00:25] don't we use localssl in labs? [09:00:28] IIRC beta generates several nginx sites [09:01:38] apergos: the varnish caches have role::protoproxy::ssl::beta [09:02:09] ok well let's keep an eye on those, though I think it's going to be fine [09:02:15] which create a resource 'protoproxy' for each of the possible domains [09:02:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:29] the proxy backend being set to 127.0.0.1 [09:02:32] yup [09:02:33] so yeah slightly different I guess [09:02:58] all right I'm going to get this merged [09:03:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [09:03:47] (03CR) 10ArielGlenn: [C: 032] nginx sites with donotify need to use nginx module, not generic defn [operations/puppet] - 10https://gerrit.wikimedia.org/r/82592 (owner: 10ArielGlenn) [09:04:08] seems localssl is another way to do what is in role::protoproxy::ssl::beta hehe [09:06:54] RECOVERY - Puppet freshness on ssl1005 is OK: puppet ran at Wed Sep 4 09:06:51 UTC 2013 [09:09:54] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Wed Sep 4 09:09:53 UTC 2013 [09:12:52] RECOVERY - Puppet freshness on ssl1001 is OK: puppet ran at Wed Sep 4 09:12:44 UTC 2013 [09:13:22] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:14:02] RECOVERY - Puppet freshness on amssq47 is OK: puppet ran at Wed Sep 4 09:14:01 UTC 2013 [09:14:52] RECOVERY - Puppet freshness on ssl1 is OK: puppet ran at Wed Sep 4 09:14:47 UTC 2013 [09:15:52] RECOVERY - Puppet freshness on ssl1003 is OK: puppet ran at Wed Sep 4 09:15:48 UTC 2013 [09:16:32] RECOVERY - Puppet freshness on ssl4 is OK: puppet ran at Wed Sep 4 09:16:24 UTC 2013 [09:18:22] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:19:52] RECOVERY - Puppet freshness on ssl1007 is OK: puppet ran at Wed Sep 4 09:19:48 UTC 2013 [09:21:02] RECOVERY - Puppet freshness on ssl1006 is OK: puppet ran at Wed Sep 4 09:20:54 UTC 2013 [09:22:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:02] RECOVERY - Puppet freshness on ssl3001 is OK: puppet ran at Wed Sep 4 09:22:59 UTC 2013 [09:23:12] RECOVERY - Puppet freshness on ssl1002 is OK: puppet ran at Wed Sep 4 09:23:04 UTC 2013 [09:23:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [09:26:12] RECOVERY - Puppet freshness on ssl1004 is OK: puppet ran at Wed Sep 4 09:26:06 UTC 2013 [09:26:22] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:27:12] PROBLEM - Puppet freshness on stafford is CRITICAL: No successful Puppet run in the last 10 hours [09:27:12] hashar: Do you know if I made my request with needed information, or does stuff have to be added in https://rt.wikimedia.org/Ticket/Display.html?id=5710 ? I could not find an "access request procedure" [09:27:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:28:12] RECOVERY - Puppet freshness on ssl3003 is OK: puppet ran at Wed Sep 4 09:28:06 UTC 2013 [09:28:52] RECOVERY - Puppet freshness on ssl1008 is OK: puppet ran at Wed Sep 4 09:28:46 UTC 2013 [09:29:12] RECOVERY - Puppet freshness on ssl1009 is OK: puppet ran at Wed Sep 4 09:29:07 UTC 2013 [09:29:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [09:30:12] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [09:30:22] RECOVERY - Puppet freshness on ssl3002 is OK: puppet ran at Wed Sep 4 09:30:12 UTC 2013 [09:30:22] RECOVERY - Puppet freshness on ssl3 is OK: puppet ran at Wed Sep 4 09:30:12 UTC 2013 [09:30:25] siebrand: do you happen to know which db ori-l send event logging events too ? [09:31:35] eventlogging::service::consumer { 'mysql-db1047': [09:31:36] ah [09:32:12] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Wed Sep 4 09:32:03 UTC 2013 [09:33:22] RECOVERY - Puppet freshness on ssl2 is OK: puppet ran at Wed Sep 4 09:33:13 UTC 2013 [09:38:50] siebrand: sorry no clue, if ops-requests is not the proper queue, it will get redirected. [09:39:10] hashar: k. tx [09:39:38] mysql:wikiadmin@db1047 [(none)]> use log [09:39:38] ERROR 1044 (42000): Access denied for user 'wikiadmin'@'10.64.%' to database 'log' [09:39:39] (03PS1) 10ArielGlenn: nginx module should expect enable true, not 'true' [operations/puppet] - 10https://gerrit.wikimedia.org/r/82593 [09:39:39] =( [09:40:50] (03CR) 10ArielGlenn: [C: 032] nginx module should expect enable true, not 'true' [operations/puppet] - 10https://gerrit.wikimedia.org/r/82593 (owner: 10ArielGlenn) [09:45:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:46:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [09:51:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:53:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [09:59:33] PROBLEM - RAID on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:00:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:01:23] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:09:01] (03CR) 10Mark Bergsma: "(1 comment)" [operations/debs/stud] (wikimedia) - 10https://gerrit.wikimedia.org/r/82428 (owner: 10Mark Bergsma) [10:10:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [10:25:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:27:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.194 second response time [10:29:43] (03PS1) 10Akosiaris: Only backup /var/lib/mailman if defined [operations/puppet] - 10https://gerrit.wikimedia.org/r/82595 [10:30:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:31:22] (03CR) 10Akosiaris: [C: 032] Only backup /var/lib/mailman if defined [operations/puppet] - 10https://gerrit.wikimedia.org/r/82595 (owner: 10Akosiaris) [10:33:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.829 second response time [10:36:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:36:31] RECOVERY - Puppet freshness on sodium is OK: puppet ran at Wed Sep 4 10:36:20 UTC 2013 [10:37:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.357 second response time [10:42:31] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [10:42:31] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [10:42:31] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [10:49:41] (03PS1) 10ArielGlenn: dns account not needed on fenari any more [operations/puppet] - 10https://gerrit.wikimedia.org/r/82598 [10:50:39] (03CR) 10ArielGlenn: [C: 032] dns account not needed on fenari any more [operations/puppet] - 10https://gerrit.wikimedia.org/r/82598 (owner: 10ArielGlenn) [10:50:52] are you also cleaning up the account manually? [10:51:30] I planned to, yes [10:51:41] just now I am runnign puppet to see if there is anything else wrong over there [10:51:52] cool [10:51:55] thanks for cleaning up my mess :) [10:52:01] no worries [10:52:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:53:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.145 second response time [10:57:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:58:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.420 second response time [11:02:21] RECOVERY - Puppet freshness on fenari is OK: puppet ran at Wed Sep 4 11:02:15 UTC 2013 [11:09:18] RECOVERY - Puppet freshness on pdf1 is OK: puppet ran at Wed Sep 4 11:09:10 UTC 2013 [11:25:38] RECOVERY - Puppet freshness on mw1126 is OK: puppet ran at Wed Sep 4 11:25:34 UTC 2013 [11:32:58] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [11:54:00] re [11:54:46] (03PS1) 10Akosiaris: Whitespace cleanup (mostly) [operations/puppet] - 10https://gerrit.wikimedia.org/r/82601 [12:01:02] (03CR) 10Akosiaris: [C: 032] Whitespace cleanup (mostly) [operations/puppet] - 10https://gerrit.wikimedia.org/r/82601 (owner: 10Akosiaris) [12:05:37] RECOVERY - Disk space on ms-be1 is OK: DISK OK [12:06:27] RECOVERY - search indices - check lucene status page on search1022 is OK: HTTP OK: HTTP/1.1 200 OK - 56465 bytes in 0.009 second response time [12:06:37] RECOVERY - search indices - check lucene status page on search1016 is OK: HTTP OK: HTTP/1.1 200 OK - 53551 bytes in 0.010 second response time [12:08:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:12:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [12:12:56] PROBLEM - DPKG on search19 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:21:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [12:38:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:39:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.582 second response time [12:45:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:32] iii am doomed [12:45:33] :( [12:46:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.804 second response time [12:53:22] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:57:22] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:12:39] E: Failed to fetch http://ubuntu.wikimedia.org/ubuntu/pool/main/libg/libgcrypt11/libgcrypt11_1.5.0-3ubuntu0.1_amd64.deb: 404 Not Found [13:12:40] uhhu [13:12:54] apt-get update? [13:13:34] looks like cow builder does not update :/ [13:14:06] --update [13:15:17] oh why can't I reproduce issues [13:21:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [13:30:16] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:32:16] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:35:45] (03PS3) 10Ottomata: Turn on automatic pulling for geowiki repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/82409 (owner: 10QChris) [13:35:55] (03CR) 10Ottomata: [C: 032 V: 032] Turn on automatic pulling for geowiki repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/82409 (owner: 10QChris) [13:38:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:35] (03CR) 10Ottomata: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82410 (owner: 10QChris) [13:42:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.340 second response time [13:46:16] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:56:34] I send a lengthy emails to ops list regarding a crazy issue I am facing with git :] [13:56:54] and no clue how to debug it properly :( [14:06:50] (03PS1) 10Ottomata: + more comment doc in analytics role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/82607 [14:07:00] (03CR) 10Ottomata: [C: 032 V: 032] + more comment doc in analytics role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/82607 (owner: 10Ottomata) [14:10:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:11:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [14:13:19] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:19:03] (03PS1) 10Ottomata: Installing Hive, Oozie and Hue servers on analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82611 [14:21:16] paravoid, can you help me out with a pinning problem? I have a local repository and I want to mark it as higher priority than upstream repos… but can't figure out how to refer to the local repo in preferences.d [14:29:48] (03PS1) 10Hashar: contint: move iptables under module [operations/puppet] - 10https://gerrit.wikimedia.org/r/82613 [14:33:19] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:33:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:36] (03PS1) 10Hashar: contint: prevents access to Zuul and git daemon [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 [14:40:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.511 second response time [14:43:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [14:50:43] (03CR) 10QChris: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82410 (owner: 10QChris) [14:51:29] PROBLEM - RAID on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [14:55:26] (03CR) 10Ottomata: [C: 032 V: 032] Installing Hive, Oozie and Hue servers on analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82611 (owner: 10Ottomata) [14:56:45] k [14:57:49] .wmnet ? [14:58:06] haha, so used to typing pmtpa.wmflabs [14:58:09] :) [14:58:14] i have shortcuts for wmnet [14:58:49] so, basically….puppet should do everything? :) [14:58:58] here we go?! [14:59:15] so, just an overview while that runs [14:59:19] go puppet go (plus that's another host off the puppet not running list) [14:59:28] hehe [14:59:36] hive is a sql engine built on mapreduce [14:59:50] the cool bit, is it lets you define tables based on any filetype loaded into hdfs [14:59:54] and then query them with sql [14:59:59] it consists of: [15:00:05] hive-server2 [15:00:05] hive-metastore [15:00:05] mysql db [15:00:13] hive clients interact with hive-server2 [15:00:27] hive-server2 talks to hive-metastore which has a configurable db backend [15:00:41] puppet should install mysql and set up the dbs and then do metastore and then do hive server2 [15:00:46] cool [15:01:21] oh i need to put the whole sockpuppet ca thing [15:01:22] oops [15:01:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:49] there we go [15:02:55] ok [15:02:57] more stuff [15:03:05] oozie is a fancy job scheduler for hadoop [15:03:24] it can launch predefined jobs triggered by when data is available in hdfs [15:03:37] it has oozie-server and also a mysql db [15:03:40] puppet should set all that up too [15:03:59] aaand, hue is a nice little web GUI to all of these hadoop services [15:04:06] hdfs, oozie, hive, pig, + more [15:04:17] it has a hue server, and a db backend [15:04:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [15:04:24] right now puppet will leave the default backend in place, sqlite [15:04:32] but it is possible to make hue use mysql [15:04:35] we can do that later if we need to [15:04:35] cool [15:04:39] :) [15:05:12] yay, good so far… :) [15:07:19] RECOVERY - Puppet freshness on analytics1027 is OK: puppet ran at Wed Sep 4 15:07:18 UTC 2013 [15:07:29] PROBLEM - RAID on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:10:16] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:10:31] paravoid: gmetric.js didn't have a license when i created that patch :P https://github.com/jbuchbinder/node-gmetric/issues/12 [15:14:05] ottomata: see that ? [15:16:16] RECOVERY - Disk space on analytics1027 is OK: DISK OK [15:16:16] (03PS2) 10Andrew Bogott: Labsdebrepo fixes: [operations/puppet] - 10https://gerrit.wikimedia.org/r/82532 [15:16:39] oooo [15:16:44] looks good? or did I miss something? [15:16:46] RECOVERY - RAID on analytics1027 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:17:11] gonna run puppet again LeslieCarr [15:17:12] just to see [15:17:23] missed hive failing to install [15:17:27] ah ok [15:17:47] (03CR) 10Andrew Bogott: [C: 032] Labsdebrepo fixes: [operations/puppet] - 10https://gerrit.wikimedia.org/r/82532 (owner: 10Andrew Bogott) [15:17:48] probably will fix itself second time and is an ordering issue [15:18:56] hmm that def there is [15:19:00] err: /Stage[main]/Cdh4::Hue/User[hue]/groups: change from to hive,ssl-cert failed: Could not set groups on user[hue]: Execution of '/usr/sbin/usermod -G hive,ssl-cert hue' returned 6: usermod: group 'ssl-cert' does not exist [15:19:02] i should make hue require hive/oozie first [15:19:59] hmm ssl-cert? [15:20:05] hm [15:21:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.092 second response time [15:22:19] hm. [15:22:29] gotta figure out what package ssl-cert group is created by [15:22:34] thought it would have been openssl [15:22:35] (03PS2) 10Ori.livneh: statsd module: provision Ganglia backend support [operations/puppet] - 10https://gerrit.wikimedia.org/r/82563 [15:23:21] (03CR) 10Ori.livneh: "PS2 adds license info to the header of gmetric.js; I'm cool w/pushing it too." [operations/puppet] - 10https://gerrit.wikimedia.org/r/82563 (owner: 10Ori.livneh) [15:24:25] could be the ssl-cert package [15:24:41] http://ubuntuforums.org/showthread.php?t=1175286 [15:25:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:37] (03PS1) 10Ottomata: Installing ssl-cert package to make sure ssl-cert group is created for Hue SSL. [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/82620 [15:25:38] yeah [15:25:46] LeslieCarr: ^ [15:26:03] (03CR) 10Ottomata: [C: 032 V: 032] Installing ssl-cert package to make sure ssl-cert group is created for Hue SSL. [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/82620 (owner: 10Ottomata) [15:26:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.188 second response time [15:26:42] cool [15:27:16] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:27:24] (03PS1) 10Ottomata: Updating modules/cdh4 with ssl-cert package install change for Hue [operations/puppet] - 10https://gerrit.wikimedia.org/r/82621 [15:27:34] something is totally weird with analytics1014, will check that out after this [15:27:44] (03CR) 10Ottomata: [C: 032 V: 032] Updating modules/cdh4 with ssl-cert package install change for Hue [operations/puppet] - 10https://gerrit.wikimedia.org/r/82621 (owner: 10Ottomata) [15:28:08] ok, let's try that now [15:29:26] RECOVERY - NTP on analytics1027 is OK: NTP OK: Offset -0.01433777809 secs [15:29:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.706 second response time [15:34:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:34:52] akosiaris, I'm inheriting from a class that's in the same module /and/ the same file, yet lint still says 'class inherits across namespaces.' What am I missing? [15:34:55] yay [15:35:45] looks happy [15:36:34] andrewbogott: well for starters inheritance is considered harmful. You sure it is not a typo ? [15:36:43] could i have a look ? [15:37:11] akosiaris, I didn't write the original code so I'm reluctant to rebuild it from the ground up… there's already dangerously large amounts of refactor in this one patch :) [15:37:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.574 second response time [15:37:20] Latest version is https://gerrit.wikimedia.org/r/#/c/77332/ [15:37:35] in platform.pp [15:40:02] yeah, LeslieCarr that looks way better [15:40:21] ok, Leslie, help me test this [15:40:22] hue [15:40:23] um [15:40:33] run this locally: [15:40:34] ssh -N analytics1001.wikimedia.org -L 8888:analytics1027.eqiad.wmnet:8888 [15:40:39] then http://localhost:8888 [15:40:57] class base::platform::dell-c2100 inherits base::platform::generic::dell [15:41:01] ohhh wait it is ssl now hm [15:41:16] it is unhappy because it is inheriting from a class one level down [15:41:37] (03PS1) 10Manybubbles: Setup logrotation for elasticsearch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82623 [15:41:37] i 'd say don't touch for now [15:41:39] ah yeah [15:41:40] run that [15:41:40] and [15:41:44] https://localhost:8888/accounts/login/?next=/ [15:41:44] ottomata: got a login prompt [15:41:48] ah great! [15:41:55] use your shell username and ldap pw [15:42:11] akosiaris, I've tried rearranging it so that the inherited class is in a higher scope, lint still complains. [15:42:59] e.g. renaming base::platform::generic::dell to base::platform::generic-dell or even base::platform-generic-dell [15:43:36] class base::platform::dell::c2100 inherits base::platform::dell { [15:43:40] this makes it happy [15:43:59] but the puppet-lint parser is known to have problems [15:44:12] so i really suggest avoiding it for now [15:44:23] ok, happy to :) [15:44:40] LeslieCarr: can you log in? [15:44:42] I stripped out tabs and fixed a bunch of other formatting issues but… got tired of aligning things [15:45:14] (03CR) 10MZMcBride: "What's needed to get this merged and deployed?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78944 (owner: 10QChris) [15:45:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:35] Well it is a step forward. [15:46:12] (03PS1) 10Hashar: contint: publish Zuul git over git protocol [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 [15:46:29] ottomata: gerrit pw [15:46:30] ? [15:47:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.546 second response time [15:47:51] ja [15:47:53] that should work [15:48:04] LeslieCarr [15:48:17] woot [15:48:27] great! [15:48:46] (03CR) 10Chad: "The usual, bug someone in ops" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78944 (owner: 10QChris) [15:49:18] cool, ummm, i think we are done LeslieCarr! [15:49:20] thank you! [15:50:46] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [15:51:03] huzzah [15:53:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:54:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [15:57:16] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:58:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:39] heya LeslieCarr, if you have one more second, i'm looking into this icinga alert [15:59:48] thinking it might be an nrpe problem? not sure [16:00:11] if I run this from neon: [16:00:13] /usr/lib/nagios/plugins/check_nrpe -H analytics1014.eqiad.wmnet -c check_dpkg [16:00:28] should I get the output from running /usr/local/lib/nagios/plugins/check_dpkg on analytics1014? [16:00:40] yeah [16:00:40] right now I just get NRPE: Unable to read output [16:00:43] but I get that from other hosts too [16:01:17] hrm [16:01:46] and nrpe is running properly on 1014... [16:02:34] i wonder what could have changed [16:03:10] (03PS2) 10Hashar: contint: publish Zuul git over git protocol [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 [16:03:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [16:04:24] for some reason check_dpkg doesn't have +x on it [16:04:37] which is the problem i think [16:05:10] (03PS1) 10Ori.livneh: Log NavigationTiming data to Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/82627 [16:06:18] hmmm yeah i wondered about that, but it doesn't on other hosts either [16:06:19] like stat1002 [16:06:25] oh [16:06:28] but it has the same problem on stat1002 [16:06:28] ah [16:06:29] hm [16:07:07] still same result after chmod 755 [16:07:16] PROBLEM - DPKG on analytics1014 is CRITICAL: NRPE: Unable to read output [16:08:19] hrm [16:08:39] /usr/local/lib/nagios/plugins/check_dpkg: 9: .: Can't open /usr/local/lib/nagios/plugins/utils.sh [16:09:14] <^d> !log copied ssh_host_key for gerrit from manganese to ytterbium, restarted gerrit on ytterbium [16:09:17] that was via nrpe? [16:09:19] or just locally? [16:09:45] hm [16:09:51] bash /usr/local/lib/nagios/plugins/check_dpkg [16:09:53] also does that [16:09:55] but prints [16:09:58] All packages OK [16:10:22] bash != sh [16:10:42] change the shebang [16:10:45] locally [16:10:52] i gotta get going [16:10:59] can check this out when i get to the dc [16:11:13] ottomata: http://serverfault.com/a/441675 [16:11:20] bbiab [16:11:23] ok cool, s'ok [16:11:28] i think its because utils.sh isn't there [16:11:34] but that is part of percona icinag stuff? [16:11:37] (03CR) 10Hashar: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 (owner: 10Hashar) [16:12:21] ottomata: it's in /usr/lib/nagios/plugins/utils.sh [16:12:33] no local [16:12:38] yeah but# This program is part of percona-nagios-checks (http://code.google.com/p/percona-nagios-checks/) [16:12:45] anyway, i think i don't need it [16:13:17] RECOVERY - DPKG on analytics1014 is OK: All packages OK [16:13:20] * ori-l shrugs [16:13:29] yeah [16:13:37] dunno if it is better to change to bash and live with error [16:13:45] or remove the source line for utils.sh [16:15:11] probably the latter [16:15:14] yeah [16:15:16] think so too [16:15:20] (03PS1) 10Ottomata: Removing use of utils.sh from check_dpkg nrpe plugin. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82628 [16:15:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:15:34] (03CR) 10Ottomata: [C: 032 V: 032] Removing use of utils.sh from check_dpkg nrpe plugin. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82628 (owner: 10Ottomata) [16:16:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [16:19:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.468 second response time [16:26:03] (03PS1) 10Ottomata: nrpe plugin files need to be executable [operations/puppet] - 10https://gerrit.wikimedia.org/r/82631 [16:26:29] heya paravoid,could you check that one real quick? ^ [16:26:37] (03CR) 10Chad: [C: 031] Setup logrotation for elasticsearch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82623 (owner: 10Manybubbles) [16:26:39] not sure if 0555 would make the most sense there [16:26:43] maybe 0544 [16:26:56] but i'm not sure what user executes nrpe plugins as [16:39:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:58] (03PS7) 10Andrew Bogott: Move base class and subclasses into a 'base' module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77332 [16:41:03] (03CR) 10Andrew Bogott: [C: 032] Move base class and subclasses into a 'base' module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77332 (owner: 10Andrew Bogott) [16:43:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.204 second response time [16:45:17] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:45:44] springle_away: can you look at https://gerrit.wikimedia.org/r/#/c/57536/ when you get the chance? [16:46:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:45] !log moved base.pp into the 'base' module [16:49:17] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:52:27] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.196 second response time [16:53:57] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:54:27] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:54:27] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:54:28] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:54:28] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [16:54:37] PROBLEM - swift-container-replicator on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [16:54:37] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [16:54:37] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [16:54:47] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:07] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [16:55:47] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [16:55:57] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.718 second response time [16:56:17] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.803 second response time [16:56:27] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.855 second response time [16:56:27] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.430 second response time [16:56:27] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.135 second response time [16:56:37] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.046 second response time [16:56:37] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 60686 bytes in 1.262 second response time [16:57:15] (03PS2) 10Ottomata: nrpe plugin files need to be executable [operations/puppet] - 10https://gerrit.wikimedia.org/r/82631 [16:57:27] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.046 second response time [16:57:33] (03CR) 10Ottomata: [C: 032 V: 032] nrpe plugin files need to be executable [operations/puppet] - 10https://gerrit.wikimedia.org/r/82631 (owner: 10Ottomata) [16:58:25] ottomata: can you merge https://gerrit.wikimedia.org/r/#/c/82563/ by any chance? faidon was ok with it save for the lack of license info, which i've since added [16:58:33] plus https://gerrit.wikimedia.org/r/#/c/82627/ [16:59:47] PROBLEM - swift-container-replicator on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:00:10] (03PS3) 10Ottomata: statsd module: provision Ganglia backend support [operations/puppet] - 10https://gerrit.wikimedia.org/r/82563 (owner: 10Ori.livneh) [17:00:21] (03CR) 10Ottomata: [C: 032 V: 032] statsd module: provision Ganglia backend support [operations/puppet] - 10https://gerrit.wikimedia.org/r/82563 (owner: 10Ori.livneh) [17:00:27] hot [17:00:27] thanks [17:00:49] (03PS2) 10Ori.livneh: Log NavigationTiming data to Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/82627 [17:01:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:43] the other one seems fine, right ori-l? [17:01:50] to merge? [17:01:55] should be, yes [17:02:15] (03CR) 10Ottomata: [C: 032 V: 032] Log NavigationTiming data to Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/82627 (owner: 10Ori.livneh) [17:02:38] done! [17:02:53] weee, let's see if this works [17:03:06] i mean, let's confirm that it works like it did in dev [17:03:13] hehe [17:04:06] puppet is sooooo slow [17:04:27] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:44] Anyone here who knows how to revive morebots? [17:04:59] yeah being really slow for me too [17:05:28] RECOVERY - DPKG on ms-be1011 is OK: All packages OK [17:05:32] andrewbogott: hrm. it's on labs now no? [17:05:45] the labs bot runs on a different host... [17:05:47] RECOVERY - DPKG on ms-be1001 is OK: All packages OK [17:05:52] I don't think I've ever meddled with the one that lives here. [17:06:07] it used to be on rackspace, on the same host that powers wikitech-static [17:06:12] but i *think* ryan moved it [17:07:27] RECOVERY - DPKG on analytics1027 is OK: All packages OK [17:08:18] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.100 second response time [17:08:18] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:08:27] RECOVERY - DPKG on analytics1005 is OK: All packages OK [17:08:27] RECOVERY - DPKG on mw31 is OK: All packages OK [17:09:27] RECOVERY - DPKG on analytics1025 is OK: All packages OK [17:09:27] RECOVERY - DPKG on search28 is OK: All packages OK [17:09:37] PROBLEM - DPKG on cp1061 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:09:47] RECOVERY - DPKG on search1019 is OK: All packages OK [17:10:07] RECOVERY - DPKG on analytics1012 is OK: All packages OK [17:10:57] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [17:10:57] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:07] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [17:11:17] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [17:11:27] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [17:11:27] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:27] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:37] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:37] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [17:11:37] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [17:11:37] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [17:12:15] <^d> Hmm, that's the second time now. [17:12:19] ottomata: O.o https://dpaste.de/mjcqN/ [17:12:27] PROBLEM - RAID on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:12:27] RECOVERY - DPKG on ms-be1012 is OK: All packages OK [17:12:37] RECOVERY - DPKG on analytics1023 is OK: All packages OK [17:12:39] that's puppet [17:12:46] god knows what it's doing [17:12:55] heh [17:13:12] <^d> Puppet does what it wants. [17:13:15] <^d> Puppet don't care. [17:13:57] PROBLEM - swift-container-replicator on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:13:59] puppet is an asshole [17:14:17] RECOVERY - DPKG on mw1057 is OK: All packages OK [17:14:17] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [17:14:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.421 second response time [17:14:27] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.070 second response time [17:14:27] RECOVERY - DPKG on lanthanum is OK: All packages OK [17:15:07] PROBLEM - swift-container-replicator on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:15:27] RECOVERY - DPKG on searchidx1001 is OK: All packages OK [17:15:28] PROBLEM - Swift HTTP on ms-fe3 is CRITICAL: Connection timed out [17:15:47] PROBLEM - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is CRITICAL: Connection timed out [17:16:03] I just merged a somwhat drastic puppet refactor, so watching these alerts nervously... [17:16:17] RECOVERY - DPKG on analytics1014 is OK: All packages OK [17:16:27] PROBLEM - DPKG on analytics1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:16:46] anecdotally it's been like this for a few days now [17:16:47] PROBLEM - swift-container-replicator on ms-be9 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:16:57] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.922 second response time [17:16:57] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.076 second response time [17:16:59] i don't have hard evidence to back that up, but that's been my impression [17:17:17] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [17:17:17] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.070 second response time [17:17:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:28] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.068 second response time [17:17:31] ooooh, i know what we should do [17:17:37] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [17:17:37] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 60686 bytes in 0.274 second response time [17:17:40] PROBLEM - DPKG on ms-fe3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:17:47] RECOVERY - DPKG on labstore3 is OK: All packages OK [17:17:57] PROBLEM - RAID on ms-fe3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:17:58] notice: Finished catalog run in 526.43 seconds --> graphite / ganglia [17:18:02] OK Who had Calamari for lunch? [17:18:17] RECOVERY - DPKG on db1027 is OK: All packages OK [17:18:27] PROBLEM - DPKG on cp1062 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:18:28] RECOVERY - DPKG on analytics1009 is OK: All packages OK [17:18:47] RECOVERY - RAID on ms-fe3 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:18:51] I got an error accessing something at Wikisource - Generated Wed, 04 Sep 2013 17:17:31 GMT by sq53.wikimedia.org (squid/2.7.STABLE9) [17:19:00] In that it's not finding a URL [17:19:17] PROBLEM - Disk space on ms-fe3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:19:27] PROBLEM - swift-container-replicator on ms-be11 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:19:37] RECOVERY - DPKG on analytics1008 is OK: All packages OK [17:19:47] RECOVERY - DPKG on search19 is OK: All packages OK [17:19:50] greg-g: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Miscellaneous+eqiad&h=vanadium.eqiad.wmnet&jr=&js=&v=2809&m=exception&vl=errors&ti=Exceptions [17:19:57] PROBLEM - swift-container-replicator on ms-be12 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:20:17] RECOVERY - DPKG on analytics1019 is OK: All packages OK [17:20:27] RECOVERY - DPKG on ms-be1009 is OK: All packages OK [17:20:27] RECOVERY - DPKG on ms-be1007 is OK: All packages OK [17:21:04] it's swift [17:21:07] RECOVERY - Disk space on ms-fe3 is OK: DISK OK [17:21:14] AaronSchulz, paravoid ^^ [17:21:27] RECOVERY - Swift HTTP on ms-fe3 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.071 second response time [17:21:27] RECOVERY - DPKG on ms-fe3 is OK: All packages OK [17:21:28] lots of: Exception from line 985 of /usr/local/apache/common-local/php-1.22wmf14/includes/filebackend/SwiftFileBackend.php: Got InvalidResponseException exception. [17:22:27] RECOVERY - DPKG on virt2 is OK: All packages OK [17:22:27] PROBLEM - swift-container-replicator on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:22:27] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 7.633 second response time [17:22:27] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:27] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:37] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:47] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:47] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:57] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:07] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [17:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.763 second response time [17:23:27] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:27] RECOVERY - DPKG on analytics1017 is OK: All packages OK [17:23:27] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [17:23:47] RECOVERY - DPKG on analytics1018 is OK: All packages OK [17:23:47] PROBLEM - DPKG on db1045 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:24:02] can someone confirm that they're looking into this? i don't know the first thing about swift [17:24:26] * andrewbogott isn't [17:24:27] RECOVERY - DPKG on ms-be1003 is OK: All packages OK [17:24:37] PROBLEM - Swift HTTP on ms-fe3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:50] bd808: hey, are you in the office? [17:25:16] ori-l: Nope. I get in on Sunday [17:25:18] locationist! hmpf! [17:25:27] RECOVERY - Swift HTTP on ms-fe3 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 1.199 second response time [17:25:28] RECOVERY - DPKG on analytics1020 is OK: All packages OK [17:25:37] RECOVERY - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 2.030 second response time [17:26:17] error and exception logs contain only Swift o_0 [17:26:17] RECOVERY - DPKG on analytics1024 is OK: All packages OK [17:26:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:57] RECOVERY - DPKG on snapshot3 is OK: All packages OK [17:27:17] RECOVERY - DPKG on search1009 is OK: All packages OK [17:27:27] RECOVERY - DPKG on analytics1006 is OK: All packages OK [17:27:54] Can anyone give me the one line summary of what's broken? [17:28:17] PROBLEM - RAID on ms-fe1 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:28:27] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:27] RECOVERY - DPKG on analytics1016 is OK: All packages OK [17:28:59] apergos ? [17:29:07] RECOVERY - RAID on ms-fe1 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [17:29:18] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.063 second response time [17:29:32] Thanks [17:29:37] RECOVERY - DPKG on searchidx2 is OK: All packages OK [17:29:44] Qcoder00: we're figuring it out; calm down [17:30:32] * Qcoder00 puts up a pitcure of an octopus with meanacing spanners... "Please Stand By!" [17:31:26] http://ganglia.wikimedia.org/latest/?c=Swift%20pmtpa&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [17:31:31] hrm [17:31:37] RECOVERY - DPKG on analytics1013 is OK: All packages OK [17:32:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.835 second response time [17:32:27] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:58] seems to be getting better: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Miscellaneous+eqiad&h=vanadium.eqiad.wmnet&jr=&js=&v=2809&m=exception&vl=errors&ti=Exceptions [17:33:59] drdee/ottomata: i am taking down an1007. i have a few ideas but it could be a couple of days. I know it does shit but wanted to let you know [17:34:17] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.250 second response time [17:34:57] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.629 second response time [17:35:17] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.411 second response time [17:35:17] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.531 second response time [17:35:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:27] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.142 second response time [17:35:37] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.438 second response time [17:35:37] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 60686 bytes in 0.282 second response time [17:35:41] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.239 second response time [17:35:47] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.451 second response time [17:36:27] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.654 second response time [17:36:29] (03PS4) 10Andrew Bogott: Turn on pluginsync. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77378 [17:37:09] No idea if this is related of helpful, but bawolff published an RFC about changing the default gallery mode this morning. [17:37:15] It gave tips on trying out the new packed and packed-hover options on existing pages. [17:37:22] Those options use thumbnail sizes that differ from normal and could cause more swift traffic. [17:37:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.229 second response time [17:37:57] PROBLEM - DPKG on ms-fe2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:38:17] PROBLEM - RAID on ms-fe2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:38:17] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [17:38:21] bd808: link to RFC? [17:38:27] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [17:38:27] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:28] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [17:38:37] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [17:38:37] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [17:38:37] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [17:38:57] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [17:39:07] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:17] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.060 second response time [17:39:27] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [17:39:47] RECOVERY - DPKG on ms-fe2 is OK: All packages OK [17:40:07] RECOVERY - RAID on ms-fe2 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:40:13] * AaronSchulz sighs at https://commons.wikimedia.org/wiki/Special:NewFiles [17:40:19] PissedPanda: https://commons.wikimedia.org/wiki/Commons:Requests_for_comment/Changing_default_gallery_mode [17:41:04] binasher: are you around? [17:41:30] * AaronSchulz wants to know what's eating that cpu [17:41:39] AaronSchulz: hey [17:41:44] yeah do it! cmjohnson1! thanks! [17:41:47] oh [17:42:02] (03PS1) 10Ori.livneh: StatsD on hafnium: load Ganglia backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/82640 [17:42:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:30] ottomata: can you do that one too by any chance? one-liner. [17:42:52] (03CR) 10Ottomata: [C: 032 V: 032] StatsD on hafnium: load Ganglia backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/82640 (owner: 10Ori.livneh) [17:42:57] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.097 second response time [17:42:57] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.083 second response time [17:43:04] thanks otto [17:43:11] yup! [17:43:16] AaronSchulz: i think paravoid and mark are looking at swift [17:43:17] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [17:43:17] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.070 second response time [17:43:17] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.095 second response time [17:43:27] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.314 second response time [17:43:36] hmm, ms-be7 and some recovered [17:43:37] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.341 second response time [17:43:37] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.042 second response time [17:43:37] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 60686 bytes in 0.171 second response time [17:43:39] just me, paravoid's not here [17:43:42] i restarted ms-be7 [17:43:50] that would explain :) [17:43:50] swift-object and swift-container [17:43:59] not sure yet what's going on [17:44:20] annoyingly, the python version that's running has been replaced by a newer version [17:44:38] making debugging/profiling harder [17:47:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.557 second response time [17:47:28] mark: MW errors seem to have stopped [17:47:46] weird [17:47:49] some backends are still in trouble [17:48:17] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.062 second response time [17:48:39] !log demon synchronized php-1.22wmf15/extensions/RSS [17:50:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:51:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.161 second response time [17:52:27] PROBLEM - RAID on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:55:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:47] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:56:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [17:58:54] !log demon synchronized php-1.22wmf15/extensions/CirrusSearch [18:00:47] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:01:17] PROBLEM - RAID on ms-fe1 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:01:28] PROBLEM - RAID on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:02:07] RECOVERY - RAID on ms-fe1 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [18:03:35] hmmm, got a icinga RAID q [18:03:37] RECOVERY - swift-container-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:03:38] maybe RobH? [18:03:59] (03PS1) 10Cmjohnson: adding dns entries for virt1008/9 & making changes to an1007 for testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/82642 [18:04:05] icinga uses /usr/local/bin/check_raid.py to check up on raid status [18:04:19] it looks like the analytics dells have a hw raid device attached? not sure [18:04:27] RECOVERY - swift-container-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:04:46] whatever is happening, the check_raid.py is failing because it is defaulting to using MegaCli64 rather than mdadm [18:04:48] RECOVERY - swift-container-replicator on ms-be9 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:05:07] RECOVERY - swift-container-replicator on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:05:27] RECOVERY - swift-container-replicator on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:05:31] I'm restarting some of these container replicators fyi [18:06:57] RECOVERY - swift-container-replicator on ms-be12 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:06:57] RECOVERY - swift-container-replicator on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:06:58] ottomata: they do...they have h310 controller cards and are h/w raided [18:07:31] oh hm [18:07:53] hm [18:07:54] WARNING: Parse error processing MegaCli64 output [18:08:15] !log restarting proxy server on ms-fe1 [18:08:39] argh [18:08:47] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.064 second response time [18:09:17] apergos: will you do the rest too? [18:09:40] there dont seem to be rest, according to icinga I should have got them all [18:09:49] RECOVERY - swift-container-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:10:18] let's see how that looks [18:10:29] did you see ganglia? [18:10:38] looking now [18:11:28] hmm I can go through and restart the other services on some of these [18:11:35] did you see I was debugging those processes? [18:11:41] 5 and 8 it looks like [18:12:04] and 5 and 1 and 2 [18:12:38] ah 2 [18:12:59] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [18:12:59] let's see how that is for a minute [18:14:35] I did not see you were debugging, sorry about that [18:16:10] (03CR) 10Cmjohnson: [C: 032 V: 032] adding dns entries for virt1008/9 & making changes to an1007 for testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/82642 (owner: 10Cmjohnson) [18:17:54] hmm, cmjohnson1, it seems the MegaCli64 output is different when running as root than when not [18:18:01] and when it is running as root it has extra output [18:18:09] Adapter 0 -- Virtual Drive Information: [18:18:10] Adapter 0: No Virtual Drive Configured. [18:19:05] * cmjohnson1 goes to look [18:20:18] ottomata: doesn't think seem to be logical to you? [18:21:50] yeah its not, [18:21:57] ottomata if you look at the partman recipe it's software raid 1 [18:22:11] for analytics dells? [18:22:19] yep [18:22:24] yeah makes sense, there are /dev/md* drives [18:22:39] is that wrong then? that's how they have always been [18:22:54] that's how they always been [18:23:21] i mean, should they be using the hw raid there? [18:24:26] i don't know who wanted this cfg. Just cuz they're h/w raid capable it's not best option for everything. [18:26:23] right ok [18:26:28] yeah i don't think we need it [18:26:31] this is probably what we want [18:26:37] these are datanodes, and jbod is basically better [18:26:44] the only thing we want raid for is for OS disk redundancy [18:26:51] so ok [18:26:57] this is a problem with the check-raid.py script then [18:27:09] if it detects hw raid capable, then it will try to check with megacli [18:28:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:11] i'm not really sure what the write behavior there [18:30:40] right* [18:30:42] :p [18:33:02] !log demon synchronized php-1.22wmf15/extensions/CirrusSearch [18:36:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.414 second response time [18:39:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:39:34] (03PS1) 10Cmjohnson: adding dns entries for for virt1008/9 [operations/dns] - 10https://gerrit.wikimedia.org/r/82646 [18:39:37] Ryan_Lane, can you revive morebots, and/or tell me (again) how to do it? [18:39:45] sure [18:39:47] it's on wikitech [18:39:48] err [18:39:50] wikitech-static [18:39:54] maybe we should move it to labs? [18:40:12] we dont maintain logging on static in outage [18:40:17] so no reason to maintain the bot there [18:40:20] indeed [18:40:21] Hm… sure, actually, I'll move it right now. [18:40:22] imo [18:40:26] andrewbogott: cool, thanks [18:40:31] let me get you the password [18:40:38] just update docs ;] [18:40:38] is it on fenari? [18:40:51] morebots? no [18:40:55] the bot is currently residing on wikitech-static server, which is a rackspace server off cluster. [18:41:02] let me get you the whole config, in fact [18:41:08] so when cluster crashes, we have docs to revive it =] [18:41:14] andrewbogott: which tool do you want me to add it to? [18:41:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.676 second response time [18:42:17] RobH: oh, rackspace finally added the image creation back :) [18:42:25] they say not to use it for backup, but it's better than nothing [18:42:39] it'll make images on a schedule [18:42:48] oh... that doesn't suck [18:42:57] indeed [18:42:57] yea its not best for backup i guess if it was primary backup [18:43:04] but for us its a backup of a static page [18:43:06] so its cool. [18:43:06] yep [18:43:34] i'll login to the interface later today and see about setting it up (unless you beat me too it) [18:43:45] oh, I need to upgrade wikitech-static [18:43:59] RobH: nah. go for it [18:44:10] Ryan_Lane, the 'morebots' tool can run it. [18:44:16] ok. cool [18:44:30] we just need to add a conf in ~/confs [18:44:50] ok. cool [18:44:53] I'll add it there [18:45:40] (03CR) 10Cmjohnson: [C: 032 V: 032] adding dns entries for for virt1008/9 [operations/dns] - 10https://gerrit.wikimedia.org/r/82646 (owner: 10Cmjohnson) [18:45:51] Ryan_Lane: not clear in interface, is this billed for? [18:45:57] or simply something allowed? [18:45:58] it is not [18:46:03] cool [18:46:18] well, maybe it is for disk space used? I doubt it, though [18:46:35] they're pretty good about telling you when something costs money [18:46:36] heh, you can schedule a daily image [18:46:44] yeah [18:46:59] nothing greater than daily, so i'm going to put daily and keep 7 days (default) sound good? [18:47:14] does it sync at scheduled times with wikitech or stream? [18:47:25] (i dont wanna schedule this at same time i image if can help it) [18:47:46] heh, nm, they dont let you set the time to do the image creation [18:47:51] so its set for 7 days, daily. [18:48:27] yep [18:48:38] sounds good [18:50:10] Ryan_Lane: hey :) are you familiar with the way we apply ipables rules in puppet ? I am wondering why we have a bunch of class requiring each other [18:50:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:33] you should look at the new ferm support [18:50:38] the old iptables stuff sucks [18:52:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.957 second response time [18:52:49] Ryan_Lane: can't invest time figuring out ferm right now :-] [18:53:04] (03PS3) 10Hashar: contint: publish Zuul git over git protocol [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 [18:53:05] I guess we have bunch of require to ensure the rules are applied in a predictable order [18:54:08] (03CR) 10Hashar: "GIT_DAEMON_BASE_PATH set to the same value as GIT_DAEMON_DIRECTORY would strip the /git/ from the URL." [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 (owner: 10Hashar) [18:54:13] are you already using the old iptables support? [18:54:18] yup [18:54:20] ah [18:54:30] yes, it's to ensure they are added in a particular order [18:54:36] that got setup before I started :] [18:54:55] and I noticed the default rule is to accept. But I am to afraid to make it deny by default huhu [18:55:19] maybe you could introduce me to ferm next week? [18:55:27] I haven't used it [18:55:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:55:48] deny by default is not easy in the old system [18:55:55] ferm should make it much easier [18:56:09] so I fall back to explicit deny, not ideal though [18:56:17] (03PS1) 10Ori.livneh: Bugfixes to StatsD Ganglia backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/82648 [18:56:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [18:56:39] ottomata or Ryan_Lane, got a second for ^? [18:56:43] RobH, half-assed docs: https://wikitech.wikimedia.org/wiki/Admin_Logs [18:57:15] heh, its a start =] [18:57:27] ori-l: ? [18:57:41] the changeset (https://gerrit.wikimedia.org/r/#/c/82648/) [19:00:07] andrewbogott: I added production-logbot.py [19:00:14] and did chmod o-r * [19:00:16] on the confs [19:00:32] ok, lemme try to start it up [19:02:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:03:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [19:06:19] PROBLEM - DPKG on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:07:19] RECOVERY - DPKG on analytics1014 is OK: All packages OK [19:09:17] (03PS1) 10CSteipp: Enable Global AbuseFilter for more wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82649 [19:10:16] morebots, what's up? [19:10:16] I am a logbot running on tools-exec-07. [19:10:16] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [19:10:16] To log a message, type !log . [19:10:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:10:50] !log Moved morebots from wikitech-static to toollabs [19:10:53] Logged the message, Master [19:11:10] Ryan_Lane, did you kill the cron or whatever that was managing the bot on wikitech-static? [19:11:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.689 second response time [19:11:24] I purged the package [19:11:32] OK, then I think we're done. Thanks. [19:11:35] cool [19:11:36] thank you :) [19:14:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:42] (03CR) 10Andrew Bogott: [C: 032] Turn on pluginsync. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77378 (owner: 10Andrew Bogott) [19:16:26] !log turning on pluginsync, which will (temporarily) make puppet even slower [19:16:29] Logged the message, Master [19:17:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [19:18:17] PROBLEM - RAID on analytics1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:20:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.109 second response time [19:26:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:27:37] PROBLEM - Puppet freshness on stafford is CRITICAL: No successful Puppet run in the last 10 hours [19:35:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.213 second response time [19:39:10] (03PS1) 10Hashar: misc varnish conf for doc.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/82653 [19:40:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.983 second response time [19:41:30] !log upgraded wiktech-static to 1.22wmf15 [19:41:33] Logged the message, Master [19:43:31] (03PS1) 10Manybubbles: Add elasticsearch plugins. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82673 [19:44:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.411 second response time [19:50:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [19:57:12] ugh 214 DatabaseBase->reportQueryError('Deadlock found ...', 1213, 'INSERT INTO `r...', 'RecentChange::s...', false) [19:58:26] binasher, springle_away, everything's deadlocked ^^ :o [19:58:48] all on Commons [19:59:48] (03CR) 10MaxSem: [C: 032] Provide new normalised forms for IPv6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81868 (owner: 10MaxSem) [20:01:32] (03PS2) 10Ottomata: Bugfixes to StatsD Ganglia backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/82648 (owner: 10Ori.livneh) [20:01:38] (03CR) 10Ottomata: [C: 032 V: 032] Bugfixes to StatsD Ganglia backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/82648 (owner: 10Ori.livneh) [20:03:32] (03CR) 10MaxSem: [V: 032] Provide new normalised forms for IPv6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81868 (owner: 10MaxSem) [20:06:37] PROBLEM - MySQL Slave Delay on db38 is CRITICAL: CRIT replication delay 320 seconds [20:06:57] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 334 seconds [20:07:07] PROBLEM - MySQL Replication Heartbeat on db38 is CRITICAL: CRIT replication delay 342 seconds [20:24:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:43:16] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [20:43:16] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [20:51:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.459 second response time [21:00:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:01:59] scapping... [21:02:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.047 second response time [21:04:36] PROBLEM - MySQL Slave Delay on db36 is CRITICAL: CRIT replication delay 3107 seconds [21:05:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:06:36] RECOVERY - MySQL Slave Delay on db36 is OK: OK replication delay 0 seconds [21:08:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.337 second response time [21:10:57] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [21:11:00] Logged the message, Master [21:12:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.691 second response time [21:23:33] !log maxsem Finished syncing Wikimedia installation... : Weekly mobile deployment [21:23:36] Logged the message, Master [21:28:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:32:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [21:33:16] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [21:36:44] (03PS1) 10Dzahn: add careers and jobs DNS entries as requested in RT #5709 [operations/dns] - 10https://gerrit.wikimedia.org/r/82750 [21:45:30] (03PS1) 10CSteipp: Enable XFO: SAMEORIGIN for enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82751 [21:46:57] (03CR) 10Catrope: [C: 031] Enable XFO: SAMEORIGIN for enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82751 (owner: 10CSteipp) [22:00:07] (03PS1) 10Ryan Lane: Add service IP to production gerrit server [operations/puppet] - 10https://gerrit.wikimedia.org/r/82753 [22:04:30] (03CR) 10Ryan Lane: [C: 032] Add service IP to production gerrit server [operations/puppet] - 10https://gerrit.wikimedia.org/r/82753 (owner: 10Ryan Lane) [22:12:06] (03PS1) 10Ryan Lane: Add dhcp for virt12/15 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82754 [22:13:27] (03PS1) 10Ryan Lane: Make virt 12-15 compute nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/82755 [22:13:50] (03CR) 10Ryan Lane: [C: 032] Add dhcp for virt12/15 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82754 (owner: 10Ryan Lane) [22:17:29] (03CR) 10Ryan Lane: [C: 032] Make virt 12-15 compute nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/82755 (owner: 10Ryan Lane) [22:22:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:23:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.247 second response time [22:27:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:29:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.240 second response time [22:30:16] (03PS1) 10Ori.livneh: Transfer responsibility for socket to ganglia.js [operations/puppet] - 10https://gerrit.wikimedia.org/r/82758 [22:30:21] Ryan_Lane: that one [22:30:43] i tested/'deployed' it by puppetd --disable on the target node and copying the file into place manually [22:30:57] so it should be safe; it merely makes puppet agree with what's on disk. [22:32:04] heh. ugh on the Gmetric/gmetric vars :D [22:32:43] dude, it's so bad [22:32:53] i don't know how that guy posted it to github with his real name on it [22:32:58] :D [22:33:18] the node.js community is node.js's greatest enemy [22:33:29] (03CR) 10Ryan Lane: [C: 032] Transfer responsibility for socket to ganglia.js [operations/puppet] - 10https://gerrit.wikimedia.org/r/82758 (owner: 10Ori.livneh) [22:34:07] ori-l: and to think, now it has your name associated with it ;) [22:34:32] heh [22:35:02] thanks, btw! [22:36:05] yw [22:40:50] ^d: so, the IP is added to the host [22:41:52] it's just a matter of changing DNS [22:41:55] I'm going to push that change in [22:43:03] (03PS1) 10Ryan Lane: Switch gerrit to the service IP [operations/dns] - 10https://gerrit.wikimedia.org/r/82761 [22:44:57] <^d> Mmk. Have you at least reviewed my puppet changes? [22:45:08] <^d> I think that's it, then we can just merge when our window opens. [22:46:04] <^d> Whoops, it's got a path conflict, lemme resolve that now. [22:51:24] (03PS6) 10Chad: Switch Gerrit from manganese to ytterbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81374 [22:52:52] PROBLEM - DPKG on mw1061 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:52:53] which puppet changes? [22:53:33] <^d> Ryan_Lane: ^ [22:53:36] (03PS1) 10Chad: Remove obsolete sudo setup [operations/puppet] - 10https://gerrit.wikimedia.org/r/82763 [22:53:39] <^d> 81374 [22:53:40] https://gerrit.wikimedia.org/r/#/c/81374/6/manifests/role/cache.pp,unified [22:53:40] ? [22:53:43] what's that? [22:54:07] <^d> That's for the misc. varnish cache that's going to sit in front of gerrit at some point. [22:54:22] <^d> Doesn't do anything atm, but didn't want it pointing to a dead box. [22:54:24] ah. I already added the IP this change is going to conflict [22:54:35] <^d> I rebased on top of yours [22:54:44] ah. ok [22:55:02] PROBLEM - Apache HTTP on mw1061 is CRITICAL: Connection refused [22:55:35] <^d> https://gerrit.wikimedia.org/r/#/c/81374/6/manifests/role/gerrit.pp looks more complicated than it is. Net effect is moving replication from production::old to production and removing ytterbium replication. [22:55:35] (03CR) 10Ryan Lane: [C: 032] Switch Gerrit from manganese to ytterbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81374 (owner: 10Chad) [22:55:39] <^d> :) [22:55:52] RECOVERY - DPKG on mw1061 is OK: All packages OK [22:55:54] I guess I should have waited [22:55:56] heh [22:56:02] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [22:56:05] I didn't merge on sockpuppet yet [22:56:22] RECOVERY - search indices - check lucene status page on search1018 is OK: HTTP OK: HTTP/1.1 200 OK - 60696 bytes in 0.009 second response time [22:57:25] <^d> I'm going to stop puppet on manganese for a bit anyway. Want to bring gerrit down without puppet bringing it up until dns switches. [22:57:38] <^d> Also, I should force a replication to ytterbium one last time. [22:59:08] yep [22:59:51] <^d> Ok, that's done. [23:00:32] <^d> How do I stop puppet again? [23:01:06] puppetd --disable [23:01:20] <^d> Too easy :) [23:01:35] <^d> !log stopped puppet on manganese [23:01:38] Logged the message, Master [23:01:58] hopefully that works anyway [23:01:59] :) [23:03:32] <^d> Ok, I think we can merge the dns change and on sockpuppet. [23:04:52] <^d> !log apache stopped on manganese [23:04:55] Logged the message, Master [23:05:05] hahaha [23:05:09] you need to start gerrit [23:05:20] <^d> I already started it on ytterbium. [23:05:28] <^d> Oh, duh. [23:05:30] <^d> So you can merge. [23:05:31] so, yeah, puppet-merge doesn't work [23:05:33] ;) [23:05:47] <^d> Ok, apache back. [23:05:50] also... [23:05:54] I need to merge the DNS change too [23:06:23] (03CR) 10Ryan Lane: [C: 032] Switch gerrit to the service IP [operations/dns] - 10https://gerrit.wikimedia.org/r/82761 (owner: 10Ryan Lane) [23:06:47] ok. done [23:06:54] you may want to do one last sync [23:06:57] then shut it off [23:07:16] and you'll want to run puppet on ytterbium :) [23:07:21] <^d> Yeah [23:07:23] otherwise I think we're good [23:09:22] well, I'm logged into gerrit [23:09:26] I'd imagine it's working :) [23:09:45] <^d> Replication is going one last time. [23:11:03] <^d> Puppet still churning. [23:12:59] <^d> Ok, everything's up on ytterbium afaict. [23:13:07] <^d> Forcing a replication from the new box to make sure it all goes out ok. [23:13:18] <^d> Reject hostKey [23:13:20] <^d> Dammit gerrit. [23:13:39] :D [23:14:16] ^d: thanks for the whisky btw :) [23:14:26] <^d> You're welcome :) [23:17:47] <^d> !log restarting gerrit one last time [23:17:50] Logged the message, Master [23:19:32] <^d> The hell? [23:20:49] weird page [23:21:07] service is down. [23:22:44] <^d> I keep getting the "reject hostKey" crap from replication. [23:22:46] <^d> Queue backs up. [23:22:57] <^d> But I can ssh to the boxes in question :\ [23:23:29] the authorized_keys file only has one entry [23:23:44] ah [23:23:47] /home/gerrit2? [23:24:21] <^d> /var/lib/gerrit2 [23:24:26] wrong homedir [23:24:39] <^d> This is identical to how it was on manganese. [23:24:51] I just updated the homedir [23:24:52] <^d> Oh, wait, homedir is set wrong? [23:24:56] <^d> Ahhh, that might do it. [23:25:24] you may want to restart the service :) [23:25:45] <^d> Did. [23:26:12] didn't you have "gerritslave" as the replication user [23:26:22] 318 # Setup the `gerritslave` account on any host that wants to receive [23:26:26] 319 # replication. [23:26:42] <^d> The destinations hadn't changed :) [23:27:18] <^d> It's working now. [23:27:22] is grrrit-wm down? [23:27:28] perhaps because of the switchover [23:27:42] <^d> It's on a new box. [23:27:54] oh, no, grrrit-wm isn't down, I saw a message go past now [23:27:55] in -dev [23:27:57] so nevermind [23:28:04] <^d> Ryan_Lane: Chrome's yelling at me and won't let me connect to the UI. [23:28:11] I was wondering about the bot, not gerrit. [23:28:26] firefox isn't working either [23:28:42] <^d> Unable to make a secure connection to the server. This may be a problem with the server, or it may be requiring a client authentication certificate that you don't have. [23:28:42] <^d> Error code: ERR_SSL_PROTOCOL_ERROR [23:28:51] (03PS1) 10Matthias Mullie: Enable Flow on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82766 [23:29:21] gerrit works for me [23:29:51] logging in, listing changes .. [23:29:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:30:23] hm [23:30:37] apache needed a restart [23:30:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.456 second response time [23:31:07] it's working now [23:31:48] (03CR) 10Matthias Mullie: [C: 04-1] "@todo: Parsoid config?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82766 (owner: 10Matthias Mullie) [23:32:16] <^d> Seems fine [23:38:29] <^d> Well out of ~4300 replication jobs, only 21 failed. Not as bad as I feared. [23:44:09] so, we good to go? :) [23:44:31] yo ^d; is it already possible to delete gerrit repo's? [23:44:56] <^d> Yeah mostly. [23:45:08] awesome! can I do it? [23:45:23] <^d> Prolly not, I think it's Admins+ops. [23:45:36] <^d> You have to manually cleanup gitblit & github afterwords too :\ [23:46:31] <^d> Send me an e-mail. There's a couple others I promised to delete someone, I'll do them as a batch tomorrow. [23:46:56] k, i will create an inventory list first, but thanks! [23:48:11] <^d> Ryan_Lane: Yeah, I think so. I've got a minor todo list for the few that failed replication, but I think we're done. [23:54:56] <^d> Ryan_Lane: Thanks for all your help. We may tweak things down the road to really take advantage of the new hardware :) [23:56:07] ^d: cool. sounds good. [23:56:11] good job :) [23:59:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds