[00:29:25] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [00:41:26] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:13:35] PROBLEM - Puppet freshness on db1011 is CRITICAL: Last successful Puppet run was Wed 06 Aug 2014 23:13:18 UTC [01:13:35] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 06 Aug 2014 17:10:26 UTC [01:20:35] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [02:16:48] !log LocalisationUpdate completed (1.24wmf15) at 2014-08-07 02:15:45+00:00 [02:16:58] Logged the message, Master [02:20:26] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [02:21:25] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [02:22:25] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [02:22:25] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [02:23:25] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [02:24:25] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures [02:24:25] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [02:24:25] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [02:24:55] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Puppet has 1 failures [02:26:25] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures [02:26:25] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [02:28:25] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [02:28:55] !log LocalisationUpdate completed (1.24wmf16) at 2014-08-07 02:27:52+00:00 [02:29:01] Logged the message, Master [02:29:15] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: Puppet has 1 failures [02:30:25] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [02:31:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [02:33:25] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [02:34:25] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [02:34:25] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [02:34:25] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [02:34:25] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [02:34:55] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: Puppet has 1 failures [02:35:25] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 1 failures [02:35:25] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 1 failures [02:35:25] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [02:36:25] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures [02:43:05] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: Puppet has 1 failures [03:03:22] ^ all of the above seem to be due to "apt-get update" failures from ulsfo, if anyone has a clue about that [03:04:45] doesn't seem to be our repo, seems to be some issue from ulsfo to security.ubuntu.com. I'm running one now and it's going ridiculously slowly. [03:05:06] probably ubuntu's problem, with whichever security mirror ulsfo happens to pick? [03:06:45] hmmm no scratch that, problems with xfer speed from ubuntu.wm.o as well [03:07:35] public traffic stats don't seem to have dropped off or anything in ulsfo, though, so I don't think it's a huge general network issue there [03:09:55] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Aug 7 03:08:49 UTC 2014 (duration 8m 48s) [03:10:01] Logged the message, Master [03:11:26] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [03:13:15] RECOVERY - Puppet freshness on db1011 is OK: puppet ran at Thu Aug 7 03:13:04 UTC 2014 [03:14:35] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 06 Aug 2014 17:10:26 UTC [03:21:35] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [03:33:45] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [03:37:25] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [03:43:59] (03PS1) 10Springle: Make live MariaDB labsdb config changes stick. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152209 [03:48:08] !log labsdb1001 restart [03:48:13] Logged the message, Master [03:48:36] PROBLEM - Host labsdb1001 is DOWN: PING CRITICAL - Packet loss = 100% [03:50:25] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [03:51:25] RECOVERY - Host labsdb1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [03:51:25] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [03:51:40] (03CR) 10Springle: [C: 032] Make live MariaDB labsdb config changes stick. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152209 (owner: 10Springle) [03:54:48] (03PS1) 10Springle: Move MariaDB 10 labsdbs instances back to port 3306 after upgrade. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152210 [03:55:35] PROBLEM - Puppet freshness on cp4020 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 01:54:46 UTC [03:55:48] (03CR) 10Springle: [C: 032] Move MariaDB 10 labsdbs instances back to port 3306 after upgrade. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152210 (owner: 10Springle) [03:56:35] PROBLEM - Puppet freshness on cp4009 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 01:55:46 UTC [03:57:25] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [03:58:26] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [03:58:35] PROBLEM - Puppet freshness on cp4011 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 01:58:24 UTC [03:58:35] PROBLEM - Puppet freshness on cp4016 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 01:58:24 UTC [03:59:35] PROBLEM - Puppet freshness on cp4010 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 01:58:49 UTC [04:00:35] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 01:59:30 UTC [04:00:35] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:00:05 UTC [04:00:35] PROBLEM - Puppet freshness on cp4015 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 01:59:35 UTC [04:01:35] PROBLEM - Puppet freshness on cp4017 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:00:36 UTC [04:02:35] PROBLEM - Puppet freshness on cp4007 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:02:17 UTC [04:04:35] PROBLEM - Puppet freshness on lvs4002 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:03:38 UTC [04:05:25] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [04:09:35] PROBLEM - Puppet freshness on cp4003 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:08:42 UTC [04:09:35] PROBLEM - Puppet freshness on cp4004 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:09:13 UTC [04:10:35] PROBLEM - Puppet freshness on cp4014 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:09:43 UTC [04:10:35] PROBLEM - Puppet freshness on lvs4003 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:10:14 UTC [04:11:35] PROBLEM - Puppet freshness on cp4001 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:10:34 UTC [04:11:35] PROBLEM - Puppet freshness on cp4005 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:10:44 UTC [04:12:35] PROBLEM - Puppet freshness on cp4018 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:11:30 UTC [04:14:25] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [04:15:35] RECOVERY - Puppet freshness on cp4009 is OK: puppet ran at Thu Aug 7 04:15:26 UTC 2014 [04:16:25] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 1 failures [04:19:35] PROBLEM - Puppet freshness on lvs4004 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:19:21 UTC [04:22:25] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [04:23:15] RECOVERY - Puppet freshness on lvs4002 is OK: puppet ran at Thu Aug 7 04:23:12 UTC 2014 [04:23:25] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [04:27:55] RECOVERY - Puppet freshness on cp4003 is OK: puppet ran at Thu Aug 7 04:27:46 UTC 2014 [04:29:35] RECOVERY - Puppet freshness on cp4005 is OK: puppet ran at Thu Aug 7 04:29:28 UTC 2014 [04:30:25] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [04:30:26] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [04:31:26] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [04:31:26] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [04:35:15] RECOVERY - Puppet freshness on cp4020 is OK: puppet ran at Thu Aug 7 04:35:12 UTC 2014 [04:37:26] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [04:37:46] RECOVERY - Puppet freshness on cp4016 is OK: puppet ran at Thu Aug 7 04:37:44 UTC 2014 [04:38:15] RECOVERY - Puppet freshness on cp4011 is OK: puppet ran at Thu Aug 7 04:38:14 UTC 2014 [04:38:45] RECOVERY - Puppet freshness on cp4015 is OK: puppet ran at Thu Aug 7 04:38:40 UTC 2014 [04:38:55] RECOVERY - Puppet freshness on cp4010 is OK: puppet ran at Thu Aug 7 04:38:45 UTC 2014 [04:39:45] RECOVERY - Puppet freshness on cp4017 is OK: puppet ran at Thu Aug 7 04:39:35 UTC 2014 [04:40:15] RECOVERY - Puppet freshness on cp4013 is OK: puppet ran at Thu Aug 7 04:40:06 UTC 2014 [04:40:25] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [04:40:26] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [04:40:26] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [04:40:26] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [04:40:35] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:39:52 UTC [04:41:35] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:40:28 UTC [04:41:45] RECOVERY - Puppet freshness on cp4007 is OK: puppet ran at Thu Aug 7 04:41:37 UTC 2014 [04:42:25] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [04:42:26] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [04:43:26] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [04:46:25] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [04:48:06] RECOVERY - Puppet freshness on cp4014 is OK: puppet ran at Thu Aug 7 04:48:02 UTC 2014 [04:48:15] RECOVERY - Puppet freshness on cp4004 is OK: puppet ran at Thu Aug 7 04:48:07 UTC 2014 [04:49:05] RECOVERY - Puppet freshness on cp4001 is OK: puppet ran at Thu Aug 7 04:48:59 UTC 2014 [04:49:52] (03CR) 10Yurik: [C: 031] Log when Internet.org in X-Analytics with proxy tag (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 (owner: 10Dr0ptp4kt) [04:50:05] RECOVERY - Puppet freshness on cp4018 is OK: puppet ran at Thu Aug 7 04:50:00 UTC 2014 [04:50:26] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [04:50:26] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [04:51:26] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [04:52:26] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [04:54:25] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [04:54:26] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [04:57:26] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures [04:58:26] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [04:59:25] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [05:00:35] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Thu Aug 7 05:00:27 UTC 2014 [05:01:25] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [05:05:15] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [05:05:25] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures [05:05:26] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [05:05:26] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [05:06:25] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures [05:06:25] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [05:07:26] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [05:08:26] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [05:09:35] PROBLEM - Puppet freshness on cp4019 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 03:08:25 UTC [05:11:26] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [05:12:26] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [05:15:25] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [05:15:26] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [05:15:35] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 06 Aug 2014 17:10:26 UTC [05:16:26] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 1 failures [05:17:26] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [05:18:26] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [05:19:26] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [05:19:26] RECOVERY - Puppet freshness on lvs4001 is OK: puppet ran at Thu Aug 7 05:19:24 UTC 2014 [05:20:26] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Thu Aug 7 05:20:19 UTC 2014 [05:21:26] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [05:21:55] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [05:22:25] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [05:22:25] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [05:22:35] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [05:23:25] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [05:23:25] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [05:25:26] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [05:29:06] RECOVERY - Puppet freshness on cp4019 is OK: puppet ran at Thu Aug 7 05:29:02 UTC 2014 [05:30:15] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: Puppet has 1 failures [05:30:25] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [05:30:26] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [05:32:26] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [05:37:26] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures [05:37:26] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [05:38:25] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [05:39:26] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [05:39:30] !log labsdb1002 restart [05:39:35] Logged the message, Master [05:39:46] PROBLEM - Host labsdb1002 is DOWN: PING CRITICAL - Packet loss = 100% [05:40:26] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [05:43:25] RECOVERY - Host labsdb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [05:43:26] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [05:44:26] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [05:45:56] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Puppet has 1 failures [05:46:26] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [05:46:26] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [05:47:25] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures [05:47:25] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [05:48:25] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [05:50:26] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [05:50:26] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [05:51:26] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [05:52:25] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [05:54:26] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [05:55:26] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [05:56:26] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [05:58:26] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:01:26] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:03:25] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [06:05:26] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:06:26] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures [06:07:26] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:08:26] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:11:35] PROBLEM - Puppet freshness on lvs4003 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:10:14 UTC [06:14:25] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:15:26] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:15:26] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:16:25] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:20:26] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:20:35] PROBLEM - Puppet freshness on lvs4004 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 02:19:21 UTC [06:23:26] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:26:05] (03PS1) 10Springle: process monitoring for labsdb [operations/puppet] - 10https://gerrit.wikimedia.org/r/152216 [06:27:01] (03CR) 10Springle: [C: 032] process monitoring for labsdb [operations/puppet] - 10https://gerrit.wikimedia.org/r/152216 (owner: 10Springle) [06:27:26] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:06] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:06] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:25] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:35] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:45] <_joe_> whoa [06:30:26] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:35] PROBLEM - puppet last run on mw1068 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:35] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] <_joe_> wtf [06:31:15] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on mw1088 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:26] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:35:06] yeah [06:35:26] I've been looking (again), and I think it is a general network issue [06:35:48] my current thinking is that we're out of bandwidth on a private mpls links between ulsfo and eqiad [06:36:25] so most of the world is reaching ulsfo fine, but ulsfo isn't reaching eqiad so great [06:36:26] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:36:34] which kinda sucks since that its cache source :P [06:37:26] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:37:28] (as I've said before!) I don't have a good grasp on a lot of the details of our network setup [06:38:11] but what I can tell looking at router configs, basically cr1-ulsfo has a 2Gbps MPLS link to eqiad, and cr2-ulsfo has a separate 10Gbps link to eqiad with a different provider. [06:38:49] and for some reason a large chunk of the traffic prefers the 2Gbps link, and it's saturated in one direction [06:39:35] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:39:38] (the outbound direction... and I'm not really sure why ulsfo outbound to eqiad would be big, seeing as it should be mostly caching data *from* eqiad) [06:40:02] _joe_: ^ [06:40:35] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:41:05] maybe someone's done something recently to generate a ton more ulsfo -> eqiad traffic. maybe some crazy monitoring traffic or something? [06:42:25] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:42:38] (I'm also not really sure why the traffic likes being on the smaller interface via cr1 and/or if that's intentional for some reason or other) [06:42:58] mark: ^ (if you're awake) [06:43:26] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:45:35] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:46:05] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:06] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:46:06] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:25] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:46:25] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:46:25] RECOVERY - puppet last run on mw1088 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:46:25] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:35] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:35] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:46:35] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:35] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:55] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:47:36] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:48:16] <_joe_> bblack: I figured it was a network issue tonight as only ulsfo servers were failing [06:48:39] when I saw it earlier I thought perhaps it was just related to apt-get fetches, but it's really not [06:48:55] it's causing slowdowns for any eqiad<->ulsfo traffic [06:48:57] <_joe_> what were the failures due to? [06:49:03] <_joe_> mmmh [06:49:06] the puppet failures are apt-get failures [06:49:22] (because fetching from carbon is super-slow and times out) [06:50:37] I'm tempted to try to divert the traffic to the 10G link somehow, but I'd be stabbing in the dark on doing that right and not making things worse. [06:51:22] <_joe_> oh I can't really help you [06:52:35] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:53:44] good morning [06:54:26] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:40] hey [07:02:26] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [07:02:35] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [07:05:35] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures [07:05:35] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [07:07:26] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures [07:09:26] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [07:14:24] hey [07:16:15] (03PS1) 10Giuseppe Lavagetto: geodns move all traffic from ulsfo to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/152219 [07:16:35] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 06 Aug 2014 17:10:26 UTC [07:17:35] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures [07:19:35] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 05:19:24 UTC [07:23:35] PROBLEM - Puppet freshness on cp4006 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 05:23:17 UTC [07:23:35] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [07:28:35] PROBLEM - Puppet freshness on cp4004 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 05:27:51 UTC [07:28:37] (03CR) 10QChris: [C: 031] Log when Internet.org in X-Analytics with proxy tag [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 (owner: 10Dr0ptp4kt) [07:29:35] PROBLEM - Puppet freshness on cp4019 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 05:29:02 UTC [07:30:35] PROBLEM - Puppet freshness on cp4018 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 05:29:37 UTC [07:36:35] PROBLEM - Puppet freshness on cp4012 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 05:36:08 UTC [07:39:05] RECOVERY - Puppet freshness on lvs4004 is OK: puppet ran at Thu Aug 7 07:38:57 UTC 2014 [07:39:06] RECOVERY - Puppet freshness on cp4012 is OK: puppet ran at Thu Aug 7 07:39:02 UTC 2014 [07:39:24] !log Set OSPF metric 1000 on cr2-eqiad:xe-5/2/2 (GTT link) [07:39:30] Logged the message, Master [07:40:49] RECOVERY - Puppet freshness on lvs4001 is OK: puppet ran at Thu Aug 7 07:40:43 UTC 2014 [07:41:19] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:41:29] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:41:49] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:41:49] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [07:41:59] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [07:42:16] (03PS1) 10Springle: Move s4 commonswiki api traffic away from db1042. Blocking a schema change and leading to max_connections. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152222 [07:42:19] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:42:29] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:42:30] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [07:42:44] (03PS2) 10Giuseppe Lavagetto: geodns move uploads traffic from ulsfo to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/152219 [07:42:55] <_joe_> mark: ^^ [07:42:58] (03CR) 10Springle: [C: 032] Move s4 commonswiki api traffic away from db1042. Blocking a schema change and leading to max_connections. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152222 (owner: 10Springle) [07:43:09] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:43:30] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 358 seconds [07:43:41] yes, go ahead [07:44:00] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:44:20] !log springle Synchronized wmf-config/db-eqiad.php: move s4 api traffic to db1056 (duration: 00m 07s) [07:44:26] Logged the message, Master [07:44:29] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:44:32] (03CR) 10Giuseppe Lavagetto: [C: 032] geodns move uploads traffic from ulsfo to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/152219 (owner: 10Giuseppe Lavagetto) [07:44:34] (03CR) 10Mark Bergsma: [C: 032] geodns move uploads traffic from ulsfo to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/152219 (owner: 10Giuseppe Lavagetto) [07:45:29] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay -1 seconds [07:46:39] PROBLEM - Puppet freshness on cp4003 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 05:46:21 UTC [07:48:39] PROBLEM - Puppet freshness on cp4001 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 05:48:28 UTC [07:48:39] PROBLEM - Puppet freshness on cp4014 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 05:48:02 UTC [07:50:39] PROBLEM - Puppet freshness on bast4001 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 05:49:29 UTC [07:52:19] RECOVERY - Puppet freshness on lvs4003 is OK: puppet ran at Thu Aug 7 07:52:13 UTC 2014 [07:52:29] RECOVERY - Puppet freshness on cp4019 is OK: puppet ran at Thu Aug 7 07:52:23 UTC 2014 [07:52:30] RECOVERY - Puppet freshness on bast4001 is OK: puppet ran at Thu Aug 7 07:52:23 UTC 2014 [07:52:39] RECOVERY - Puppet freshness on cp4018 is OK: puppet ran at Thu Aug 7 07:52:34 UTC 2014 [07:53:09] RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:53:29] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [07:53:59] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [07:54:19] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:54:30] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:56:20] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:56:36] <_joe_> that looks better [08:01:30] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [08:02:30] RECOVERY - Puppet freshness on cp4006 is OK: puppet ran at Thu Aug 7 08:02:27 UTC 2014 [08:03:19] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:04:00] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [08:04:59] RECOVERY - Puppet freshness on cp4003 is OK: puppet ran at Thu Aug 7 08:04:54 UTC 2014 [08:05:39] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [08:06:09] RECOVERY - Puppet freshness on cp4014 is OK: puppet ran at Thu Aug 7 08:06:00 UTC 2014 [08:06:30] RECOVERY - Puppet freshness on cp4004 is OK: puppet ran at Thu Aug 7 08:06:25 UTC 2014 [08:06:59] RECOVERY - Puppet freshness on cp4001 is OK: puppet ran at Thu Aug 7 08:06:51 UTC 2014 [08:07:29] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [08:07:30] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [08:07:39] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [08:08:39] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [08:08:50] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [08:32:32] !log Jenkins: switching [https://integration.wikimedia.org/ci/job/analytics-libcidr/|analytics-libcdr job] from https://github.com/wmf-analytics/libcidr/ to https://gerrit.wikimedia.org/r/analytics/libcidr [08:32:38] Logged the message, Master [08:35:29] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 302 seconds [08:35:29] PROBLEM - MySQL Slave Delay on db1042 is CRITICAL: CRIT replication delay 304 seconds [08:40:30] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay -1 seconds [08:40:30] RECOVERY - MySQL Slave Delay on db1042 is OK: OK replication delay 0 seconds [08:48:22] (03CR) 10Alexandros Kosiaris: "Sigh, my fault. Thanks for fixing it Daniel" [operations/puppet] - 10https://gerrit.wikimedia.org/r/148836 (owner: 10Physikerwelt) [08:49:50] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [09:07:08] akosiaris: morning, have you seen my question from yesterday ? [09:10:21] Notice: Finished catalog run in 4364.03 seconds [09:10:21] \O/ [09:10:26] (that is on labs) [09:10:29] <_joe_> hashar: uh? [09:10:47] that is the first puppet run on a Jenkins slave in labs :-] [09:10:54] it installs the whole world [09:11:30] <_joe_> and puppet does ensure => inefficient [09:13:23] !log Jenkins: polling a new Jenkins slave using Trusty integration-slave1006-trusty [10.68.17.223] with 4 CPU. Copy pasted from 1004-trusty [09:13:28] Logged the message, Master [09:14:40] (03PS2) 10Alexandros Kosiaris: TXT records for google verification [operations/dns] - 10https://gerrit.wikimedia.org/r/152159 (owner: 10Dzahn) [09:15:45] matanya: about backup::client ? yes. leave it as it, no reason to mess with it [09:16:24] akosiaris: the question came from mutante. he wanted to put firewall on yterbium [09:16:39] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 06 Aug 2014 17:10:26 UTC [09:19:16] (03CR) 10Alexandros Kosiaris: "I updated the change to have all of the required domains on RT #8068. The single exception being wmflabs.org which is not managed via git/" [operations/dns] - 10https://gerrit.wikimedia.org/r/152159 (owner: 10Dzahn) [09:20:04] matanya: he can do it then, no reason not to [09:20:14] mutante: ^ :) [09:20:22] (03CR) 10Alexandros Kosiaris: [C: 032] "I updated the change to have all of the required domains on RT #8068. The single exception being wmflabs.org which is not managed via git/" [operations/dns] - 10https://gerrit.wikimedia.org/r/152159 (owner: 10Dzahn) [09:21:02] (03CR) 10Reza: [C: 031] "Wonderful!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152042 (https://bugzilla.wikimedia.org/69171) (owner: 10Calak) [09:23:39] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [09:24:22] (03PS1) 10Hashar: contint: rubygems is provided by ruby since Trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/152229 [09:29:03] (03CR) 10Hashar: [V: 031] "Cherry picked on integration puppet master. That got rid of the issue on the Trusty instance and the Precise instances are happy." [operations/puppet] - 10https://gerrit.wikimedia.org/r/152229 (owner: 10Hashar) [09:39:26] greetings [09:39:58] <_joe_> ciao godog [09:41:35] ciao _joe_ [09:54:59] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 19 data above and 9 below the confidence bounds [10:05:40] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [10:28:40] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [10:29:04] (03CR) 10Alexandros Kosiaris: [C: 032] contint: rubygems is provided by ruby since Trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/152229 (owner: 10Hashar) [10:29:47] akosiaris: cool, thx for doing all the TXT records for google verification [10:30:10] also, no more messing with file uploads into docroot [10:31:23] (03CR) 10Dzahn: "matanya, i saw some update on this in backlog. not -1 anymore?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133517 (owner: 10Dzahn) [10:32:13] (03CR) 10Matanya: [C: 031] "yes, it was fixed, feel free to merge, per alex." [operations/puppet] - 10https://gerrit.wikimedia.org/r/133517 (owner: 10Dzahn) [10:34:15] mutante: :-) [10:34:42] akosiaris: woo [10:34:44] Yes, thanks :) [10:35:02] I think there's a couple of google files laying around that we should probably tidy up too [10:35:39] true, yes [10:35:47] Reedy: yup.. was next on my list.. Not sure how to do it however. James mentioned they are not in git ? [10:36:01] let me check [10:36:47] i remember uploading one for wikivoyage [10:37:08] I only found 2 on tin [10:37:16] wikivoyage.org [10:37:16] wikipedia.org [10:37:29] it's possible that's all there is [10:37:40] yup [10:38:23] yeah, loooks to be [10:38:26] What's the mywotf83102384bfcd1e52152.html again? [10:38:41] heh ? [10:38:51] first time I see that [10:39:10] myWTF ? [10:39:27] i do remember it [10:40:08] https://www.mywot.com/wiki/Verify_your_website [10:40:14] "Web of Trust" [10:40:46] "After you have verified your site with WOT, you may remove the META tag, generated Keyword, or the uploaded HTML file." [10:40:47] DELETE [10:40:52] cant do " This choice requires having FTP access." :) [10:45:02] !log iron, bast1001 - installed package upgrades [10:45:07] Logged the message, Master [10:45:55] Reedy: https://gerrit.wikimedia.org/r/#/c/152154/ ? [10:46:13] all, can we kill tarin? [10:46:29] tampa poolcounter [10:46:40] (03PS1) 10Giuseppe Lavagetto: ssl proxies: use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/152248 [10:46:53] mutante: will check [10:47:47] thanks Reedy [10:48:25] added Aaron [10:48:49] Ugh [10:48:56] I really should use mosh or something [10:48:58] ssh dropping [10:49:33] '10.64.0.179', [10:49:33] '10.64.16.152' [10:49:51] helium and potassium [10:50:06] ack, both eqiad [10:50:34] (03CR) 10Reedy: [C: 031] "LGTM. We have no PMTPA config for PoolCounter, and using it would just be a pointless extended RTT." [operations/puppet] - 10https://gerrit.wikimedia.org/r/152154 (owner: 10Dzahn) [10:50:39] :) [10:54:25] (03PS2) 10Giuseppe Lavagetto: ssl proxies: use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/152248 [10:54:54] (03PS2) 10Dzahn: enable firewall on ytterbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/133517 [10:55:17] mutante: How many servers are left now? :) [10:56:20] Reedy: 9 [10:56:30] lol [10:56:31] per misc_pmtpa dsh group [10:56:44] fucking serious SSH [10:56:47] well, under 10 :) [10:57:06] <_joe_> mutante: I'd try a ping sweep of the pmtpa subnets ;) [10:57:58] https://chrome.google.com/webstore/detail/mosh/ooiklbnjmhbcgemelgfhaeaocllobloj [10:58:52] _joe_: i actually did a couple months ago to compare to dsh groups, afair everything else was network equipment (and nfs) [10:59:09] but never hurts to do again, yea [11:00:06] Reedy: /puppet$ git log | grep mosh [11:00:21] mutante: isn't tarin also a ganglia aggregator for tampa ? [11:00:29] I just checked it's on bast1001 [11:00:40] (03CR) 10Giuseppe Lavagetto: "Sorry, I was a little cryptic. We're adding a dedicated virtualhost on all mediawiki appservers for toollabs, just to make a redirect. I a" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151523 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [11:01:02] akosiaris: you once said "let's keep it until the very end".. but i hope that end has come since the appservers are gone [11:01:22] that is what I remember... [11:01:23] heh [11:01:30] so 9 servers in pmtpa... hmmm [11:01:46] can we kill old mail and dns servers yet? [11:01:50] that is the main question to remove more [11:01:52] lol, pdf2 and pdf3 [11:02:22] Reedy: indeed, i guess now that OCG is here .. ?! [11:02:26] mchenry we probably can [11:02:36] it's still not "production ready" afaik [11:02:38] altough I might be missing something [11:02:55] dns, well the moment there is no service in tampa, yes [11:02:58] removing mchenry would be fun [11:03:03] but there still is the PDF service, right ? [11:03:08] (03CR) 10Giuseppe Lavagetto: "To better forumlate my objection: can someone give me a reason why this virtual host cannot be set up on the same server that serves tool" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151523 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [11:03:46] akosiaris: yes, the old pdf service is still what is default (ie: vast majority of use) [11:04:43] greg-g: thought so, thanks [11:05:34] * greg-g nods [11:05:34] (03CR) 10Dzahn: [C: 032] "adds firewall on gerrit server, we have holes for http,https and gerrit's high ssh port" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133517 (owner: 10Dzahn) [11:06:27] let's see what breaks now... [11:06:34] :p [11:07:07] is anyone working on the mail bounce issue? [11:07:24] cuz i have a hundred or so bounces from polonium to rlane@ due to it not existing [11:07:31] nope [11:07:37] sounds like it got deleted by oit [11:07:59] ok, iptables rules are there now [11:08:01] ryan's legacy permeates [11:08:02] gerrit still ok ? [11:08:21] <^d> Is it ever? [11:08:38] heh [11:09:07] hrmm, whatever the heck is firing them off needs to be corrected, i imagine to coren's email from ryans? [11:09:23] mutante: seems so [11:09:29] for www-data@graphite-test.eqiad.wmflabs; Thu, 07 Aug 2014 11:05:02 +0000 [11:09:39] who is running the graphite tests? [11:09:54] in labs [11:10:24] I am removing rlane from root@wikimedia.org [11:10:35] ahh, thats it then, cool [11:10:59] akosiaris: thx! [11:11:00] not sure that will be all of it though [11:11:09] we'll know soon enough ;] [11:11:09] (03PS3) 10Giuseppe Lavagetto: ssl proxies: use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/152248 [11:11:09] this is ryan we are talking about [11:11:57] !log removed rlane from root@wikimedia.org and usability@wikimedia.org [11:12:03] Logged the message, Master [11:12:22] hmm and not laner@wikimedia.org no longer makes sense... cause it points to rlane... removing that too [11:12:34] usability? haha [11:12:57] (03CR) 10Dzahn: [C: 031] "let's have MaxSem confirm , merge at Wikimania" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117673 (owner: 10Matanya) [11:13:00] !log removed laner@wikimedia.org entirely. It pointed to rlane@wikimedia.org which no longer exists [11:13:06] Logged the message, Master [11:14:29] (03CR) 10Dzahn: "ACCEPT tcp -- anywhere anywhere tcp dpt:29418" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133517 (owner: 10Dzahn) [11:15:03] (03PS1) 10Hoo man: No longer allow voyage 'crats to usermerge [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152255 [11:15:31] (03CR) 10Dzahn: [C: 032] "just removing tabs but touches all icinga check commands" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152155 (owner: 10Dzahn) [11:16:09] robh: seems like that removing ryan from root@wikimedia.org was sufficient for my inbox. I don't dare think what might crawl up in labs however [11:16:27] \o/ [11:16:40] mutante, as if I know:P [11:17:39] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 06 Aug 2014 17:10:26 UTC [11:18:17] MaxSem: heh, not the ferm stuff itself, more like "does vumi need anything else besides redis port" [11:18:37] yeah, exactly what I mean;) [11:18:51] ok:) [11:18:57] (03CR) 10Legoktm: [C: 031] No longer allow voyage 'crats to usermerge [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152255 (owner: 10Hoo man) [11:19:17] I haven't actually run it when developing for it, I did it the hardcore TDD way:P [11:19:20] i can look at netstat or tcpdump later [11:19:28] heh:) [11:20:20] (03CR) 10Dzahn: "we should check if it really doesnt use any other port (netstat,tcpdump)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117673 (owner: 10Matanya) [11:22:06] (03PS4) 10Giuseppe Lavagetto: ssl proxies: use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/152248 [11:22:08] (03CR) 10Hoo man: [C: 032] "We just decided that this is a necessary step needed for SUL finalization and given how powerful these tools are we should do this now. Th" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152255 (owner: 10Hoo man) [11:22:23] (03Merged) 10jenkins-bot: No longer allow voyage 'crats to usermerge [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152255 (owner: 10Hoo man) [11:23:01] (03CR) 10Chad: "Can we undeploy the extension then since nobody uses it?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152255 (owner: 10Hoo man) [11:23:08] <^d> hoo: ^ [11:23:30] ^d: we need it for global user merge in a few weeks [11:23:40] https://bugzilla.wikimedia.org/show_bug.cgi?id=68844 [11:23:52] <^d> Isn't global user merge a different extension? [11:24:00] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [11:24:18] it uses the underlying code from UserMerge [11:24:58] <^d> bleh [11:25:15] https://gerrit.wikimedia.org/r/#/c/144644/15/LocalUserMergeJob.php,cm [11:25:38] !log hoo Synchronized wmf-config/InitialiseSettings.php: I53f76a35ac - No longer allow voyage 'crats to usermerge (duration: 00m 15s) [11:25:40] who the hell did local modifications on tin, btw? [11:25:41] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [11:25:43] Logged the message, Master [11:25:56] hoo: what are you doing on tin? ;) [11:26:43] ACKNOWLEDGEMENT - DPKG on pdf2 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:43] ACKNOWLEDGEMENT - Disk space on pdf2 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:43] ACKNOWLEDGEMENT - RAID on pdf2 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:43] ACKNOWLEDGEMENT - check configured eth on pdf2 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:43] ACKNOWLEDGEMENT - check if dhclient is running on pdf2 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:44] ACKNOWLEDGEMENT - puppet disabled on pdf2 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:44] ACKNOWLEDGEMENT - puppet last run on pdf2 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:45] ACKNOWLEDGEMENT - DPKG on pdf3 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:45] ACKNOWLEDGEMENT - Disk space on pdf3 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:46] ACKNOWLEDGEMENT - RAID on pdf3 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:46] ACKNOWLEDGEMENT - check configured eth on pdf3 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:47] ACKNOWLEDGEMENT - check if dhclient is running on pdf3 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:47] ACKNOWLEDGEMENT - puppet disabled on pdf3 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:48] ACKNOWLEDGEMENT - puppet last run on pdf3 is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [11:26:52] greg-g: removing a blocker for SUL finalization which we agree to remove now [11:27:00] * greg-g nods [11:27:00] * agreed, even [11:27:04] <_joe_> thanks mutante [11:27:10] <^d> greg-g: On tin to deploy stuff, obvs :) [11:27:14] hoo: what's the diff you see on tin? [11:27:21] ^d: shhh [11:27:31] greg-g: # deleted: docroot/wikivoyage.org/mywotf83102384bfcd1e52152.html [11:27:39] mutante: ^ [11:27:39] that's what git status gives [11:27:50] That was me [11:27:55] No idea why it was in git [11:27:59] ah, he was just asking me about it :) [11:27:59] (03CR) 10Giuseppe Lavagetto: [C: 031] "Compilation results:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152248 (owner: 10Giuseppe Lavagetto) [11:28:06] Reedy: pf :P [11:28:18] Reedy: commit/push? :) [11:28:23] greg-g: yep, that was the web of trust thing [11:28:25] yeah [11:28:29] ssh keeps dying [11:28:31] <_joe_> mutante: whenever you have time, have a look @ https://gerrit.wikimedia.org/r/152248 [11:28:31] switched to mosh [11:28:32] :/ [11:28:38] works for me [11:28:39] but hte client doesn't seem to want to key forward [11:28:44] do we support mosh on bastions/tin? [11:28:49] I thought it was inherently less secure? [11:28:50] yeah [11:28:55] <_joe_> we should then extend this to all ssl hosts I guess [11:28:56] it's installed on bast1001 at least.... [11:29:01] hmmm [11:29:15] mutante: do you understand what wikimedia.community is on RT #8068 ? [11:29:33] git grep on the dns repo is particularly helpful [11:29:40] mosh doesn't do key forwarding. Otherwise it is awesome. [11:29:44] check /puppet$ git log | grep mosh [11:29:50] for mosh on labs discussion. afair [11:30:03] mosh's versions are incompatible also [11:30:17] _joe_: ok, want me to make a similar one for a misc. service to test first? [11:30:29] Non-authoritative answer: [11:30:29] Name: Wikimedia.community [11:30:29] Address: 208.80.154.224 [11:30:29] and it doesn't do proxycommand either [11:30:50] boooo [11:30:57] Reedy: I 'll be damned... [11:31:10] We seriously bought that? [11:31:10] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:31:12] Hahahaha [11:31:23] <^d> What a retarded domain name. [11:31:29] <^d> Oh wait, all the new gTLDs are retarded. [11:31:31] well.. it points to our servers... [11:31:35] <^d> Including .wiki [11:31:40] so yeah we must have bought it... [11:32:00] well "acquired" it [11:32:56] <^d> What a dumb thing to acquire. [11:33:39] ah.. it is a symlink to wikimedia.org... that explains it [11:34:41] s/.org/.com/ [11:36:18] heh [11:40:01] (03PS1) 10Reedy: Remove mywotf file. Apparently not needed after verified [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152263 [11:40:21] (03CR) 10Reedy: [C: 032] Remove mywotf file. Apparently not needed after verified [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152263 (owner: 10Reedy) [11:40:25] (03Merged) 10jenkins-bot: Remove mywotf file. Apparently not needed after verified [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152263 (owner: 10Reedy) [11:41:16] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 09:40:17 UTC [11:43:36] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.001 second response time [11:48:37] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:52:56] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 0 below the confidence bounds [11:52:56] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 0 below the confidence bounds [11:53:36] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.007 second response time [11:53:56] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [500.0] [11:55:24] argg. back. got disconnected [11:56:10] akosiaris: wikimedia.community is an actual domain [11:56:33] wikimedia.community: symbolic link to `wikimedia.com' [11:56:36] ^ in dns [11:57:45] wikimedia.community has address 208.80.154.224 [11:58:06] oh, all too late , i see [11:58:12] just got back online [11:59:47] I couple weeks ago I was logging refused queries and checking whois, and there *are* a bunch of others we've registered to our nameservers but never added to our dns [11:59:56] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Thu Aug 7 11:59:52 UTC 2014 [12:00:40] bblack: yea, needs improved workflow. we dont have the budget anymore [12:00:53] in the past ops used to register them [12:01:26] we need some kind of notification when to add them [12:01:46] I think it's legal that ends up doing it, based on whois. We could ask them to file an RT when they do. [12:02:45] mutante: Whereabouts are you? [12:02:54] they said they dont have the bandwith for RT , in the past [12:03:08] maybe now legalpad phabricator's "herald" could send it [12:03:50] Reedy: Garden Room, at the entrance across art gallery [12:18:12] (03CR) 10Dzahn: [C: 031] Redirect c[sz].wikimedia.org to http://www.wikimedia.cz [operations/puppet] - 10https://gerrit.wikimedia.org/r/147485 (owner: 10Reedy) [12:18:18] (03PS2) 10Dzahn: Redirect c[sz].wikimedia.org to http://www.wikimedia.cz [operations/puppet] - 10https://gerrit.wikimedia.org/r/147485 (owner: 10Reedy) [12:18:27] (03CR) 10Dzahn: [C: 031] Redirect c[sz].wikimedia.org to http://www.wikimedia.cz [operations/puppet] - 10https://gerrit.wikimedia.org/r/147485 (owner: 10Reedy) [12:20:10] !log restarting Zuul [12:20:15] Logged the message, Master [12:21:25] (03PS1) 10Alexandros Kosiaris: wikimedia.community: Add Google webmasters tools verification [operations/dns] - 10https://gerrit.wikimedia.org/r/152269 [12:21:36] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.006 second response time [12:23:25] !log Zuul upgraded labs branch to match production (i.e. have same version of Zuul cloner) [12:23:29] (03CR) 10Dzahn: "in this case it's correct that it's not protocol relative, wikimedia.cz does not support https" [operations/puppet] - 10https://gerrit.wikimedia.org/r/147485 (owner: 10Reedy) [12:23:33] Logged the message, Master [12:23:56] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0] [12:23:57] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 0 below the confidence bounds [12:23:57] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 0 below the confidence bounds [12:24:36] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.009 second response time [12:25:30] (03CR) 10Dzahn: "not sure about that, 34 domains are symlinks to wikimedia.com, that is like the default symlink, you would add this for all of them" [operations/dns] - 10https://gerrit.wikimedia.org/r/152269 (owner: 10Alexandros Kosiaris) [12:28:18] mutante: not sure either... hence the change... [12:29:00] akosiaris: yea.. we would add wrong verification codes for the other 30 domains.. [12:29:11] but on the other hand, if they are not added in webmaster tools anyways.. [12:29:30] we can always unlink and make it an actual copy of wikimedia.com [12:30:26] i've often been annoyed how it all symlinks to the single file. [12:30:33] but never wanted to take the time to untangle it [12:30:56] s/all/most [12:32:23] I am annoyed by wikimedia.com tbh [12:32:33] i wanted to change it to wikimedia.org [12:32:33] domain creep [12:32:41] to be the link target [12:33:05] i dunno why we picked .com [12:33:39] (03PS1) 10RobH: smokeping was sending as rob, changing [operations/puppet] - 10https://gerrit.wikimedia.org/r/152273 [12:34:26] well it is a mostly empty zone [12:34:36] probably for that [12:35:00] yes, to not add all the extra things [12:35:02] hmm [12:35:11] and of course I don't understand why we want to have google webmaster's tools for wikimedia.community [12:35:45] it even points to Domain not configured [12:35:46] (03CR) 10RobH: [C: 032] smokeping was sending as rob, changing [operations/puppet] - 10https://gerrit.wikimedia.org/r/152273 (owner: 10RobH) [12:36:01] so we are not even serving it ... [12:36:30] akosiaris: probably they want us to add it to Apache config [12:36:36] and add some redirect [12:36:56] questions for James though [12:37:41] mutante: you mean like at some point in the future? [12:37:44] could be [12:37:52] matanya: you happen to know ? [12:38:14] akosiaris: yea, apache config deployment is a bit slow sometimes :p [12:38:43] ah, it might even be a change that needs to be converted now [12:38:51] that apache config is in puppet [12:40:45] akosiaris: aaah, here's the thing, it's for mail [12:40:46] exim: add wikimedia.community to wikimedia_domains [12:41:00] so not being in http doesnt mean we dont "serve" it [12:41:11] might want a redirect though for good measure [12:41:12] <_joe_> btw [12:41:23] <_joe_> for sites/things that do not need mediawiki [12:41:34] <_joe_> we shouldn't use the appservers for redirects [12:41:41] <_joe_> it's stupid and inefficient [12:42:18] hrmm [12:42:22] wtf, where did mint go [12:42:30] was toolserver, even when decom should show in racktables... [12:42:32] _joe_: .. but instead? varnish? [12:42:33] in decom rack [12:42:41] mutante: sure, but what does that have to do with google webmasters tools ? [12:42:47] <_joe_> mutante: whatever, I'd use varnish or better nginx [12:43:03] akosiaris: true, no idea [12:43:07] <_joe_> or any other webserver we have lying around [12:43:23] _joe_: i see [12:43:28] <_joe_> we're caching those redirects on varnish anyway [12:43:29] springle: can I close RT #8032 ? [12:43:39] _joe_: true [12:43:43] seems like everything is back to normal [12:44:01] _joe_: i think it has been brought up in the past bu nobody wanted to convert all the apache config to varnish config [12:44:16] and then Tim wrote the generator [12:44:47] ACKNOWLEDGEMENT - DPKG on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:44:47] ACKNOWLEDGEMENT - Disk space on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:44:47] ACKNOWLEDGEMENT - RAID on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:44:47] ACKNOWLEDGEMENT - check configured eth on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:44:47] ACKNOWLEDGEMENT - check if dhclient is running on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:44:48] ACKNOWLEDGEMENT - puppet disabled on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:44:48] ACKNOWLEDGEMENT - puppet last run on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:44:49] ACKNOWLEDGEMENT - DPKG on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:44:49] ACKNOWLEDGEMENT - Disk space on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:44:50] ACKNOWLEDGEMENT - RAID on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:44:50] ACKNOWLEDGEMENT - check configured eth on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:44:51] ACKNOWLEDGEMENT - check if dhclient is running on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:44:51] ACKNOWLEDGEMENT - puppet disabled on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:44:52] ACKNOWLEDGEMENT - puppet last run on dobson is CRITICAL: Connection refused by host daniel_zahn no NRPE on old distro [12:45:09] bblack: what are the chances of me easily and successfully compiling our varnish on trusty ? [12:45:10] akosiaris: oh, yes [12:45:14] springle: thanks [12:45:22] <_joe_> mutante: I've seen a change to create a virtual host for toolserver.org [12:45:26] <_joe_> on the mw appservers [12:45:31] <_joe_> that doesn't make sense [12:45:44] _joe_: oh, i think i saw your comment there, yea [12:46:21] akosiaris: probably pretty decent [12:46:46] ok. giving it a go then and hopefully I 'll not have to pester you :-) [12:46:50] it's a "git buildpackage" thing for the debian package, needs some various pre-reqs installed (which are listed in the metadata) [12:47:01] nice [12:47:11] _joe_: it's just how we have always done it in the past, we actually moved stuff from misc. apaches to the cluster, thinking it's better to have the redirects on cluster [12:47:38] akosiaris: there's a debian/README.WMF that shows my usual gbp command invocation [12:47:45] (03Abandoned) 10RobH: sunfire hosts decommed, removed from dns [operations/dns] - 10https://gerrit.wikimedia.org/r/118480 (owner: 10Matanya) [12:47:57] and you want to use the branch 3.0.5-plus-wm for our current prod package [12:48:55] bblack: cool cause I would go for 3.0.6.. thanks [12:49:34] yeah 3.0.6 is on hold. upstream has released a 3.0.6rc1 and I merged that into there so far, but waiting for them to release 3.0.6 for real before we make the switch. [12:50:54] <_joe_> mutante: mmm i'm quite convinced it's not a good idea, but we may discuss it (btw, this is the first time I see a new virtualhost added to the configuration) [12:51:58] <_joe_> mutante: would you care to double-check https://gerrit.wikimedia.org/r/#/c/152248/ ? [12:52:16] _joe_: yea, that was just explaining the history, not even an opinion [12:52:25] _joe_: we can test it with something .misc first ? [12:52:35] <_joe_> well, yeah :) [12:52:45] <_joe_> like, gerrit? [12:52:54] Bugzilla [12:53:06] i can try and copy your change for that [12:54:04] hold on, one other change before.. [12:54:41] <_joe_> mutante: I'm preparing bugzilla right now [12:54:51] ok, cool [12:55:14] (03PS1) 10Dzahn: fix etherpad-lite monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/152279 [12:55:20] (03CR) 10Alexandros Kosiaris: [C: 032] osm: Move postgres datadir to /srv/postgres [operations/puppet] - 10https://gerrit.wikimedia.org/r/152013 (owner: 10Alexandros Kosiaris) [12:58:03] mutante: I was pretty sure I was fixed that... sigh [12:58:04] sorry [12:58:09] (03CR) 10Dzahn: [C: 032] "PROCS CRITICAL: 0 processes with regex args '^node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js'" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152279 (owner: 10Dzahn) [12:58:43] akosiaris: no worries, i just saw it after icinga was cleaner again [12:58:58] the ACKs for pdf boxes had expired and stuff [13:00:37] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Puppet has 2 failures [13:00:44] ^ that is me [13:01:40] (03PS1) 10Alexandros Kosiaris: postgres: enclose data_directory in single quotes [operations/puppet] - 10https://gerrit.wikimedia.org/r/152281 [13:02:31] ACKNOWLEDGEMENT - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 06 Aug 2014 17:10:26 UTC daniel_zahn 17:12 _joe_: stopped the jobrunner on mw1053, was running in fcgi mode unpuppetized and with a broken vhost. Fixed it, it started spawning exceptions. DO NOT enable puppet again [13:02:47] (03PS1) 10Giuseppe Lavagetto: bugzilla: use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/152282 [13:02:50] (03CR) 10Alexandros Kosiaris: [C: 032] postgres: enclose data_directory in single quotes [operations/puppet] - 10https://gerrit.wikimedia.org/r/152281 (owner: 10Alexandros Kosiaris) [13:03:57] ACKNOWLEDGEMENT - RAID on holmium is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) daniel_zahn #7239: holmium failed raid, replaced disk slot 0 [13:04:05] _joe_: well most new domains for redirections are usually just added to the virtualhost in redirects.conf but the toolserver one doesn't follow the patter of just domain with some exact path as exception, so this seems the usual way to me. [13:04:50] <_joe_> jzerebecki: mh mutante told me it's common. [13:05:09] <_joe_> then again, what's wrong at hosting the redirects on the same server that will serve the content? [13:05:14] <_joe_> isn't it better? [13:05:52] it would be another workflow for parked domains with just redirects. [13:06:19] <_joe_> I see a ton of reasons why it's better [13:07:54] <_joe_> so, maybe someone in ops will agree with you, I'm not reverting my -1 on the basis that "we always did that" [13:08:15] i definitely agree with joe [13:08:20] <_joe_> putting those redirects directly on the servers that will eventually serve that content makes much more sense [13:08:37] i see no reason why we'd want to have that on the main wiki platform [13:09:37] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Puppet has 8 failures [13:09:42] mark: I don't mind that much, but it would be breaking the pattern of keeping all the redirect only domains together [13:10:22] (03CR) 10Mark Bergsma: [C: 04-1] "I also don't see any reason why this should live on the main wiki platform instead of on the tools web server(s)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151523 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [13:11:37] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [13:11:47] mark: besides we do not have any domains besides those below wmflabs. and wmflabs.org. on labs [13:11:56] so? [13:12:24] (03CR) 10Dzahn: [C: 04-1] "SSLCipherSuite appears in more than 1 place, lemme amend..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/152282 (owner: 10Giuseppe Lavagetto) [13:12:41] why does it matter if it's the first domain or the 26th? [13:15:21] i don't think that does, but it would be another way to do something we already do one way that works. there is no code review, no configuration management for those labs domains. only cloudadmins can view and edit them via nova. [13:17:00] <_joe_> mutante: thanks :) [13:18:34] yeah, putting it on the wiki platform is one way to do it that works [13:18:40] (03PS2) 10Dzahn: bugzilla: use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/152282 (owner: 10Giuseppe Lavagetto) [13:18:48] but we're not hosting a bunch of other unrelated sites there on it either eh, even though we could [13:19:07] sure, there's a bunch of legacy stuff on there still [13:19:15] because that setup originates from over 10 years ago [13:19:20] I hear a bunch of users complaining about swift errors when uploading files on commons [13:19:41] but that doesn't mean that in 2014 it still makes sense to put new unrelated stuff on there [13:19:47] godog: ping [13:19:47] mark: ping detected, please leave a message! [13:19:53] haha [13:20:05] godog: please have a look at those swift errors [13:20:31] mark: yep I'm here [13:20:36] RECOVERY - etherpad_lite_process_running on zirconium is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [13:20:51] akosiaris: ^ [13:21:22] bawolff: where are you? I'm in the garden room at the search table [13:21:45] godog: I'm currently in frobisher 6 for gwtoolset stuff [13:21:53] I can come over there [13:22:17] although I don't really know what's happening beyond I heard some users complain on #wikimedia-commons [13:22:33] oh ok, let me come over there [13:23:16] godog: Its a big discussion though, not really a good place to have off topic discussion [13:24:01] mark: what would a better way to do those legacy things? like doing those other unrelated domains with redirects to sites not even hosted by wmf directly in varnish instead of apache? [13:24:16] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [13:24:22] no, properly using a web server like nginx or apache [13:24:31] cached by a varnish in front if needed [13:26:17] _joe_: how about the combination of apache-2.2 and compat mode "strong"? on BZ i'd do that , just not on the cluster [13:27:26] <_joe_> mutante: it fails :) [13:27:29] <_joe_> on purpose [13:28:01] ah, ok, so before the reasoning was "let's already list them, and once apache is being upgraded it will just work" [13:28:34] (03CR) 10Dzahn: [C: 031] "let's go and use this to test before making the cluster change. (i'll retab the template in a minute :p)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152282 (owner: 10Giuseppe Lavagetto) [13:29:03] mark: seems sensible. then should anyone wanting to change or add to those redirects be required to implement that improvement before any existing redirect only domains are changed or added? [13:29:19] yes [13:30:32] godog: Seems like moved files are also dissapearing [13:30:34] why can't this simply be added as a new vhost on the existing tools.wmflabs servers? [13:30:38] e.g. https://upload.wikimedia.org/wikipedia/commons/0/06/John_Sinclair_2006_poetry_reading.jpg [13:33:51] bawolff: yeah I was chatting about that with aaron yesterday, possible due to some older bug during move but shouldn't happen for new images [13:36:09] ACKNOWLEDGEMENT - Host tantalum is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #7947: reclaim tantalum [13:40:11] (03PS1) 10Mark Bergsma: Move ulsfo upload traffic from eqiad to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/152291 [13:40:38] (03CR) 10Mark Bergsma: [C: 032] Move ulsfo upload traffic from eqiad to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/152291 (owner: 10Mark Bergsma) [13:42:05] ACKNOWLEDGEMENT - Disk space on praseodymium is CRITICAL: DISK CRITICAL - free space: /mnt/tmp 0 MB (0% inode=99%): daniel_zahn cassandra test host [13:42:55] ACKNOWLEDGEMENT - Disk space on dataset1001 is CRITICAL: DISK CRITICAL - free space: /data 705111 MB (1% inode=99%): daniel_zahn RT #7922 fix disk space check on dataset1001 [13:46:50] ACKNOWLEDGEMENT - check configured eth on labstore1001 is CRITICAL: bond0 reporting no carrier. daniel_zahn RT #7657 virt1002/labstore1001 network exhaustion [13:47:40] ACKNOWLEDGEMENT - puppet last run on dataset2 is CRITICAL: CRITICAL: Puppet has 3 failures daniel_zahn RT #7923 fix puppet run on dataset2 [13:48:57] ACKNOWLEDGEMENT - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00334448160535 daniel_zahn RT #7924, RT #7779 [13:51:31] ACKNOWLEDGEMENT - Disk space on labstore1001 is CRITICAL: DISK CRITICAL - free space: /exp/dumps 0 MB (0% inode=99%): daniel_zahn RT #8090: labstore1001 - disk space [13:51:45] (03CR) 10coren: [C: 04-2] "Besides the technical-ish objections raised by others, I don't agree with the principle of keeping redirect to individual tools alive inde" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151523 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [13:52:57] PROBLEM - ElasticSearch health check on elastic1018 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.40 [13:53:16] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 11:53:02 UTC [13:56:28] !log starting elasticsearch on elastic1018 [13:56:33] Logged the message, Master [13:57:26] !log Temporarily set max connections to swift from cp1049 backend varnish from 1000 to 2000 [13:57:32] Logged the message, Master [13:58:27] robh: these sound like broken hardware, dont they. [sdb] Sense Key : Medium Error [current] , sd 6:0:1:0: [sdb] Add. Sense: Unrecovered read error ... [13:58:41] yep [13:58:43] one of the elastic search boxes [13:58:46] lemme make ticket [13:58:50] sound like an ssd death [13:58:59] another m320 bites the dust [13:59:22] that sucks though cuz we're already fairly constrained on elastic at the moment iirc [13:59:23] oh well [14:00:13] hmm, i just restarted the elastic service there, now "is running. status: yellow: timed_out" [14:00:56] it wont work with a dead ssd [14:01:02] it needs both iirc [14:01:08] (or maybe its raid1, i dont recall) [14:02:47] <^d> mutante: Yellow's fine, it's just reallocating stuff since a server crapped out [14:02:49] <^d> Doing it's job :) [14:02:58] <^d> its, even. [14:03:16] <^d> { [14:03:16] <^d> "cluster_name" : "production-search-eqiad", [14:03:16] <^d> "status" : "yellow", [14:03:18] <^d> "timed_out" : false, [14:03:20] <^d> "number_of_nodes" : 18, [14:03:22] <^d> "number_of_data_nodes" : 18, [14:03:24] <^d> "active_primary_shards" : 2035, [14:03:26] <^d> "active_shards" : 6028, [14:03:26] ^d: should i depool it or nah? [14:03:28] <^d> "relocating_shards" : 0, [14:03:30] <^d> "initializing_shards" : 16, [14:03:32] <^d> "unassigned_shards" : 60 [14:03:34] <^d> } [14:03:42] in pybal that is [14:04:15] <^d> pybal should fail the health check and depool on its own. [14:04:16] PROBLEM - ElasticSearch health check on elastic1018 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.40 [14:04:33] ah, there we go :p [14:04:45] it seems it doesnt [14:04:50] 18 { 'host': 'elastic1018.eqiad.wmnet', 'weight': 30, 'enabled': True } [14:05:30] !log depooled elastic1018 - service wasnt running and signs of broken hardware (SSD) [14:05:37] Logged the message, Master [14:06:18] PROBLEM - check if dhclient is running on elastic1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:06:26] PROBLEM - check configured eth on elastic1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:06:46] PROBLEM - DPKG on elastic1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:06:46] PROBLEM - puppet last run on elastic1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:06:56] PROBLEM - RAID on elastic1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:06:56] PROBLEM - Disk space on elastic1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:06:56] PROBLEM - puppet disabled on elastic1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:58] disabling those notifications [14:11:49] well, now elastic1018 died while we talked about it [14:12:09] <^d> !log elastic1018: blacklisted from shard allocation since it's dead [14:12:15] Logged the message, Master [14:13:36] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Thu Aug 7 14:13:26 UTC 2014 [14:14:22] mutante: stop talking about servers :-] [14:15:13] "unhandled error code" are the best errors :) [14:15:23] cmjohnson1: hehe, ok [14:22:16] bblack: http://apt.wikimedia.org/pending/ [14:22:35] akosiaris: I'd hold on that just a little, since -wm7 has storage bugs [14:22:49] bblack: it's just for testing on labs [14:23:01] ok [14:23:22] if I can definitively find the bug, I'll make a -wm8 today [14:23:28] btw the trusty1 version qualifier is due to reprepro not liking the same package twice [14:23:55] sure, no worries. Turns it was exceptionally easy to build thanks to your notes [14:23:56] yeah [14:23:56] !log shutting down elastic1018 [14:24:01] Logged the message, Master [14:24:04] great! [14:25:00] ACKNOWLEDGEMENT - Host elastic1018 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #8091: elastic1018 - broken SSD [14:27:22] (03CR) 10Legoktm: [C: 031] Remove unnecessary file_exists checks for skin requires [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152119 (owner: 10Bartosz Dziewoński) [14:27:32] (03CR) 10Legoktm: [C: 031] "Now necessary." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152120 (owner: 10Bartosz Dziewoński) [14:29:01] matanya: ugh, mailman_queue_size check doesnt wrork right yet, output = null [14:29:30] Uh... Beta Labs says Vector is installed, LOL [14:29:52] "Whoops! The default skin for your wiki ($wgDefaultSkin), vector, is not available." [14:30:00] (03CR) 10Dzahn: [C: 031] decom tarin (pmtpa poolcounter) [operations/puppet] - 10https://gerrit.wikimedia.org/r/152154 (owner: 10Dzahn) [14:30:06] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [14:30:08] (03PS2) 10Dzahn: decom tarin (pmtpa poolcounter) [operations/puppet] - 10https://gerrit.wikimedia.org/r/152154 [14:30:18] bawolff: should be getting better and better over the next two hours or so, errors are back under threshold [14:30:42] Yay! [14:30:50] I'm going to cp and paste that to VP [14:35:23] <_joe_> bawolff: https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22) shows both the outage and the resolution [14:38:03] (03CR) 10Dzahn: [C: 032] planet.wikimedia.org -- fix https redirects to http [operations/puppet] - 10https://gerrit.wikimedia.org/r/149311 (https://bugzilla.wikimedia.org/68554) (owner: 10Chmarkine) [14:39:02] (03CR) 10Dzahn: [V: 032] planet.wikimedia.org -- fix https redirects to http [operations/puppet] - 10https://gerrit.wikimedia.org/r/149311 (https://bugzilla.wikimedia.org/68554) (owner: 10Chmarkine) [14:40:49] (03PS1) 10Hoo man: Allow "hoo" to sudo into datasets [operations/puppet] - 10https://gerrit.wikimedia.org/r/152724 [14:41:48] (03PS1) 10Alexandros Kosiaris: akosiaris .dotfiles updates [operations/puppet] - 10https://gerrit.wikimedia.org/r/152726 [14:43:14] !log uploaded varnish_3.0.5plus~x-wm7trusty1 on apt.wikimedia.org (for usage in trusty labs machines, notably cxserver) [14:43:20] Logged the message, Master [14:48:39] (03CR) 10JanZerebecki: [C: 031] "Looks good, will need the DNS change after that." [operations/puppet] - 10https://gerrit.wikimedia.org/r/147485 (owner: 10Reedy) [14:52:27] (03PS2) 10Alexandros Kosiaris: akosiaris .dotfiles updates [operations/puppet] - 10https://gerrit.wikimedia.org/r/152726 [14:53:47] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Puppet has 1 failures [14:54:36] RECOVERY - Puppet freshness on labsdb1005 is OK: puppet ran at Thu Aug 7 14:54:30 UTC 2014 [14:55:47] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [14:56:10] Bah, messing with my bouncer, sorry for spam [15:07:22] chrismcmahon: something screwed up Beta Labs [15:08:07] PROBLEM - MySQL Processlist on db1062 is CRITICAL: CRIT 2 unauthenticated, 0 locked, 1 copy to table, 80 statistics [15:08:28] StevenW: it was MatmaRex [15:08:40] It should be un-exploded [15:09:00] Considering we're going to use it for usability testing during Wikimania starting tomorrow... [15:09:07] RECOVERY - MySQL Processlist on db1062 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 2 statistics [15:09:38] StevenW: stay cool [15:09:40] our top men are on it [15:09:42] :) [15:09:46] phew [15:09:58] apparnetly, https://gerrit.wikimedia.org/r/#/c/152120/ needs to be merged [15:10:32] Let's get it fixed up [15:11:02] (03PS2) 10Reedy: Remove unnecessary file_exists checks for skin requires [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152119 (owner: 10Bartosz Dziewoński) [15:11:07] (03CR) 10Reedy: [C: 032] Remove unnecessary file_exists checks for skin requires [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152119 (owner: 10Bartosz Dziewoński) [15:11:18] (03PS3) 10Krinkle: Add skin requires for Vector and MonoBook [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152120 (owner: 10Bartosz Dziewoński) [15:11:21] (03Merged) 10jenkins-bot: Remove unnecessary file_exists checks for skin requires [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152119 (owner: 10Bartosz Dziewoński) [15:11:46] (03CR) 10Reedy: [C: 032] Add skin requires for Vector and MonoBook [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152120 (owner: 10Bartosz Dziewoński) [15:11:50] (03Merged) 10jenkins-bot: Add skin requires for Vector and MonoBook [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152120 (owner: 10Bartosz Dziewoński) [15:12:16] (03CR) 10Ori.livneh: [C: 04-1] "We don't want a critical alert whenever someone runs an HHVM REPL to debug an issue on the app server, so either the upper limit should be" (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/152080 (owner: 10Giuseppe Lavagetto) [15:12:54] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 13s) [15:13:00] Logged the message, Master [15:14:05] (03CR) 10Ori.livneh: mediawiki: basic HHVM monitoring (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/152081 (owner: 10Giuseppe Lavagetto) [15:14:51] (03CR) 10Ori.livneh: [C: 031] mediawiki: move testwiki to HAT [DO_NOT_MERGE] [operations/puppet] - 10https://gerrit.wikimedia.org/r/152082 (owner: 10Giuseppe Lavagetto) [15:14:54] c'mon jenkins [15:15:17] (03CR) 10Alexandros Kosiaris: [C: 032] akosiaris .dotfiles updates [operations/puppet] - 10https://gerrit.wikimedia.org/r/152726 (owner: 10Alexandros Kosiaris) [15:15:39] 15:14:02 15:14:02 scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="eowiki" --list-file="/a/common/wmf-config/extension-list" --output="/tmp/tmp.if3iYJSuEE" ' returned non-zero exit status 255 (duration: 00m 05s) [15:17:22] MatmaRex: Are we missing a couple of extension-list additions? [15:17:23] akosiaris: how would you feel about adding a 'snippets' subdirectory in the apache module for bits of configuration that serve a common use-case and have been reviewed? it wouldn't be utilized by puppet at all; it'd be there purely for the benefit of human readers [15:17:43] as documentation ? [15:17:50] yeah, pretty much [15:18:06] I 'd like it. but probably under doc/snippets [15:18:13] or something similar so that it's clear [15:18:29] 15:18:20 PHP Warning: require_once(/mnt/srv/scap-stage-dir/php-master/skins/Vector/Vector.php): failed to open stream: No such file or directory in /mnt/srv/common-local/wmf-config/CommonSettings.php on line 518 [15:18:29] 15:18:20 Warning: require_once(/mnt/srv/scap-stage-dir/php-master/skins/Vector/Vector.php): failed to open stream: No such file or directory in /mnt/srv/common-local/wmf-config/CommonSettings.php on line 518 [15:18:29] 15:18:20 PHP Fatal error: require_once(): Failed opening required '/mnt/srv/scap-stage-dir/php-master/skins/Vector/Vector.php' (include_path='/mnt/srv/scap-stage-dir/php-master:/usr/local/lib/php:/usr/share/php') in /mnt/srv/common-local/wmf-config/CommonSettings.php on line 518 [15:18:29] 15:18:20 Fatal error: require_once(): Failed opening required '/mnt/srv/scap-stage-dir/php-master/skins/Vector/Vector.php' (include_path='/mnt/srv/scap-stage-dir/php-master:/usr/local/lib/php:/usr/share/php') in /mnt/srv/common-local/wmf-config/CommonSettings.php on line 518 [15:18:36] but yeah I like it [15:18:54] cool :) [15:19:13] (03CR) 10Giuseppe Lavagetto: "Agreed - I'll set the alert if no process is running." [operations/puppet] - 10https://gerrit.wikimedia.org/r/152080 (owner: 10Giuseppe Lavagetto) [15:20:50] (03PS1) 10Dzahn: decom tantalum, former OCG QA box [operations/puppet] - 10https://gerrit.wikimedia.org/r/152739 [15:21:12] Empty vector folder [15:21:28] (03CR) 10Giuseppe Lavagetto: mediawiki: basic HHVM monitoring (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/152081 (owner: 10Giuseppe Lavagetto) [15:22:07] Same for monobook [15:22:09] Scap running for labs [15:22:41] Reedy: do we need to add skins to extensions-list? [15:22:49] We've got some added already [15:22:55] https://noc.wikimedia.org/conf/highlight.php?file=extension-list [15:22:57] look at hte bottom [15:24:34] (03PS1) 10Legoktm: Add MonoBook & Vector to extensions-list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152741 [15:24:36] Reedy: ^ [15:27:19] (03CR) 10Krinkle: "Be careful since these don't exist in production (mediawiki make branch doesn't have these yet). It'll work on beta though because it incl" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152741 (owner: 10Legoktm) [15:27:28] ACKNOWLEDGEMENT - Host platinum is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT ##8096: status of platinum ? [15:28:55] legoktm: -labs [15:28:56] ;) [15:29:12] ah ok [15:29:15] wait [15:29:17] why not prod though? [15:29:29] because core still includes them [15:29:36] it might work ok [15:29:38] but it won't next thursday? [15:29:50] Reedy: core doesn't include them [15:29:57] Reedy: Well, it does. [15:30:00] but CommonSettings also does [15:30:04] in prod, [15:30:17] (03PS2) 10Legoktm: Add MonoBook & Vector to extensions-list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152741 [15:30:25] (03PS3) 10Legoktm: Add MonoBook & Vector to extensions-list-labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152741 [15:30:29] Yeah, let's keep it safe for now [15:30:43] I'll move it around as necessary as usual [15:30:55] beta should be nearly fixd [15:30:58] scap is just about done [15:31:12] (03PS2) 10Chad: Adding missing Swift dependencies [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/151108 [15:35:24] (03PS1) 10RobH: setting up dns for protactinium [operations/dns] - 10https://gerrit.wikimedia.org/r/152747 [15:36:57] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Epic puppet fail [15:37:20] !log reedy Synchronized php-1.24wmf15/maintenance/findMissingFiles.php: (no message) (duration: 00m 17s) [15:37:24] Logged the message, Master [15:38:06] !log reedy Synchronized php-1.24wmf16/maintenance/findMissingFiles.php: (no message) (duration: 00m 20s) [15:38:10] Logged the message, Master [15:39:48] (03PS1) 10Ori.livneh: HHVM: fix Apache config for status site [operations/puppet] - 10https://gerrit.wikimedia.org/r/152753 [15:40:03] (03CR) 10Giuseppe Lavagetto: "I decided the way to go is:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152080 (owner: 10Giuseppe Lavagetto) [15:40:27] (03CR) 10Ori.livneh: "Better, IMO: https://gerrit.wikimedia.org/r/#/c/152079/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152079 (owner: 10Giuseppe Lavagetto) [15:41:10] (03PS3) 10Jforrester: SpecialCite is now CiteThisPage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/149597 [15:42:45] (03CR) 10RobH: [C: 032] setting up dns for protactinium [operations/dns] - 10https://gerrit.wikimedia.org/r/152747 (owner: 10RobH) [15:43:01] StevenW: That should be beta fixed [15:43:13] (03CR) 10Ori.livneh: "... https://gerrit.wikimedia.org/r/#/c/152753/ , rather" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152079 (owner: 10Giuseppe Lavagetto) [15:43:26] Thanks! [15:44:43] (03PS4) 10Jforrester: SpecialCite is now CiteThisPage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/149597 [15:44:58] (03CR) 10Reedy: [C: 032] Add MonoBook & Vector to extensions-list-labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152741 (owner: 10Legoktm) [15:45:49] (03Merged) 10jenkins-bot: Add MonoBook & Vector to extensions-list-labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152741 (owner: 10Legoktm) [15:45:57] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:46:25] heh ? [15:46:34] labsdb1003 should not have recovered... [15:46:35] wtf ? [15:46:54] !log reedy Synchronized wmf-config/extension-list-labs: (no message) (duration: 00m 13s) [15:46:58] Logged the message, Master [15:47:08] akosiaris: if puppet is running it considers it a run [15:47:22] akosiaris: so if puppet has just been invoked but hasn't failed yet it's A-OK [15:47:23] (03PS1) 10Reedy: Remove old extension-list files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152756 [15:47:30] ori: Error 400 on SERVER: Duplicate declaration: Nrpe::Monitor_service[mysqld] is already declared in file /etc/puppet/manifests/role/db.pp:97; cannot redeclare at /etc/puppet/modules/mariadb/manifests/monitor_process.pp:13 on node labsdb1003.eqiad.wmnet [15:47:34] so nope... [15:47:49] (03CR) 10Reedy: [C: 032] Remove old extension-list files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152756 (owner: 10Reedy) [15:47:51] the previous yaml was very very thorough [15:48:08] !log zirconium - attempt to fix apache site setup manually [15:48:13] Logged the message, Master [15:48:14] (03Merged) 10jenkins-bot: Remove old extension-list files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152756 (owner: 10Reedy) [15:48:29] ori: http://p.defau.lt/?vxsQz911y0Il7EZYVGAkLw [15:48:34] so something is fishy here... [15:49:57] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Epic puppet fail [15:51:08] (03CR) 10Reedy: [C: 04-1] SpecialCite is now CiteThisPage (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/149597 (owner: 10Jforrester) [15:53:15] (03CR) 10Dzahn: "this results in situations like this: sites-enabled has one file, sites-available has another file, none are symlinks and the content is d" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140218 (owner: 10Ori.livneh) [15:53:36] (03PS1) 10Reedy: Point wgSiteMatrixFile at full path (not /apache symlink) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/152758 [15:56:34] * andrewbogott is looking at that labsdb1003 crit [15:56:50] (03PS1) 10RobH: protactinium install params [operations/puppet] - 10https://gerrit.wikimedia.org/r/152761 [15:58:57] (03PS1) 10Yuvipanda: quarry: Allow Cross Domain access to outputs for everyone [operations/puppet] - 10https://gerrit.wikimedia.org/r/152764 [16:01:37] (03PS1) 10Andrew Bogott: Remove duplicate nrpe::monitor_service def [operations/puppet] - 10https://gerrit.wikimedia.org/r/152766 [16:02:00] springle: ^ will fix the labsdb1003 puppet failure. [16:03:16] PROBLEM - Puppet freshness on cp1049 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 14:02:27 UTC [16:04:06] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:05:15] (03CR) 10RobH: [C: 032] protactinium install params [operations/puppet] - 10https://gerrit.wikimedia.org/r/152761 (owner: 10RobH) [16:05:46] hm, and now labsdb1003 reports 'recovery' despite still being broken :( [16:05:55] I guess I'll spend some time on puppet monitoring [16:08:03] (03PS1) 10Dzahn: fix Apache site setup for contacts.wm [operations/puppet] - 10https://gerrit.wikimedia.org/r/152771 [16:08:06] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Epic puppet fail [16:08:52] (03CR) 10Jgreen: [C: 031] decom tantalum, former OCG QA box [operations/puppet] - 10https://gerrit.wikimedia.org/r/152739 (owner: 10Dzahn) [16:09:24] !log krinkle Synchronized php-1.24wmf16/extensions/GlobalCssJs/GlobalCssJs.hooks.php: 4bbf4e0ed92f9a09 (duration: 00m 05s) [16:09:30] Logged the message, Master [16:10:16] PROBLEM - Puppet freshness on cp1048 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 14:09:33 UTC [16:10:37] (03CR) 10Dzahn: "root@zirconium:/# file /etc/apache2/sites-enabled/contacts.wikimedia.org" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152771 (owner: 10Dzahn) [16:12:02] (03PS2) 10Dzahn: fix Apache site setup for contacts.wm [operations/puppet] - 10https://gerrit.wikimedia.org/r/152771 [16:12:05] (03PS1) 10Alexandros Kosiaris: puppet agent: disable usecacheonfailure [operations/puppet] - 10https://gerrit.wikimedia.org/r/152773 [16:13:07] (03CR) 10Dzahn: [C: 032] fix Apache site setup for contacts.wm [operations/puppet] - 10https://gerrit.wikimedia.org/r/152771 (owner: 10Dzahn) [16:14:11] (03CR) 10Alexandros Kosiaris: [C: 032] puppet agent: disable usecacheonfailure [operations/puppet] - 10https://gerrit.wikimedia.org/r/152773 (owner: 10Alexandros Kosiaris) [16:15:07] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [16:16:42] (03CR) 10Dzahn: "Notice: /Stage[main]/Apache/File[/etc/apache2/sites-enabled/contacts.wikimedia.org]/ensure: removed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152771 (owner: 10Dzahn) [16:18:16] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 85 data above and 9 below the confidence bounds [16:20:14] (03CR) 10Jgreen: [C: 031 V: 032] nagios plugin to check OCG server health [operations/puppet] - 10https://gerrit.wikimedia.org/r/152168 (owner: 10Jgreen) [16:20:38] (03CR) 10Jgreen: [C: 032 V: 031] nagios plugin to check OCG server health [operations/puppet] - 10https://gerrit.wikimedia.org/r/152168 (owner: 10Jgreen) [16:21:53] (03PS2) 10Jgreen: puppetize check_ocg_health nagios check [operations/puppet] - 10https://gerrit.wikimedia.org/r/152180 [16:23:19] if an icinga check returns 3 (aka 'unknown') does that clear the previous state, or just leave things as they were? [16:24:23] it should clear it, 3 is a separate state [16:24:50] ok, so if something goes critical, then is unknown, then critical again… we'll get two warnings [16:24:53] icinga will move it to the "purple" section [16:24:55] in the web ui [16:25:06] Although I can't think of a time when I've seen 'unknown' come up in IRC, so maybe that never happens [16:25:32] i think the IRC bot doesnt care about unknowns even though Icinga handles them separately [16:25:48] hm [16:25:50] and we dont get pages for them either [16:25:54] (03PS5) 10Jforrester: SpecialCite is now CiteThisPage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/149597 [16:25:58] but you can see them in web [16:26:06] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:27:19] (03PS1) 10Dzahn: fix apache site setup for planet languages [operations/puppet] - 10https://gerrit.wikimedia.org/r/152777 [16:27:57] hm, nope, irc and web interface both say 'OK' [16:28:00] (03PS2) 10Dzahn: fix apache site setup for planet languages [operations/puppet] - 10https://gerrit.wikimedia.org/r/152777 [16:28:05] even though it's not. So I guess we aren't hitting 'unknown' [16:28:18] There's not any way to have a test exist with "Ask me later"? [16:28:22] *exit [16:28:54] (03CR) 10Dzahn: [C: 032] "root@zirconium:/etc/apache2/sites-enabled# file *planet*" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152777 (owner: 10Dzahn) [16:29:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [16:29:16] (03PS1) 10Yuvipanda: quarry: Serve JSON output with appropriate MIME type [operations/puppet] - 10https://gerrit.wikimedia.org/r/152779 [16:29:43] (03CR) 10Jgreen: [C: 032 V: 031] puppetize check_ocg_health nagios check [operations/puppet] - 10https://gerrit.wikimedia.org/r/152180 (owner: 10Jgreen) [16:31:01] andrewbogott: 0 (OK), 1 (WARN), 2 (CRIT), 3 (UNKNOWN).. that's it.. not sure how you mean "ask me later" [16:31:22] 'ask me later' like, this is a bad time to run the test. [16:31:47] you can change the interval at which icinga runs the check [16:31:51] I guess I can have it return 'unknown' in that case but that will cause flapping [16:31:55] or put logic into the plugin itself [16:31:57] to check the time first [16:31:58] !log temporarily disabling icinga notifications for ocg100[123] ocg service check [16:32:04] Logged the message, Master [16:32:08] It's a race -- changing the interval won't work unless the test can itself reschedule itself for later. [16:32:23] (03CR) 10Jforrester: SpecialCite is now CiteThisPage (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/149597 (owner: 10Jforrester) [16:32:28] I assume having the test block will cause bad things [16:32:55] andrewbogott: you can say that it is only critical if it has been critical X times in a row [16:33:17] like, "for 3 check intervals" [16:34:48] andrewbogott: http://nagios.sourceforge.net/docs/3_0/flapping.html [16:37:14] (03PS1) 10Jgreen: fix variable reassignment error in ocg::nagios::check [operations/puppet] - 10https://gerrit.wikimedia.org/r/152782 [16:37:20] (03PS2) 10Yuvipanda: quarry: Serve JSON output with appropriate MIME type [operations/puppet] - 10https://gerrit.wikimedia.org/r/152779 [16:37:56] (03CR) 10JanZerebecki: [C: 04-1] "Yea I think it makes sense to not artificially prolong the life of the domain if it is scheduled to go away. As an alternate to adding it " [operations/puppet] - 10https://gerrit.wikimedia.org/r/151523 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [16:38:16] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 14:37:31 UTC [16:40:39] Coren_WM2014: merge https://gerrit.wikimedia.org/r/#/c/152764/ and https://gerrit.wikimedia.org/r/#/c/152779/? [16:41:17] (03PS2) 10Jgreen: fix variable reassignment error in ocg::nagios::check [operations/puppet] - 10https://gerrit.wikimedia.org/r/152782 [16:41:35] andrewbogott: ^ as well, when you're not firefighting [16:49:30] (03PS1) 10Dzahn: remove chwiki.wordpress from French planet feed [operations/puppet] - 10https://gerrit.wikimedia.org/r/152786 [16:50:47] (03CR) 10Dzahn: [C: 031] "sorry, but this is not for soccer (see all the images on http://fr.planet.wikimedia.org/)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152786 (owner: 10Dzahn) [16:52:20] <^d> mutante: I don't see a single wiki-related post at all. [16:53:26] something about commons categories.. [16:53:40] "Pas encore de catégorie Commons, mais une sélection rapide disponible sur mon Flickr" [16:53:58] not yet a commons category.. but my flickr stuff [16:55:06] #monkeyselfie [16:55:30] (03CR) 10Jgreen: [C: 032 V: 031] fix variable reassignment error in ocg::nagios::check [operations/puppet] - 10https://gerrit.wikimedia.org/r/152782 (owner: 10Jgreen) [16:56:16] <^d> odder: We need a single purpose twitter account for this :) [16:56:51] (03CR) 10Filippo Giunchedi: [C: 031] HHVM: fix Apache config for status site [operations/puppet] - 10https://gerrit.wikimedia.org/r/152753 (owner: 10Ori.livneh) [16:57:50] ^d: please review https://gerrit.wikimedia.org/r/#/c/97190/2/templates/varnish/errorpage.inc.vcl.erb :) [16:58:43] * yuvipanda pokes andrewbogott for CR again :D [16:59:00] yuvipanda: um, ok, still staring at puppet races :( [16:59:03] <^d> mutante: existing comments are right, should be amended :) [16:59:06] andrewbogott: ah, ok :( [16:59:22] godog: and https://gerrit.wikimedia.org/r/#/c/152001/ :P [17:03:29] (03CR) 10Alexandros Kosiaris: [C: 04-2] "monitoring should be done in role classes and not module classes anyway. If anything should be fixed (which it should) is the module, not " [operations/puppet] - 10https://gerrit.wikimedia.org/r/152766 (owner: 10Andrew Bogott) [17:04:04] (03PS1) 10Dzahn: fix apache site setup for planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/152789 [17:04:57] (03CR) 10Dzahn: [C: 032] fix apache site setup for planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/152789 (owner: 10Dzahn) [17:09:00] (03PS1) 10Tim Landscheidt: beta: Fix IP mapping for stream.wmflabs.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/152791 [17:09:56] (03PS1) 10Dzahn: fix site name for planet default site [operations/puppet] - 10https://gerrit.wikimedia.org/r/152792 [17:10:30] (03CR) 10Dzahn: [C: 032] fix site name for planet default site [operations/puppet] - 10https://gerrit.wikimedia.org/r/152792 (owner: 10Dzahn) [17:10:45] (03PS1) 10Jgreen: make sure nagios ocg plugin gets installed in ocg::nagios::check [operations/puppet] - 10https://gerrit.wikimedia.org/r/152794 [17:11:22] (03CR) 10jenkins-bot: [V: 04-1] make sure nagios ocg plugin gets installed in ocg::nagios::check [operations/puppet] - 10https://gerrit.wikimedia.org/r/152794 (owner: 10Jgreen) [17:12:08] (03CR) 10Dzahn: [C: 04-1] "+ HAS_TAB=1" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152794 (owner: 10Jgreen) [17:12:24] jenkins:) [17:12:46] (03CR) 10Tim Landscheidt: "After merging this, dnsmasq needs to be restarted on virt$SOMETHING as part of the OpenStack network service. The last time this caused a" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152791 (owner: 10Tim Landscheidt) [17:16:08] (03CR) 10Nemo bis: "I'm not sure, isn't this related to the french-speaking wiki-photo-expeditions group?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152786 (owner: 10Dzahn) [17:16:34] (03CR) 10Andrew Bogott: "ok -- we need Sean for a proper fix. It's not obvious to me how tasks should be divided up between role::db::labsdb and role::mariadb::la" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152766 (owner: 10Andrew Bogott) [17:17:46] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Thu Aug 7 17:17:37 UTC 2014 [17:17:53] (03PS1) 10Alexandros Kosiaris: Revert "process monitoring for labsdb" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152796 [17:18:08] (03CR) 10Dzahn: "Notice: /Stage[main]/Apache/File[/etc/apache2/sites-enabled/50--etc-apache2-sites-enabled-planet-wikimedia-org.conf]/ensure: removed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152792 (owner: 10Dzahn) [17:19:07] (03PS2) 10Jgreen: make sure nagios ocg plugin gets installed in ocg::nagios::check [operations/puppet] - 10https://gerrit.wikimedia.org/r/152794 [17:19:29] (03CR) 10Alexandros Kosiaris: "Yes, I remember having that discussion with Sean. He had some pretty convincing argument for that but it eludes my memory now" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152766 (owner: 10Andrew Bogott) [17:20:15] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "process monitoring for labsdb" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152796 (owner: 10Alexandros Kosiaris) [17:20:53] (03CR) 10Jgreen: [C: 032 V: 031] "does it have a tab!??!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152794 (owner: 10Jgreen) [17:21:30] (03PS10) 10Dr0ptp4kt: Log when Internet.org in X-Analytics with proxy tag [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 [17:22:01] (03CR) 10Alexandros Kosiaris: "After running a git bisect I identified the culprit commit and reverted at" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152766 (owner: 10Andrew Bogott) [17:22:26] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 4 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [17:22:30] (03PS11) 10Dr0ptp4kt: Log when Internet.org in X-Analytics with proxy tag [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 [17:23:13] (03CR) 10Dr0ptp4kt: "Updated to use case-insensitive regex. See latest PS." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 (owner: 10Dr0ptp4kt) [17:23:35] heh ? puppet-merge says no changes to merge... while there clearly are some.. [17:23:41] gerrit troubles maybe ? [17:24:03] niah, me being tired ... [17:24:05] forget it [17:24:29] (03CR) 10Nemo bis: "That is, https://commons.wikimedia.org/wiki/Category:Sport_event_coverage_by_Wikimedia_CH" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152786 (owner: 10Dzahn) [17:25:17] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:25:49] andrewbogott_afk: I have identified the source of the "race conditions" on the puppet run check. Should be fixed in https://gerrit.wikimedia.org/r/#/c/152773/ [17:28:37] (03PS1) 10Jgreen: grr. hate virtual resources. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152797 [17:29:16] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 15:28:39 UTC [17:29:20] (03CR) 10Pleclown: [C: 04-1] "Everything is Wikimedia related." [operations/puppet] - 10https://gerrit.wikimedia.org/r/152786 (owner: 10Dzahn) [17:31:51] (03CR) 10Jgreen: [C: 032 V: 031] grr. hate virtual resources. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152797 (owner: 10Jgreen) [17:34:13] (03PS1) 10Jgreen: change from icinga to nagios-plugins as require for ocg plugin [operations/puppet] - 10https://gerrit.wikimedia.org/r/152799 [17:37:38] (03CR) 10Jgreen: [C: 032 V: 031] change from icinga to nagios-plugins as require for ocg plugin [operations/puppet] - 10https://gerrit.wikimedia.org/r/152799 (owner: 10Jgreen) [17:46:11] (03CR) 10Pleclown: "I've changed the settings of the rss feed (was set to full text instead of summary, don't know why)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152786 (owner: 10Dzahn) [17:52:26] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 99 data above and 0 below the confidence bounds [17:52:26] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 29 data above and 0 below the confidence bounds [17:52:26] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 29 data above and 0 below the confidence bounds [17:58:46] RECOVERY - Puppet freshness on cp1048 is OK: puppet ran at Thu Aug 7 17:58:37 UTC 2014 [18:00:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [18:00:46] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Epic puppet fail [18:00:57] ^ that's me [18:01:12] (everything's fine!) [18:01:57] RECOVERY - Puppet freshness on cp1049 is OK: puppet ran at Thu Aug 7 18:01:55 UTC 2014 [18:02:46] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:03:42] (03PS2) 10Ori.livneh: HHVM: set a 5s graceful shutdown timeout [operations/puppet] - 10https://gerrit.wikimedia.org/r/152001 [18:03:49] (03CR) 10Ori.livneh: [C: 032] HHVM: set a 5s graceful shutdown timeout [operations/puppet] - 10https://gerrit.wikimedia.org/r/152001 (owner: 10Ori.livneh) [18:04:02] (03CR) 10Ori.livneh: [V: 032] HHVM: set a 5s graceful shutdown timeout [operations/puppet] - 10https://gerrit.wikimedia.org/r/152001 (owner: 10Ori.livneh) [18:15:17] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:21:16] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 16:20:38 UTC [18:47:29] !starting the process of fixing upload cache sizes, there will be periodic slim 5xx spikes... [18:48:26] (03PS1) 10Jgreen: grr, redo ocg nagios check foo, was all wrong [operations/puppet] - 10https://gerrit.wikimedia.org/r/152805 [18:51:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [18:51:26] PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Epic puppet fail [18:53:25] (03PS2) 10Jgreen: grr, redo ocg nagios check foo, was all wrong [operations/puppet] - 10https://gerrit.wikimedia.org/r/152805 [18:54:46] logbot dead I guess? [18:55:13] oh, no, I just failed to actually say "log" [18:55:25] !log starting the process of fixing upload cache sizes, there will be periodic slim 5xx spikes... [18:55:29] Logged the message, Master [18:57:58] ACKNOWLEDGEMENT - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 6.7647994323e-68 Jeff Gage same old problem host. service is ok. [19:00:36] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Thu Aug 7 19:00:25 UTC 2014 [19:01:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:03:30] (03CR) 10Jgreen: [C: 032 V: 031] grr, redo ocg nagios check foo, was all wrong [operations/puppet] - 10https://gerrit.wikimedia.org/r/152805 (owner: 10Jgreen) [19:07:57] PROBLEM - DPKG on analytics1021 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:12:57] RECOVERY - DPKG on analytics1021 is OK: All packages OK [19:19:15] !log rebooting analytics1021 for kernel upgrade [19:19:21] Logged the message, Master [19:20:29] (03CR) 10JanZerebecki: [C: 031] "Good, probably first deploy and test I2bf2a0d8b4f7064615ac31eff78f3237eb26298c ." [operations/puppet] - 10https://gerrit.wikimedia.org/r/152248 (owner: 10Giuseppe Lavagetto) [19:21:48] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 16.0 [19:21:49] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1018 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 12.0 [19:22:37] PROBLEM - OCG health check on ocg1002 is CRITICAL: The command defined for service OCG health check does not exist [19:22:38] PROBLEM - OCG health check on ocg1003 is CRITICAL: The command defined for service OCG health check does not exist [19:22:50] kafka broker alerts are me, not a problem [19:23:27] PROBLEM - OCG health check on ocg1001 is CRITICAL: The command defined for service OCG health check does not exist [19:23:35] grr. [19:23:57] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 16.0 Jeff Gage analytics1021 reboot [19:23:57] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on analytics1018 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 12.0 Jeff Gage analytics1021 reboot [19:29:20] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 15:28:39 UTC [19:32:42] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [19:33:50] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1018 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [19:34:58] (03CR) 10Yurik: [C: 04-1] Log when Internet.org in X-Analytics with proxy tag (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 (owner: 10Dr0ptp4kt) [19:35:09] dr0ptp4kt, sorry :) [19:36:02] (03PS3) 10JanZerebecki: bugzilla: use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/152282 (owner: 10Giuseppe Lavagetto) [19:36:50] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1012 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [19:37:18] (03CR) 10JanZerebecki: [C: 031] "Removed unmerged, not needed dependency." [operations/puppet] - 10https://gerrit.wikimedia.org/r/152282 (owner: 10Giuseppe Lavagetto) [19:37:20] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 5118.00827509 [19:45:52] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:02:41] (03PS1) 10Jgreen: another stab at nrpe ocg health check config [operations/puppet] - 10https://gerrit.wikimedia.org/r/152812 [20:03:21] (03CR) 10jenkins-bot: [V: 04-1] another stab at nrpe ocg health check config [operations/puppet] - 10https://gerrit.wikimedia.org/r/152812 (owner: 10Jgreen) [20:03:23] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet has 10 failures [20:05:58] (03PS2) 10Jgreen: another stab at nrpe ocg health check config [operations/puppet] - 10https://gerrit.wikimedia.org/r/152812 [20:07:04] (03CR) 10Jgreen: [C: 032 V: 031] another stab at nrpe ocg health check config [operations/puppet] - 10https://gerrit.wikimedia.org/r/152812 (owner: 10Jgreen) [20:13:27] (03CR) 10Dr0ptp4kt: Log when Internet.org in X-Analytics with proxy tag (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 (owner: 10Dr0ptp4kt) [20:13:45] (03PS1) 10Jgreen: remove ^check_ from name, it gets prepended later for greater redundancy [operations/puppet] - 10https://gerrit.wikimedia.org/r/152814 [20:15:43] (03CR) 10Jgreen: [C: 032 V: 031] remove ^check_ from name, it gets prepended later for greater redundancy [operations/puppet] - 10https://gerrit.wikimedia.org/r/152814 (owner: 10Jgreen) [20:22:19] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [20:34:13] PROBLEM - Puppet freshness on analytics1032 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 20:30:24 UTC [20:34:24] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:36:13] PROBLEM - Puppet freshness on analytics1032 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 20:30:24 UTC [20:37:30] i'll check out analytics1032 [20:38:13] PROBLEM - Puppet freshness on analytics1032 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 20:30:24 UTC [20:38:13] PROBLEM - Puppet freshness on virt1009 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 18:37:41 UTC [20:42:38] ACKNOWLEDGEMENT - Puppet freshness on analytics1032 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 20:30:24 UTC Jeff Gage cause not yet known [20:44:10] RECOVERY - Puppet freshness on analytics1032 is OK: puppet ran at Thu Aug 7 20:44:00 UTC 2014 [20:45:30] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [20:51:29] (03PS1) 10Jgreen: include nrpe on ocg100[*] [operations/puppet] - 10https://gerrit.wikimedia.org/r/152822 [20:53:34] (03PS2) 10Jgreen: include nrpe on ocg100[*] [operations/puppet] - 10https://gerrit.wikimedia.org/r/152822 [20:55:34] (03CR) 10Jgreen: [C: 032 V: 032] include nrpe on ocg100[*] [operations/puppet] - 10https://gerrit.wikimedia.org/r/152822 (owner: 10Jgreen) [20:57:10] PROBLEM - puppet last run on amssq39 is CRITICAL: CRITICAL: Epic puppet fail [20:57:30] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:13:20] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 19:13:06 UTC [21:16:11] RECOVERY - puppet last run on amssq39 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [21:23:11] Someone restart jenkins? [21:29:35] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 15:28:39 UTC [21:43:38] Deskana: you're spamming with your |Away stuff a lot tonight :) [21:44:34] Probably a bad conference connection. [21:44:50] my gosh it was bad today [21:52:59] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Thu Aug 7 21:52:48 UTC 2014 [22:04:17] what happened greg-g [22:06:07] ? [22:06:44] odder: more context would be helpful :) [22:06:52] oh, re wifi? [22:07:22] conference wifi is just generally bad for geek events unless you bring in real ops people to setup a real network [22:07:24] 23:44 greg-g: my gosh it was bad today [22:07:48] yeah, that's re the conf wifi (was in reply to marktraceur ) [22:08:25] it's Wikimania, go meet people! [22:08:29] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 69 seconds ago with 0 failures [22:09:21] odder: HACKATHON! [22:09:30] *now* it's wikimania [22:15:11] greg-g: How's the Wi-Fi at Wikimania? :) [22:18:02] the same :) [22:21:10] See? no difference :) [22:22:14] odder: but! hackathon implies a need to be connected :) wikimania doesn't (inherently, but people will be a-twittering) [22:23:16] wikimania needs editing and reading... [22:29:07] and videos [22:34:46] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [22:39:36] PROBLEM - Puppet freshness on virt1009 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 18:37:41 UTC [22:41:40] (03CR) 10BBlack: [C: 031] "Unfortunately I don't think there's any better way to do it at this time (restarting dnsmasq for these changes via Nova)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/152791 (owner: 10Tim Landscheidt) [22:47:12] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:59:20] (03CR) 10Andrew Bogott: [C: 032] quarry: Allow Cross Domain access to outputs for everyone [operations/puppet] - 10https://gerrit.wikimedia.org/r/152764 (owner: 10Yuvipanda) [22:59:57] (03CR) 10Andrew Bogott: [C: 032] quarry: Serve JSON output with appropriate MIME type [operations/puppet] - 10https://gerrit.wikimedia.org/r/152779 (owner: 10Yuvipanda) [23:00:48] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [23:12:24] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [23:26:24] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Puppet has 1 failures [23:28:45] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [23:30:37] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 15:28:39 UTC [23:33:45] (03CR) 10Mattflaschen: [C: 031] "The discussion on tewiki supports enabling it. We should do so during one of our next windows." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/151639 (https://bugzilla.wikimedia.org/69103) (owner: 10Phuedx) [23:33:48] PROBLEM - Puppet freshness on db1011 is CRITICAL: Last successful Puppet run was Thu 07 Aug 2014 21:32:54 UTC [23:43:49] bblack, could you merge a small patch in the next 15 min? [23:47:16] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [23:53:22] RECOVERY - Puppet freshness on db1011 is OK: puppet ran at Thu Aug 7 23:53:17 UTC 2014