[00:23:36] (03PS2) 10Ori.livneh: add `keyholder` module for managing a shared ssh-agent [puppet] - 10https://gerrit.wikimedia.org/r/165779 [00:51:13] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 307 seconds [00:51:23] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 325 seconds [00:52:14] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [00:52:23] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:55:02] (03PS1) 10Dzahn: add a delay between updates for wikisite/editthis [debs/wikistats] - 10https://gerrit.wikimedia.org/r/166162 (https://bugzilla.wikimedia.org/59742) [00:56:53] (03CR) 10jenkins-bot: [V: 04-1] add a delay between updates for wikisite/editthis [debs/wikistats] - 10https://gerrit.wikimedia.org/r/166162 (https://bugzilla.wikimedia.org/59742) (owner: 10Dzahn) [01:02:39] (03PS2) 10Dzahn: add a delay between updates for wikisite/editthis [debs/wikistats] - 10https://gerrit.wikimedia.org/r/166162 (https://bugzilla.wikimedia.org/59742) [01:05:51] (03PS3) 10Dzahn: add a delay between updates for wikisite/editthis [debs/wikistats] - 10https://gerrit.wikimedia.org/r/166162 (https://bugzilla.wikimedia.org/59742) [01:06:16] (03CR) 10Dzahn: [C: 032] add a delay between updates for wikisite/editthis [debs/wikistats] - 10https://gerrit.wikimedia.org/r/166162 (https://bugzilla.wikimedia.org/59742) (owner: 10Dzahn) [01:06:22] (03CR) 10Dzahn: [V: 032] add a delay between updates for wikisite/editthis [debs/wikistats] - 10https://gerrit.wikimedia.org/r/166162 (https://bugzilla.wikimedia.org/59742) (owner: 10Dzahn) [01:11:49] (03PS4) 10Ori.livneh: puppetmaster Apache template - retab [puppet] - 10https://gerrit.wikimedia.org/r/153987 (owner: 10Dzahn) [01:11:52] (03CR) 10Ori.livneh: [C: 031] puppetmaster Apache template - retab [puppet] - 10https://gerrit.wikimedia.org/r/153987 (owner: 10Dzahn) [01:17:10] (03PS1) 10Dzahn: add update cronjobs for wikisite/editthis [puppet] - 10https://gerrit.wikimedia.org/r/166164 (https://bugzilla.wikimedia.org/59742) [01:18:35] (03CR) 10Dzahn: [C: 032] "labs-only and now has a delay" [puppet] - 10https://gerrit.wikimedia.org/r/166164 (https://bugzilla.wikimedia.org/59742) (owner: 10Dzahn) [01:19:43] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [01:26:53] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [01:47:44] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail [02:06:04] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [02:17:18] !log LocalisationUpdate completed (1.25wmf2) at 2014-10-11 02:17:18+00:00 [02:17:28] Logged the message, Master [02:29:32] !log LocalisationUpdate completed (1.25wmf3) at 2014-10-11 02:29:32+00:00 [02:29:38] Logged the message, Master [03:12:43] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [03:14:43] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [03:31:55] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: Puppet has 1 failures [03:32:44] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: Puppet has 1 failures [03:32:53] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:01] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Oct 11 03:35:00 UTC 2014 (duration 34m 59s) [03:35:09] Logged the message, Master [03:50:01] RECOVERY - puppet last run on ms-be2001 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [03:50:13] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [03:51:13] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [04:24:34] PROBLEM - Swift HTTP backend on ms-fe2003 is CRITICAL: Connection timed out [04:24:43] PROBLEM - Host pollux is DOWN: CRITICAL - Time to live exceeded (208.80.153.43) [04:24:43] PROBLEM - Host achernar is DOWN: CRITICAL - Time to live exceeded (208.80.153.42) [04:24:54] PROBLEM - Swift HTTP backend on ms-fe2004 is CRITICAL: Connection timed out [04:25:43] PROBLEM - Host bast2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.5) [04:25:44] PROBLEM - Host baham is DOWN: CRITICAL - Time to live exceeded (208.80.153.13) [04:25:44] PROBLEM - Host labcontrol2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.14) [04:25:44] PROBLEM - Host install2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.4) [04:26:11] RECOVERY - Host bast2001 is UP: PING OK - Packet loss = 0%, RTA = 35.87 ms [04:26:11] RECOVERY - Host pollux is UP: PING OK - Packet loss = 0%, RTA = 34.58 ms [04:26:11] RECOVERY - Host labcontrol2001 is UP: PING OK - Packet loss = 0%, RTA = 36.32 ms [04:26:11] RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 34.79 ms [04:26:11] RECOVERY - Host install2001 is UP: PING OK - Packet loss = 0%, RTA = 40.39 ms [04:26:56] PROBLEM - Host cr1-codfw is DOWN: CRITICAL - Time to live exceeded (208.80.153.192) [04:28:14] PROBLEM - Host cr2-codfw is DOWN: CRITICAL - Time to live exceeded (208.80.153.193) [04:28:14] PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:1:d6ae:52ff:feac:4dc8 [04:28:26] PROBLEM - Host pollux is DOWN: CRITICAL - Time to live exceeded (208.80.153.43) [04:28:26] PROBLEM - Host labcontrol2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.14) [04:28:26] PROBLEM - Host baham is DOWN: CRITICAL - Time to live exceeded (208.80.153.13) [04:28:26] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: CRITICAL - Time to live exceeded (208.80.153.15) [04:28:26] PROBLEM - Host install2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.4) [04:28:26] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610 [04:28:26] PROBLEM - Host bast2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.5) [04:28:27] PROBLEM - Host acamar is DOWN: CRITICAL - Time to live exceeded (208.80.153.12) [04:28:33] PROBLEM - LVS HTTP IPv4 on ms-fe.svc.codfw.wmnet is CRITICAL: Connection timed out [04:28:41] PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e [04:28:41] PROBLEM - Host ms-fe2003 is DOWN: PING CRITICAL - Packet loss = 100% [04:28:41] PROBLEM - Host ms-fe2004 is DOWN: PING CRITICAL - Packet loss = 100% [04:28:41] PROBLEM - Host db2003 is DOWN: PING CRITICAL - Packet loss = 100% [04:28:41] PROBLEM - Host db2034 is DOWN: PING CRITICAL - Packet loss = 100% [04:28:41] PROBLEM - Host db2005 is DOWN: PING CRITICAL - Packet loss = 100% [04:28:41] PROBLEM - Recursive DNS on 208.80.153.42 is CRITICAL: CRITICAL - Plugin timed out while executing system call [04:28:43] PROBLEM - Host ms-be2011 is DOWN: PING CRITICAL - Packet loss = 100% [04:28:43] PROBLEM - Host lvs2003 is DOWN: PING CRITICAL - Packet loss = 100% [04:28:43] PROBLEM - Host ms-be2007 is DOWN: PING CRITICAL - Packet loss = 100% [04:28:43] PROBLEM - Host db2017 is DOWN: PING CRITICAL - Packet loss = 100% [04:32:13] PROBLEM - Host 208.80.153.42 is DOWN: CRITICAL - Time to live exceeded (208.80.153.42) [04:32:23] PROBLEM - Host 208.80.153.12 is DOWN: CRITICAL - Time to live exceeded (208.80.153.12) [04:32:58] PROBLEM - Host ms-fe.svc.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [04:39:35] well, that's awesome [04:41:23] I was going to re-route ns1, since that's the only remotely-interesting thing depending on codfw [04:41:44] but it seems last time paravoid dealt with this, he left it in as a low-pref backup route, so it's already working [04:48:54] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 34.45 ms [04:48:54] RECOVERY - Host acamar is UP: PING OK - Packet loss = 0%, RTA = 34.35 ms [04:48:54] RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 36.40 ms [04:48:54] RECOVERY - Host ms-be2006 is UP: PING OK - Packet loss = 0%, RTA = 34.47 ms [04:48:54] RECOVERY - Host ms-be2008 is UP: PING OK - Packet loss = 0%, RTA = 34.29 ms [04:48:54] RECOVERY - Host db2030 is UP: PING OK - Packet loss = 0%, RTA = 35.92 ms [04:48:55] RECOVERY - Host db2028 is UP: PING OK - Packet loss = 0%, RTA = 36.31 ms [04:51:19] PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: puppet fail [04:51:29] RECOVERY - Host cr1-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.25 ms [04:51:29] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: puppet fail [04:51:29] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: Puppet has 69 failures [04:51:29] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail [04:51:30] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: puppet fail [04:51:30] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: puppet fail [04:51:30] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: Puppet has 3 failures [04:51:30] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: puppet fail [04:51:39] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [04:51:40] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: puppet fail [04:51:40] PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: puppet fail [04:51:40] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail [04:51:40] PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: puppet fail [04:51:40] PROBLEM - puppet last run on db2033 is CRITICAL: CRITICAL: puppet fail [04:51:40] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: puppet fail [04:51:59] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: puppet fail [04:51:59] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: puppet fail [04:51:59] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: puppet fail [04:52:00] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: puppet fail [04:52:00] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: Puppet has 11 failures [04:52:17] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: puppet fail [04:52:17] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [04:52:17] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: puppet fail [04:52:17] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: puppet fail [04:52:28] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: puppet fail [04:52:28] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: puppet fail [04:52:28] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: puppet fail [04:52:28] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: puppet fail [04:52:28] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: puppet fail [04:52:37] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: puppet fail [04:52:37] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail [04:52:37] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 11 failures [04:52:38] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: Puppet has 25 failures [04:52:50] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [04:53:57] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [04:53:58] RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [04:53:58] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [04:54:18] RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [04:55:18] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [04:55:38] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [04:56:38] RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [04:56:57] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [04:57:28] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [04:57:37] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [04:57:48] RECOVERY - puppet last run on lvs2005 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [04:57:48] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [04:58:47] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [04:59:08] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [04:59:58] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [05:00:09] RECOVERY - puppet last run on db2033 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [05:02:48] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [05:03:37] RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [05:03:47] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [05:03:58] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [05:04:07] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [05:04:08] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [05:04:49] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [05:05:17] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [05:05:38] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [05:05:58] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [05:07:07] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [05:07:07] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [05:07:17] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [05:07:38] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [05:07:38] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [05:08:47] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [05:11:07] PROBLEM - Swift HTTP backend on ms-fe2002 is CRITICAL: Connection timed out [05:11:07] PROBLEM - Swift HTTP backend on ms-fe2001 is CRITICAL: Connection timed out [05:11:07] PROBLEM - Swift HTTP backend on ms-fe2003 is CRITICAL: Connection timed out [05:11:07] PROBLEM - LVS HTTP IPv4 on ms-fe.svc.codfw.wmnet is CRITICAL: Connection timed out [05:12:58] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 2 failures [05:14:07] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Puppet has 4 failures [05:14:30] (03PS5) 10KartikMistry: WIP: apertium service configuration for Beta [puppet] - 10https://gerrit.wikimedia.org/r/165485 [05:16:18] PROBLEM - LVS HTTP IPv4 on ms-fe.svc.codfw.wmnet is CRITICAL: Connection timed out [05:16:29] PROBLEM - Host baham is DOWN: CRITICAL - Time to live exceeded (208.80.153.13) [05:16:29] PROBLEM - Host pollux is DOWN: CRITICAL - Time to live exceeded (208.80.153.43) [05:16:29] PROBLEM - Host install2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.4) [05:16:29] PROBLEM - Host bast2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.5) [05:16:29] PROBLEM - Host achernar is DOWN: CRITICAL - Time to live exceeded (208.80.153.42) [05:16:40] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610 [05:16:41] PROBLEM - Host 208.80.153.42 is DOWN: CRITICAL - Time to live exceeded (208.80.153.42) [05:17:12] PROBLEM - Host cr2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [05:17:12] PROBLEM - Host ms-be2008 is DOWN: PING CRITICAL - Packet loss = 100% [05:17:12] PROBLEM - Host db2038 is DOWN: PING CRITICAL - Packet loss = 100% [05:17:13] PROBLEM - Host ms-be2003 is DOWN: PING CRITICAL - Packet loss = 100% [05:17:13] PROBLEM - Host ms-be2011 is DOWN: PING CRITICAL - Packet loss = 100% [05:17:13] PROBLEM - Host acamar is DOWN: PING CRITICAL - Packet loss = 100% [05:17:13] PROBLEM - Host ms-be2006 is DOWN: PING CRITICAL - Packet loss = 100% [05:18:39] RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 52.62 ms [05:18:39] RECOVERY - Host 208.80.153.42 is UP: PING OK - Packet loss = 0%, RTA = 52.04 ms [05:20:29] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: puppet fail [05:21:09] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: Puppet has 1 failures [05:21:28] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: puppet fail [05:22:53] PROBLEM - Swift HTTP backend on ms-fe2002 is CRITICAL: Connection timed out [05:22:54] PROBLEM - Swift HTTP backend on ms-fe2001 is CRITICAL: Connection timed out [05:22:54] PROBLEM - Swift HTTP backend on ms-fe2003 is CRITICAL: Connection timed out [05:22:54] PROBLEM - LVS HTTP IPv4 on ms-fe.svc.codfw.wmnet is CRITICAL: Connection timed out [05:22:57] PROBLEM - puppet last run on ms-be2009 is CRITICAL: Timeout while attempting connection [05:23:49] PROBLEM - Host 208.80.153.12 is DOWN: CRITICAL - Time to live exceeded (208.80.153.12) [05:23:49] PROBLEM - Host cr1-codfw is DOWN: CRITICAL - Time to live exceeded (208.80.153.192) [05:23:49] PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e [05:23:58] PROBLEM - Host bast2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.5) [05:23:58] PROBLEM - Host achernar is DOWN: CRITICAL - Time to live exceeded (208.80.153.42) [05:23:58] PROBLEM - Host acamar is DOWN: CRITICAL - Time to live exceeded (208.80.153.12) [05:23:58] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610 [05:23:58] PROBLEM - Host cr2-codfw is DOWN: CRITICAL - Time to live exceeded (208.80.153.193) [05:23:58] PROBLEM - Host labcontrol2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.14) [05:23:58] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: CRITICAL - Time to live exceeded (208.80.153.15) [05:23:59] PROBLEM - Host install2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.4) [05:23:59] PROBLEM - Host pollux is DOWN: CRITICAL - Time to live exceeded (208.80.153.43) [05:24:00] PROBLEM - Host baham is DOWN: CRITICAL - Time to live exceeded (208.80.153.13) [05:24:08] PROBLEM - Host db2005 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:09] PROBLEM - Host db2019 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:09] PROBLEM - Host db2029 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:09] PROBLEM - Host db2039 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:09] PROBLEM - Host lvs2003 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:18] PROBLEM - Host 208.80.153.42 is DOWN: CRITICAL - Time to live exceeded (208.80.153.42) [05:24:18] PROBLEM - Host db2017 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:18] PROBLEM - Host db2012 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:18] PROBLEM - Host ms-be2007 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:18] PROBLEM - Host ms-be2009 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:19] PROBLEM - Host ms-be2006 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:19] PROBLEM - Host ms-be2008 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:20] PROBLEM - Host ms-be2004 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:20] PROBLEM - Host lvs2005 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:21] PROBLEM - Host db2037 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:21] PROBLEM - Host db2011 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:22] PROBLEM - Host db2028 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:22] PROBLEM - Host ms-be2003 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:23] PROBLEM - Host db2033 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:23] PROBLEM - Host ms-be2001 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:24] PROBLEM - Host ms-fe2002 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:24] PROBLEM - Host ms-fe2004 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:25] PROBLEM - Host ms-be2005 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:25] PROBLEM - Host ms-be2012 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:26] PROBLEM - Host db2030 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:26] PROBLEM - Host ms-be2002 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:27] PROBLEM - Host db2038 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:27] PROBLEM - Host ms-be2010 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:28] PROBLEM - Host ms-be2011 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:28] PROBLEM - Host db2002 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:39] PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:1:d6ae:52ff:feac:4dc8 [05:24:49] PROBLEM - Host ms-fe2001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:49] PROBLEM - Host db2034 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:49] PROBLEM - Host lvs2001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:49] PROBLEM - Host db2007 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:49] PROBLEM - Host lvs2006 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:49] PROBLEM - Host db2001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:49] PROBLEM - Host db2035 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:50] PROBLEM - Host db2023 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:50] PROBLEM - Host db2016 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:51] PROBLEM - Host lvs2002 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:51] PROBLEM - Host ms-fe2003 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:52] PROBLEM - Host db2003 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:52] PROBLEM - Host db2004 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:53] PROBLEM - Host db2009 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:53] PROBLEM - Host ms-fe.svc.codfw.wmnet is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:55] PROBLEM - Host lvs2004 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:55] PROBLEM - Host db2018 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:55] PROBLEM - Host db2010 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:24:55] PROBLEM - Host db2036 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:28:28] RECOVERY - Host ms-fe2004 is UP: PING OK - Packet loss = 0%, RTA = 53.39 ms [05:28:28] RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 52.41 ms [05:28:28] RECOVERY - Host ms-be2006 is UP: PING OK - Packet loss = 0%, RTA = 53.80 ms [05:28:28] RECOVERY - Host lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 52.19 ms [05:28:28] RECOVERY - Host db2001 is UP: PING OK - Packet loss = 0%, RTA = 52.73 ms [05:29:59] RECOVERY - Host ms-fe.svc.codfw.wmnet is UP: PING OK - Packet loss = 0%, RTA = 51.74 ms [05:30:49] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [05:30:49] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: puppet fail [05:30:58] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: puppet fail [05:31:09] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: puppet fail [05:31:09] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 24 failures [05:31:09] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: Puppet has 5 failures [05:31:18] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [05:31:18] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [05:31:18] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: puppet fail [05:31:19] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail [05:31:19] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail [05:31:19] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail [05:31:28] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail [05:31:28] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [05:31:28] PROBLEM - puppet last run on db2005 is CRITICAL: CRITICAL: puppet fail [05:31:53] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: puppet fail [05:32:09] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [05:36:58] PROBLEM - Swift HTTP backend on ms-fe2004 is CRITICAL: Connection timed out [05:37:19] PROBLEM - Host baham is DOWN: CRITICAL - Time to live exceeded (208.80.153.13) [05:37:20] PROBLEM - Host bast2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.5) [05:37:20] PROBLEM - Host achernar is DOWN: CRITICAL - Time to live exceeded (208.80.153.42) [05:37:20] PROBLEM - Host install2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.4) [05:37:28] PROBLEM - Host labcontrol2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.14) [05:37:39] RECOVERY - Host install2001 is UP: PING OK - Packet loss = 0%, RTA = 54.06 ms [05:37:42] RECOVERY - Host achernar is UP: PING OK - Packet loss = 0%, RTA = 52.03 ms [05:37:42] RECOVERY - Host labcontrol2001 is UP: PING OK - Packet loss = 0%, RTA = 51.79 ms [05:37:42] RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 51.75 ms [05:38:08] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 92 seconds ago with 0 failures [05:38:09] RECOVERY - Host bast2001 is UP: PING OK - Packet loss = 0%, RTA = 52.55 ms [05:38:58] PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:1:d6ae:52ff:feac:4dc8 [05:39:51] RECOVERY - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is UP: PING OK - Packet loss = 0%, RTA = 52.88 ms [05:40:39] PROBLEM - puppet last run on db2019 is CRITICAL: CRITICAL: puppet fail [05:42:29] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [05:43:11] RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [05:43:28] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [05:43:38] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [05:43:48] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [05:43:49] RECOVERY - puppet last run on db2019 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [05:43:49] RECOVERY - puppet last run on db2005 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [05:44:49] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [05:45:28] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [05:46:09] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [05:46:48] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [05:46:49] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:47:39] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [05:48:09] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [05:48:19] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [05:53:42] (03PS1) 10Glaisher: Add several domains to wgCopyUploadsDomains for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166176 (https://bugzilla.wikimedia.org/71195) [05:56:29] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [05:57:39] PROBLEM - Host install2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.4) [05:57:48] PROBLEM - Swift HTTP backend on ms-fe2001 is CRITICAL: Connection timed out [05:57:49] PROBLEM - LVS HTTP IPv4 on ms-fe.svc.codfw.wmnet is CRITICAL: Connection timed out [05:57:52] PROBLEM - Swift HTTP backend on ms-fe2003 is CRITICAL: Connection timed out [05:57:58] PROBLEM - Host pollux is DOWN: CRITICAL - Time to live exceeded (208.80.153.43) [05:58:08] PROBLEM - Host achernar is DOWN: CRITICAL - Time to live exceeded (208.80.153.42) [05:58:11] PROBLEM - Host baham is DOWN: CRITICAL - Time to live exceeded (208.80.153.13) [05:58:11] PROBLEM - Host bast2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.5) [05:58:13] PROBLEM - Host ms-be2008 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:13] PROBLEM - Host db2018 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:13] PROBLEM - Host db2011 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:13] PROBLEM - Host db2028 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:13] PROBLEM - Host lvs2005 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:13] PROBLEM - Host db2004 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:14] PROBLEM - Host ms-be2005 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:14] PROBLEM - Host ms-fe2002 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:15] PROBLEM - Host ms-be2006 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:15] PROBLEM - Host db2030 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:16] PROBLEM - Host db2016 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:16] PROBLEM - Host ms-fe2004 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:17] PROBLEM - Host db2034 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:21] PROBLEM - Host db2023 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:58:21] PROBLEM - Host ms-be2002 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:58:30] PROBLEM - Host acamar is DOWN: CRITICAL - Time to live exceeded (208.80.153.12) [05:58:38] PROBLEM - Host 208.80.153.12 is DOWN: CRITICAL - Time to live exceeded (208.80.153.12) [05:58:39] PROBLEM - Host cr1-codfw is DOWN: CRITICAL - Time to live exceeded (208.80.153.192) [05:58:39] PROBLEM - Host labcontrol2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.14) [05:58:58] PROBLEM - Host cr2-codfw is DOWN: CRITICAL - Time to live exceeded (208.80.153.193) [05:58:59] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: CRITICAL - Time to live exceeded (208.80.153.15) [05:59:00] PROBLEM - Host db2037 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:00] PROBLEM - Host db2017 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:00] PROBLEM - Host ms-be2007 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:00] PROBLEM - Host ms-be2012 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:00] PROBLEM - Host db2005 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:01] PROBLEM - Host ms-be2010 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:01] PROBLEM - Host ms-be2003 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:02] PROBLEM - Host db2019 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:02] PROBLEM - Host ms-be2001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:03] PROBLEM - Host ms-be2004 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:03] PROBLEM - Host ms-be2011 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:04] PROBLEM - Host db2002 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:04] PROBLEM - Host ms-be2009 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:05] PROBLEM - Host db2007 is DOWN: CRITICAL - Plugin timed out after 15 seconds [05:59:05] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610 [05:59:08] PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e [05:59:09] PROBLEM - Host db2038 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:09] PROBLEM - Host db2033 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:10] PROBLEM - Host 208.80.153.42 is DOWN: CRITICAL - Time to live exceeded (208.80.153.42) [05:59:28] PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:1:d6ae:52ff:feac:4dc8 [05:59:38] PROBLEM - Host ms-fe.svc.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [05:59:47] PROBLEM - Host db2029 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:47] PROBLEM - Host ms-fe2001 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:47] PROBLEM - Host db2010 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:47] PROBLEM - Host lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:47] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:48] PROBLEM - Host db2009 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:48] PROBLEM - Host ms-fe2003 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:49] PROBLEM - Host db2001 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:49] PROBLEM - Host db2012 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:59] RECOVERY - Host ms-fe2004 is UP: PING OK - Packet loss = 0%, RTA = 52.41 ms [05:59:59] RECOVERY - Host lvs2005 is UP: PING OK - Packet loss = 0%, RTA = 52.09 ms [05:59:59] RECOVERY - Host ms-fe2003 is UP: PING OK - Packet loss = 0%, RTA = 52.57 ms [05:59:59] RECOVERY - Host db2033 is UP: PING OK - Packet loss = 0%, RTA = 51.96 ms [06:00:00] RECOVERY - Host ms-fe2001 is UP: PING OK - Packet loss = 0%, RTA = 53.04 ms [06:00:00] RECOVERY - Host db2016 is UP: PING OK - Packet loss = 0%, RTA = 52.23 ms [06:00:00] RECOVERY - Host db2023 is UP: PING OK - Packet loss = 0%, RTA = 51.68 ms [06:00:09] RECOVERY - Host labcontrol2001 is UP: PING OK - Packet loss = 0%, RTA = 54.03 ms [06:00:09] RECOVERY - Host bast2001 is UP: PING OK - Packet loss = 0%, RTA = 54.23 ms [06:00:09] RECOVERY - Host ms-be2011 is UP: PING OK - Packet loss = 0%, RTA = 52.62 ms [06:00:09] RECOVERY - Host lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 51.95 ms [06:00:09] RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 52.01 ms [06:01:10] !log put ms-fe.svc.codfw.wmnet into downtime for the next two days, because I'm tired of getting paged about it :p [06:01:17] Logged the message, Master [06:01:28] RECOVERY - Host 208.80.153.12 is UP: PING OK - Packet loss = 0%, RTA = 52.48 ms [06:01:49] RECOVERY - Host cr1-codfw is UP: PING OK - Packet loss = 0%, RTA = 54.11 ms [06:01:58] RECOVERY - Host 2620:0:860:2:d6ae:52ff:fead:5610 is UP: PING OK - Packet loss = 0%, RTA = 53.30 ms [06:03:28] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: Puppet has 1 failures [06:04:08] PROBLEM - Swift HTTP backend on ms-fe2002 is CRITICAL: Connection timed out [06:04:08] PROBLEM - Swift HTTP backend on ms-fe2001 is CRITICAL: Connection timed out [06:04:08] PROBLEM - Swift HTTP backend on ms-fe2003 is CRITICAL: Connection timed out [06:05:19] PROBLEM - Host ms-fe2001 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:19] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:19] PROBLEM - Host db2010 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:19] PROBLEM - Host lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:19] PROBLEM - Host db2012 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:20] PROBLEM - Host db2009 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:20] PROBLEM - Host ms-fe2003 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:21] PROBLEM - Host db2001 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:21] PROBLEM - Host lvs2004 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:22] PROBLEM - Host db2036 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:22] PROBLEM - Host lvs2006 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:23] PROBLEM - Host db2035 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:23] PROBLEM - Host lvs2003 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:24] PROBLEM - Host db2003 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:24] PROBLEM - Host db2039 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:29] RECOVERY - Host db2009 is UP: PING OK - Packet loss = 0%, RTA = 51.73 ms [06:05:29] RECOVERY - Host db2012 is UP: PING OK - Packet loss = 0%, RTA = 51.69 ms [06:05:29] RECOVERY - Host lvs2006 is UP: PING OK - Packet loss = 0%, RTA = 51.69 ms [06:05:29] RECOVERY - Host db2003 is UP: PING OK - Packet loss = 0%, RTA = 51.69 ms [06:05:29] RECOVERY - Host lvs2003 is UP: PING OK - Packet loss = 0%, RTA = 51.73 ms [06:05:30] RECOVERY - Host db2039 is UP: PING OK - Packet loss = 0%, RTA = 51.73 ms [06:05:30] RECOVERY - Host db2010 is UP: PING OK - Packet loss = 0%, RTA = 51.73 ms [06:05:31] RECOVERY - Host db2036 is UP: PING OK - Packet loss = 0%, RTA = 51.75 ms [06:05:31] RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 51.64 ms [06:05:32] RECOVERY - Host db2035 is UP: PING OK - Packet loss = 0%, RTA = 51.70 ms [06:05:32] RECOVERY - Host ms-fe2003 is UP: PING OK - Packet loss = 0%, RTA = 51.72 ms [06:05:33] RECOVERY - Host db2001 is UP: PING OK - Packet loss = 0%, RTA = 51.71 ms [06:05:33] RECOVERY - Host lvs2004 is UP: PING OK - Packet loss = 0%, RTA = 51.71 ms [06:05:34] RECOVERY - Host lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 51.81 ms [06:05:34] RECOVERY - Host ms-fe2001 is UP: PING OK - Packet loss = 0%, RTA = 52.58 ms [06:08:00] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: puppet fail [06:08:39] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:10:20] RECOVERY - Host ms-fe.svc.codfw.wmnet is UP: PING OK - Packet loss = 0%, RTA = 52.39 ms [06:17:40] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:18:21] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:19:20] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:21:51] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:25:00] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:28:31] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:29:31] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:40] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:50] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:00] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:11] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:11] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:11] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:11] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:20] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:20] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:50] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:51] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:30] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:31] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:40] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:50] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:00] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:45:21] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:31] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:45:31] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:45:40] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:02] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:46:13] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:46:14] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:15] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:43] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:52] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:47:02] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [06:47:03] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 68 seconds ago with 0 failures [06:47:23] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:47:23] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [08:24:43] (03PS1) 10Christopher Johnson (WMDE): added git fetch --tags service [puppet] - 10https://gerrit.wikimedia.org/r/166181 [08:25:26] (03CR) 10jenkins-bot: [V: 04-1] added git fetch --tags service [puppet] - 10https://gerrit.wikimedia.org/r/166181 (owner: 10Christopher Johnson (WMDE)) [08:32:41] (03CR) 10Christopher Johnson (WMDE): "This addresses a problem that I noticed with the use of tags in the phabricator module. If I tag a commit and change the puppet manifest " [puppet] - 10https://gerrit.wikimedia.org/r/166181 (owner: 10Christopher Johnson (WMDE)) [08:36:25] (03PS2) 10Christopher Johnson (WMDE): added git fetch --tags service [puppet] - 10https://gerrit.wikimedia.org/r/166181 [08:37:07] (03CR) 10jenkins-bot: [V: 04-1] added git fetch --tags service [puppet] - 10https://gerrit.wikimedia.org/r/166181 (owner: 10Christopher Johnson (WMDE)) [09:28:12] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "There are a few implementation details I'd like to see fixed, but more fundamentally:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/166181 (owner: 10Christopher Johnson (WMDE)) [11:12:30] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [11:33:01] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [11:39:29] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [11:57:40] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:46:19] (03PS1) 10Glaisher: Create Oriya Wikisource (orwikisource) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166186 (https://bugzilla.wikimedia.org/71875) [12:46:38] (03CR) 10jenkins-bot: [V: 04-1] Create Oriya Wikisource (orwikisource) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166186 (https://bugzilla.wikimedia.org/71875) (owner: 10Glaisher) [13:12:11] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Add "recommended article" and "featured list" badge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166144 (https://bugzilla.wikimedia.org/70268) (owner: 10Bene) [14:35:37] (03PS2) 10Glaisher: Create Oriya Wikisource (orwikisource) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166186 (https://bugzilla.wikimedia.org/71875) [14:57:32] (03PS2) 10Nemo bis: Enhanced recent changes: explicitly disable by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/124292 (https://bugzilla.wikimedia.org/35785) [14:58:22] (03PS3) 10Nemo bis: Enhanced recent changes: explicitly disable by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/124292 (https://bugzilla.wikimedia.org/35785) [15:01:35] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: puppet fail [15:20:56] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:24:38] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: Puppet has 1 failures [15:27:27] is there a tool to create the changelog of a wmf version? (like https://www.mediawiki.org/wiki/MediaWiki_1.25/wmf2/Changelog) [15:35:34] yes, somewhere [15:37:20] in tools release or so I think [15:37:21] one second [15:37:55] mediawiki/tools/release is the repo [15:38:13] make-deploy-notes/make-deploy-notes is the script [15:38:16] FlorianSW: ^ [15:38:42] hoo: thanks :) then i can create the wmf3 changelog, i hope :) [15:39:11] Probably, if you have the things checked out as needed [15:39:25] s/checked out/available/ [15:39:35] You'll probably need all git repos [15:40:09] hoo: yes [15:40:16] s/create/try create [15:41:56] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [15:45:44] aww crap [15:45:47] legoktm: ping [15:47:58] (03CR) 10KartikMistry: WIP: apertium service configuration for Beta (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/165485 (owner: 10KartikMistry) [15:49:47] Reedy: There? [15:49:47] hmph, i wanted to look up that script, and noticed https://git.wikimedia.org/ is down. (504 Gateway Time-out) [15:51:31] MatmaRex: if it is on git.wikimedia, normally it is on github, too?! :) [15:51:55] yeah. i was just going to look there, but got distracted and then hoo did it :) [15:52:01] heh :) [15:52:12] :P [15:52:29] thanks anyway :) [16:00:23] greg-g: Around at any chance? [16:06:01] hoo, MatmaRex: thanks for your help :) -> https://www.mediawiki.org/wiki/MediaWiki_1.25/wmf3/Changelog [16:06:28] Nice :) [16:06:47] :) [16:06:56] PROBLEM - Disk space on ms-be1013 is CRITICAL: DISK CRITICAL - free space: / 1222 MB (2% inode=85%): [16:07:16] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [16:09:09] !log hoo Synchronized php-1.25wmf3/extensions/CentralAuth/: Deploying forgotten backport from Thursday: SpecialCentralAutoLogin: Fix getting files after file layout change (duration: 00m 08s) [16:09:17] Logged the message, Master [16:11:12] Ok, I think we got everything done now that got forgotten during Thursdays train :P :D [16:11:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [16:13:51] hoo: \o/ :D [16:25:36] (03PS1) 10Ori.livneh: mediawiki::monitoring::webserver: tidy [puppet] - 10https://gerrit.wikimedia.org/r/166196 [16:30:15] hoo: it didn't get deployed o.O? [16:30:25] legoktm: Well, now it did [16:30:26] :P [16:30:33] :/ [16:36:05] PROBLEM - puppet last run on mw1093 is CRITICAL: CRITICAL: Puppet has 1 failures [16:41:30] (03CR) 10Odisha1: [C: 031] Create Oriya Wikisource (orwikisource) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166186 (https://bugzilla.wikimedia.org/71875) (owner: 10Glaisher) [16:54:16] RECOVERY - puppet last run on mw1093 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:55:17] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:59:16] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54224 bytes in 2.293 second response time [17:26:15] PROBLEM - puppet last run on mw1174 is CRITICAL: CRITICAL: Puppet has 1 failures [17:44:16] RECOVERY - puppet last run on mw1174 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [17:53:16] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:18:45] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54303 bytes in 1.083 second response time [19:16:19] (03PS1) 10coren: Labs: fix innatention typo in firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/166211 [19:16:25] andrewbogott: ^^ [19:17:24] (03CR) 10Andrew Bogott: [C: 032] "this mistake was so obvious as to be invisible" [puppet] - 10https://gerrit.wikimedia.org/r/166211 (owner: 10coren) [19:22:46] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:45] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54303 bytes in 0.444 second response time [19:40:06] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:13] (03PS4) 10BryanDavis: iegreview: Create module and role for deployment [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) [20:08:03] (03CR) 10BryanDavis: "Addressed comments from mutante and added config for Parsoid integration." [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [20:08:19] (03PS4) 10BryanDavis: iegreview: Apply role to zirconium and configure varnish [puppet] - 10https://gerrit.wikimedia.org/r/165232 (https://bugzilla.wikimedia.org/71597) [20:14:46] (03PS5) 10BryanDavis: iegreview: Provision iegreview application [puppet] - 10https://gerrit.wikimedia.org/r/165232 (https://bugzilla.wikimedia.org/71597) [20:15:23] (03PS1) 10coren: Labs: Moar verbose in firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/166221 [20:15:33] andrewbogott: ^^ that still won't work, but we'll get a good idea why. [20:17:55] (03PS2) 10BryanDavis: ocg: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163375 [20:21:21] (03CR) 10Ori.livneh: iegreview: Create module and role for deployment (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [20:22:50] (03CR) 10Ori.livneh: [C: 031] iegreview: Put iegreview.wikimedia.org behind misc-web-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/165236 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [20:23:41] (03CR) 10Ori.livneh: [C: 031] iegreview: Provision iegreview application [puppet] - 10https://gerrit.wikimedia.org/r/165232 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [20:23:45] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54258 bytes in 0.501 second response time [20:23:50] (03PS2) 10BryanDavis: role::mathoid: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163374 [20:24:57] (03PS2) 10BryanDavis: role::parsoid: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163373 [20:25:32] (03PS2) 10BryanDavis: mwprof: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163372 [20:25:41] Coren, andrewbogott: since you're both around, do you mind if I merge a few of Bryan's deployment::target -> package <|provider==trebuchet|>'s changes? They're no-ops, I can babysit them, and I know how to troubleshoot. [20:26:08] ori: I'm breaking up the chain so you can pick and choose :) [20:26:30] (03PS2) 10BryanDavis: gdash: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163371 [20:26:33] yeah it sucks that no one got to those yet, you e-mailed the list and everything [20:27:09] I imagine few besides you and me care ;) [20:27:27] ori: I think that's fine, I should be on for an hour or so. although I can't complain to understand the context [20:27:42] thanks, i'll be done in less than that [20:29:43] (03CR) 10Ori.livneh: [C: 032] gdash: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163371 (owner: 10BryanDavis) [20:33:24] (03PS2) 10Ori.livneh: role::deployment::test: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163360 (owner: 10BryanDavis) [20:33:41] (03CR) 10Ori.livneh: [C: 032 V: 032] role::deployment::test: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163360 (owner: 10BryanDavis) [20:33:52] (03PS2) 10Ori.livneh: scholarships: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163361 (owner: 10BryanDavis) [20:33:59] (03CR) 10Ori.livneh: [C: 032 V: 032] scholarships: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163361 (owner: 10BryanDavis) [20:34:34] (03PS2) 10BryanDavis: eventlogging: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163370 [20:35:04] (03PS3) 10Ori.livneh: eventlogging: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163370 (owner: 10BryanDavis) [20:35:10] (03CR) 10Ori.livneh: [C: 032 V: 032] eventlogging: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163370 (owner: 10BryanDavis) [20:36:17] (03PS2) 10Ori.livneh: kibana: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163362 (owner: 10BryanDavis) [20:36:23] (03CR) 10Ori.livneh: [C: 032 V: 032] kibana: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163362 (owner: 10BryanDavis) [20:36:35] (03PS2) 10Ori.livneh: role::logstash: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163363 (owner: 10BryanDavis) [20:36:47] (03CR) 10Ori.livneh: [C: 032 V: 032] role::logstash: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163363 (owner: 10BryanDavis) [20:37:53] (03PS2) 10BryanDavis: role::servermon: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163369 [20:38:24] (03PS2) 10BryanDavis: role::performance: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163368 [20:38:25] PROBLEM - puppet last run on vanadium is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:17] (03PS2) 10BryanDavis: role::librenms: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163367 [20:40:28] (03PS2) 10BryanDavis: role::analytics: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163366 [20:41:05] (03PS2) 10BryanDavis: role::ci::slave: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163365 [20:41:37] (03PS2) 10BryanDavis: role::elasticsearch: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163364 [20:41:38] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: Puppet has 1 failures [20:43:27] (03PS2) 10BryanDavis: Remove deployment::target [puppet] - 10https://gerrit.wikimedia.org/r/163376 [20:44:07] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Puppet has 1 failures [20:45:23] (03PS1) 10Ori.livneh: Fix-ups for 3d3c9144 and 8dc45e87af [puppet] - 10https://gerrit.wikimedia.org/r/166243 [20:45:53] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-ups for 3d3c9144 and 8dc45e87af [puppet] - 10https://gerrit.wikimedia.org/r/166243 (owner: 10Ori.livneh) [20:47:14] ori: ugh. do I need to fix the rest of them like that? [20:47:29] if they make the same mistake [20:47:36] RECOVERY - puppet last run on vanadium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:47:44] I bet they do. [20:47:45] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [20:47:46] but i won't get to any more of those today [20:47:55] yeah. no worries [20:48:09] they all applied correctly [20:48:15] (with that patch on top) [20:48:15] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:48:16] I'll look at the rest and fix them up if needed [20:48:22] andrewbogott: all done, thanks [20:50:35] (03PS3) 10BryanDavis: role::elasticsearch: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163364 [20:52:32] (03PS3) 10BryanDavis: role::analytics: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163366 [20:56:35] (03PS3) 10BryanDavis: role::ci::slave: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163365 [20:57:53] (03PS3) 10BryanDavis: role::performance: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163368 [20:59:21] (03PS3) 10BryanDavis: role::librenms: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163367 [20:59:34] (03PS1) 10coren: Labs: Make more than one partition [puppet] - 10https://gerrit.wikimedia.org/r/166255 [20:59:36] (03PS1) 10coren: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/166256 [20:59:56] Oh, gah. My git repo is desync'ed. [21:00:27] (03PS3) 10BryanDavis: role::servermon: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163369 [21:00:33] (03Abandoned) 10coren: Labs: Make more than one partition [puppet] - 10https://gerrit.wikimedia.org/r/166255 (owner: 10coren) [21:00:47] (03Abandoned) 10coren: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/166256 (owner: 10coren) [21:01:54] (03PS3) 10BryanDavis: mwprof: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163372 [21:04:52] (03PS3) 10BryanDavis: role::parsoid: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163373 [21:06:32] (03PS1) 10coren: Labs: Make swap a primary partition in images [puppet] - 10https://gerrit.wikimedia.org/r/166257 [21:06:33] andrewbogott: ^^ [21:07:25] (03PS3) 10BryanDavis: role::mathoid: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163374 [21:07:27] (03CR) 10Andrew Bogott: [C: 032] Labs: Make swap a primary partition in images [puppet] - 10https://gerrit.wikimedia.org/r/166257 (owner: 10coren) [21:11:01] (03PS3) 10BryanDavis: ocg: Convert to package<|provider==trebuchet|> [puppet] - 10https://gerrit.wikimedia.org/r/163375 [21:41:56] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:01:53] (03PS5) 10BryanDavis: iegreview: Create module and role for deployment [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) [22:02:23] (03CR) 10BryanDavis: "Updates for Ori's comments in patch set 5" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [22:03:38] (03PS6) 10BryanDavis: iegreview: Create module and role for deployment [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) [22:06:49] (03PS7) 10BryanDavis: iegreview: Create module and role for deployment [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) [22:16:47] (03CR) 10BryanDavis: [C: 04-1] "Do not merge until all other patches in the series have been merged:" [puppet] - 10https://gerrit.wikimedia.org/r/163376 (owner: 10BryanDavis) [22:17:01] (03PS3) 10BryanDavis: Remove deployment::target [puppet] - 10https://gerrit.wikimedia.org/r/163376 [22:28:36] (03PS1) 10coren: Labs: fixes to firstboot.sh so that it works [puppet] - 10https://gerrit.wikimedia.org/r/166261 [22:28:58] ^^ andrewbogott: another (stupid) bug bites the dust. [22:29:22] (03CR) 10Andrew Bogott: [C: 032] Labs: fixes to firstboot.sh so that it works [puppet] - 10https://gerrit.wikimedia.org/r/166261 (owner: 10coren) [23:12:56] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54237 bytes in 4.876 second response time [23:21:22] !log logstash not showing any udp2log events after 2014-10-10T01:42:22.000Z [23:21:29] Logged the message, Master [23:22:36] Reedy: Can you log into logstash1001 and restart the logstash service please? I'm on a laptop with no pros ssh key. [23:22:40] *prod [23:22:41] ja [23:23:13] stop: Unknown instance: [23:23:13] logstash start/running, process 17106 [23:23:38] * Reedy looks if it's running on the other boxes [23:23:46] !log Started logstash on logstash1001 [23:23:51] Logged the message, Master [23:24:02] Might want to tail /var/log/logstash.log and make sure it stays running after initializing [23:24:32] The instance on logstash1001 is the most important one. It's the only ip that udp2log is sending to [23:25:10] * bd808 wants to design a better transport that has some redundancy… someday [23:25:57] logstash 994 216 1.8 2667616 297156 ? Sl Sep29 37498:57 /usr/bin/java -Xms128m -Xmx128m -Djava.io.tmpdir=/var/lib/logstash -jar /opt/logstash/logstash.jar agent -f /etc/logstash/conf.d --log /var/log/logstash/logstash.log --filterworkers 1 [23:25:57] logstash 17106 22.3 1.3 3162384 223876 ? SNsl 23:23 0:35 /usr/bin/java -Xms128m -Xmx128m -Djava.io.tmpdir=/var/lib/logstash -Xmx500m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -jar /opt/logstash/vendor/jar/jruby-complete-1.7.11.jar -I/opt/logstash/lib /opt/logstash/lib/logstash/runner.rb agent -f /etc/logstash/conf.d -l [23:25:57] /var/log/logstash/logstash.log --log /var/log/logstash/logstash.log --filterworkers 1 [23:26:15] Wonder why there's 2 with very different parameters [23:26:53] jesus. OCG is spammy [23:27:15] yeah it's turned up pretty loud. [23:27:32] That's ok as long as we can keep up with the traffic though. [23:27:47] 2 logstash processes doesn't sound right [23:27:47] /var/log/logstash/logstash.log [23:27:53] * bd808 nods [23:28:03] Looks like it's having redis related issues [23:28:13] yeah that's sort of expected [23:28:14] {:timestamp=>"2014-10-11T23:27:29.641000+0000", :message=>"A plugin had an unrecoverable error. Will restart this plugin.\n Plugin: \"127.0.0.1\", data_type=>\"list\", key=>\"logstash\", name=>\"default\">\n Error: ERR operation not permitted", :level=>:error} [23:28:25] It shouldn't hurt anything [23:28:46] {:timestamp=>"2014-10-11T23:28:31.502000+0000", :message=>"UDP listener died", :exception=>#, :backtrace=>["org/jruby/ext/socket/RubyUDPSocket.java:160:in `bind'", "/opt/logstash/lib/logstash/inputs/udp.rb:69:in `udp_listener'", "/opt/logstash/lib/logstash/inputs/udp.rb:50:in `run'", "/opt/logstash/lib/logstash/pipeline.rb:163:in `inputworker'", "/opt/logstash/lib/logstash/pipeline. [23:28:46] rb:157:in `start_input'"], :level=>:warn} [23:28:47] hmm [23:28:54] let me stop it, and kill that other process too [23:28:58] we added that input but never started using it and apparently never configured it correctly either [23:29:12] sounds like a good plan [23:29:51] init.d or service for starting? ;) [23:30:03] serice [23:30:07] *service [23:30:15] Is there an init.d? [23:30:26] That would be from the new deb packages [23:30:35] and likely not what we want [23:31:07] -rwxr-xr-x 1 root root 3462 Jun 24 12:41 /etc/init.d/logstash [23:32:06] oh… maybe that's right? [23:32:19] I don't see a custom upstart script in puppet [23:33:27] It's getting events now at least [23:33:29] (03CR) 10Ori.livneh: [C: 031] "I left a few asshole comments in-line that you may take or leave as you like. This looks good overall." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [23:33:42] !log killed both logstash events on logstash1001. Started logstash again after [23:33:49] Logged the message, Master [23:39:15] !log killed both logstash events on logstash100[23]. Started logstash again after [23:39:25] Logged the message, Master [23:41:20] bd808: looks to be behaving now [23:41:37] Reedy: tanks [23:41:41] thanks even [23:42:23] I wonder if I "broke" it [23:42:38] 2014-10-10T01:42:22.000Z [23:43:09] hmm, no, I did the upgrades on the 8th [23:50:00] Reedy: Maybe when Daniel merged the puppet changes? [23:51:19] ah, looks like it might be [23:51:30] Oct 10 2:32 AM