[00:47:35] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 12 hours old. [01:21:57] PROBLEM - RAID on analytics1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:22:57] RECOVERY - RAID on analytics1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:26:06] PROBLEM - RAID on analytics1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:06] RECOVERY - RAID on analytics1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:41:33] PROBLEM - RAID on analytics1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:43:34] RECOVERY - RAID on analytics1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [02:02:36] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet last ran 2606152 seconds ago, expected 14400 [02:05:36] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [02:13:27] !log LocalisationUpdate completed (1.25wmf1) at 2014-10-05 02:13:26+00:00 [02:13:41] Logged the message, Master [02:22:47] !log LocalisationUpdate completed (1.25wmf2) at 2014-10-05 02:22:47+00:00 [02:22:55] Logged the message, Master [02:55:58] (03PS1) 10Ori.livneh: Enable LuaSandbox profiling when `forceprofile` is true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164750 [02:57:04] ori: :D [02:58:12] jackmcbarn: do you think that's an acceptable trade-off? [02:58:34] ori: definitely good for now, but do we know if it affects performance yet? [02:58:55] no, i've been swamped and haven't gotten to it [02:59:29] (03CR) 10Jackmcbarn: [C: 031] "I like this for now, but in the future, I'd like to see it always enabled (unless we discover that it is bad for performance)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164750 (owner: 10Ori.livneh) [02:59:55] (03PS2) 10Ori.livneh: Enable LuaSandbox profiling when `forceprofile` is true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164750 [03:00:02] (03CR) 10Ori.livneh: [C: 032] Enable LuaSandbox profiling when `forceprofile` is true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164750 (owner: 10Ori.livneh) [03:00:09] (03Merged) 10jenkins-bot: Enable LuaSandbox profiling when `forceprofile` is true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164750 (owner: 10Ori.livneh) [03:01:08] ori: one other thing maybe for a follow-up. i think we should have it always enabled on jobrunners, since they sometimes exhibit weird behavior that can't be reproduced on the user-facing servers [03:01:59] !log ori Synchronized wmf-config/CommonSettings.php: I707b5754: Enable LuaSandbox profiling when is true (duration: 00m 07s) [03:02:05] Logged the message, Master [03:02:09] jackmcbarn: sure, +1 [03:02:38] ori: i notice that the cpu limit doubler patch got reverted though, since its logic for detecting jobrunners didn't work. do we have a good way to do that? [03:19:55] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Oct 5 03:19:55 UTC 2014 (duration 19m 53s) [03:20:04] Logged the message, Master [05:48:01] Carmela: just the tip. [05:48:46] No MSG. [05:49:07] Heh. [06:28:37] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: puppet fail [06:29:27] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:30] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:46] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:46] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:47] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:47] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:56] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:56] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:57] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:57] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:57] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:57] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:07] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:17] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:17] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:27] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:47] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:44:53] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:44:58] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:45:33] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:59] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 0 hours old. [06:47:18] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:47:19] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [08:03:33] (03PS25) 10Catrope: Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [08:20:12] (03PS26) 10Catrope: Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [08:20:14] (03PS1) 10Catrope: Add citoid module to sca1001 and sca1002 [puppet] - 10https://gerrit.wikimedia.org/r/164758 [08:20:16] (03PS1) 10Catrope: Add LVS for citoid [puppet] - 10https://gerrit.wikimedia.org/r/164759 [11:00:45] (03PS1) 10QChris: Ensure that the namenode directory exists before starting the namenode [puppet/cdh] - 10https://gerrit.wikimedia.org/r/164761 [11:01:52] (03PS1) 10QChris: Declare namenode directory only once [puppet] - 10https://gerrit.wikimedia.org/r/164762 [11:01:54] (03PS1) 10QChris: Declare datanode's mount directories only once [puppet] - 10https://gerrit.wikimedia.org/r/164763 [11:22:00] RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [11:23:05] !log adding static route for ns1 to rubidium (ns0) on cr1-eqiad to temporarily redirect its traffic while the codfw is offline [11:23:13] Logged the message, Master [11:32:00] RECOVERY - Host db2012 is UP: PING OK - Packet loss = 0%, RTA = 52.56 ms [11:32:00] RECOVERY - Host db2002 is UP: PING OK - Packet loss = 0%, RTA = 52.71 ms [11:32:00] RECOVERY - Host lvs2004 is UP: PING OK - Packet loss = 0%, RTA = 52.97 ms [11:32:00] RECOVERY - Host ms-be2006 is UP: PING OK - Packet loss = 0%, RTA = 53.11 ms [11:32:00] RECOVERY - Host db2011 is UP: PING OK - Packet loss = 0%, RTA = 53.13 ms [11:33:40] PROBLEM - Host ps1-d1-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:33:40] PROBLEM - Host ps1-d3-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:33:40] PROBLEM - Host ps1-c1-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:33:40] PROBLEM - Host ps1-d2-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:33:40] PROBLEM - Host ps1-c3-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:33:50] PROBLEM - Host ps1-c2-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [11:34:00] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: puppet fail [11:34:00] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [11:34:00] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: puppet fail [11:34:09] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail [11:34:19] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [11:34:39] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: puppet fail [11:34:39] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: puppet fail [11:34:39] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [11:34:39] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: puppet fail [11:34:40] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: puppet fail [11:34:40] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [11:34:40] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: puppet fail [11:34:41] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail [11:34:41] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: puppet fail [11:34:49] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: puppet fail [11:34:49] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: puppet fail [11:34:50] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: puppet fail [11:34:50] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: puppet fail [11:35:00] PROBLEM - puppet last run on db2005 is CRITICAL: CRITICAL: puppet fail [11:35:11] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: puppet fail [11:35:11] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: puppet fail [11:35:12] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: puppet fail [11:35:12] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: puppet fail [11:35:12] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: puppet fail [11:35:12] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: puppet fail [11:35:12] PROBLEM - puppet last run on db2033 is CRITICAL: CRITICAL: puppet fail [11:35:12] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: puppet fail [11:35:13] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: puppet fail [11:35:13] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: puppet fail [11:35:14] PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: puppet fail [11:35:14] PROBLEM - puppet last run on db2019 is CRITICAL: CRITICAL: puppet fail [11:35:15] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: puppet fail [11:35:22] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [11:35:23] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: puppet fail [11:35:23] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail [11:35:42] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [11:35:54] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [11:36:12] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [11:37:03] RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [11:37:03] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [11:37:23] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [11:38:12] RECOVERY - puppet last run on lvs2005 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [11:38:42] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:39:54] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:40:14] RECOVERY - puppet last run on db2033 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [11:41:18] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [11:42:11] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:42:50] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:43:34] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [11:43:34] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [11:43:38] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [11:43:58] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [11:44:12] RECOVERY - puppet last run on db2005 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [11:44:13] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [11:44:28] RECOVERY - puppet last run on db2019 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [11:45:00] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [11:45:29] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [11:46:38] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [11:47:09] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [11:47:38] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [11:47:38] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:48:28] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:48:59] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:50:18] RECOVERY - puppet last run on ms-be2001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [11:50:19] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:50:19] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [11:50:38] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [11:51:19] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [11:51:59] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:54:59] RECOVERY - Host ps1-d2-pmtpa is UP: PING WARNING - Packet loss = 93%, RTA = 30.34 ms [11:54:59] RECOVERY - Host ps1-c3-pmtpa is UP: PING WARNING - Packet loss = 93%, RTA = 29.33 ms [11:54:59] RECOVERY - Host ps1-d3-pmtpa is UP: PING WARNING - Packet loss = 93%, RTA = 39.98 ms [11:54:59] RECOVERY - Host ps1-d1-pmtpa is UP: PING WARNING - Packet loss = 93%, RTA = 37.28 ms [11:54:59] RECOVERY - Host ps1-c1-pmtpa is UP: PING WARNING - Packet loss = 93%, RTA = 29.61 ms [11:55:14] RECOVERY - Host ps1-c2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 30.00 ms [11:56:18] PROBLEM - Host achernar is DOWN: CRITICAL - Time to live exceeded (208.80.153.42) [11:56:28] PROBLEM - Host install2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.4) [11:56:38] PROBLEM - Host acamar is DOWN: CRITICAL - Time to live exceeded (208.80.153.12) [11:56:38] PROBLEM - Host labcontrol2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.14) [11:56:38] PROBLEM - Host bast2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.5) [11:56:38] PROBLEM - Host pollux is DOWN: CRITICAL - Time to live exceeded (208.80.153.43) [11:56:49] PROBLEM - Host baham is DOWN: CRITICAL - Time to live exceeded (208.80.153.13) [11:56:49] PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e [11:56:59] PROBLEM - Host ms-be2008 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:00] PROBLEM - Host ms-be2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:00] PROBLEM - Host db2033 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:00] PROBLEM - Host db2037 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:00] PROBLEM - Host db2039 is DOWN: PING CRITICAL - Packet loss = 100% [11:59:28] PROBLEM - Host cr1-codfw is DOWN: CRITICAL - Time to live exceeded (208.80.153.192) [11:59:39] PROBLEM - Host cr2-codfw is DOWN: CRITICAL - Time to live exceeded (208.80.153.193) [12:04:46] (03CR) 10Krinkle: "@Hashar: I'm aware. That's why I'm using 94, not 99. I'm still work-in-progress on this, but I'll probably end up using "xvdb-run --auto-s" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [12:11:13] RECOVERY - Host pollux is UP: PING OK - Packet loss = 0%, RTA = 52.11 ms [12:11:13] RECOVERY - Host db2017 is UP: PING OK - Packet loss = 0%, RTA = 52.05 ms [12:11:13] RECOVERY - Host db2007 is UP: PING OK - Packet loss = 0%, RTA = 52.02 ms [12:11:13] RECOVERY - Host ms-be2003 is UP: PING OK - Packet loss = 0%, RTA = 52.06 ms [12:11:13] RECOVERY - Host ms-be2005 is UP: PING OK - Packet loss = 0%, RTA = 52.19 ms [12:12:53] PROBLEM - Host ps1-d3-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [12:12:53] PROBLEM - Host ps1-d1-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [12:12:53] PROBLEM - Host ps1-c3-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [12:12:53] PROBLEM - Host ps1-c1-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [12:13:14] PROBLEM - Host ps1-c2-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds [12:13:14] PROBLEM - Host ps1-d2-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds [12:13:23] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: puppet fail [12:13:24] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: puppet fail [12:13:33] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [12:13:33] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: puppet fail [12:13:34] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: puppet fail [12:13:43] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: puppet fail [12:13:53] PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: puppet fail [12:13:59] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: puppet fail [12:13:59] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: puppet fail [12:13:59] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: puppet fail [12:13:59] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: puppet fail [12:13:59] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: puppet fail [12:13:59] PROBLEM - puppet last run on db2005 is CRITICAL: CRITICAL: puppet fail [12:13:59] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [12:14:02] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail [12:14:03] PROBLEM - puppet last run on db2019 is CRITICAL: CRITICAL: puppet fail [12:14:03] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail [12:14:03] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: puppet fail [12:14:03] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: puppet fail [12:14:13] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: puppet fail [12:14:13] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: puppet fail [12:14:14] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: puppet fail [12:14:14] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [12:14:14] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: puppet fail [12:14:14] PROBLEM - puppet last run on db2033 is CRITICAL: CRITICAL: puppet fail [12:14:14] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: puppet fail [12:14:14] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: puppet fail [12:14:14] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: puppet fail [12:14:15] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: puppet fail [12:14:23] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [12:14:24] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: puppet fail [12:14:24] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail [12:14:54] RECOVERY - Host cr1-codfw is UP: PING OK - Packet loss = 0%, RTA = 56.20 ms [12:15:03] RECOVERY - Host cr2-codfw is UP: PING OK - Packet loss = 0%, RTA = 52.76 ms [12:16:21] (03CR) 10Krinkle: "Note that live-1.5 is still actively used by lots of w/ symlinks in docroot/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162520 (owner: 10MaxSem) [12:16:24] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:17:25] RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:17:44] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:18:27] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [12:19:06] RECOVERY - puppet last run on lvs2005 is OK: OK: Puppet is currently enabled, last run 64 seconds ago with 0 failures [12:19:06] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:19:24] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [12:19:44] RECOVERY - puppet last run on db2033 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:20:41] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [12:22:51] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:23:22] RECOVERY - puppet last run on db2005 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [12:23:25] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:23:36] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [12:23:51] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [12:24:01] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:24:32] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:24:41] RECOVERY - puppet last run on db2019 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:24:44] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:25:31] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:26:01] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:27:01] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:27:12] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:27:22] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:28:05] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:28:22] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:28:22] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:29:52] RECOVERY - puppet last run on ms-be2001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:29:52] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:29:52] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset -0.009409 secs [12:30:11] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:30:42] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [12:31:31] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:31:32] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:37:13] (03PS1) 10Calak: Prevent search engines from indexing user pages and all talk pages on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164766 (https://bugzilla.wikimedia.org/71663) [15:22:40] (03PS1) 10Calak: Add namespace alias on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164773 (https://bugzilla.wikimedia.org/71668) [15:27:54] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [15:34:24] https://gerrit.wikimedia.org/r/#/c/164773/1/wmf-config/InitialiseSettings.php gerrit is not good at rtl it seems [15:34:35] (see the 'عن' => 106, part) [15:37:33] (03PS2) 10Calak: Enable Echo for Persian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164491 (https://bugzilla.wikimedia.org/71669) (owner: 10Reza) [15:40:14] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [16:39:51] !log restore ns1 routing to codfw [16:39:58] Logged the message, Master [17:48:49] RECOVERY - Host ps1-d3-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 38.12 ms [17:48:58] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [17:49:02] RECOVERY - Host ps1-c1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 38.36 ms [17:49:02] RECOVERY - Host ps1-c2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 33.99 ms [17:49:02] RECOVERY - Host ps1-d2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 35.57 ms [17:49:02] RECOVERY - Host ps1-c3-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 35.74 ms [17:49:19] RECOVERY - Host ps1-d1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 33.32 ms [18:14:29] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [18:32:00] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:11:18] (03PS2) 10Calak: Add namespace alias on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164773 (https://bugzilla.wikimedia.org/71668) [19:22:32] legoktm: hi! around ? [19:23:01] hey [19:23:51] great. any thoughts on when we can schedule to get bouncehandler isntalled in prod ? [19:24:20] I meant the wiki side -- as per https://gerrit.wikimedia.org/r/#/c/155753/49/manifests/role/mail.pp we planned to get that into loginwiki [19:24:53] hmm [19:25:14] I think hoo pointed that we use loginwiki since thats where we have most of them [19:25:56] It's the Wiki i'd choose, but we certainly need to make sure stuff isn't going to go wrong if a user doesn't exist there [19:26:06] probably this is blocked by SULF [19:26:14] SULF ? [19:26:19] SUL finalisation [19:26:26] (https://www.mediawiki.org/wiki/SUL_finalisation) [19:26:30] but you probably wont want to wait... so bind against CA to check accounts match? [19:26:39] (If you haven't yet, dunno) [19:27:23] it already tries to use CA if possible [19:27:40] I am fetching the code. one sec [19:27:59] https://github.com/wikimedia/mediawiki-extensions-BounceHandler/blob/master/includes/BounceHandlerActions.php#L79 [19:28:01] Ok... what if we have local user enwiki:foo and that one gets an emailed bounced [19:28:16] but loginwiki:foo (global account) belongs to somebody else [19:28:23] https://www.mediawiki.org/wiki/SUL_finalisation [19:28:24] err [19:28:26] $caUser = CentralAuthUser::getInstance( $user ); [19:28:27] if ( $caUser->isAttached( $this->wikiId ) ) { [19:28:33] that's in BounceHandlerActions [19:29:07] legoktm: that's guards against the other way round (not attached on loginwiki, but on enwiki) [19:29:07] the code is currently assuming if CA is enabled, it is finalised. Needs some tweaking to remove that assumption... [19:29:10] not very likely [19:29:12] hoo: that can happen ? multiple users with same ? [19:29:13] (for us, at least) [19:29:21] tonythomas: Right now, sadly, yes [19:29:33] oh ! [19:29:47] hoo: isn't that what we care about? that the account is attached on enwiki? [19:29:51] legoktm: there is an else case there though [19:30:09] legoktm: both accounts need to be attached... loginwiki and enwiki [19:30:21] and we need to check that before taking action [19:30:40] hoo: anyawy, the first phase will be only log based though [19:30:58] $wgBounceHandlerUnconfirmUsers = false; [19:30:58] still should be tested [19:31:06] why does loginwiki attachment matter? [19:31:07] we don't want this to fatal or something in production [19:31:41] legoktm: $this->wikiId is not the id of the current wiki (wfWikiId()), but the one where the bounce came from? [19:31:58] yes [19:32:02] ohhhhh [19:32:04] hoo: Jeff and I found that the current labs design would make it difficult for us to test a webserver- mx mode [19:32:05] blareghad [19:32:14] I see now. [19:32:17] I meant 'beta' [19:32:30] tonythomas: What mode? [19:32:32] so we produced the same in labs - found it working alright [19:33:27] "to test a webserver- mx mode" [19:33:30] what is meant by that? [19:33:58] hoo: like we have bouncehandler installed in beta wiki - and we need the exim configurations to be placed on the mail server outlet of beta - which turns out to be polonium. [19:34:16] it sends mails through production? [19:34:19] remember our realm switches :( That wont happen -- as polonium is always configured in prodcution mode [19:34:31] Ok, I didn't expect that [19:34:55] even if it dont - the email should bounce back from a remote mx right ? it digs for the wikimedia domain and hits to polonium [19:35:01] and gets stuck there [19:36:09] tonythomas: I see the problem and can't really come up with a fix offhand [19:36:24] (despite of moving that away from polonium into a labs instance) [19:37:12] yeah. we did that :) [19:37:20] Awesome [19:37:24] and its working fine there [19:37:33] the bounces gets registered into the db [19:37:48] and the user gets un-subscribed on exceeding the limit too [19:38:58] hoo: the difference - as Jeff tellls is between dig mx wikimedia.org and for beta emails dig mx deployment.wikimedia.beta.wmflabs.org [19:40:20] tonythomas: Ok, but how is that a problem? [19:40:42] polonium or lead will do it for prod. and whatever instance does it for beta [19:42:31] hoo: once the remote gmail mx that produce the bounce search for mx of deployment.wikimedia.beta.wmflabs.org and wont be able to find the mx [19:42:47] as you can see from the terminal output -- there is no mx shown afaik [19:42:48] because there's none set [19:42:53] yep [19:43:23] but if you still bounce to whatever the mx should be, it works [19:43:24] ? [19:44:11] I think the remote mx should lookup for the failing domain and pass the bounce to the mx of that domain [19:44:20] Yes [19:44:33] so... someone would need to update beta's DNS [19:44:34] that step should fail, if the remote mx is not able to find [19:44:41] it will [19:45:04] hoo: that would be great. any idea where the beta emails go through ? [19:45:07] the mail server ? [19:47:04] yeah. it goes through polonium [19:47:12] I just tested with a test email [19:47:55] hoo: https://dpaste.de/h7mb#L20,21 [19:48:47] and since its return address is wiki-deploymentwiki-blah-@deployment.wikimedia.beta.wmflabs.org the remote mx fails to route the email back to polonium. [19:49:13] tonythomas: I guess it could still bounce to something else, if the dns for deployment.wikimedia.beta.wmflabs.org were ok [19:49:22] ok as in set up for that [19:50:28] yeah. if deployment.wikimedia.beta.wmflabs.org would resolve to show up polonium and we have the role { labs } configuration in polonium -- then this should work [19:51:08] no, that shouldn't go via polonium [19:51:17] in fact it can't (w/o messing a lot of other stuffs) [19:51:22] the curl also wont work for exampel [19:51:27] * example [19:52:26] curl wont work ? [19:53:05] * tonythomas wish we had a testwiki in production :D [19:53:44] tonythomas: polinium wont be able to connect to the beta instances [19:53:59] not even via http ? [19:54:34] Probably not, no [19:54:49] (03PS4) 10Physikerwelt: Re-enable all Math modes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/139421 (https://bugzilla.wikimedia.org/66587) (owner: 10Reedy) [19:54:52] it can only connect to machines within the production cluster [19:55:02] (I guess, but it's probably that way) [19:55:20] hoo: that makes it almost impossible to test on beta :( [19:55:44] (03PS5) 10Physikerwelt: Re-enable all Math modes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/139421 (https://bugzilla.wikimedia.org/66587) (owner: 10Reedy) [19:55:45] No, just create a polonium-equivalent in beta and use that [19:55:50] I don't know what would block that [19:56:01] except that it's quite some work [19:56:05] maybe [19:56:11] hoo: yeah. that would be great. [19:56:29] and make beta be mail serve-d through that one ? [19:56:42] not necessary outgoing mail, I guess [19:56:49] but it should at least be in the return path [19:57:07] yeah. we will need the mx records for that hostname [19:57:43] so that someone looking up for deployment.wikimedia.beta.wmflabs.org should find that mx [19:57:47] I guess that can be done... virt1000 is the dns server for beta [19:57:54] yeah. [19:57:55] maybe that can even be done via wikitech [19:57:56] no idea [19:58:18] hoo: any idea who I should ping on a Sunday ? [19:58:59] Probably no one [19:59:19] hmm. sundays :| [19:59:54] we actually discussed about this earlier though - thought it would be difficult - and thought of limitting our tests only to labs [20:01:04] That's not up to me to decide... if Jeff is ok with that, maybe [20:01:46] so things to do: 1) Get the extension into prod, 2) Configure it for prod, 3) Set up the exim in prod to make use of the extension [20:02:07] 1. Step needs someone to say it's ok to ge tthat extension deployed (greg?) [20:02:13] true. Configuring is done - I think [20:02:26] 1) is the tough job [20:02:53] its been through the sec-review before getting into beta - but no other reviews [20:03:19] Nemo_bis: was going through the bugs today. [20:03:54] tonythomas: Ok, so I guess it needs the perf. one and then you can poke greg [20:04:46] yeah. I will add that in the bug [20:20:32] tonythomas, I would be interested in knowing how many emails we send daily on average [20:21:06] We have the exim stats in ganglia but I never undestood how "true" they are [20:21:45] Nemo_bis: few hundreds in a minute ? [20:22:17] exim stats say something like a hundred per second [20:22:47] oh. [20:23:10] we where discussing on the possbility of having a polonium-equivallent in beta [20:23:19] so that we can test the extension happily [20:27:03] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [20:53:30] (03CR) 10Ebrahim: [C: 031] Enable Echo for Persian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164491 (https://bugzilla.wikimedia.org/71669) (owner: 10Reza) [20:54:29] (03CR) 10Ebrahim: [C: 031] Add namespace alias on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164773 (https://bugzilla.wikimedia.org/71668) (owner: 10Calak) [20:54:39] (03CR) 10Ebrahim: [C: 031] Prevent search engines from indexing user pages and all talk pages on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164766 (https://bugzilla.wikimedia.org/71663) (owner: 10Calak) [20:59:49] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [21:14:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:14:56] hey ops, wikpedia: timeout on accessing wikipedia (via esams) [21:15:44] looks fine over here [21:16:35] hm, my traceroute/ping shows only one response of 10 [21:17:13] 100% over 15 packets here [21:17:17] ho se4598_2 having same problem when try to visit dewiki and enwiki [21:17:28] I'd blame your or your ISPs connectivity [21:17:37] What ISPs do you have? [21:17:54] DTAG mobile and wire connection, germany [21:18:16] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [21:18:22] DTAG mobile works for me as well [21:18:33] Unitymedia, Germany [21:18:33] mw.org, too :/ [21:18:39] wow, all german :D [21:18:40] two tracerts: http://pastebin.com/raw.php?i=0yz7zLJF [21:19:10] * FlorianSW waiting for tracert... [21:19:39] maybe the route via telia has a problem/is overloaded? [21:20:25] se4598_2 my goes over adm-b5-link.telia.net too... still waiting for finish :) [21:21:14] FlorianSW, yeah, I canceled my second route, b/c clearly beyond finish. first stopped normally [21:21:28] hoo: have you a tracert? [21:21:40] se4598_2 i hope it finish sometimes ;) But i think no :( [21:22:00] FlorianSW: I do, but am busy atm [21:22:25] hoo ok, just to can compare, bc it's working for you :) [21:22:54] all the wikis appear to have fallen over for me as well [21:23:00] !log Bypassed Wikibase restrictions and set https://www.wikidata.org/wiki/Q183 back to old serialization format [21:23:00] * FlorianSW wait's the last 5 hops [21:23:05] Logged the message, Master [21:23:08] aude: ^ [21:23:11] bits cache network drop, see https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Bits+caches+esams&m=cpu_report&s=by+name&mc=2&g=network_report [21:23:23] Coren? ^ [21:23:37] op on duty? [21:23:39] Maybe it should be superprotected [21:23:43] (for real) [21:23:46] yeah, something is up [21:23:48] se4598_2: http://pastebin.com/raw.php?i=KLM2Ce4T [21:24:15] i pinged a couple of folks, will page if no one replies [21:24:26] looks like the same as se4598_2's one [21:24:28] se4598_2: seems to be recovering [21:24:34] just as we speak, graph goes up [21:24:43] ori, recovering for me [21:25:08] ori same for me (at least on dewiki) [21:27:24] oh ffs [21:27:30] zend segfault [21:27:33] arrrg [21:28:04] yeah [21:28:17] hoo: what's the story with https://www.wikidata.org/wiki/Q183 ? [21:28:39] ori: It's to large for our new serialization format [21:28:46] it kind of worked in the old [21:28:50] but it's still awry [21:28:59] and it's causing mayor troubles [21:29:06] (also in the Wikipedias) [21:29:14] what sort of major troubles? [21:29:24] ori: Pages can't be edited/ re-parsed [21:29:28] articles that reference it not being editable :D [21:29:46] and watchlists fataling and "fun" like that [21:30:45] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:32:42] hoo: is there a plan for fixing it? [21:33:10] ori: yes, but nothing we can do in even a week [21:33:17] so I'm just going to reset it [21:33:24] and maybe then super protect even [21:33:45] would that make pages that reference it editable again? [21:33:47] as every edit will kill it again (if an edit makes it through) [21:33:50] ori: Yes [21:34:02] cool; do you need any help from me? [21:35:09] No, I don't think so [21:35:22] I guess these are the segfaults we saw before [21:35:35] so I'll just search for a revision that is so much older [21:35:45] one that renders [21:36:07] when stuff is ok again, we can revert back [21:41:04] * hoo cries [21:41:12] stupid php 5.3 [21:42:10] now I found a revision that doesn't segfault... but it's oom [21:42:22] * hoo goes back further [21:42:47] I found one [21:42:53] going to hell for that [21:59:48] Bleh, why is it shit hits fans when I'm eating? [22:00:31] 'nything I can do to help? [22:00:31] Coren: because you do such a good job, life feels the need to throw stuff at you when you're not here? [22:01:08] * Coren reads backlog. [22:01:14] I have +staff if you need to superprotect. [22:01:48] * Coren idly wonders if that segfaults hhvm too. [22:01:48] Coren: I guess that would be handy... I could also give my real name account +sysadmin, I guess? [22:02:04] I don't think +sysadmin has superprotect. [22:02:14] It does not last I checked. [22:02:15] but i can arrange that [22:02:24] can we just make sure admins don't touch it? [22:02:31] legoktm: Ok, will do [22:02:35] legoktm: That's what superprotect /does/ [22:02:38] using superprotect is just going to kick off the drahmaz [22:02:50] blehr [22:02:55] Coren: I meant socially, not technically. [22:03:04] legoktm: Not with an explanation. "Breats wikimedia because bug. Don't touch until bug is fix. ktxbai. [22:03:29] Coren: people will still get upset over it :P [22:03:40] I don't think so [22:03:42] super protect + wikidata item for Germany == web rage :( [22:03:52] bd808: Do you think so? [22:03:59] This would be the best case ever :p [22:04:09] No, seriously, nobody is issane enough to not understand that's a bug that breaks the wiki. [22:04:15] hoo: I don't think so [22:04:16] yeah [22:04:17] yeah [22:04:19] Super protect is first used on dewiki - second use - protecting the 'German' article on Wikidata! :o [22:04:30] *Germany [22:04:31] it's an unfortunate coincidence and that's it [22:04:37] JohnLewis: editing is waas broken for the last 3 months [22:04:49] Coren: haha, you're underestimating them [22:05:09] Will someone do it, or shall I do it myself [22:05:09] ori: we know it is - but the users won't :p [22:05:22] (I don't really mind much doing it myself) [22:05:23] hoo: urg yeah true [22:05:28] MatmaRex: I'm honestly not worried. I have a lot of tech cred with the dewiki folks; if I say "this is technical, things will break if this is touched" they'll beleive me. [22:05:28] but if hoo does it himself as a volunteer sysadmin, people will be less upset since it's not "omg WMF evil" [22:05:36] Coren: have you seen the patch that fixed the order of protection options after superprotect was introduced, that got four -1's from random "community members"? [22:06:05] MatmaRex: not random - users who are dewiki edits I believe [22:06:10] that was hilarious [22:06:19] hoo: Is the serialization bug something that can be fixed quicker if you get some help? [22:06:50] Coren: sure, but that's probably not the folks who are going to be making a fuss :> [22:06:52] bd808: Not sure what kind of help that would be [22:06:58] But, we need a solution now [22:07:08] if some admin edits in accident, we're screwed again [22:07:15] (i'd still superprotect that wikidata page, i'd just follow that up with getting some popcorn) [22:07:24] I'm about to superprotect with "Editing this item will break mediawiki because of a crash-causing bug; this is a temporary safeguard until the bug is fixed." [22:07:33] if protecting this protect us - I'm in favour of it. [22:07:38] Coren: That would be awesome [22:07:39] Coren: link to the bug number [22:07:46] MatmaRex: Ref? [22:07:48] Bug number is in the edit history [22:07:48] or link to something [22:07:52] (well one of the many) [22:07:55] https://bugzilla.wikimedia.org/show_bug.cgi?id=71519 [22:07:55] but link [22:07:56] it was also segfaulting [22:07:57] one sec [22:08:01] and running out of memory [22:08:04] couple of bugs [22:08:14] could be tracking bug about Q183 probs :S [22:08:17] yeah, 71519 is probably the best one [22:08:25] I'll also put a message on the talk page. [22:08:32] Add a "See [[bugzilla:71519]]", I suppose [22:08:38] And send an email to Philippe [22:08:46] I'll sign that off as Wikidata-Dev after [22:08:49] (Because I just use +staff) [22:09:14] Coren: I can also get +sysadmin on my non-community works acc. and then do it that way [22:09:20] not sure that's better process-wise [22:09:48] hoo: I'm ops on duty; I'me exactly the right person to intervene (and get any flak) [22:09:49] surely sysop-only is fine unless you can't trust wikidata admins to behave? [22:09:57] Coren: Ok, thanks [22:09:59] hey folks [22:10:00] wasup? [22:10:08] hmm [22:10:10] Lydia_WMDE: Will give you a summary in a bit [22:10:15] actually, protecting a page insert a null revision [22:10:25] will that revision use the old (good) or new (broken) format? [22:10:27] MC8: It'd be fine except that an error breaks the wiki. [22:10:29] hoo: I poked her about it since y'know :) [22:10:59] !log WD:Q183 was frozen on version 120566337, see bug 71519 (and others) [22:11:05] Logged the message, Master [22:11:09] That version is pretty old [22:11:26] but it was the first that actually worked w/o hitting the f... [22:11:32] why does this need superprotect? [22:11:47] Lydia_WMDE: If an admin edits, serialization changes again and it will fail again [22:12:01] ^ that which I was going to say in easier terms :p [22:12:04] don't forget to make yourself a userpage, Coren [22:12:23] Also we probably want to revert back to pre-trouble at some point [22:12:25] beurocrats then and tell them? [22:12:28] MC8: I'm doing the emergency communication now, I'll do so right after. [22:12:56] Coren: Ah, doh [22:13:06] Coren: I did it for you to help you :) [22:13:12] we froze it on an even older version now [22:13:18] but I don't think that's needed [22:13:36] (see the revision size as a estimate of the size) [22:13:42] hoo: That's not an issue - it's "somewhere in the past that doesn't break" only until the issue is fixed -- which revision is immaterial so long as it doesn't explode. [22:13:47] Ok [22:13:57] i disagree tbh [22:14:38] I would also favor to have it on 120566337 which should be ok-ish [22:14:52] if users on-wiki can't fix it it needs to be frozen at an acceptable revision at least [22:14:58] I only went back further because I thought it also fails, but that was a cache [22:16:04] hoo: do we know what the actual issue is? [22:16:16] Lydia_WMDE: About which of the bugs? [22:16:22] Lydia_WMDE: I'm more than happy to help the community find a better version, so long as things don't break. This is an emergency protection measure. [22:16:27] Coren: Ok [22:16:58] If you update the protection now, the revision you create will be the oldid 120566337 [22:17:03] hoo: all of the ones causing this protection [22:17:16] that is the newest I could find that doesn't make stuff go insane [22:17:55] Lydia_WMDE: Apparently the fact that the item can't be handled by neither client nor repo broke editing (and watchlists and ...) [22:18:09] ok [22:18:11] No idea why that happened now and not when this actually started coming up on Thursday [22:18:13] why can it not be handled? [22:18:23] Handled how? [22:18:29] Oh by PHP [22:18:39] well... mostly size [22:19:03] sigh [22:19:04] ok [22:19:05] New DataModel and new serialization is less performant then old one, that's why this broke [22:19:21] even some of the old serialized version were running OOM [22:19:22] ok can you write this all down in an email to tech? [22:19:35] Internal tech or public? [22:19:43] public is good i think [22:19:52] Ok [22:19:57] hoo: Can a preview show in advance if a revision would break or not? [22:20:01] hoo: thanks! [22:20:16] Coren: You can use &oldid with the revision and that should do it [22:20:23] but is not 100%, sadly [22:21:44] !log Updated gerrit's hooks-bugzilla to 6e1e659 (with hooks-its at a421db4) [22:21:50] Logged the message, Master [22:22:00] https://www.wikidata.org/w/index.php?title=Talk:Q183&diff=162221367&oldid=162220138 <-- please double check for accuracy? [22:22:29] Coren: Can you then please just do some kind of re-protection? [22:22:34] I just need a new null-revsion [22:22:44] but obviously can't make a new one myself [22:22:44] hoo, kk [22:22:57] hoo: No. Q183 500s [22:23:05] Awesome [22:23:31] Is there a VP-equivalent on Wikidata? [22:23:37] VP? [22:23:44] Village Pump [22:23:45] WD:Project chat [22:23:48] oh sure [22:23:54] I guess Lydia_WMDE can post there? [22:23:58] salient only has one L btw [22:24:20] i'd rather not as i had no say in it and don't have all the details [22:24:32] Lydia_WMDE: Ok, fair point [22:24:45] I'm doing a post there. [22:25:26] I'm getting a 500 from https://www.wikidata.org/wiki/Q183 as logged in user with hhvm enabled. [22:25:39] bd808: HHVM doesn't exist on WD right now [22:26:04] what the heck [22:26:37] hoo: Can you roll back to the known good rev? [22:26:52] Coren: Yeah, will have to :S [22:27:08] Sorry for the further trouble [22:27:10] My view of https://www.wikidata.org/wiki/Special:Version says different. I have global js that is setting the cookie to use hhvm cluster. [22:27:19] bd808: We killed it [22:27:28] ask ori... it was causing to much pain [22:27:36] (03PS1) 10QChris: Update hooks-bugzilla to 6e1e659eedc8719a2a0ea0906266738a18c7aa42 [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/164879 [22:27:43] hoo: I get Q183 again. Your doing? [22:27:49] hoo: what was? [22:27:51] !log Q183 is on revision 116786096 again, please don't alter this further! [22:27:55] Logged the message, Master [22:27:59] Ok, no more experiments [22:28:05] this one is good, so keep [22:28:06] it [22:28:22] !log Q183 superprotected as a safeguard [22:28:28] Logged the message, Master [22:28:41] ori: We disabled HHVM on Wikidata because of the huge amount of issues [22:28:53] There we go. I go finish my meal now, but I'm keeping an eye on the channel. Ping me if you need further help. [22:29:17] hoo: there was one issue, IIRC, and it was supposed to be fixed last tuesday, by a wikidata deploy [22:29:25] if there are a "huge amount of issues", you guys aren't filing bugs [22:29:43] ori: Well, there is more in the logs then we have on bugzilla I think [22:29:56] but I'm not sure [22:31:50] I don't know that there is any way right now to disable the hhvm cluster for a particular wiki. It would take a varnish patch to ignore the hhvm=true cookie when the hostname matched some blacklist. [22:31:51] HHVM isn't really optional -- we're in the middle of switching to it. it's OK to call a time-out to fix some issue but I expect some diligence with respect to reporting issues :/ [22:32:11] bd808: we disabled the beta feature; the code for the beta feature unsets the cookie onbeforepagedisplay [22:32:20] so your global script and wikimediaevents are duking it out on each page load [22:32:27] ah [22:32:30] ori: We've been doing a lot lately [22:32:47] aude will know the exact status [22:32:47] i know! [22:32:58] ok. i'm sure we'll work it out. [22:33:05] no stress, thanks for jumping on this issue. [22:35:52] (03PS1) 10QChris: Linkify Phabricator Task references in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/164880 [22:45:05] (03PS1) 10QChris: Make gerrit set PATCH_TO_REVIEW status only in bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/164881 [22:47:10] * Coren is back. [22:47:44] error logs look okish [22:53:59] All of that said, this is just a dam over the flood; this bug is going to bite us in the ass with other items sooner rather than later. [22:54:35] Yep [22:54:58] We had such troubles before (on smaller scale) [22:55:18] it's always solvable, but it requires major changes to a lot of layers so nothing we can do in a blink [23:01:04] Incidentally, if I'm going to be asked why "I don't trust admins", the reason is simple: It's not a question of trust but of ability to fix. [23:02:36] Yep... also protection in Wikibase is not well enough visible to avoid accidential changes (especially using scripts and the API) [23:02:39] which admins do [23:44:37] (03PS1) 10Ori.livneh: mediawiki: install `perf` on Trusty app servers [puppet] - 10https://gerrit.wikimedia.org/r/164883