[00:07:37] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [00:25:27] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [00:40:40] (03CR) 10Krinkle: [C: 031] Removed "refreshLinks" from $wgJobBackoffThrottling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210857 (owner: 10Aaron Schulz) [01:15:47] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [01:35:17] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (91196s 90000s) [01:38:37] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [01:56:17] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [01:56:27] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [01:58:26] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 3 below the confidence bounds [02:17:27] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [02:21:48] !log l10nupdate Synchronized php-1.26wmf5/cache/l10n: (no message) (duration: 06m 24s) [02:22:04] Logged the message, Master [02:26:58] !log LocalisationUpdate completed (1.26wmf5) at 2015-05-18 02:25:54+00:00 [02:27:03] Logged the message, Master [02:42:21] !log l10nupdate Synchronized php-1.26wmf6/cache/l10n: (no message) (duration: 05m 35s) [02:42:30] Logged the message, Master [02:46:55] !log LocalisationUpdate completed (1.26wmf6) at 2015-05-18 02:45:52+00:00 [02:47:00] Logged the message, Master [03:35:38] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [03:37:46] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [03:51:53] 6operations, 7database: investigate performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1292599 (10Springle) 3NEW a:3jcrespo [03:59:48] 6operations, 7database: switch to innodb tables for replication state - https://phabricator.wikimedia.org/T99486#1292611 (10Springle) 3NEW a:3jcrespo [04:05:24] 6operations, 6Phabricator: m3 set max_allowed_packet to 33554432 or greater - https://phabricator.wikimedia.org/T98339#1292618 (10Springle) a:5Springle>3jcrespo [04:11:37] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [04:20:23] 6operations, 6Phabricator: m3 set max_allowed_packet to 33554432 or greater - https://phabricator.wikimedia.org/T98339#1292629 (10Springle) If we change this, check mysqldump command for m3 backups on dbstore1001. [04:21:27] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [04:25:16] 6operations, 7HHVM, 5Patch-For-Review: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1292630 (10Springle) p:5Normal>3High [04:25:38] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:46] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:17] RECOVERY - RAID on mw1107 is OK no RAID installed [04:28:56] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:36:06] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [04:41:07] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [05:03:57] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (12590 90000s) [05:18:54] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon May 18 05:17:50 UTC 2015 (duration 17m 49s) [05:19:00] Logged the message, Master [05:20:08] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [05:23:28] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [05:39:07] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [05:39:47] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [05:50:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [05:53:49] hmm, cologneblue is missing on mw.o [05:55:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [06:00:36] RECOVERY - Host eeden is UPING OK - Packet loss = 0%, RTA = 88.63 ms [06:00:47] RECOVERY - Host ns2-v4 is UPING OK - Packet loss = 0%, RTA = 89.21 ms [06:01:15] (03PS1) 10Florianschmidtwelzow: Remove gather-hidelist from AvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211663 (https://phabricator.wikimedia.org/T94652) [06:03:25] (03CR) 10Legoktm: "Is the extension deployed on meta?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211663 (https://phabricator.wikimedia.org/T94652) (owner: 10Florianschmidtwelzow) [06:16:49] (03CR) 10Jalexander: [C: 04-1] "Please do not merge, this extension is not installed on meta and without it being specifically added to availablerights it is not able to " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211663 (https://phabricator.wikimedia.org/T94652) (owner: 10Florianschmidtwelzow) [06:30:27] PROBLEM - puppet last run on logstash1006 is CRITICAL puppet fail [06:31:17] PROBLEM - puppet last run on mw1100 is CRITICAL Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on analytics1030 is CRITICAL Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 2 failures [06:31:48] PROBLEM - puppet last run on cp4004 is CRITICAL Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on mw1003 is CRITICAL Puppet has 1 failures [06:32:47] PROBLEM - puppet last run on mw1042 is CRITICAL Puppet has 1 failures [06:32:48] PROBLEM - puppet last run on lvs2004 is CRITICAL Puppet has 1 failures [06:33:26] PROBLEM - puppet last run on db2065 is CRITICAL Puppet has 1 failures [06:33:56] PROBLEM - puppet last run on mw1065 is CRITICAL Puppet has 1 failures [06:35:37] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:35:37] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [06:35:37] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:35:53] (03PS5) 10Giuseppe Lavagetto: hiera: use the proxy backend, rationalize the hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/207129 [06:38:27] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [06:45:27] RECOVERY - puppet last run on mw1003 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:45:57] RECOVERY - puppet last run on lvs2004 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:45:58] RECOVERY - puppet last run on mw1100 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on analytics1030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:37] RECOVERY - puppet last run on db2065 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on logstash1006 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:47:06] RECOVERY - puppet last run on mw1065 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:47:17] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:17] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:47:17] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:37] RECOVERY - puppet last run on mw1042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:07] (03CR) 10Giuseppe Lavagetto: [C: 032] "I tested this with the puppet compiler and it results to be a no-op - the only changes are related to ganglia_class apparently." [puppet] - 10https://gerrit.wikimedia.org/r/207129 (owner: 10Giuseppe Lavagetto) [06:48:17] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:04] 6operations, 10Hackathon-Lyon-2015, 10Wikimedia-Site-requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1292743 (10jcrespo) [06:51:37] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [06:54:37] PROBLEM - puppet last run on mw1229 is CRITICAL puppet fail [06:54:52] (03PS1) 10Giuseppe Lavagetto: Revert "hiera: use the proxy backend, rationalize the hierarchy" [puppet] - 10https://gerrit.wikimedia.org/r/211666 [06:55:01] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "hiera: use the proxy backend, rationalize the hierarchy" [puppet] - 10https://gerrit.wikimedia.org/r/211666 (owner: 10Giuseppe Lavagetto) [06:55:08] PROBLEM - puppet last run on mw1032 is CRITICAL puppet fail [06:55:09] <_joe_> fu.ck [06:55:17] PROBLEM - puppet last run on mw1185 is CRITICAL puppet fail [06:55:26] PROBLEM - puppet last run on mw2153 is CRITICAL puppet fail [06:55:27] PROBLEM - puppet last run on mw2209 is CRITICAL puppet fail [06:55:27] PROBLEM - puppet last run on mw2175 is CRITICAL puppet fail [06:55:27] PROBLEM - puppet last run on mw2140 is CRITICAL puppet fail [06:55:27] PROBLEM - puppet last run on mw2204 is CRITICAL puppet fail [06:55:27] PROBLEM - puppet last run on mw2144 is CRITICAL puppet fail [06:55:27] PROBLEM - puppet last run on mw2201 is CRITICAL puppet fail [06:55:28] PROBLEM - puppet last run on mw2205 is CRITICAL puppet fail [06:55:28] PROBLEM - puppet last run on mw2009 is CRITICAL puppet fail [06:55:29] PROBLEM - puppet last run on elastic1023 is CRITICAL puppet fail [06:55:29] PROBLEM - puppet last run on ganeti2006 is CRITICAL puppet fail [06:55:36] PROBLEM - puppet last run on elastic1017 is CRITICAL puppet fail [06:55:37] PROBLEM - puppet last run on logstash1005 is CRITICAL puppet fail [06:55:38] <_joe_> that's me :( [06:55:47] PROBLEM - puppet last run on mw1043 is CRITICAL puppet fail [06:55:47] PROBLEM - puppet last run on mw1105 is CRITICAL puppet fail [06:55:47] PROBLEM - puppet last run on mw1033 is CRITICAL puppet fail [06:55:56] PROBLEM - puppet last run on mw2091 is CRITICAL puppet fail [06:55:57] PROBLEM - puppet last run on mw1225 is CRITICAL puppet fail [06:56:17] PROBLEM - puppet last run on mw1122 is CRITICAL puppet fail [06:56:17] PROBLEM - puppet last run on mw1024 is CRITICAL puppet fail [06:56:28] PROBLEM - puppet last run on iodine is CRITICAL puppet fail [06:56:36] PROBLEM - puppet last run on mw1077 is CRITICAL puppet fail [06:56:36] PROBLEM - puppet last run on mw1209 is CRITICAL puppet fail [06:56:48] <_joe_> I can't get why [06:56:57] PROBLEM - puppet last run on mw2188 is CRITICAL puppet fail [06:56:57] PROBLEM - puppet last run on mw2185 is CRITICAL puppet fail [06:56:57] PROBLEM - puppet last run on mw2180 is CRITICAL puppet fail [06:56:57] PROBLEM - puppet last run on mw2148 is CRITICAL puppet fail [06:57:05] <_joe_> Alex's compiler was telling me everything was ok, wtf [06:57:06] PROBLEM - puppet last run on mw2156 is CRITICAL puppet fail [06:57:07] PROBLEM - puppet last run on mw2157 is CRITICAL puppet fail [06:57:07] PROBLEM - puppet last run on mw2141 is CRITICAL puppet fail [06:57:07] PROBLEM - puppet last run on mw2069 is CRITICAL puppet fail [06:57:08] PROBLEM - puppet last run on mw2010 is CRITICAL puppet fail [06:57:08] PROBLEM - puppet last run on mw2063 is CRITICAL puppet fail [06:57:08] PROBLEM - puppet last run on ganeti2002 is CRITICAL puppet fail [06:57:08] PROBLEM - puppet last run on ganeti2004 is CRITICAL puppet fail [06:57:09] PROBLEM - puppet last run on ganeti2003 is CRITICAL puppet fail [06:57:09] PROBLEM - puppet last run on mw1142 is CRITICAL puppet fail [06:57:10] PROBLEM - puppet last run on mw1231 is CRITICAL puppet fail [06:57:36] PROBLEM - puppet last run on mw2075 is CRITICAL puppet fail [06:57:46] PROBLEM - puppet last run on mw1071 is CRITICAL puppet fail [06:57:57] PROBLEM - puppet last run on mw1215 is CRITICAL puppet fail [06:58:16] PROBLEM - puppet last run on mw1193 is CRITICAL puppet fail [06:58:16] PROBLEM - puppet last run on mw1253 is CRITICAL puppet fail [06:58:17] PROBLEM - puppet last run on elastic1003 is CRITICAL puppet fail [06:58:17] PROBLEM - puppet last run on mw1021 is CRITICAL puppet fail [06:58:18] PROBLEM - puppet last run on mw1086 is CRITICAL puppet fail [06:58:28] PROBLEM - puppet last run on mw1203 is CRITICAL puppet fail [06:58:37] PROBLEM - puppet last run on mw1090 is CRITICAL puppet fail [06:58:37] PROBLEM - puppet last run on mw1241 is CRITICAL puppet fail [06:58:37] PROBLEM - puppet last run on mw2158 is CRITICAL puppet fail [06:58:37] PROBLEM - puppet last run on mw2195 is CRITICAL puppet fail [06:58:38] PROBLEM - puppet last run on mw2137 is CRITICAL puppet fail [06:58:46] PROBLEM - puppet last run on mw2129 is CRITICAL puppet fail [06:58:46] PROBLEM - puppet last run on mw2128 is CRITICAL puppet fail [06:58:46] PROBLEM - puppet last run on mw2078 is CRITICAL puppet fail [06:58:46] PROBLEM - puppet last run on mw1030 is CRITICAL puppet fail [06:58:46] PROBLEM - puppet last run on mw2060 is CRITICAL puppet fail [06:58:47] PROBLEM - puppet last run on mw2021 is CRITICAL puppet fail [06:58:47] PROBLEM - puppet last run on mw2019 is CRITICAL puppet fail [06:58:47] PROBLEM - puppet last run on mw2067 is CRITICAL puppet fail [06:58:48] PROBLEM - puppet last run on mw2040 is CRITICAL puppet fail [06:58:49] PROBLEM - puppet last run on mw1037 is CRITICAL puppet fail [06:58:49] PROBLEM - puppet last run on mw1135 is CRITICAL puppet fail [06:58:56] PROBLEM - puppet last run on mw1220 is CRITICAL puppet fail [06:58:58] PROBLEM - puppet last run on mw1104 is CRITICAL puppet fail [06:59:07] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:17] PROBLEM - puppet last run on mw2018 is CRITICAL puppet fail [06:59:17] PROBLEM - puppet last run on mw2039 is CRITICAL puppet fail [06:59:36] PROBLEM - puppet last run on restbase1003 is CRITICAL puppet fail [06:59:36] PROBLEM - puppet last run on mw1154 is CRITICAL puppet fail [06:59:46] PROBLEM - puppet last run on cp3007 is CRITICAL puppet fail [06:59:46] PROBLEM - puppet last run on cp3038 is CRITICAL puppet fail [06:59:46] PROBLEM - puppet last run on cp3017 is CRITICAL puppet fail [06:59:57] PROBLEM - puppet last run on mw1158 is CRITICAL puppet fail [07:00:07] PROBLEM - puppet last run on mw1047 is CRITICAL puppet fail [07:00:07] PROBLEM - puppet last run on mw1018 is CRITICAL puppet fail [07:00:17] PROBLEM - puppet last run on lanthanum is CRITICAL puppet fail [07:00:17] PROBLEM - puppet last run on db2055 is CRITICAL Puppet has 8 failures [07:00:17] PROBLEM - puppet last run on mw2211 is CRITICAL puppet fail [07:00:17] PROBLEM - puppet last run on mw2142 is CRITICAL puppet fail [07:00:17] PROBLEM - puppet last run on mw1128 is CRITICAL puppet fail [07:00:18] PROBLEM - puppet last run on mw2131 is CRITICAL puppet fail [07:00:18] PROBLEM - puppet last run on mw2101 is CRITICAL puppet fail [07:00:19] PROBLEM - puppet last run on mw1073 is CRITICAL puppet fail [07:00:19] PROBLEM - puppet last run on mw2110 is CRITICAL puppet fail [07:00:26] PROBLEM - puppet last run on mw2070 is CRITICAL Puppet has 8 failures [07:00:26] PROBLEM - puppet last run on mw2055 is CRITICAL puppet fail [07:00:26] PROBLEM - puppet last run on db2017 is CRITICAL Puppet has 6 failures [07:00:27] PROBLEM - puppet last run on mw2077 is CRITICAL puppet fail [07:00:27] PROBLEM - puppet last run on elastic1025 is CRITICAL puppet fail [07:00:56] PROBLEM - puppet last run on lvs3002 is CRITICAL puppet fail [07:01:18] PROBLEM - puppet last run on cp3048 is CRITICAL Puppet has 4 failures [07:02:46] (03Abandoned) 10Florianschmidtwelzow: Remove gather-hidelist from AvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211663 (https://phabricator.wikimedia.org/T94652) (owner: 10Florianschmidtwelzow) [07:02:58] PROBLEM - puppet last run on cp4010 is CRITICAL Puppet has 3 failures [07:04:47] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [07:05:17] PROBLEM - puppet last run on wtp2016 is CRITICAL Puppet has 6 failures [07:05:26] PROBLEM - puppet last run on eventlog2001 is CRITICAL Puppet has 5 failures [07:06:56] PROBLEM - puppet last run on db2047 is CRITICAL Puppet has 1 failures [07:06:57] PROBLEM - puppet last run on mw2176 is CRITICAL Puppet has 8 failures [07:06:57] PROBLEM - puppet last run on db2010 is CRITICAL Puppet has 5 failures [07:10:47] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:41] RECOVERY - puppet last run on logstash1005 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:13:11] PROBLEM - puppet last run on mw2084 is CRITICAL Puppet has 7 failures [07:13:21] RECOVERY - puppet last run on mw1030 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:14:01] PROBLEM - puppet last run on mw2024 is CRITICAL Puppet has 2 failures [07:14:01] PROBLEM - puppet last run on mw2047 is CRITICAL Puppet has 4 failures [07:14:02] RECOVERY - puppet last run on ganeti2004 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:14:02] RECOVERY - puppet last run on ganeti2006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:14:10] RECOVERY - puppet last run on ganeti2002 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:14:10] RECOVERY - puppet last run on ganeti2003 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [07:14:10] RECOVERY - puppet last run on elastic1023 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:14:11] RECOVERY - puppet last run on elastic1017 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:14:31] RECOVERY - puppet last run on elastic1003 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:14:40] RECOVERY - puppet last run on mw1229 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:14:40] RECOVERY - puppet last run on mw1043 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:14:41] RECOVERY - puppet last run on mw1033 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:14:41] RECOVERY - puppet last run on mw1105 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [07:14:41] RECOVERY - puppet last run on mw2201 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:14:50] RECOVERY - puppet last run on mw2091 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [07:14:51] RECOVERY - puppet last run on mw1032 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:14:51] RECOVERY - puppet last run on mw1215 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:14:51] PROBLEM - puppet last run on mw2052 is CRITICAL Puppet has 1 failures [07:14:51] PROBLEM - puppet last run on mw2120 is CRITICAL Puppet has 4 failures [07:14:51] RECOVERY - puppet last run on mw2148 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:15:01] RECOVERY - puppet last run on mw1225 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:15:01] RECOVERY - puppet last run on mw1071 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:15:02] RECOVERY - puppet last run on iodine is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:11] RECOVERY - puppet last run on mw1122 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:15:11] RECOVERY - puppet last run on mw2180 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:15:11] RECOVERY - puppet last run on mw2144 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:15:11] RECOVERY - puppet last run on mw2205 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:15:11] RECOVERY - puppet last run on mw2209 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:15:21] RECOVERY - puppet last run on mw1086 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:15:21] RECOVERY - puppet last run on mw2157 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:15:30] RECOVERY - puppet last run on mw1185 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:30] RECOVERY - puppet last run on mw1209 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:15:30] RECOVERY - puppet last run on mw2156 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:15:30] PROBLEM - puppet last run on mw2182 is CRITICAL Puppet has 1 failures [07:15:31] RECOVERY - puppet last run on mw1024 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:15:31] RECOVERY - puppet last run on mw2175 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [07:15:32] RECOVERY - puppet last run on mw2140 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:41] RECOVERY - puppet last run on mw2069 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:15:41] RECOVERY - puppet last run on mw2010 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:15:42] RECOVERY - puppet last run on mw1037 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:15:42] RECOVERY - puppet last run on mw2040 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:15:42] RECOVERY - puppet last run on mw2063 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:15:42] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:15:42] RECOVERY - puppet last run on mw1142 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:43] RECOVERY - puppet last run on mw1231 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:43] RECOVERY - puppet last run on mw1220 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:50] RECOVERY - puppet last run on mw1077 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:50] RECOVERY - puppet last run on mw2204 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:16:00] RECOVERY - puppet last run on mw1241 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:16:00] RECOVERY - puppet last run on mw1090 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:16:01] RECOVERY - puppet last run on mw2128 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:16:01] RECOVERY - puppet last run on mw2153 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:16:01] RECOVERY - puppet last run on mw1104 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:16:10] RECOVERY - puppet last run on cp3007 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [07:16:11] RECOVERY - puppet last run on mw1154 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:16:11] RECOVERY - puppet last run on mw1203 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:16:12] RECOVERY - puppet last run on mw2009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:16:30] RECOVERY - puppet last run on mw1253 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:16:30] RECOVERY - puppet last run on mw2075 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:16:30] RECOVERY - puppet last run on mw1021 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:16:31] RECOVERY - puppet last run on mw2188 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:16:31] RECOVERY - puppet last run on db2047 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:16:31] RECOVERY - puppet last run on mw2141 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:16:31] RECOVERY - puppet last run on mw2185 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:16:40] RECOVERY - puppet last run on mw1158 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:16:40] PROBLEM - puppet last run on mw1199 is CRITICAL Puppet has 5 failures [07:16:41] RECOVERY - puppet last run on db2055 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:16:50] RECOVERY - puppet last run on db2010 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [07:16:50] PROBLEM - puppet last run on cp4020 is CRITICAL puppet fail [07:16:50] RECOVERY - puppet last run on mw1193 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:17:01] RECOVERY - puppet last run on mw2176 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:17:10] RECOVERY - puppet last run on restbase1003 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:17:22] RECOVERY - puppet last run on mw2060 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:17:22] RECOVERY - puppet last run on db2017 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:17:22] RECOVERY - puppet last run on mw2021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:17:22] RECOVERY - puppet last run on mw2055 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:17:22] RECOVERY - puppet last run on mw2070 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:17:22] RECOVERY - puppet last run on mw2019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:17:23] RECOVERY - puppet last run on eventlog2001 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:17:24] RECOVERY - puppet last run on mw2067 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:17:24] RECOVERY - puppet last run on elastic1025 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:17:30] RECOVERY - puppet last run on mw1073 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:17:30] RECOVERY - puppet last run on cp3048 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:17:30] RECOVERY - puppet last run on mw2137 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:17:40] RECOVERY - puppet last run on mw1047 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:17:41] RECOVERY - puppet last run on mw1128 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:17:41] RECOVERY - puppet last run on mw2078 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:17:51] RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:17:51] RECOVERY - puppet last run on mw2195 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:18:01] RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:18:02] RECOVERY - puppet last run on cp3017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:18:02] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:18:02] RECOVERY - puppet last run on mw2039 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:18:02] RECOVERY - puppet last run on mw2084 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [07:18:02] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:18:04] (03CR) 10Mobrovac: "Ok, the deploy repo has been updated to the latest mathoid code in Iea3c5e6002123b38ad52214b26189d59569ac287 so this should be good to go " [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [07:18:10] RECOVERY - puppet last run on mw2142 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:18:11] RECOVERY - puppet last run on mw2052 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:18:11] RECOVERY - puppet last run on mw2120 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:18:11] RECOVERY - puppet last run on mw2131 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:18:12] RECOVERY - puppet last run on mw1199 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:18:31] RECOVERY - puppet last run on wtp2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:18:40] RECOVERY - puppet last run on lanthanum is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:18:41] RECOVERY - puppet last run on mw1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:18:50] RECOVERY - puppet last run on mw2182 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:19:01] RECOVERY - puppet last run on cp3038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:19:01] RECOVERY - puppet last run on mw2211 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:19:01] RECOVERY - puppet last run on mw2024 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:19:02] RECOVERY - puppet last run on mw2047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:19:02] RECOVERY - puppet last run on mw2077 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [07:19:10] RECOVERY - puppet last run on mw2110 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:19:21] RECOVERY - puppet last run on mw2101 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:20:01] RECOVERY - puppet last run on lvs3002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:20:48] RECOVERY - RAID on mw1107 is OK no RAID installed [07:21:59] PROBLEM - puppet last run on mw2103 is CRITICAL puppet fail [07:21:59] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:25:25] (03PS1) 10Jcrespo: Set max_allowed_packet on m3 (phabricator) to 32M (current: 16M) [puppet] - 10https://gerrit.wikimedia.org/r/211669 (https://phabricator.wikimedia.org/T98339) [07:30:21] (03CR) 10Santhosh: [C: 031] CX: Enable 'cxstats' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211116 (owner: 10KartikMistry) [07:34:19] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [07:34:19] RECOVERY - puppet last run on cp4020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:41:49] RECOVERY - puppet last run on mw2103 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:43:18] (03PS2) 10Jcrespo: Set max_allowed_packet on m3 (phabricator) to 32M (current: 16M) [puppet] - 10https://gerrit.wikimedia.org/r/211669 (https://phabricator.wikimedia.org/T98339) [07:51:11] (03PS1) 10KartikMistry: CX: Add wikis for deployment on 20150518 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211671 (https://phabricator.wikimedia.org/T98454) [07:51:17] (03CR) 10jenkins-bot: [V: 04-1] CX: Add wikis for deployment on 20150518 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211671 (https://phabricator.wikimedia.org/T98454) (owner: 10KartikMistry) [07:53:35] (03PS2) 10KartikMistry: CX: Add wikis for deployment on 20150518 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211671 (https://phabricator.wikimedia.org/T98454) [07:54:40] (03CR) 10Springle: [C: 031] Set max_allowed_packet on m3 (phabricator) to 32M (current: 16M) [puppet] - 10https://gerrit.wikimedia.org/r/211669 (https://phabricator.wikimedia.org/T98339) (owner: 10Jcrespo) [07:54:46] 6operations, 6Phabricator, 5Patch-For-Review: m3 set max_allowed_packet to 33554432 or greater - https://phabricator.wikimedia.org/T98339#1292795 (10jcrespo) Once the patch is applied, the second part of the fix will be to execute `SET GLOBAL max_allowed_packet = 32M;` from slaves to masters on db2012.codfw.... [08:01:39] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:07] (03CR) 10Jcrespo: [C: 032] Set max_allowed_packet on m3 (phabricator) to 32M (current: 16M) [puppet] - 10https://gerrit.wikimedia.org/r/211669 (https://phabricator.wikimedia.org/T98339) (owner: 10Jcrespo) [08:04:19] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [08:05:50] (03PS1) 10Filippo Giunchedi: es-tool: wait longer before enabling replication [puppet] - 10https://gerrit.wikimedia.org/r/211672 (https://phabricator.wikimedia.org/T99500) [08:06:38] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:10] RECOVERY - RAID on mw1107 is OK no RAID installed [08:13:10] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:20:46] (03CR) 10Giuseppe Lavagetto: [C: 031] Set HHVM mysql connection timeout to 3s [puppet] - 10https://gerrit.wikimedia.org/r/211155 (https://phabricator.wikimedia.org/T98489) (owner: 10BryanDavis) [08:22:53] mobrovac: bonjour :} [08:23:08] mobrovac: are you handling the beta cluster VE/RestBase path config issue? https://phabricator.wikimedia.org/T99496 [08:23:10] buongiorno mr hashar [08:24:01] (03CR) 10Giuseppe Lavagetto: "I'm not sure how much of a burden on the puppetmasters would be having this running for every puppet run would cause. I guess quite a lot." [puppet] - 10https://gerrit.wikimedia.org/r/210926 (owner: 10Faidon Liambotis) [08:24:08] hashar: for now i'm looking into why RB keeps failing there, the path config should be easy enough to fix after that [08:24:13] (famous last words...) [08:27:56] ahah [08:28:03] mobrovac: mind if I assign the task to you ? [08:28:12] oh you did [08:28:14] awesome [08:28:15] hashar: self-assigned already [08:28:16] :P [08:33:18] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [08:38:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:39:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:39:58] 6operations, 10Wikimedia-DNS, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1292858 (10Bawolff) >>! In T99216#1290616, @80686 wrote: > good point about the privacy policy. My suggestion is that we point at the WMF pri... [08:43:58] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [08:44:39] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:44:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:45:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:45:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:45:39] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:45:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:46:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:46:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:47:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:47:40] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:48:18] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [08:48:59] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [08:49:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [08:49:38] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [08:50:05] mmhh ulsfo problems? [08:50:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [08:50:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [08:50:40] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [08:50:48] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [08:50:59] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [08:51:30] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [08:51:49] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [08:52:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [08:52:30] <_joe_> godog: likely so, yes [08:53:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:57:23] (03PS1) 10Filippo Giunchedi: depool ulsfo due to traffic issues [dns] - 10https://gerrit.wikimedia.org/r/211681 [08:57:40] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [08:58:59] _joe_ akosiaris ^ thoughts ? I think we should depool [08:59:53] <_joe_> still showing packet loss? [09:00:09] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:09] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:30] eqiad-ulsfo yeah, smokeping sees loss [09:00:38] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [09:00:49] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:00:51] <_joe_> yeah we should [09:00:58] yep, depooling [09:01:07] 6operations, 6Phabricator, 5Patch-For-Review: m3 set max_allowed_packet to 33554432 or greater - https://phabricator.wikimedia.org/T98339#1292901 (10jcrespo) @chasemp Changes has been applied to m3, unless the application maintains persistent connections/pool, it should already fix the issue, can you check i... [09:01:10] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] depool ulsfo due to traffic issues [dns] - 10https://gerrit.wikimedia.org/r/211681 (owner: 10Filippo Giunchedi) [09:01:39] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:01:49] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 13.33% of data above the critical threshold [500.0] [09:01:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:02:03] !log loss on ulsfo-eqiad, depooled ulsfo [09:02:08] Logged the message, Master [09:02:09] RECOVERY - Router interfaces on cr2-ulsfo is OK host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [09:02:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 22.22% of data above the critical threshold [20000.0] [09:02:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:02:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:02:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:02:49] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:02:49] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:03:08] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:03:29] RECOVERY - RAID on mw1107 is OK no RAID installed [09:03:29] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 53 minutes ago with 0 failures [09:03:46] <_joe_> I'll lookl at mw1107 in a few [09:04:33] 6operations, 6Phabricator, 5Patch-For-Review: m3 set max_allowed_packet to 33554432 or greater - https://phabricator.wikimedia.org/T98339#1292907 (10mmodell) @jcrespo: phabricator no longer complains about max_allowed_packet, so it seems to be resolved. It does still complain about php post_max_size, but tha... [09:04:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:05:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [09:05:49] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [09:06:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [09:06:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [09:06:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [09:06:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [09:06:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [09:07:10] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [09:07:28] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [09:07:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [09:07:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [09:07:30] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [09:07:58] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [09:08:17] 6operations, 6Phabricator, 5Patch-For-Review: m3 set max_allowed_packet to 33554432 or greater - https://phabricator.wikimedia.org/T98339#1292916 (10jcrespo) 5Open>3Resolved Great! Yes, open if needed a separate task, as that is not db-related. [09:08:48] PROBLEM - HTTPS on cp4010 is CRITICAL: Return code of 110 is out of bounds [09:09:18] PROBLEM - HTTPS on cp4017 is CRITICAL: Return code of 110 is out of bounds [09:09:28] PROBLEM - HTTPS on cp4016 is CRITICAL: Return code of 110 is out of bounds [09:09:39] PROBLEM - HTTPS on cp4006 is CRITICAL: Return code of 255 is out of bounds [09:10:20] RECOVERY - HTTPS on cp4017 is OK: SSLXNN OK - 36 OK [09:10:48] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [09:10:49] RECOVERY - HTTPS on cp4016 is OK: SSLXNN OK - 36 OK [09:11:08] PROBLEM - puppet last run on cp4014 is CRITICAL puppet fail [09:11:19] RECOVERY - HTTPS on cp4006 is OK: SSLXNN OK - 36 OK [09:11:19] RECOVERY - HTTPS on cp4010 is OK: SSLXNN OK - 36 OK [09:11:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 22.22% of data above the critical threshold [20000.0] [09:12:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 22.22% of data above the critical threshold [20000.0] [09:12:30] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 22.22% of data above the critical threshold [20000.0] [09:12:30] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:12:30] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 22.22% of data above the critical threshold [20000.0] [09:12:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 22.22% of data above the critical threshold [20000.0] [09:12:49] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 22.22% of data above the critical threshold [20000.0] [09:12:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:13:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:14:30] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:14:39] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:14:58] PROBLEM - puppet last run on cp4005 is CRITICAL Puppet has 2 failures [09:14:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:18:59] PROBLEM - puppet last run on db2058 is CRITICAL puppet fail [09:19:19] PROBLEM - puppet last run on cp4016 is CRITICAL puppet fail [09:20:20] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:20:20] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:20:40] PROBLEM - puppet last run on cp4010 is CRITICAL puppet fail [09:25:05] 6operations, 7database: switch to innodb tables for replication state - https://phabricator.wikimedia.org/T99486#1292951 (10jcrespo) p:5Normal>3Low I will set this to low, as **current procedure does work** and it may not be strictly necessary until (if) a switch to GTID-based replication happens (which ha... [09:25:19] RECOVERY - RAID on mw1107 is OK no RAID installed [09:25:19] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:25:29] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [09:25:59] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [09:27:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [09:27:20] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [09:27:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [09:27:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [09:27:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [09:27:59] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:27:59] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [09:27:59] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [09:28:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [09:28:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [09:28:28] RECOVERY - puppet last run on cp4005 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:28:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [09:28:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [09:35:40] RECOVERY - puppet last run on db2058 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:37:33] RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:37:49] RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:42:10] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:42:10] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:42:35] (03PS1) 10Muehlenhoff: Use 3.19 on jessie by default (Bug: T97411) [puppet] - 10https://gerrit.wikimedia.org/r/211688 [09:48:48] RECOVERY - RAID on mw1107 is OK no RAID installed [09:55:20] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:00:18] RECOVERY - RAID on mw1107 is OK no RAID installed [10:00:19] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 36 minutes ago with 0 failures [10:03:09] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [10:37:49] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [10:43:25] (03PS1) 10Jcrespo: Depooling db1063 for configuration and restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211693 [10:46:49] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:49:19] PROBLEM - configured eth on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:49:59] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:50:48] RECOVERY - configured eth on mw1107 is OK - interfaces up [10:51:38] RECOVERY - RAID on mw1107 is OK no RAID installed [10:51:39] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 14 minutes ago with 0 failures [10:56:59] (03CR) 10Springle: [C: 031] Depooling db1063 for configuration and restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211693 (owner: 10Jcrespo) [10:58:31] (03CR) 10Jcrespo: [C: 032] Depooling db1063 for configuration and restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211693 (owner: 10Jcrespo) [11:02:39] is there anything going on in the cluster now? I would like to fix graphoid service (nodejs) https://phabricator.wikimedia.org/T99349 [11:04:29] springle, ^ [11:06:38] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:07:17] !log depooling db1063 from cluster for maintenance [11:07:22] Logged the message, Master [11:07:40] yurik, ^just a restart, etc. [11:08:03] jynus, so should be ok to git deploy sync a sca1x [11:08:09] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:08:18] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 31 minutes ago with 0 failures [11:10:54] !log jynus Synchronized wmf-config/db-eqiad.php: depool db1063 (duration: 01m 00s) [11:10:57] Logged the message, Master [11:11:19] RECOVERY - RAID on mw1107 is OK no RAID installed [11:13:33] !log deployed graphoid update to fix https://phabricator.wikimedia.org/T99349 [11:13:39] Logged the message, Master [11:17:40] (03PS2) 10Jforrester: Enable a test of the VisualEditor A/B testing framework [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205778 [11:19:27] (03PS3) 10Jforrester: Enable a test of the VisualEditor A/B testing framework [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205778 (https://phabricator.wikimedia.org/T90666) [11:27:55] (03PS1) 10Jforrester: Disable VisualEditor A/B test pilot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211696 [11:27:57] (03PS1) 10Jforrester: Enable A/B test of VisualEditor for new accounts on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211697 (https://phabricator.wikimedia.org/T90666) [11:27:59] (03PS1) 10Jforrester: Disable A/B test of VisualEditor for new accounts on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211698 (https://phabricator.wikimedia.org/T90666) [11:40:58] (03PS17) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [11:41:25] (03PS18) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [11:41:50] <_joe_> akosiaris: if you have a moment, care to give a look? ^^ [11:54:37] yurik, sorry I didn't answer, yeah, that is on a different set of machines [11:58:22] (03PS1) 10Muehlenhoff: Update to 3.19.7 [debs/linux] - 10https://gerrit.wikimedia.org/r/211701 [11:59:59] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:00:09] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:02:42] <_joe_> I cant' get why this ^^ happens [12:03:19] RECOVERY - RAID on mw1107 is OK no RAID installed [12:03:39] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 22 minutes ago with 0 failures [12:03:40] greg-g: My change to the deployment calendar – https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=159633&oldid=159628 – awaits your approval. Mostly it's as early as we could make it to be as far from the weekend as possible. Hope it's OK. [12:04:05] (03PS1) 10Muehlenhoff: Update to 3.19.8 [debs/linux] - 10https://gerrit.wikimedia.org/r/211702 [12:10:19] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:09] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [12:13:29] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:17:55] (03PS1) 10Jcrespo: Repooling of db1063 after reststart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211706 [12:19:14] (03PS2) 10Giuseppe Lavagetto: confd: create module [puppet] - 10https://gerrit.wikimedia.org/r/208399 [12:27:27] godog: can you review, https://gerrit.wikimedia.org/r/211371 and https://gerrit.wikimedia.org/r/210914 [12:27:34] godog: beta only [12:27:51] (03PS1) 10Aude: Enable Wikibase arbitrary access on enwikivoyage, fawiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211712 (https://phabricator.wikimedia.org/T98249) [12:29:48] PROBLEM - puppet last run on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:31:18] RECOVERY - puppet last run on mw1040 is OK Puppet is currently enabled, last run 8 minutes ago with 0 failures [12:32:33] (03CR) 10Manybubbles: [C: 04-1] "Why not catch the timeout exception?" [puppet] - 10https://gerrit.wikimedia.org/r/211672 (https://phabricator.wikimedia.org/T99500) (owner: 10Filippo Giunchedi) [12:34:43] (03PS1) 10KartikMistry: CX: Add all languages in source [puppet] - 10https://gerrit.wikimedia.org/r/211713 (https://phabricator.wikimedia.org/T98946) [12:37:38] (03PS1) 10Aude: Enable Graph extension on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211714 [12:37:48] 6operations, 7database: On a maintenance window, upgrade db063 to 14.04 and its MariaDB package to 10.0.16 - https://phabricator.wikimedia.org/T99520#1293342 (10jcrespo) 3NEW a:3jcrespo [12:38:36] springle, ^happy now :-) ? [12:39:17] jynus: haha cool [12:40:31] 6operations, 7database: On a maintenance window, upgrade db1063 to 14.04 and its MariaDB package to 10.0.16 - https://phabricator.wikimedia.org/T99520#1293352 (10jcrespo) [12:41:28] PROBLEM - configured eth on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:42:09] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:42:19] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:45] (03PS2) 10Jcrespo: Repooling of db1063 after reststart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211706 [12:47:59] RECOVERY - configured eth on mw1107 is OK - interfaces up [12:48:28] (03CR) 10Springle: [C: 031] Repooling of db1063 after reststart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211706 (owner: 10Jcrespo) [12:48:58] RECOVERY - RAID on mw1107 is OK no RAID installed [12:50:38] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 31 minutes ago with 0 failures [12:53:09] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [12:54:58] (03CR) 10Jcrespo: [C: 032] Repooling of db1063 after reststart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211706 (owner: 10Jcrespo) [12:55:29] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:57:08] RECOVERY - RAID on mw1107 is OK no RAID installed [13:00:04] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150518T1300). [13:01:01] !log jynus Synchronized wmf-config/db-eqiad.php: repool db1063 (duration: 00m 17s) [13:01:07] Logged the message, Master [13:17:44] (03CR) 10Aude: [C: 032] Enable Wikibase arbitrary access on enwikivoyage, fawiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211712 (https://phabricator.wikimedia.org/T98249) (owner: 10Aude) [13:17:56] (03CR) 10Aude: [C: 032] Enable Graph extension on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211714 (owner: 10Aude) [13:24:09] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:18] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:22] (03Merged) 10jenkins-bot: Enable Wikibase arbitrary access on enwikivoyage, fawiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211712 (https://phabricator.wikimedia.org/T98249) (owner: 10Aude) [13:25:25] (03Merged) 10jenkins-bot: Enable Graph extension on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211714 (owner: 10Aude) [13:26:59] !log jynus Synchronized wmf-config/db-eqiad.php: repool db1063 after warmup period (duration: 01m 01s) [13:27:02] Logged the message, Master [13:27:19] RECOVERY - RAID on mw1107 is OK no RAID installed [13:27:30] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 29 minutes ago with 0 failures [13:31:46] !log aude Synchronized php-1.26wmf6/extensions/Wikidata: Fix rdf dump script (duration: 03m 23s) [13:31:49] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:31:49] (03PS4) 10Filippo Giunchedi: Beta: CX: Enable all languages in source and target [puppet] - 10https://gerrit.wikimedia.org/r/211371 (https://phabricator.wikimedia.org/T98946) (owner: 10KartikMistry) [13:31:50] Logged the message, Master [13:31:54] (03PS3) 10Filippo Giunchedi: Beta: CX: Add Hebrew (he) as target language [puppet] - 10https://gerrit.wikimedia.org/r/210914 (https://phabricator.wikimedia.org/T99082) (owner: 10KartikMistry) [13:31:59] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Beta: CX: Add Hebrew (he) as target language [puppet] - 10https://gerrit.wikimedia.org/r/210914 (https://phabricator.wikimedia.org/T99082) (owner: 10KartikMistry) [13:32:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Beta: CX: Enable all languages in source and target [puppet] - 10https://gerrit.wikimedia.org/r/211371 (https://phabricator.wikimedia.org/T98946) (owner: 10KartikMistry) [13:32:19] PROBLEM - puppet last run on mw2168 is CRITICAL Puppet has 1 failures [13:32:20] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:30] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:32:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:32:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:32:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:33:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:33:13] kart_: the two patches conflict, I'm going to merge https://gerrit.wikimedia.org/r/#/c/210914/ since that's already submitted, please rebase the other [13:33:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:33:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:33:39] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:34:29] !log aude Synchronized wmf-config/InitialiseSettings.php: Enable arbitrary access on enwikivoyage, fawiki, and hewiki, and graph extension everywhere (duration: 00m 57s) [13:34:33] Logged the message, Master [13:34:57] is someone working on mw1107? [13:35:32] godog: thanks a lot. [13:36:06] kart_: np, it'd have been better if I merged the other first but meh :) [13:36:20] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [13:36:20] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [13:36:20] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [13:36:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [13:36:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [13:36:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [13:36:49] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [13:36:59] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [13:37:38] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [13:38:12] !log aude Synchronized wmf-config/InitialiseSettings-labs.php: Remove beta-specific Graph settings (duration: 01m 46s) [13:38:18] Logged the message, Master [13:39:05] mark hi, any updates on granting me access to sca cluster - https://phabricator.wikimedia.org/T98371 ? [13:39:24] ok, my repool seems ok, going for lunch now [13:39:34] yurik: no but we'll discuss it in the ops meeting again today [13:39:43] there wasn't one last week which is probably why this is delayed, sorry [13:39:53] np, thx [13:40:27] mark: jynus workign on mw1107? [13:40:45] Timeout, server mw1107.eqiad.wmnet not responding. [13:40:53] when syncing [13:41:23] <_joe_> aude: something strange is surely happening on that server [13:41:36] yep [13:41:50] i have trouble to login there [13:42:08] aude, no [13:42:13] jynus: ok [13:45:48] RECOVERY - RAID on mw1107 is OK no RAID installed [13:45:49] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:50:49] RECOVERY - puppet last run on mw2168 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:50:59] 7Blocked-on-Operations, 10RESTBase, 5Patch-For-Review: Enable group 1 wikis in RESTBase - https://phabricator.wikimedia.org/T93452#1293528 (10mobrovac) Deployment scheduled for [2015-05-19](https://wikitech.wikimedia.org/wiki/Deployments#Tuesday.2C.C2.A0May.C2.A019), @fgiunchedi will assist us. [13:51:55] (03PS1) 10Ottomata: Log frontend parsoid varnish requests to kafka via varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/211720 (https://phabricator.wikimedia.org/T99372) [13:55:41] 6operations, 10Wikimedia-Site-requests: Move the Nourmande Wikipedia from nrm to nrf - https://phabricator.wikimedia.org/T25216#1293542 (10Verdy_p) Bur nrf is just for Guernsey and Jersey (Norman with lots of changes borrowed from English, plus historic forms), not for continental Norman that has at least two... [13:58:29] hey bblack, yt? you mentioned something about wanting to change the way the parsoid cache stuff was set up, right? [13:59:22] (03PS5) 10KartikMistry: Beta: CX: Enable all languages in source and target [puppet] - 10https://gerrit.wikimedia.org/r/211371 (https://phabricator.wikimedia.org/T98946) [14:00:07] godog: done ^^ [14:00:17] (03CR) 10Mobrovac: Log frontend parsoid varnish requests to kafka via varnishkafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/211720 (https://phabricator.wikimedia.org/T99372) (owner: 10Ottomata) [14:00:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:01:19] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:01:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:01:39] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:01:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Beta: CX: Enable all languages in source and target [puppet] - 10https://gerrit.wikimedia.org/r/211371 (https://phabricator.wikimedia.org/T98946) (owner: 10KartikMistry) [14:01:49] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:01:49] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:01:49] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:01:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:02:07] kart_: cool, merged [14:02:08] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:02:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:02:09] hey! ULSFO what gives?! [14:02:18] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:02:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:02:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:03:09] ottomata: there were problems this morning too, it is depooled btw [14:03:21] (03CR) 10Ottomata: "Yuri mentioned that _parsoid might be a bad name for this topic, as this cache cluster does more than parsoid. I want to talk to bblack a" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/211720 (https://phabricator.wikimedia.org/T99372) (owner: 10Ottomata) [14:04:10] (03CR) 10KartikMistry: [C: 04-1] "To be merged after, https://gerrit.wikimedia.org/r/#/c/211671/" [puppet] - 10https://gerrit.wikimedia.org/r/211713 (https://phabricator.wikimedia.org/T98946) (owner: 10KartikMistry) [14:05:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [14:05:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [14:05:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [14:05:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [14:05:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [14:05:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [14:05:29] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [14:05:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [14:05:49] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [14:05:59] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [14:06:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [14:06:19] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:06:28] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [14:06:49] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [14:07:14] <_joe_> !log restarting HHVM on mw1107 - memory leak probably happening [14:07:21] Logged the message, Master [14:07:49] RECOVERY - RAID on mw1107 is OK no RAID installed [14:07:59] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [14:12:19] PROBLEM - puppet last run on cp3009 is CRITICAL puppet fail [14:13:09] mobrovac: yt? [14:13:32] ottomata: yup [14:13:48] yurik: mentioned you wrote some system to log stuff from nodejs services [14:13:51] just curious, what's that? [14:14:40] ottomata: i didn't write it, i just use it in the node.js service template [14:14:51] ottomata: it's just a log-relay thin thingy [14:15:32] 6operations, 6Labs: Andrew needs to get paged for labs cluster icinga alerts. - https://phabricator.wikimedia.org/T99524#1293563 (10Andrew) 3NEW a:3Andrew [14:15:39] and we use it to send stuff to logstash [14:15:41] or stdout [14:15:44] or a file [14:15:53] structured logs, that is [14:16:12] interesting, what kind of logs is it used for? [14:16:15] operational stuff? [14:16:50] yes [14:17:09] anything from trace to fatal i-m-committing-suicide ones [14:17:27] and you can adjust the level, ofc [14:17:30] hm, ok, cool, interesting, that seems like the right thing to do then. [14:17:58] this is neat because it lets devs see stuff on the console and ops in logstash in prod [14:17:59] i'm interested in seeing a kafka + hadoop/other based generic eventlogging type system soon [14:18:04] yeah [14:18:23] yeah, i know, we need it badly as well [14:18:32] but, no resources for this now [14:18:46] the services team still values sleep time :) [14:19:29] PROBLEM - puppet last run on dbstore2002 is CRITICAL Puppet has 1 failures [14:19:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 22.22% of data above the critical threshold [20000.0] [14:19:38] PROBLEM - puppet last run on db2060 is CRITICAL Puppet has 1 failures [14:19:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 22.22% of data above the critical threshold [20000.0] [14:20:09] PROBLEM - puppet last run on mw1064 is CRITICAL Puppet has 1 failures [14:20:09] PROBLEM - puppet last run on mw1112 is CRITICAL Puppet has 1 failures [14:20:20] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 22.22% of data above the critical threshold [20000.0] [14:20:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 22.22% of data above the critical threshold [20000.0] [14:20:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 22.22% of data above the critical threshold [20000.0] [14:20:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 22.22% of data above the critical threshold [20000.0] [14:20:39] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 22.22% of data above the critical threshold [20000.0] [14:20:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:20:49] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 22.22% of data above the critical threshold [20000.0] [14:20:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 22.22% of data above the critical threshold [20000.0] [14:21:02] (03PS2) 10KartikMistry: CX: Add languages for deployment 20150518 and all source languages [puppet] - 10https://gerrit.wikimedia.org/r/211713 (https://phabricator.wikimedia.org/T98946) [14:21:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 22.22% of data above the critical threshold [20000.0] [14:21:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 22.22% of data above the critical threshold [20000.0] [14:21:48] 6operations, 7HHVM: HHVM 3.6 leaks memory - https://phabricator.wikimedia.org/T99525#1293578 (10Joe) 3NEW [14:22:11] 6operations, 7HHVM: HHVM 3.6 leaks memory - https://phabricator.wikimedia.org/T99525#1293585 (10Joe) [14:22:27] nice... [14:22:51] <_joe_> akosiaris: very nice, and now finding someone to work on this will be even funnier [14:24:06] (03CR) 10GWicke: "Are we confident that the new render code is producing a result that looks as intended? Before the last deploy we tested this manually on " [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:24:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [14:27:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [14:27:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [14:27:10] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [14:27:18] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [14:27:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [14:27:30] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [14:27:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [14:27:58] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [14:28:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [14:28:10] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [14:28:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [14:28:55] (03PS1) 10Ottomata: Use udp2log::rsyncd class to set up rsyncd on fluorine [puppet] - 10https://gerrit.wikimedia.org/r/211721 (https://phabricator.wikimedia.org/T99245) [14:30:59] RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [14:32:42] (03CR) 10Ottomata: [C: 032] Use udp2log::rsyncd class to set up rsyncd on fluorine [puppet] - 10https://gerrit.wikimedia.org/r/211721 (https://phabricator.wikimedia.org/T99245) (owner: 10Ottomata) [14:34:39] RECOVERY - puppet last run on dbstore2002 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:35:19] RECOVERY - puppet last run on mw1064 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:35:29] !log disabling puppet on labnet1001 to debug dnsmasq [14:35:33] Logged the message, Master [14:35:39] PROBLEM - puppet last run on es1002 is CRITICAL Puppet has 1 failures [14:36:29] RECOVERY - puppet last run on db2060 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:36:59] RECOVERY - puppet last run on mw1112 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:37:32] 6operations, 10Analytics-EventLogging: Allow eventlogging ZeroMQ traffic to be consumed inside of the Analytics VLAN. - https://phabricator.wikimedia.org/T99246#1293621 (10Ottomata) Thank you! confirmed. [14:37:42] (03CR) 10BBlack: [C: 04-1] "All of the bnx2x num_queues stuff (which is the bulk of it) can be left down in the caches-only stanza as well. That's only for hosts whe" [puppet] - 10https://gerrit.wikimedia.org/r/211688 (owner: 10Muehlenhoff) [14:38:17] (03CR) 10BBlack: [C: 031] "Sounds good in theory to me, assuming it's the right INI name for it." [puppet] - 10https://gerrit.wikimedia.org/r/211155 (https://phabricator.wikimedia.org/T98489) (owner: 10BryanDavis) [14:39:24] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1293625 (10Paladox) [14:40:15] (03Abandoned) 10BBlack: Reverts 3x changes (logo paths + removal of wmgUseBits), sets wmgUseBits = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209722 (owner: 10BBlack) [14:41:33] bblack: what do you thikn should happen to the parsoidcache cluster? [14:41:42] will those requests move to an existing varnishcluster? [14:42:49] ottomata: I think that's a conversation we need to have soon, because I think not everybody involved is on the same page. [14:43:05] but parsoidcache is an existing varnishcluster :) [14:43:18] ha, yes, buuuuuut, i mean, one of the more usual ones [14:43:26] text, misc, upload, etc. [14:44:11] so basically the parsoidcache varnishes were a special thing just for parsoid, not for other services [14:44:37] 7Blocked-on-Operations, 5Patch-For-Review, 3Search-and-Discovery-Research-and-Data-Sprint: Create rsync connector to fluorine - https://phabricator.wikimedia.org/T98383#1293629 (10Ottomata) 5Resolved>3Open Done! [14:44:55] aye [14:45:12] when other services started appearing, they were put there "temporarily" while the longer-term plan was to have them run through restbase rather than through varnish directly as separate services at the varnish level [14:45:31] and then restbase deployed, and yet still new services are getting deployed outside of restbase, via the parsoidcache cluster [14:45:34] hm, will things like restbase and other services always have varnish in front? [14:46:00] I've heard that we're still on-track for those services going through restbase, but then I hear comments from people working on those services like they're totally unaware of that plan... [14:46:13] restbase does have varnish in front of it, on the text clusters [14:46:22] the idea is that the other services flow through restbase, and thus through that. [14:46:26] ah ok, on text. cool [14:46:34] would parsoid be moved to text then too? [14:46:38] or will that stay in parsoid? [14:46:40] cluster [14:46:40] https://phabricator.wikimedia.org/T96688 [14:46:43] no either [14:46:46] err neither [14:47:06] oh? [14:47:39] ok, so the reason i'm asking, is in order to consume logs for those http requests into hadoop, they need a vanrishkafka instance, which needs a topic name. [14:47:44] it's not that the old *oid services will move to the text cluster. it's that they become part of the restbase API, which happens to live in the text cluster [14:47:55] the more webrequest_ topics we ahve, the more overhead there is on the analytics side of things (more oozie jobs to deal with, etc.) [14:48:28] bblack: that's fine, i'm just curious as to where the http requests for these services will first hit [14:48:37] if they all end up hitting text frontend because restbase is in text [14:48:41] then there is nothing I need to do :) [14:49:00] manybubbles, marktraceur, ^d, thcipriani: Who wants to SWAT this morning? [14:49:01] right [14:49:05] * anomie is busy catching up [14:49:12] will parsoidcache cluster likely go away then? [14:49:16] unless we want to keep adding more hacks to make the parsoidcache situation more-livable in the interim. [14:49:33] well, IF the intention is that these requests will definitely go to text cluster [14:49:34] but honestly if this situation's going to keep dragging out forever, it should be a separate cluster or something :P [14:49:49] then there's no reason for me not to just have the parsoidcache varnishkafka alos log to the webrequest_text topic [14:50:02] it would look totally different when the data moves anyways [14:50:04] even now that would be fine [14:50:10] what woudl? [14:50:14] the URLs and such [14:50:22] oh? [14:50:22] (03PS3) 10KartikMistry: CX: Add wikis for deployment on 20150518 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211671 (https://phabricator.wikimedia.org/T98454) [14:50:32] well, i don't mind that so much, as long as we are collecting the data I thikn [14:50:41] yuri is asking for those to be in hadoop [14:50:43] also, I'm not sure it's a great idea to log all parsoidcache requests, but I'm not sure.... [14:50:49] ottomata: we should have an RB entry point for graphoid in the next days [14:51:12] code at https://github.com/wikimedia/restbase/pull/247, puppet patch to follow later today [14:51:13] (03PS3) 10KartikMistry: CX: Add languages for deployment 20150518 and all source languages [puppet] - 10https://gerrit.wikimedia.org/r/211713 (https://phabricator.wikimedia.org/T98946) [14:51:32] then why is yuri still saying things like "parsoidcache seems like a bad name for a generic services cluster" and asking for logging endpoints there for his service, etc? [14:51:38] gwicke: cool, that means graphoid requests will show up at text varnishes rather than parsoidcache varnishes? [14:51:43] (03PS4) 10KartikMistry: CX: Add wikis for deployment on 20150518 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211671 (https://phabricator.wikimedia.org/T98454) [14:51:56] bblack: yuri just wants the requests logged [14:52:00] (03PS4) 10KartikMistry: CX: Add languages for deployment 20150518 and all source languages [puppet] - 10https://gerrit.wikimedia.org/r/211713 (https://phabricator.wikimedia.org/T98946) [14:52:07] ottomata: yes, once the extension points to that [14:52:17] i think maybe he is not on the same page as you and the services folks about what is to happen to parsoidcache [14:52:22] RB also gives him metrics etc [14:52:24] he just knows that there is more than parsoid behind that cluster [14:52:27] so it is a bad name [14:52:29] RECOVERY - puppet last run on es1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:52:55] ottomata: it is a special parsoid varnish [14:53:02] with huge storage etc [14:53:11] gwicke: is taht expected to stay around? [14:53:20] or will that also move behind restbase? [14:53:23] errrrr [14:53:26] the other services were only added temporarily, as it was the path of least resistance at that point [14:53:59] ok, yeah q: is parsoidcache planned to stick around and serve only parsoid requests? [14:54:01] is that the intention? [14:54:12] (03PS1) 10Mobrovac: Beta: RESTBase: Switch to deployment-logstash1's IP address [puppet] - 10https://gerrit.wikimedia.org/r/211724 (https://phabricator.wikimedia.org/T99506) [14:54:12] ottomata: the intention is to switch the old parsoid varnishes off once the services are migrated off it [14:54:12] and other services will move elsewhere (text varnish/RB) [14:54:14] ? [14:54:21] oh. [14:54:33] (03PS2) 10Mobrovac: Beta: RESTBase: Switch to deployment-logstash1's IP address [puppet] - 10https://gerrit.wikimedia.org/r/211724 (https://phabricator.wikimedia.org/T99506) [14:54:34] Did someone claim SWAT or should I get off the bench? [14:54:41] so, all services are expected to be served via RB via varnish text? [14:54:57] pretty much, yes [14:54:59] including parsoid? [14:55:06] parsoid is already served via RB [14:55:10] OH! [14:55:14] well [14:55:21] then adding varnishkafka to parsoidcache will not help yuri at all :) [14:55:24] but it still exists on parsoidcache as well, and we still can't kill that yet, right? [14:55:40] marktraceur: No one claimed it yet that I've seen. [14:55:48] we'll stop updating those caches as soon as OCG has switched too [14:55:48] I'll do it then [14:55:58] which should be this week [14:56:06] yurik: still around? [14:56:27] gwicke: many parsoid reqs are coming via text varnish cluster now then? [14:56:31] at that point we'll tell the remaining users to migrate [14:57:01] ottomata: yes, all VE requests for example [14:57:12] * marktraceur pings AaronSchulz et Krinkle et kart_ for SWAT in ~5 minutes [14:58:00] hm! [14:58:14] (03PS5) 10KartikMistry: CX: Add wikis for deployment on 20150518 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211671 (https://phabricator.wikimedia.org/T98454) [14:58:39] (03PS5) 10KartikMistry: CX: Add languages for deployment 20150518 and all source languages [puppet] - 10https://gerrit.wikimedia.org/r/211713 (https://phabricator.wikimedia.org/T98946) [14:59:55] ottomata: yuri is looking for graphoid requests, which aren't exposed via text yet [15:00:05] manybubbles, anomie, ^d, thcipriani, AaronSchulz, Krinkle, kart_: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150518T1500). Please do the needful. [15:00:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:00:18] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 22.22% of data above the critical threshold [20000.0] [15:00:32] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 22.22% of data above the critical threshold [20000.0] [15:00:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:00:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:00:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:00:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 22.22% of data above the critical threshold [20000.0] [15:01:18] PROBLEM - puppet last run on cp3022 is CRITICAL puppet fail [15:01:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:01:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:01:19] PROBLEM - puppet last run on mw2019 is CRITICAL Puppet has 1 failures [15:01:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 22.22% of data above the critical threshold [20000.0] [15:01:39] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:01:48] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures [15:01:49] PROBLEM - puppet last run on mw2142 is CRITICAL Puppet has 1 failures [15:01:49] PROBLEM - puppet last run on mw1154 is CRITICAL Puppet has 1 failures [15:02:19] PROBLEM - puppet last run on terbium is CRITICAL Puppet has 1 failures [15:02:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:03:08] PROBLEM - puppet last run on palladium is CRITICAL Puppet has 1 failures [15:03:17] geez what's up with ulsfo? [15:03:27] marktraceur: pong [15:03:35] paravoid: are there network problems in ulsfo? [15:03:40] yes [15:03:41] * Krinkle is here [15:03:48] ok [15:03:50] OK, two of three is a good start [15:03:59] not loss that I can see of, though, just latency [15:04:12] I don't understand why kafka can't handle that well tbh [15:04:18] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [15:04:18] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [15:04:38] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [15:04:38] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [15:04:47] kart_: You only have one config patch so you'll go first. [15:04:52] marktraceur: I'm here. [15:04:55] (03Abandoned) 10Ottomata: Log frontend parsoid varnish requests to kafka via varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/211720 (https://phabricator.wikimedia.org/T99372) (owner: 10Ottomata) [15:04:56] marktraceur: go ahead. [15:04:59] My clocks both say it's time to go, so we're going! [15:05:05] * marktraceur pokes jouncebot [15:05:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [15:05:15] Oh, it didn't ping me. [15:05:15] godog: around? [15:05:15] kart_: ping detected, please leave a message! [15:05:19] paravoid: i think we can tune it better, but it would have to take more memory on the varnishes then [15:05:28] paravoid: it could also be just that the drerr check can't handle zero data points from zero traffic? [15:05:29] that happesn when the buffers fill up waiting for message AKCs [15:05:31] ACKs [15:05:49] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [15:05:56] (03CR) 10MarkTraceur: [C: 032] CX: Add wikis for deployment on 20150518 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211671 (https://phabricator.wikimedia.org/T98454) (owner: 10KartikMistry) [15:05:59] bblack: as in the data wouldn't be getting to statsd? [15:05:59] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [15:05:59] no clients -> no vk stats data to look at -> fakes out this icinga check and makes it think things are amiss? [15:06:03] (03Merged) 10jenkins-bot: CX: Add wikis for deployment on 20150518 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211671 (https://phabricator.wikimedia.org/T98454) (owner: 10KartikMistry) [15:06:17] ottomata: no, as in the data is empty right now and should be, and maybe the checks aren't sane in that scenario [15:06:19] how large are those buffers? [15:06:21] godog: https://gerrit.wikimedia.org/r/#/c/211713 [15:06:34] godog: Please :) [15:06:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [15:07:10] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [15:07:35] kart_: sure, I'll merge that [15:07:38] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [15:07:48] (03PS6) 10Filippo Giunchedi: CX: Add languages for deployment 20150518 and all source languages [puppet] - 10https://gerrit.wikimedia.org/r/211713 (https://phabricator.wikimedia.org/T98946) (owner: 10KartikMistry) [15:07:49] hm, bblack, i doubt it, since these are coming from the rate of delivery errors. each time vk outputs to that stats.json file, it outputs the number of drerrs it has ever seen. the rate is the change in that number [15:07:53] !log marktraceur Synchronized wmf-config/InitialiseSettings.php: [SWAT] [config] Add wikis for deployment on 2015-05-18 (duration: 00m 29s) [15:07:54] kart_: Test away! [15:07:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] CX: Add languages for deployment 20150518 and all source languages [puppet] - 10https://gerrit.wikimedia.org/r/211713 (https://phabricator.wikimedia.org/T98946) (owner: 10KartikMistry) [15:07:58] Logged the message, Master [15:07:59] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [15:08:00] so, if no clients, taht means no more drerrs, and the rate would be 0 [15:08:03] which is what it is supposed to be [15:08:13] kart_: merged [15:08:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [15:09:01] marktraceur: done. thanks! [15:09:04] godog: thanks! [15:09:09] Sweet. [15:09:12] Krinkle is next. [15:09:40] oh whoa, paravoid, i'm looking at varnishkafka.log on cp4010, this looks different than the esams drerrs we ahve sometimes [15:09:56] this looks like vk can't even talk to kafka brokers [15:10:04] KAFKAERR: Kafka error (-195): analytics1012.eqiad.wmnet:9092/bootstrap: Failed to connect to broker at analytics1012.eqiad.wmnet:9092: Connection refused [15:10:13] analytics1018.eqiad.wmnet:9092/bootstrap: Metadata request failed: Local: Broker transport failure [15:10:23] analytics1018.eqiad.wmnet:9092/18: Metadata request failed: Local: Message timed out [15:10:38] 6operations, 10ops-eqiad, 10fundraising-tech-ops: barium has a failed HDD - https://phabricator.wikimedia.org/T93899#1293710 (10Cmjohnson) 5Open>3Resolved Fixed...resolving [15:10:39] marktraceur: OK. [15:10:48] Just gotta wait for the merge... [15:12:31] 6operations, 10ops-codfw, 5Patch-For-Review: Set up missing PDUs in codfw and eqiad - https://phabricator.wikimedia.org/T84416#1293714 (10Cmjohnson) It has come to my attention that are humidity setting are too high. They are set in eqiad at 12% and I had them set at 25%. This needs to be updated. Other th... [15:13:26] 7Blocked-on-Operations, 5Patch-For-Review, 3Search-and-Discovery-Research-and-Data-Sprint: Create rsync connector to fluorine - https://phabricator.wikimedia.org/T98383#1293715 (10Ottomata) 5Open>3Resolved [15:13:31] 6operations, 10fundraising-tech-ops: upgrade tellurium.frack.eqiad.wmnet to Trusty - https://phabricator.wikimedia.org/T95294#1293717 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson This has been completed. [15:16:19] RECOVERY - puppet last run on palladium is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:16:40] RECOVERY - puppet last run on mw1154 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:17:09] RECOVERY - puppet last run on terbium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:17:58] RECOVERY - puppet last run on mw2019 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:18:19] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:18:19] RECOVERY - puppet last run on mw2142 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:18:44] ottomata: did you roll out the fluorine -> stat1002 rsync after all? [15:19:09] I got NMS alerts, both servers had their network saturated [15:19:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:19:26] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=fluorine.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+eqiad [15:19:30] RECOVERY - puppet last run on cp3022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:19:43] I'm not happy with using rsync as a logging pipeline [15:19:44] paravoid: yes, sorry, forgot we were discussing that. i ran an rsync manually to rsync the current data [15:20:18] paravoid: not really sure why. it only rsyncs the already rotated logs [15:20:25] the archived logs [15:20:32] so that they are availble on an analysis host for processing [15:20:41] the logging hosts aren't good for processing [15:21:02] and i don't think we want to maintain udp2log (or kafka) consumption on analysis hosts if we don't have to [15:21:08] why not? [15:21:14] those machines should have their computation and disk io available for analysis [15:21:49] but also, this is just how things have always been done with udp2log, i'm not making an arch desciion here, just rsyncing one more log for researchers. [15:21:50] how is consuming logs that you need to consume anwyay computationally/disk io intensive [15:21:53] it's less spiky [15:22:20] you rsync once a day and save to disk. the rest of the time you don't hvae the running consumer process to compete with [15:22:24] i think the danger is more the 'other' stuff being computationally intensive and spiking. [15:22:26] Krinkle: Syncing wmf5, then I'll go into wmf6 [15:22:30] k [15:22:41] and also manage. [15:22:45] especially with udp2log [15:22:54] if folks on stat1002 are doing intensive stuff [15:22:56] we can lose packets [15:23:08] problem is, right now with that rsync of yours, fluorine's network was saturated, so we lost some udp2log messages for sure [15:23:15] hehe [15:23:30] !log marktraceur Synchronized php-1.26wmf5/includes/Title.php: [SWAT] [wmf5] Log callers that trigger Title::newFromText $text type warning (duration: 00m 15s) [15:23:36] Logged the message, Master [15:23:57] well, hm, at least we can maybe make that predicatble? or, can I throttle the rsync somehow? [15:24:03] Krinkle: I guess it's not super testable, huh? [15:24:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [15:24:41] !log marktraceur Synchronized php-1.26wmf6/includes/Title.php: [SWAT] [wmf6] Log callers that trigger Title::newFromText $text type warning (duration: 00m 46s) [15:24:45] Logged the message, Master [15:24:54] I ran around a few random pages and nothing looked obviously broken but if you can test it, now is the time [15:25:15] paravoid: should I add --bwlimit to rsync command? [15:25:30] sure [15:25:56] oh, i can add that to the daemon module configs [15:25:58] that would be better [15:26:05] the udp2log daemons share a common puppet config [15:26:08] lemme look into that [15:26:12] udp2log rsync daemons* [15:26:19] AaronSchulz: You around yet? [15:26:49] just consuming (while filtering) the log stream on the stat hosts feels a lot cleaner to me, but ymmv, whatever :) [15:26:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:27:16] paravoid: suggestion for a good throttle limit? [15:27:44] infinite [15:28:17] if you're going to make the user-serving systems deal with generating that flood, the stats systems should at least bother to consume it all :P [15:28:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:28:51] marktraceur: Yeah, I'll keep an eye on the logs for that one [15:28:55] (03CR) 10Filippo Giunchedi: "my reasoning was that the read timeout is an effect of ES getting busy and give it a little time to settle, on timeout we would simply ret" [puppet] - 10https://gerrit.wikimedia.org/r/211672 (https://phabricator.wikimedia.org/T99500) (owner: 10Filippo Giunchedi) [15:28:57] (is for me and Krenair) [15:30:18] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [15:31:48] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [15:32:40] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:34:12] hah, bblack i do not understand your comment at all, but perhaps I do not need to :) [15:34:29] Yay, merged. [15:35:56] ottomata: what I mean is if throttling means tossing events out, why did we generate them? [15:36:00] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 22.22% of data above the critical threshold [20000.0] [15:36:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:36:21] !log marktraceur Synchronized php-1.26wmf6/includes/: [SWAT] [wmf6] resourceloader: Don't cache minification of user.tokens (duration: 00m 19s) [15:36:22] Krinkle: Sync'd, test away (if you can...) [15:36:27] Logged the message, Master [15:36:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:36:39] AaronSchulz: If you're not here then I'm pushing you to the evening SWAT [15:36:54] marktraceur: testing.. [15:37:01] throttling does not mean tossing out events [15:37:07] bblack, throtting is for rsyncing of archived files [15:37:19] so that thte rsync doesn't saturate the nic and cause lost events [15:37:48] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [15:37:55] marktraceur: Verified. [15:38:07] On mediawiki.org,