[00:07:56] (03CR) 10Ori.livneh: [C: 032] Set HHVM mysql connection timeout to 3s on canary servers [puppet] - 10https://gerrit.wikimedia.org/r/211155 (https://phabricator.wikimedia.org/T98489) (owner: 10BryanDavis) [00:20:18] !log ori Synchronized php-1.26wmf6/includes/jobqueue/JobQueueGroup.php: 1e43c05283: Revert "Undefer push() in lazyPush() temporarily" (duration: 00m 12s) [00:20:29] Logged the message, Master [00:37:05] !log ori Synchronized php-1.26wmf6/includes/MediaWiki.php: b13721b5cb: Pass a message key to MalformedTitleException constructor (duration: 00m 12s) [00:37:14] Logged the message, Master [00:38:54] !log ori Synchronized php-1.26wmf7/includes/MediaWiki.php: adacd7b35c: Pass a message key to MalformedTitleException constructor (duration: 00m 11s) [00:39:03] Logged the message, Master [00:41:03] PROBLEM - HHVM rendering on mw1063 is CRITICAL - Socket timeout after 10 seconds [00:41:12] PROBLEM - Apache HTTP on mw1063 is CRITICAL - Socket timeout after 10 seconds [00:42:33] RECOVERY - HHVM rendering on mw1063 is OK: HTTP OK: HTTP/1.1 200 OK - 66095 bytes in 0.181 second response time [00:42:42] RECOVERY - Apache HTTP on mw1063 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [01:32:54] twentyafterfour: thanks! (i had not added it to dsh groups yet ) [01:36:22] got this issue on mira now: [01:36:24] E: Unable to locate package trebuchet-trigger [01:36:57] [ERROR ] An un-handled exception was caught by salt's global exception handler: [01:37:01] that too [02:25:04] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (90487s 90000s) [02:38:06] !log l10nupdate Synchronized php-1.26wmf6/cache/l10n: (no message) (duration: 09m 36s) [02:38:19] Logged the message, Master [02:45:21] !log LocalisationUpdate completed (1.26wmf6) at 2015-05-21 02:44:18+00:00 [02:45:28] Logged the message, Master [03:06:14] !log l10nupdate Synchronized php-1.26wmf7/cache/l10n: (no message) (duration: 06m 13s) [03:06:24] Logged the message, Master [03:10:52] !log LocalisationUpdate completed (1.26wmf7) at 2015-05-21 03:09:49+00:00 [03:10:58] Logged the message, Master [03:58:49] (03PS1) 10Ori.livneh: Make performance.wikimedia.org HTTPS-only [puppet] - 10https://gerrit.wikimedia.org/r/212496 [04:01:07] (03PS2) 10Ori.livneh: Make performance.wikimedia.org HTTPS-only [puppet] - 10https://gerrit.wikimedia.org/r/212496 [04:01:13] (03CR) 10Ori.livneh: [C: 032 V: 032] Make performance.wikimedia.org HTTPS-only [puppet] - 10https://gerrit.wikimedia.org/r/212496 (owner: 10Ori.livneh) [04:05:11] (03PS1) 10Ori.livneh: Typo fix for Iaba1aee51: apache::mod::headers, not header [puppet] - 10https://gerrit.wikimedia.org/r/212497 [04:05:16] (03CR) 10jenkins-bot: [V: 04-1] Typo fix for Iaba1aee51: apache::mod::headers, not header [puppet] - 10https://gerrit.wikimedia.org/r/212497 (owner: 10Ori.livneh) [04:05:22] (03PS2) 10Ori.livneh: Typo fix for Iaba1aee51: apache::mod::headers, not header [puppet] - 10https://gerrit.wikimedia.org/r/212497 [04:05:27] (03CR) 10Ori.livneh: [C: 032 V: 032] Typo fix for Iaba1aee51: apache::mod::headers, not header [puppet] - 10https://gerrit.wikimedia.org/r/212497 (owner: 10Ori.livneh) [04:14:43] PROBLEM - puppet last run on ms-be1018 is CRITICAL Puppet has 1 failures [04:31:24] RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [04:53:43] (03PS1) 10Ori.livneh: Unset $wgDiff, so we stop shelling out to diff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212499 [04:54:01] (03CR) 10Ori.livneh: [C: 032] Unset $wgDiff, so we stop shelling out to diff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212499 (owner: 10Ori.livneh) [04:54:08] (03Merged) 10jenkins-bot: Unset $wgDiff, so we stop shelling out to diff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212499 (owner: 10Ori.livneh) [04:55:02] PROBLEM - Translation cache space on mw1161 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [04:55:03] PROBLEM - Translation cache space on mw1033 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [04:55:06] !log ori Synchronized wmf-config/CommonSettings.php: Ia5239c1e: Unset $wgDiff, so we stop shelling out to diff (duration: 00m 12s) [04:55:13] PROBLEM - Translation cache space on mw1201 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [04:55:13] PROBLEM - Translation cache space on mw1240 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 92% [04:55:13] PROBLEM - Translation cache space on mw1257 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [04:55:17] Logged the message, Master [04:55:22] PROBLEM - Translation cache space on mw1107 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [04:55:23] PROBLEM - Translation cache space on mw1173 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 92% [04:55:23] PROBLEM - Translation cache space on mw1070 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [04:55:23] PROBLEM - Translation cache space on mw1211 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [04:55:23] PROBLEM - Translation cache space on mw1097 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [04:55:23] PROBLEM - Translation cache space on mw1048 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 92% [04:55:23] PROBLEM - Translation cache space on mw1258 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [04:55:24] PROBLEM - Translation cache space on mw1216 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 97% [04:55:24] PROBLEM - Translation cache space on mw1214 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [04:55:25] PROBLEM - Translation cache space on mw1036 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [04:55:25] PROBLEM - Translation cache space on mw1080 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [04:55:26] PROBLEM - Translation cache space on mw1057 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 92% [04:55:26] PROBLEM - Translation cache space on mw1104 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [04:55:27] PROBLEM - Translation cache space on mw1238 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [04:55:32] PROBLEM - Translation cache space on mw1024 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 92% [04:55:32] PROBLEM - Translation cache space on mw1045 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [04:55:32] PROBLEM - Translation cache space on mw1055 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [04:55:33] PROBLEM - Translation cache space on mw1088 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [04:55:33] PROBLEM - Translation cache space on mw1053 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [04:55:33] PROBLEM - Translation cache space on mw1151 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [04:55:44] PROBLEM - Translation cache space on mw1219 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [04:55:45] PROBLEM - Translation cache space on mw1081 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [04:55:45] PROBLEM - Translation cache space on mw1255 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [04:55:46] PROBLEM - Translation cache space on mw1225 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [04:55:46] PROBLEM - Translation cache space on mw1079 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [04:55:47] PROBLEM - Translation cache space on mw1176 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 97% [04:55:47] PROBLEM - Translation cache space on mw1043 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [04:55:48] PROBLEM - Translation cache space on mw1022 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [04:55:48] PROBLEM - Translation cache space on mw1182 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 97% [04:55:49] PROBLEM - Translation cache space on mw1204 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [04:55:49] PROBLEM - Translation cache space on mw1146 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [04:55:50] PROBLEM - Translation cache space on mw1202 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [04:55:50] PROBLEM - Translation cache space on mw1058 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [04:55:51] PROBLEM - Translation cache space on mw1066 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [04:55:51] PROBLEM - Translation cache space on mw1050 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [04:55:52] PROBLEM - Translation cache space on mw1217 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 98% [04:55:52] PROBLEM - Translation cache space on mw1224 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 97% [04:55:53] PROBLEM - Translation cache space on mw1037 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [04:56:04] PROBLEM - Translation cache space on mw1164 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [04:56:05] PROBLEM - Translation cache space on mw1241 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:05] PROBLEM - Translation cache space on mw1082 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 97% [04:56:06] PROBLEM - Translation cache space on mw1100 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 97% [04:56:06] PROBLEM - Translation cache space on mw1148 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [04:56:07] PROBLEM - Translation cache space on mw1064 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [04:56:07] PROBLEM - Translation cache space on mw1030 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 97% [04:56:08] PROBLEM - Translation cache space on mw1027 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 97% [04:56:08] PROBLEM - Translation cache space on mw1207 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 98% [04:56:09] PROBLEM - Translation cache space on mw1126 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [04:56:12] PROBLEM - Translation cache space on mw1131 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [04:56:12] PROBLEM - Translation cache space on mw1139 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [04:56:12] PROBLEM - Translation cache space on mw1133 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [04:56:13] PROBLEM - Translation cache space on mw1247 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:13] PROBLEM - Translation cache space on mw1073 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [04:56:13] PROBLEM - Translation cache space on mw1046 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 98% [04:56:14] PROBLEM - Translation cache space on mw1067 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [04:56:14] PROBLEM - Translation cache space on mw1019 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 98% [04:56:14] PROBLEM - Translation cache space on mw1195 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 98% [04:56:14] PROBLEM - Translation cache space on mw1094 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [04:56:26] ummm. [04:56:32] PROBLEM - Translation cache space on mw1068 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:32] PROBLEM - Translation cache space on mw1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:32] PROBLEM - Translation cache space on mw1083 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:32] PROBLEM - Translation cache space on mw1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:32] PROBLEM - Translation cache space on mw1060 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:42] PROBLEM - Translation cache space on mw1130 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [04:56:42] RECOVERY - Translation cache space on mw1161 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:56:43] PROBLEM - Translation cache space on mw1149 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:43] PROBLEM - Translation cache space on mw1112 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:43] PROBLEM - Translation cache space on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:43] RECOVERY - Translation cache space on mw1033 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:56:43] PROBLEM - Translation cache space on mw1105 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:44] PROBLEM - Translation cache space on mw1095 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:52] PROBLEM - Translation cache space on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:52] PROBLEM - Translation cache space on mw1026 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:52] PROBLEM - Translation cache space on mw1025 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:52] PROBLEM - Translation cache space on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:53] PROBLEM - Translation cache space on mw1086 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:53] PROBLEM - Translation cache space on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:53] PROBLEM - Translation cache space on mw1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:53] PROBLEM - Translation cache space on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:54] PROBLEM - Translation cache space on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:54] PROBLEM - Translation cache space on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:55] PROBLEM - Translation cache space on mw1109 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:55] RECOVERY - Translation cache space on mw1201 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:02] RECOVERY - Translation cache space on mw1240 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:02] RECOVERY - Translation cache space on mw1257 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:02] PROBLEM - Translation cache space on mw1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:03] RECOVERY - Translation cache space on mw1173 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:03] RECOVERY - Translation cache space on mw1211 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:03] RECOVERY - Translation cache space on mw1097 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:03] RECOVERY - Translation cache space on mw1258 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:03] they'll recover [04:57:09] ok [04:57:14] PROBLEM - Translation cache space on mw1102 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:15] PROBLEM - Translation cache space on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:15] RECOVERY - Translation cache space on mw1053 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:22] RECOVERY - Translation cache space on mw1029 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:22] RECOVERY - Translation cache space on mw1170 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:22] RECOVERY - Translation cache space on mw1209 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:22] PROBLEM - Translation cache space on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:22] RECOVERY - Translation cache space on mw1162 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:22] RECOVERY - Translation cache space on mw1054 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:23] RECOVERY - Translation cache space on mw1108 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:23] RECOVERY - Translation cache space on mw1227 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:24] RECOVERY - Translation cache space on mw1106 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:24] RECOVERY - Translation cache space on mw1039 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:25] RECOVERY - Translation cache space on mw1075 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:25] RECOVERY - Translation cache space on mw1085 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:36] RECOVERY - Translation cache space on mw1218 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:37] PROBLEM - Translation cache space on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:37] RECOVERY - Translation cache space on mw1128 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:38] RECOVERY - Translation cache space on mw1041 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:38] RECOVERY - Translation cache space on mw1023 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:39] RECOVERY - Translation cache space on mw1241 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:39] PROBLEM - Translation cache space on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:40] RECOVERY - Translation cache space on mw1119 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:40] RECOVERY - Translation cache space on mw1206 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:42] RECOVERY - Translation cache space on mw1103 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:42] RECOVERY - Translation cache space on mw1061 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:42] RECOVERY - Translation cache space on mw1077 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:42] RECOVERY - Translation cache space on mw1056 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:43] RECOVERY - Translation cache space on mw1032 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:43] RECOVERY - Translation cache space on mw1132 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:44] RECOVERY - Translation cache space on mw1040 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:44] RECOVERY - Translation cache space on mw1098 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:45] RECOVERY - Translation cache space on mw1052 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:56] RECOVERY - Translation cache space on mw1195 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:57] RECOVERY - Translation cache space on mw1071 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:57] RECOVERY - Translation cache space on mw1094 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:58] RECOVERY - Translation cache space on mw1129 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:58] RECOVERY - Translation cache space on mw1044 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:57:59] RECOVERY - Translation cache space on mw1076 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:02] RECOVERY - Translation cache space on mw1125 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:02] RECOVERY - Translation cache space on mw1096 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:02] RECOVERY - Translation cache space on mw1035 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:02] RECOVERY - Translation cache space on mw1068 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:02] RECOVERY - Translation cache space on mw1138 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:03] RECOVERY - Translation cache space on mw1083 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:03] RECOVERY - Translation cache space on mw1038 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:03] RECOVERY - Translation cache space on mw1142 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:04] RECOVERY - Translation cache space on mw1018 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:04] RECOVERY - Translation cache space on mw1092 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:05] RECOVERY - Translation cache space on mw1060 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:12] RECOVERY - Translation cache space on mw1121 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:12] RECOVERY - Translation cache space on mw1145 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:13] RECOVERY - Translation cache space on mw1149 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:13] RECOVERY - Translation cache space on mw1112 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:13] RECOVERY - Translation cache space on mw1105 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:13] RECOVERY - Translation cache space on mw1101 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:13] RECOVERY - Translation cache space on mw1095 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:22] RECOVERY - Translation cache space on mw1130 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:22] RECOVERY - Translation cache space on mw1110 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:23] RECOVERY - Translation cache space on mw1026 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:23] RECOVERY - Translation cache space on mw1021 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:23] RECOVERY - Translation cache space on mw1025 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:23] RECOVERY - Translation cache space on mw1086 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:23] RECOVERY - Translation cache space on mw1089 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:23] RECOVERY - Translation cache space on mw1144 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:23] RECOVERY - Translation cache space on mw1047 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:24] RECOVERY - Translation cache space on mw1069 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:32] RECOVERY - Translation cache space on mw1122 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:32] RECOVERY - Translation cache space on mw1109 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:33] RECOVERY - Translation cache space on mw1065 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:42] RECOVERY - Translation cache space on mw1117 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:42] RECOVERY - Translation cache space on mw1093 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:42] RECOVERY - Translation cache space on mw1072 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:42] RECOVERY - Translation cache space on mw1143 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:43] RECOVERY - Translation cache space on mw1078 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:43] RECOVERY - Translation cache space on mw1150 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:43] RECOVERY - Translation cache space on mw1120 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:43] RECOVERY - Translation cache space on mw1107 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:44] RECOVERY - Translation cache space on mw1070 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:44] RECOVERY - Translation cache space on mw1048 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:45] RECOVERY - Translation cache space on mw1116 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:45] RECOVERY - Translation cache space on mw1102 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:52] RECOVERY - Translation cache space on mw1049 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:52] RECOVERY - Translation cache space on mw1080 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:53] RECOVERY - Translation cache space on mw1104 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:53] RECOVERY - Translation cache space on mw1024 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:58:53] PROBLEM - Translation cache space on mw1007 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [04:58:53] RECOVERY - Translation cache space on mw1115 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:59:02] RECOVERY - Translation cache space on mw1151 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:59:03] RECOVERY - Translation cache space on mw1022 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:59:03] RECOVERY - Translation cache space on mw1146 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:59:04] RECOVERY - Translation cache space on mw1127 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:59:04] RECOVERY - Translation cache space on mw1137 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:59:14] RECOVERY - Translation cache space on mw1136 is OK: HHVM_TC_SPACE OK TC sizes are OK [04:59:22] PROBLEM - Translation cache space on mw1010 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [04:59:22] PROBLEM - Translation cache space on mw1011 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [04:59:23] PROBLEM - Translation cache space on mw1004 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [04:59:33] PROBLEM - Translation cache space on mw1002 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [04:59:33] PROBLEM - Translation cache space on mw1001 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [04:59:42] PROBLEM - Translation cache space on mw1003 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 97% [04:59:43] PROBLEM - Translation cache space on mw1008 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [04:59:52] PROBLEM - Translation cache space on mw1015 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [04:59:52] PROBLEM - Translation cache space on mw1006 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [04:59:55] that's the job runners [04:59:55] so who gets to restart machines gracefully instead of letting them all crash at the same time? :) [05:00:02] s/machines/hhvm/ [05:00:03] PROBLEM - Translation cache space on mw1009 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [05:00:12] PROBLEM - Translation cache space on mw1012 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [05:00:13] PROBLEM - Translation cache space on mw1005 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [05:00:22] PROBLEM - Translation cache space on mw1016 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [05:00:23] PROBLEM - Translation cache space on mw1014 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [05:00:23] PROBLEM - Translation cache space on mw1013 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [05:01:33] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [05:01:53] RECOVERY - Translation cache space on mw1009 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:02:03] RECOVERY - Translation cache space on mw1012 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:02:03] RECOVERY - Translation cache space on mw1005 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:02:03] RECOVERY - Translation cache space on mw1016 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:02:04] RECOVERY - Translation cache space on mw1014 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:02:10] we didn't use to have alerts for this [05:02:12] RECOVERY - Translation cache space on mw1013 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:02:19] so we'd just get the 5xx spike and shrug it off [05:02:22] RECOVERY - Translation cache space on mw1007 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:02:43] RECOVERY - Translation cache space on mw1010 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:02:52] RECOVERY - Translation cache space on mw1011 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:02:52] RECOVERY - Translation cache space on mw1004 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:02:52] ori: they didn't used to all crash at the same moment, but they have starting in the last week [05:03:02] RECOVERY - Translation cache space on mw1002 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:03:02] RECOVERY - Translation cache space on mw1001 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:03:12] RECOVERY - Translation cache space on mw1003 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:03:13] RECOVERY - Translation cache space on mw1008 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:03:13] RECOVERY - Translation cache space on mw1015 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:03:14] RECOVERY - Translation cache space on mw1006 is OK: HHVM_TC_SPACE OK TC sizes are OK [05:03:34] ebernhardson|zzz: per https://phabricator.wikimedia.org/T99525 i suspect APC is somehow counting toward available space [05:04:16] ls [05:04:25] ...wrong window [05:05:09] ori: i mean, i dunno what do to with it exactly. But i emailed ops list about 225 hhvm instances core dumping in a 2 minute span during a swat window, _joe_'s response was to set up those warnings [05:05:40] i'm not sure the alerts help any [05:06:10] we just have to figure out why the cold cache is growing quicker than before [05:06:12] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (12921 90000s) [05:06:28] the theory was to run a rolling restart, to avoid the 5xx spike. I would have assumed that mattered but maybe not :) [05:07:19] depool/restart/repool [05:07:53] the rolling restart doesn't answer the question, IMO. yes, the TC cache will fill up eventually, and yes, at that point HHVM has to be restarted or it will crash. but it used to happen so infrequently so as to not constitute a problem. something changed. [05:08:05] 3.6 [05:08:17] we'll have to do rolling restarts eventually anyway if we want to do repo auth [05:08:25] but that is blocked on pybal getting etcd integration [05:09:44] ori: also, unsetting $wgDiff causes the Echo unit tests to fail. I'm a bit dubious of that patch [05:10:15] [a0b7f7ea] [no req] MWException from line 196 of /vagrant/mediawiki/extensions/Echo/includes/DiffParser.php: Positional error: left [05:10:59] * ebernhardson|zzz is filing a bug against echo for that now [05:11:17] not happening in prod. buggy test that assumes too much? [05:11:49] ori: this test runs a normal page edit [05:13:12] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [05:16:44] trying to repro locally [05:20:32] ori: hmm, i might just have something odd in my local config. I put a patch into gerrit to run the DiscussionParserTest (which uses real revision content that has had problems in prod before) and it seems ok [05:20:42] err, to run it with $wgDiff = false [05:20:59] yes, the tests pass for me [05:25:23] PROBLEM - puppet last run on ms-be1017 is CRITICAL Puppet has 1 failures [05:40:03] (03PS1) 10KartikMistry: CX: Fix language codes [puppet] - 10https://gerrit.wikimedia.org/r/212504 [05:42:04] RECOVERY - puppet last run on ms-be1017 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:18:35] 6operations, 6Labs, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1300776 (10yuvipanda) [06:18:43] 6operations, 10Wikimedia-DNS: Consider DNSSec - https://phabricator.wikimedia.org/T26413#1300778 (10jeblad) Seems like there is no DNSSEC in place http://dnssec-debugger.verisignlabs.com/no.wikipedia.org Perhaps it is time to consider this? [06:23:10] 6operations, 6Labs, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1300793 (10yuvipanda) (labvirt1005 is empty because of T97521 - it hasn't been repooled yet) [06:28:12] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu May 21 06:27:09 UTC 2015 (duration 27m 8s) [06:28:19] Logged the message, Master [06:30:12] PROBLEM - puppet last run on mw1039 is CRITICAL puppet fail [06:30:45] (03PS2) 10KartikMistry: CX: Fix language codes [puppet] - 10https://gerrit.wikimedia.org/r/212504 [06:30:53] PROBLEM - puppet last run on labsdb1003 is CRITICAL Puppet has 1 failures [06:31:43] PROBLEM - puppet last run on cp3037 is CRITICAL Puppet has 1 failures [06:32:03] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 1 failures [06:32:03] PROBLEM - puppet last run on cp3016 is CRITICAL Puppet has 1 failures [06:32:13] PROBLEM - puppet last run on mw2015 is CRITICAL Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on cp3042 is CRITICAL Puppet has 1 failures [06:34:07] (03PS2) 10KartikMistry: CX: Enable Content Translation for 20150521 planned wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212281 (https://phabricator.wikimedia.org/T98741) [06:34:43] PROBLEM - puppet last run on mw1189 is CRITICAL Puppet has 1 failures [06:34:43] PROBLEM - puppet last run on mw1003 is CRITICAL Puppet has 1 failures [06:35:33] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:35:33] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:36:02] PROBLEM - puppet last run on mw1088 is CRITICAL Puppet has 1 failures [06:36:04] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures [06:43:00] <_joe_> !log cleaning up the bytecode caches of a few appservers [06:43:06] Logged the message, Master [06:44:53] RECOVERY - puppet last run on mw1003 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [06:45:42] RECOVERY - puppet last run on cp3016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:45:43] RECOVERY - puppet last run on mw2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:02] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:12] RECOVERY - puppet last run on labsdb1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:12] RECOVERY - puppet last run on mw1088 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:13] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:42] RECOVERY - puppet last run on mw1189 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:02] RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:47:23] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:47:32] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:47:32] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:48:43] RECOVERY - puppet last run on mw1039 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:09:22] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL host 198.35.26.192, interfaces up: 64, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/3: down - ! TiNet {#1065}BR [07:16:03] RECOVERY - Router interfaces on cr1-ulsfo is OK host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [07:18:53] (03CR) 10Alexandros Kosiaris: [C: 032] CX: Fix language codes [puppet] - 10https://gerrit.wikimedia.org/r/212504 (owner: 10KartikMistry) [07:21:50] <_joe_> !log cleaning the bytecode cache database everywhere [07:21:59] Logged the message, Master [07:36:54] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster::logstash. Fix out of module dependencies [puppet] - 10https://gerrit.wikimedia.org/r/209270 (owner: 10Alexandros Kosiaris) [07:49:53] !log uploaded to apt.wikimedia.org trusty-wikimedia distribution jessie-wikimedia: php-luasandbox_2.0.9 [07:50:02] Logged the message, Master [07:51:40] (03PS2) 10Alexandros Kosiaris: Remove the unneeded priorites in filenames [puppet] - 10https://gerrit.wikimedia.org/r/212305 [07:52:12] (03CR) 10Alexandros Kosiaris: [C: 032] Remove the unneeded priorites in filenames [puppet] - 10https://gerrit.wikimedia.org/r/212305 (owner: 10Alexandros Kosiaris) [08:03:13] PROBLEM - Host analytics1036 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:17] (03PS1) 10KartikMistry: CX: Corrected language code based on wgLanguageCode setting [puppet] - 10https://gerrit.wikimedia.org/r/212514 [08:15:51] (03CR) 10Filippo Giunchedi: [C: 031] "minor nit but LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/212322 (owner: 10EBernhardson) [08:20:50] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 6Scrum-of-Scrums, 5Patch-For-Review: Jenkins is using php-luasandbox 1.9-1 for zend unit tests; precise should be upgraded to 2.0-8 or equivalent - https://phabricator.wikimedia.org/T88798#1300839 (10akosiaris) 5Open>3Resolv... [08:21:07] finally... [08:23:49] <_joe_> \o/ [08:24:16] <_joe_> (also \o/ for dodging that grenade) [08:28:06] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1300851 (10fgiunchedi) re: what do to with errors I think we should: * be strict about what keys get created in graphite, to avoid an explosion of metrics * not lose information > If any of %m or %s loo... [08:42:56] (03PS2) 10KartikMistry: CX: Enable 'cxstats' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211116 [08:44:59] (03CR) 10Santhosh: [C: 031] CX: Enable 'cxstats' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211116 (owner: 10KartikMistry) [08:50:04] (03CR) 10Mobrovac: [C: 031] Beta: updated graphoid to the new api endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212480 (owner: 10Yurik) [08:57:29] 6operations, 6Labs, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1300898 (10Andrew) yep, we can do this on labvirt1005 at any time. We can cold-migrate some test instances there to test the suspend/resume issue. Well, ok, actually, let's make s... [08:58:23] 6operations, 6Labs, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1300899 (10Andrew) And, btw, there's no short-term plan to repool labvirt1005 -- we've always planned to keep a server empty as a backup, and since labvirt1005 was down during the m... [09:37:22] (03PS1) 10Giuseppe Lavagetto: puppet/self: try to fix kafkian puppet syntax [puppet] - 10https://gerrit.wikimedia.org/r/212523 [09:37:35] <_joe_> mobrovac: ^^ I'll try this on the beta master [09:37:54] k [09:38:43] <_joe_> mobrovac: try puppet now? [09:39:00] ... [09:39:48] _joe_: ok, this works [09:40:06] <_joe_> lol [09:40:15] <_joe_> we have another problem on the beta puppetmaster btw [09:40:21] ah? [09:40:31] <_joe_> yeah nothing that should bother you [09:41:26] <_joe_> lemme amend the commit message :) [09:50:53] (03PS2) 10Giuseppe Lavagetto: puppet/self: fix the kafkian puppet syntax [puppet] - 10https://gerrit.wikimedia.org/r/212523 [09:51:21] <_joe_> (I spent 3 minutes founding the problem and writing a fix, then I spent like 15 minutes apologizing for it) [09:51:48] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet/self: fix the kafkian puppet syntax [puppet] - 10https://gerrit.wikimedia.org/r/212523 (owner: 10Giuseppe Lavagetto) [10:01:20] (03PS1) 10Filippo Giunchedi: initial debian packaging [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/212528 (https://phabricator.wikimedia.org/T99771) [10:02:03] <_joe_> godog: you're my hero <3 [10:02:26] haha luckily https://wiki.debian.org/Python/LibraryStyleGuide is fairly comprehensive [10:03:59] <_joe_> godog: did you use 0.3.3 from pip? [10:05:16] good evening. It seems the static tree for 1.26wmf7 isn't set up, any www.mediawiki.org page has a missing "Powered by MediaWiki" icon at the bottom because //www.mediawiki.org/static/1.26wmf7/resources/assets/poweredby_mediawiki_88x31.png doesn't exist [10:06:14] tin:/srv/mediawiki/docroot/mediawiki/static has earlier trees like 1.26wmf6, but no 1.26wmf7 [10:07:53] _joe_: from github [10:09:40] and if I ask for ?debug=1, I get a ton of 404s for resources in /static/1.26wmf7 [10:09:41] <_joe_> godog: uhm, the copyright notice should've changed there [10:09:42] (03PS1) 10KartikMistry: CX: Add languages for deployment on 20150521 [puppet] - 10https://gerrit.wikimedia.org/r/212529 (https://phabricator.wikimedia.org/T98741) [10:11:08] (03CR) 10KartikMistry: [C: 04-1] "Only to merge after, https://gerrit.wikimedia.org/r/#/c/212281/ is done." [puppet] - 10https://gerrit.wikimedia.org/r/212529 (https://phabricator.wikimedia.org/T98741) (owner: 10KartikMistry) [10:12:39] (03CR) 10Giuseppe Lavagetto: "LGTM, we'll probably need to update the package a bit when 0.4 is released in a few days, but it's going to be minor tweaks anyway." [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/212528 (https://phabricator.wikimedia.org/T99771) (owner: 10Filippo Giunchedi) [10:23:49] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 6Scrum-of-Scrums, 5Patch-For-Review: Jenkins is using php-luasandbox 1.9-1 for zend unit tests; precise should be upgraded to 2.0-8 or equivalent - https://phabricator.wikimedia.org/T88798#1301041 (10Anomie) 5Resolved>3Open... [10:28:41] (03PS2) 10Filippo Giunchedi: es-tool: try harder to enable replication [puppet] - 10https://gerrit.wikimedia.org/r/211672 (https://phabricator.wikimedia.org/T99005) [10:28:52] robh: I filed https://phabricator.wikimedia.org/T99886 about the missing /static/1.26wmf7 assigned it twentyafterfour It seems high priority [10:31:48] (03CR) 10Muehlenhoff: "Looks good to me." [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/212528 (https://phabricator.wikimedia.org/T99771) (owner: 10Filippo Giunchedi) [10:50:16] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1301078 (10Aklapper) > not sure how much overlap there is with security I think I don't leak too much info by saying in public that currently the number of tasks associated with the... [10:50:23] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1301079 (10Aklapper) My vague guess of DBs and tables (might not be complete at all) is (in //db.table// format): * phabricator_maniphest.maniphest_transaction (complete log of any k... [10:50:48] (03PS1) 10Giuseppe Lavagetto: puppet/self: use the appropriate override [puppet] - 10https://gerrit.wikimedia.org/r/212532 [10:50:56] godog: if you're free, https://gerrit.wikimedia.org/r/#/c/212514/ :) [10:51:36] kart_: I'll take a look in ~15 [10:53:11] godog: thanks! [11:02:45] spagewmf: looking into the symlink issue [11:04:19] twentyafterfour: thanks [11:12:43] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [11:13:04] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 43.80 ms [11:22:33] RECOVERY - Host analytics1036 is UPING OK - Packet loss = 0%, RTA = 1.86 ms [11:27:44] PROBLEM - puppet last run on analytics1036 is CRITICAL puppet fail [11:30:43] PROBLEM - High load average on labstore1001 is CRITICAL 87.50% of data above the critical threshold [24.0] [11:31:10] !log troubleshooting analytics1036, includes reboots [11:31:16] Logged the message, Master [11:33:42] PROBLEM - Host analytics1036 is DOWN: PING CRITICAL - Packet loss = 100% [11:33:51] kart_: good to merge in production too? [11:36:36] godog: yes. need fix of correct language names. Only merge, https://gerrit.wikimedia.org/r/#/c/212514/ :) [11:36:55] (03PS2) 10Filippo Giunchedi: CX: Corrected language code based on wgLanguageCode setting [puppet] - 10https://gerrit.wikimedia.org/r/212514 (owner: 10KartikMistry) [11:37:01] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] CX: Corrected language code based on wgLanguageCode setting [puppet] - 10https://gerrit.wikimedia.org/r/212514 (owner: 10KartikMistry) [11:37:20] kart_: ack, it is merged [11:37:28] (03PS1) 1020after4: 1.26wmf7 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212535 [11:38:38] (03PS2) 1020after4: 1.26wmf7 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212535 (https://phabricator.wikimedia.org/T99886) [11:39:32] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [11:40:53] (03PS4) 10Giuseppe Lavagetto: confd: create module [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) [11:41:36] (03CR) 10jenkins-bot: [V: 04-1] confd: create module [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) (owner: 10Giuseppe Lavagetto) [11:44:40] (03CR) 1020after4: [C: 032] 1.26wmf7 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212535 (https://phabricator.wikimedia.org/T99886) (owner: 1020after4) [11:44:46] (03Merged) 10jenkins-bot: 1.26wmf7 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212535 (https://phabricator.wikimedia.org/T99886) (owner: 1020after4) [11:46:27] !log twentyafterfour Started scap: 1.26wmf7 symlinks [11:46:32] Logged the message, Master [11:48:16] godog: thanks! [11:49:14] !log I'm investigating some inconsistencies in symlinks in /srv/mediawiki, ref https://phabricator.wikimedia.org/T99886 [11:49:21] Logged the message, Master [11:51:43] !log twentyafterfour Finished scap: 1.26wmf7 symlinks (duration: 05m 16s) [11:51:48] Logged the message, Master [12:02:22] RECOVERY - Host analytics1036 is UPING OK - Packet loss = 0%, RTA = 0.40 ms [12:02:58] Who knows about svn-private? [12:06:15] PROBLEM - Host analytics1036 is DOWN: PING CRITICAL - Packet loss = 100% [12:06:45] Is that where wmf-config used to be? [12:06:58] (03CR) 10Manybubbles: [C: 031] es-tool: try harder to enable replication [puppet] - 10https://gerrit.wikimedia.org/r/211672 (https://phabricator.wikimedia.org/T99005) (owner: 10Filippo Giunchedi) [12:27:30] (03CR) 10Filippo Giunchedi: "yeah UNRELEASED is intended, will change it upon upload to internal apt repo" [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/212528 (https://phabricator.wikimedia.org/T99771) (owner: 10Filippo Giunchedi) [12:32:43] (03CR) 10Muehlenhoff: [C: 031] initial debian packaging [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/212528 (https://phabricator.wikimedia.org/T99771) (owner: 10Filippo Giunchedi) [13:18:27] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1301309 (10faidon) I spent quite some time trying to troubleshoot this. It looks like the server doesn't get responses to its ARP requests for the gateway (VRRP) IP. IPv6 works; pinging .2/.3 works; setting... [13:28:43] (03PS1) 10BBlack: Fix up varnish director-level retries [puppet] - 10https://gerrit.wikimedia.org/r/212543 (https://phabricator.wikimedia.org/T99839) [13:29:13] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1301317 (10Ottomata) I turned hyperthreading on, that's all. But, seeing as I was in bios, maybe I twiddled something accidentally that could have caused this. So weird! [13:29:25] (03CR) 10jenkins-bot: [V: 04-1] Fix up varnish director-level retries [puppet] - 10https://gerrit.wikimedia.org/r/212543 (https://phabricator.wikimedia.org/T99839) (owner: 10BBlack) [13:31:54] (03PS2) 10BBlack: Fix up varnish director-level retries [puppet] - 10https://gerrit.wikimedia.org/r/212543 (https://phabricator.wikimedia.org/T99839) [13:37:03] PROBLEM - puppet last run on cp3039 is CRITICAL puppet fail [13:38:27] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1301320 (10Ottomata) > some of these we might need to aggregate while receiving the metrics to avoid fetching all metrics every time (e.g. to get total 5xx) Not sure what you mean by this, can you elabor... [13:52:51] 6operations, 10Traffic: Sanitize varnish director-level retries - https://phabricator.wikimedia.org/T99839#1301329 (10BBlack) @joe came up with some math so that we don't have to specify the whole size of the ring, but still get X% of odds of trying all backends' health before giving up. Patch defaulting all... [13:54:22] RECOVERY - puppet last run on cp3039 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [13:58:17] (03CR) 10Ottomata: [C: 031] statistics: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211351 (owner: 10Dzahn) [14:13:11] (03CR) 10BBlack: [C: 032] "Testing in prod with puppet disabled..." [puppet] - 10https://gerrit.wikimedia.org/r/212543 (https://phabricator.wikimedia.org/T99839) (owner: 10BBlack) [14:13:26] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1301375 (10fgiunchedi) >>! In T83580#1301320, @Ottomata wrote: >> some of these we might need to aggregate while receiving the metrics to avoid fetching all metrics every time (e.g. to get total 5xx) > N... [14:17:42] (03PS1) 10BBlack: trivial fixup for 5db2d5c [puppet] - 10https://gerrit.wikimedia.org/r/212544 [14:17:58] (03CR) 10BBlack: [C: 032 V: 032] trivial fixup for 5db2d5c [puppet] - 10https://gerrit.wikimedia.org/r/212544 (owner: 10BBlack) [14:31:50] (03PS1) 10BBlack: Revert director-level retries changes [puppet] - 10https://gerrit.wikimedia.org/r/212547 [14:32:20] (03CR) 10BBlack: [C: 04-1] "Staging this JIC, hopefully will abandon." [puppet] - 10https://gerrit.wikimedia.org/r/212547 (owner: 10BBlack) [14:32:53] RECOVERY - Host analytics1036 is UPING OK - Packet loss = 0%, RTA = 2.73 ms [14:35:43] (03PS5) 10Giuseppe Lavagetto: confd: create module [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) [14:37:53] !log enabling puppet on caches for varnish retries changes... [14:37:58] Logged the message, Master [14:39:22] PROBLEM - puppet last run on cp3044 is CRITICAL Puppet has 2 failures [14:39:23] PROBLEM - puppet last run on cp1071 is CRITICAL Puppet has 1 failures [14:39:37] PROBLEM - puppet last run on cp1051 is CRITICAL Puppet has 1 failures [14:39:41] bah [14:39:45] re-disabling.... [14:40:33] PROBLEM - Host analytics1036 is DOWN: PING CRITICAL - Packet loss = 100% [14:41:11] that an1036 issue is wwweirdd [14:42:47] (03Abandoned) 10BBlack: Revert director-level retries changes [puppet] - 10https://gerrit.wikimedia.org/r/212547 (owner: 10BBlack) [14:43:03] (03PS1) 10BBlack: another trivial fixup for 5db2d5c [puppet] - 10https://gerrit.wikimedia.org/r/212549 [14:43:05] (03PS1) 10BBlack: Revert director-level retries changes... [puppet] - 10https://gerrit.wikimedia.org/r/212550 [14:43:25] (03CR) 10BBlack: [C: 032 V: 032] another trivial fixup for 5db2d5c [puppet] - 10https://gerrit.wikimedia.org/r/212549 (owner: 10BBlack) [14:43:54] (03CR) 10BBlack: [C: 04-1] "again, staging for reversion..." [puppet] - 10https://gerrit.wikimedia.org/r/212550 (owner: 10BBlack) [14:44:13] RECOVERY - Host analytics1036 is UPING OK - Packet loss = 0%, RTA = 2.42 ms [14:45:33] PROBLEM - puppet last run on cp1058 is CRITICAL Puppet has 1 failures [14:45:53] PROBLEM - puppet last run on cp3017 is CRITICAL Puppet has 2 failures [14:49:02] RECOVERY - puppet last run on cp1058 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:32] PROBLEM - configured eth on analytics1036 is CRITICAL: eth3 reporting no carrier. [14:51:02] RECOVERY - puppet last run on cp3017 is OK Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:51:37] hello an1036! [14:51:44] paravoid: what's happenin? [14:51:59] I'm trying stuff... [14:52:53] !next [14:53:44] jouncebot, next [14:53:45] In 0 hour(s) and 6 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150521T1500) [14:53:53] RECOVERY - configured eth on analytics1036 is OK - interfaces up [14:54:48] not doing anything are you twentyafterfour? [14:54:54] (on tin) [14:55:53] (03PS4) 10Alex Monk: Enable a test of the VisualEditor A/B testing framework [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205778 (https://phabricator.wikimedia.org/T90666) (owner: 10Jforrester) [14:56:04] (03CR) 10Alex Monk: [C: 032] Enable a test of the VisualEditor A/B testing framework [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205778 (https://phabricator.wikimedia.org/T90666) (owner: 10Jforrester) [14:56:10] (03Merged) 10jenkins-bot: Enable a test of the VisualEditor A/B testing framework [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205778 (https://phabricator.wikimedia.org/T90666) (owner: 10Jforrester) [14:56:13] Whee. [14:56:34] doesn't look like it [14:56:44] PROBLEM - Host analytics1036 is DOWN: PING CRITICAL - Packet loss = 100% [14:58:33] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/205778/ - VE A/B test on enwiki (duration: 00m 11s) [14:58:39] Logged the message, Master [14:58:43] PROBLEM - puppet last run on cp4006 is CRITICAL Puppet has 2 failures [14:59:52] apparently due to the infelicities of how we check puppet agent enable + last-run status, we'll see a few false puppetfail alerts on caches now [15:00:03] RECOVERY - puppet last run on cp3044 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:00:03] RECOVERY - puppet last run on cp1071 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, Krenair, James_F: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150521T1500). [15:00:04] !log krenair Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/205778/ - enable VE A/B test (duration: 00m 14s) [15:00:08] James_F, ^ please test [15:00:10] Logged the message, Master [15:00:33] RECOVERY - puppet last run on cp4006 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:00:33] PROBLEM - puppet last run on cp3046 is CRITICAL Puppet has 2 failures [15:00:33] PROBLEM - puppet last run on cp3038 is CRITICAL Puppet has 2 failures [15:00:34] PROBLEM - puppet last run on cp3015 is CRITICAL Puppet has 2 failures [15:00:34] PROBLEM - puppet last run on cp3018 is CRITICAL Puppet has 2 failures [15:00:47] ^ those are not real :P [15:01:22] Who's deploying? [15:01:28] Krenair: ? [15:01:32] me [15:01:39] Krenair: cool. [15:02:14] akosiaris, hi, any updates on the geo server? [15:02:23] RECOVERY - puppet last run on cp3038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:02:23] RECOVERY - puppet last run on cp3018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:02:23] RECOVERY - puppet last run on cp3046 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:02:23] RECOVERY - puppet last run on cp3015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:03:41] James_F, I guess it's probably not trivial to test given the 50% chance of triggering it [15:04:05] Krenair: Yeah. [15:04:06] JohnFLewis, that error count link is .. interesting.. we should probably remove it [15:04:16] should be fine though [15:04:24] PROBLEM - Translation cache space on mw1063 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 97% [15:04:49] kart_, are these just trivial config changes or do we need changes to other infrastructure or schema changes or anything? [15:05:07] Krenair: normal config. no other changes. [15:05:22] PROBLEM - Translation cache space on mw1208 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 98% [15:05:25] (03CR) 10Alex Monk: [C: 032] CX: Enable 'cxstats' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211116 (owner: 10KartikMistry) [15:05:32] (03Merged) 10jenkins-bot: CX: Enable 'cxstats' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211116 (owner: 10KartikMistry) [15:06:14] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/211116/ (duration: 00m 16s) [15:06:14] kart_, please test [15:06:20] Logged the message, Master [15:06:26] Krenair: yeah... [15:07:36] Krenair: go ahead, I need to update code after that in next deployment. [15:08:48] (03CR) 10Alex Monk: [C: 032] CX: Enable Content Translation for 20150521 planned wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212281 (https://phabricator.wikimedia.org/T98741) (owner: 10KartikMistry) [15:08:54] PROBLEM - puppet last run on cp1051 is CRITICAL Puppet has 1 failures [15:08:55] (03Merged) 10jenkins-bot: CX: Enable Content Translation for 20150521 planned wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212281 (https://phabricator.wikimedia.org/T98741) (owner: 10KartikMistry) [15:09:30] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/212281/ (duration: 00m 10s) [15:09:31] kart_, ^ [15:09:36] Logged the message, Master [15:10:24] yep [15:10:42] RECOVERY - puppet last run on cp1051 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:11:15] Krenair: thanks! [15:11:21] (03CR) 10Filippo Giunchedi: "nice! any idea where I could test this without bringing up a 3 node ES cluster?" [puppet] - 10https://gerrit.wikimedia.org/r/211672 (https://phabricator.wikimedia.org/T99005) (owner: 10Filippo Giunchedi) [15:11:40] (03PS2) 10KartikMistry: CX: Add languages for deployment on 20150521 [puppet] - 10https://gerrit.wikimedia.org/r/212529 (https://phabricator.wikimedia.org/T98741) [15:12:22] RECOVERY - Translation cache space on mw1208 is OK: HHVM_TC_SPACE OK TC sizes are OK [15:12:26] (03CR) 10KartikMistry: [C: 031] CX: Add languages for deployment on 20150521 [puppet] - 10https://gerrit.wikimedia.org/r/212529 (https://phabricator.wikimedia.org/T98741) (owner: 10KartikMistry) [15:12:44] godog: akosiaris can you merge, https://gerrit.wikimedia.org/r/#/c/212529/ ? [15:14:02] Krenair: Well, I just created two accounts and one got VE and the other didn't, so… [15:14:09] James_F, sounds good! [15:14:21] I commented on the task, sent the email, etc. [15:14:35] Krenair: Thanks. :-) [15:16:33] RECOVERY - Translation cache space on mw1063 is OK: HHVM_TC_SPACE OK TC sizes are OK [15:17:06] Krenair: sorry for the belated response, no I'm not touching tin right now [15:17:11] ok. Who can merge 212529 :) [15:17:24] any Opsen around? [15:19:28] kart_: I'll merge it [15:19:33] 6operations, 10Traffic: Sanitize varnish director-level retries - https://phabricator.wikimedia.org/T99839#1301579 (10BBlack) 5Open>3Resolved ^ above change + a couple follow-on syntax fixups nits is deployed, doesn't seem to have broken anything, and hopefully improves some of the less-ideal situations fr... [15:19:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] CX: Add languages for deployment on 20150521 [puppet] - 10https://gerrit.wikimedia.org/r/212529 (https://phabricator.wikimedia.org/T98741) (owner: 10KartikMistry) [15:20:17] godog: thanks! [15:20:18] kart_: done [15:20:21] 6operations, 10Traffic: Reboot caches for kernel 3.19.6 globally - https://phabricator.wikimedia.org/T96854#1301586 (10BBlack) Fixed up director-level retries in T99839, tested another upload cache reboot without depool, still same spike behavior. So that wasn't it... [15:20:24] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [15:20:59] ^ 5xx alert there is from my mini-spike referenced in the wikibugs line just above it [15:22:50] (03PS4) 10Dzahn: statistics: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211351 [15:23:57] (03CR) 10Dzahn: [C: 032] statistics: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211351 (owner: 10Dzahn) [15:24:39] Krenair: let me know when SWAT is finished. [15:24:44] (03PS2) 10Dzahn: site.pp: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/212232 [15:24:45] it is finished [15:24:49] sorry I didn't make that clear [15:25:00] Krenair: cool. I can start cx deployment then. [15:25:10] ... didn't we just do that? [15:25:20] (03PS3) 10Dzahn: site.pp: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/212232 [15:25:30] oh, there's a separate thing? [15:25:49] Krenair: yes. [15:25:58] Krenair: cxserver and ContentTranslation [15:26:21] (03CR) 10Dzahn: [C: 032] site.pp: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/212232 (owner: 10Dzahn) [15:26:30] We made it little complex ;) (mw-config, puppet, cxserver, ContentTranslation ext) [15:27:25] fun [15:28:27] (03PS6) 10Giuseppe Lavagetto: confd: create module [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) [15:29:05] ottomata: I'm giving up on an1036 for now, let's check it out again when cmjohnson1 is at the DC [15:30:25] ok, wow, thanks paravoid [15:30:30] that crazy huh? [15:30:50] yeah [15:30:53] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:32:26] !log removed max-registration properties from 2015 board elections on metawiki and votewiki per my comment on T97924 [15:32:32] Logged the message, Master [15:34:50] (03PS5) 10Dzahn: sshd: don't use NIST key exchange protocols [puppet] - 10https://gerrit.wikimedia.org/r/185321 [15:35:07] (03PS6) 10Dzahn: sshd: set Message Authentication Code ciphers [puppet] - 10https://gerrit.wikimedia.org/r/185329 [15:35:09] (03CR) 10Muehlenhoff: [C: 032] sshd: set Message Authentication Code ciphers [puppet] - 10https://gerrit.wikimedia.org/r/185329 (owner: 10Dzahn) [15:37:32] (03CR) 10Dzahn: [C: 032] snapshot: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211348 (owner: 10Dzahn) [15:37:45] (03Abandoned) 10BBlack: refactor varnish::logging for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/208659 (owner: 10BBlack) [15:37:56] (03PS4) 10Dzahn: snapshot: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211348 [15:39:02] (03CR) 10Dzahn: [C: 032] sshd: set Message Authentication Code ciphers [puppet] - 10https://gerrit.wikimedia.org/r/185329 (owner: 10Dzahn) [15:43:32] (03CR) 10Muehlenhoff: [C: 032] sshd: don't use NIST key exchange protocols [puppet] - 10https://gerrit.wikimedia.org/r/185321 (owner: 10Dzahn) [15:43:51] (03PS6) 10Dzahn: sshd: don't use NIST key exchange protocols [puppet] - 10https://gerrit.wikimedia.org/r/185321 [15:44:48] (03CR) 10Dzahn: [C: 032] sshd: don't use NIST key exchange protocols [puppet] - 10https://gerrit.wikimedia.org/r/185321 (owner: 10Dzahn) [15:45:10] moritzm: thank you for those reviews [15:48:10] andrewbogott: you once said on the patch to disable agent forwarding that you would like to keep it for the labs migration. How about noawadays? [15:50:11] (03PS7) 10Giuseppe Lavagetto: confd: create module [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) [15:53:45] (03CR) 10Ori.livneh: confd: create module (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) (owner: 10Giuseppe Lavagetto) [15:55:15] strange... [15:55:21] I loaded https://www.mediawiki.org/static/1.26wmf7/resources/assets/poweredby_mediawiki_88x31.png and got 404 a couple of times [15:55:24] <_joe_> ori: thanks for taking the time :) [15:55:24] then it just worked [15:55:44] <_joe_> ori: tbh, it's not even working now anyways :P [15:56:51] !log Updated cxserver [15:56:56] Logged the message, Master [16:00:04] kart_: Dear anthropoid, the time has come. Please deploy Content Translation Deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150521T1600). [16:02:11] (03PS8) 10Giuseppe Lavagetto: confd: create module [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) [16:02:50] (03CR) 10jenkins-bot: [V: 04-1] confd: create module [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) (owner: 10Giuseppe Lavagetto) [16:02:53] (03CR) 10Giuseppe Lavagetto: confd: create module (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) (owner: 10Giuseppe Lavagetto) [16:04:03] (03PS9) 10Giuseppe Lavagetto: confd: create module [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) [16:04:43] RECOVERY - Keyholder SSH agent on mira is OK Keyholder is armed with all configured keys. [16:04:57] !log armed keyholder on mira [16:05:03] Logged the message, Master [16:06:31] jouncebot: yes sir. [16:10:39] !log kartik Started scap: Update ContentTranslation [16:10:45] Logged the message, Master [16:15:46] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1301712 (10csteipp) >>! In T99295#1301078, @Aklapper wrote: >> not sure how much overlap there is with security > > I think I don't leak too much info by saying in public that curre... [16:16:02] Krenair, I am going to revert https://wikitech.wikimedia.org/w/index.php?title=Add_a_wiki&diff=160100&oldid=158643 now that i know what should be done and put a proper explanation [16:16:22] jynus, thanks [16:16:49] it should be done for all wikis, as there are things opt-in and some opt-out [16:17:56] 6operations, 10Deployment-Systems: need package 'trebuchet-trigger' for trusty - https://phabricator.wikimedia.org/T99919#1301717 (10Dzahn) 3NEW [16:19:22] 6operations, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1301727 (10Dzahn) [16:19:25] 6operations, 10Deployment-Systems: need package 'trebuchet-trigger' for trusty - https://phabricator.wikimedia.org/T99919#1301728 (10Dzahn) [16:20:51] 6operations, 10Deployment-Systems: need package 'trebuchet-trigger' for trusty - https://phabricator.wikimedia.org/T99919#1301738 (10cscott) Oh, hey, I was just talking about trebuchet. There were some quoting fixes I contributed upstream, which I don't think we've deployed yet: https://github.com/trebuchet-d... [16:24:43] 6operations, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1301755 (10Dzahn) meanwhile mira has: - lots of packages pulled in via the deployment server role which is now applied - fonts, ocaml, ..tex, libxml etc etc ... - translation cache space:... [16:26:22] if everything works as expected, there should not be access issues by default, but it is more painful to do some tasks afterwards (involves long-running-queries) [16:28:11] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1301763 (10RobH) a:3Yana Since this is actually awaiting @Yana to tell us about content, I'm going to assign it to her (rather than leave it up for grabs.) Yana: Once you guys ha... [16:28:20] 6operations, 10Wikimedia-DNS: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1301772 (10RobH) [16:34:04] !log kartik Finished scap: Update ContentTranslation (duration: 23m 25s) [16:34:10] Logged the message, Master [16:34:23] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [16:35:20] 6operations, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1301792 (10Dzahn) ``` Error: salt-call deploy.deployment_server_init returned 1 instead of one of [0] Error: /Stage[main]/Deployment::Deployment_server/Exec[eventual_consistency_deployment_... [16:38:42] PROBLEM - High load average on labstore1001 is CRITICAL 75.00% of data above the critical threshold [24.0] [16:42:22] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [16:43:38] (03PS1) 10Dzahn: scap: ensure /home/l10nupdate/.ssh exists [puppet] - 10https://gerrit.wikimedia.org/r/212569 (https://phabricator.wikimedia.org/T95436) [16:44:40] (03PS2) 10Dzahn: scap: ensure /home/l10nupdate/.ssh exists [puppet] - 10https://gerrit.wikimedia.org/r/212569 (https://phabricator.wikimedia.org/T95436) [16:46:52] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [16:47:04] (03PS3) 10Dzahn: scap: ensure /home/l10nupdate/.ssh exists [puppet] - 10https://gerrit.wikimedia.org/r/212569 (https://phabricator.wikimedia.org/T95436) [16:49:51] (03PS4) 10Dzahn: scap: ensure /home/l10nupdate/.ssh exists [puppet] - 10https://gerrit.wikimedia.org/r/212569 (https://phabricator.wikimedia.org/T95436) [16:50:36] (03PS5) 10Dzahn: scap: ensure /home/l10nupdate/.ssh exists [puppet] - 10https://gerrit.wikimedia.org/r/212569 (https://phabricator.wikimedia.org/T95436) [16:51:21] (03CR) 10Dzahn: [C: 032] scap: ensure /home/l10nupdate/.ssh exists [puppet] - 10https://gerrit.wikimedia.org/r/212569 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [16:57:18] !log dist-upgrade on mw1123 [16:57:24] Logged the message, Master [16:58:52] 6operations, 10Wikimedia-DNS: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1301824 (10Yana) Thanks @RobH! Will do. [17:01:43] !log mw1123: apt-get autoclean, rebooting for kernel upgrade [17:01:50] Logged the message, Master [17:04:52] PROBLEM - Host mw1123 is DOWN: PING CRITICAL - Packet loss = 100% [17:05:36] icinga-wm: you just lag [17:05:52] RECOVERY - Host mw1123 is UPING OK - Packet loss = 0%, RTA = 1.84 ms [17:06:23] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.236 second response time [17:06:43] RECOVERY - Translation cache space on mw1123 is OK: HHVM_TC_SPACE OK TC sizes are OK [17:06:49] 6operations, 7HHVM: mw1123 has defunct unkillable hhvm process - https://phabricator.wikimedia.org/T99594#1301832 (10Dzahn) ran apt-get dist-upgrade which installed 3.13.0-53, ran apt-get autoclean rebooted 10:07 < icinga-wm> RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 4... [17:10:43] 6operations, 7HHVM: mw1123 has defunct unkillable hhvm process - https://phabricator.wikimedia.org/T99594#1301836 (10Dzahn) 5Open>3Resolved a:3Dzahn from /var/log/apt/history.log: linux-image-generic:amd64 (3.13.0.24.28, 3.13.0.53.60) Linux mw1123 3.13.0-53-generic [17:13:14] @seen yuvipanda under_the_brooklyn_bridge [17:13:14] mutante: I have never seen yuvipanda under_the_brooklyn_bridge [17:13:34] hehe [17:14:31] @seen a_diamond_in_the_flesh [17:14:31] ori: I have never seen a_diamond_in_the_flesh [17:16:16] Lorde? i had to look it up [17:17:19] Deployment_server/Exec[eventual_consistency_deployment_server_init]/returns: [WARNING ] Although 'dmidecode' was found in path, the current user cannot execute it. Grains output might not be accurate. [17:18:12] Exec[eventual_consistency_deployment_server_init]/returns: [ERROR ] An un-handled exception was caught by salt's global exception handler: [17:18:16] hrmmm [17:19:06] OSError: [Errno 2] No such file or directory [17:20:55] 6operations, 10Deployment-Systems: errors reported by "eventual_consistency_deployment_server_init" on new deploy server - https://phabricator.wikimedia.org/T99928#1301923 (10Dzahn) 3NEW [17:22:08] 6operations, 10Deployment-Systems: errors reported by "eventual_consistency_deployment_server_init" on new deploy server - https://phabricator.wikimedia.org/T99928#1301931 (10Dzahn) [17:22:10] 6operations, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1301930 (10Dzahn) [17:26:23] 6operations, 10ops-eqiad: ssh connection to some management servers fails, a hard reset may be needed - https://phabricator.wikimedia.org/T99805#1301947 (10Dzahn) seems like this is needed: "press and hold the System Identification Button for 15 seconds to reset the iDRAC " "It does a soft reboot of the iDRAC... [17:30:02] PROBLEM - High load average on labstore1001 is CRITICAL 100.00% of data above the critical threshold [24.0] [17:32:43] (03CR) 10Dzahn: [C: 032] quarry: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211354 (owner: 10Dzahn) [17:38:07] (03PS13) 10Ottomata: Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [17:38:43] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [17:38:45] (03CR) 10jenkins-bot: [V: 04-1] Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [17:40:56] (03PS14) 10Ottomata: Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [17:41:36] (03CR) 10jenkins-bot: [V: 04-1] Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [17:42:03] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1302044 (10Ottomata) Comments addressed in latest patchset: https://gerrit.wikimedia.org/r/#/c/212041/ I added `purge` as a vaild method. Does varnish have any other special methods we should count?... [17:45:21] (03PS15) 10Ottomata: Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [17:46:00] (03CR) 10jenkins-bot: [V: 04-1] Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [17:58:59] (03CR) 10Manybubbles: "Beta? Its a four node cluster at deployment-elastic0[4567].eqiad.wmflabs. It won't hit the timeout though because there are so many fewer " [puppet] - 10https://gerrit.wikimedia.org/r/211672 (https://phabricator.wikimedia.org/T99005) (owner: 10Filippo Giunchedi) [18:07:03] PROBLEM - High load average on labstore1001 is CRITICAL 66.67% of data above the critical threshold [24.0] [18:10:33] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [18:35:13] PROBLEM - High load average on labstore1001 is CRITICAL 100.00% of data above the critical threshold [24.0] [19:01:43] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [19:05:12] PROBLEM - puppet last run on es2003 is CRITICAL Puppet has 1 failures [19:08:13] PROBLEM - puppet last run on db2034 is CRITICAL Puppet has 1 failures [19:10:52] PROBLEM - DPKG on gallium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:11:33] PROBLEM - puppet last run on multatuli is CRITICAL puppet fail [19:12:23] PROBLEM - DPKG on lanthanum is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:14:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 13.33% of data above the critical threshold [500.0] [19:15:42] what's with dpkg here, someone doing a salt update? ^ [19:15:43] RECOVERY - DPKG on lanthanum is OK: All packages OK [19:16:15] i just looked at lanthanum. not sure what was wrong but it looked like alex fixed it [19:16:46] apt-get upgrade said 1 not fully installed or removed, but when i tried to find it with dpkg -l nothing looked odd [19:17:46] it's normal that we get some DPKG+puppet reports like that when someone does manual apt-get install/upgrade stuff for applying e.g. sec fixes, and background puppet runs on the host conflict [19:18:02] was just curious who/what it was if it was about to spread to tons more hosts :) [19:21:04] "2015-05-21 19:03:11 upgrade fuse" [19:21:09] bblack: I'm installing the fuse security fixes [19:21:13] RECOVERY - DPKG on gallium is OK: All packages OK [19:21:17] ah ok [19:21:32] oh yeah duh i could have checked dpkg.log [19:22:09] although I'm not sure what could've caused this? the icinga check running while an update is in action? [19:22:22] there were no installation errors on those two hosts [19:22:29] yeah, a race [19:22:34] RECOVERY - puppet last run on es2003 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:23:53] RECOVERY - puppet last run on db2034 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [19:24:13] PROBLEM - High load average on labstore1001 is CRITICAL 87.50% of data above the critical threshold [24.0] [19:25:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:27:43] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [19:28:31] 10Ops-Access-Requests, 6operations: Create a Phabricator project for 'Partnerships' - https://phabricator.wikimedia.org/T99945#1302380 (10SVentura) 3NEW [19:28:53] RECOVERY - puppet last run on multatuli is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:34:32] PROBLEM - High load average on labstore1001 is CRITICAL 75.00% of data above the critical threshold [24.0] [19:42:24] 6operations, 10ops-eqiad: analytics1028, Replace system board, raid card - Disks OK - https://phabricator.wikimedia.org/T99947#1302419 (10Cmjohnson) 3NEW a:3Cmjohnson [19:45:03] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [19:50:33] 6operations, 10ops-eqiad: analytics1028, Replace system board, raid card - Disks OK - https://phabricator.wikimedia.org/T99947#1302438 (10Cmjohnson) Requested new parts Congratulations: Work Order SR911482109 was successfully submitted. [19:51:39] (03CR) 10Merlijn van Deen: [C: 04-1] "This completely breaks the workflow for Windows users. ProxyCommand is basically impossible with PuTTY, and agent forwarding or HBA is the" [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [19:53:19] <_joe_> valhallasw: seriously? admin a prod cluster from windows? *uhm* [19:53:42] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [19:53:52] <_joe_> I rephrase: keep any private key for something valuable on a windows machine? :) [19:53:53] _joe_: unless I'm mistaken, those puppet manifests are also used by labs [19:54:04] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1146174 (10RobH) The order for these has been placed, and is tracked via https://rt.wikimedia.org/Ticket/Display.html?id=9337 I'll resolve this once the order has arrived and/or the new... [19:54:06] <_joe_> valhallasw: yeah, they are [19:54:36] <_joe_> valhallasw: so you say toolabs users need to ssh with agent forwarding? [19:54:59] _joe_: no, that's not what I say. I say they need to be able to ssh within labs, somehow. [19:55:11] HBA would be another solution to the problem [19:55:14] <_joe_> ok fair enough [19:55:17] within tool labs* [19:55:23] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [19:55:30] HBA has the advantage it also works over mosh [19:55:32] <_joe_> yeah I wasn't thinking of "labs user from windows" case [19:55:32] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1302451 (10RobH) The SSDs have an ETA of 2015-05-27. The servers have an ETA of 2015-06-17. [19:55:54] _joe_: also, I'm not sure why you're so afraid of Windows. [19:56:38] <_joe_> valhallasw: it's a prime platform for malware, and it's obscure enough to its users to make it particularly hard to notice you've been owned [19:57:13] _joe_: a rooted linux desktop would also not be noticed by normal users [19:57:14] <_joe_> valhallasw: I don't assume my same level of opsec is required from a labs user, though. [19:59:07] <_joe_> valhallasw: it's easier to, and you know it, I hope. That applies to os x as well, the area of the system which operates at ring0 is incredibly smaller than in windows. Also, statistically, windows ownage is much more common. Please note I'm not dealing with the "NSA wants your keys" scenario [20:00:04] statistically, 83.4% of stats are made up [20:00:25] _joe_: from experience, I can tell you it's not. I could spot it from a ls binary that acted weird, but if I had not used the console, I would have no clue. [20:00:25] <_joe_> valhallasw: but again, to log into a labs VM from windows seems a perfectly legit use case [20:00:26] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 6Scrum-of-Scrums, 5Patch-For-Review: Jenkins is using php-luasandbox 1.9-1 for zend unit tests; precise should be upgraded to 2.0-8 or equivalent - https://phabricator.wikimedia.org/T88798#1302458 (10akosiaris) 5Open>3Resolv... [20:00:33] <_joe_> so I agree with your comment [20:00:57] But if the WMF has a policy to only allow prod access from linux/os x machines, fine with me. [20:01:13] <_joe_> valhallasw: nope, it's just my 2 cents [20:02:45] <_joe_> valhallasw: my point was I wasn't willing to compromise on security for toollabs because someone wanted to use windows. [20:03:12] <_joe_> but labs *users*, that's a different use case and I think your comment is important [20:03:31] 'security for prod', I assume? but yes, that's fair. [20:03:34] https://monkeyswithbuttons.wordpress.com/2010/10/01/ssh-proxycommand-and-putty/ [20:03:39] just curious [20:03:43] doesn't this work ? [20:03:44] _joe_: the commit message says "Per sshd_config(5) this can't stop a malicious shell user" so how much of a security impact is it? [20:03:53] * akosiaris no windows user [20:04:09] akosiaris: yes, it does, but you basically need to repeat most of that for each host you're connecting to [20:04:10] <_joe_> legoktm: agent forwarding? [20:04:24] _joe_: yeah. [20:04:28] akosiaris: which, in the case of tool labs, can be one of the 30-or-so exec hosts [20:04:29] disabling it [20:04:54] akosiaris: basically, you can't do the Host: *.eqiad trick one would use in .ssh/config [20:05:12] valhallasw: ah, so its doesn't completely break the workflow, but rather makes it very very inconvient [20:05:12] akosiaris: but as noted, the preferred solution would be to enable HBA and disable agent forwarding [20:05:33] akosiaris: right, let me clarify that [20:05:49] valhallasw: thanks! [20:06:28] <_joe_> legoktm: agent forwarding means having your auth agent forwarding requests around the cluster, and it can theoretically be tampered with [20:06:35] legoktm: it's about a user not being able to hijack someone else's ssh-agent and gaining access where he shouldn't have [20:06:49] <_joe_> it's also possible to hijack you key for other local users [20:06:53] <_joe_> that ^^ [20:06:57] not key! agent [20:07:06] <_joe_> yeah sorry [20:07:21] <_joe_> the key is obviously on your computer [20:07:56] ok, so if a malicious user bypassed the no agent forwarding part, they'd just be opening themselves up to getting pwnd? [20:08:18] legoktm: to help you understand. A root user right now on a bastion host, can use anyone's agent to connect to any machine that "anyone" can connect to [20:08:45] and it has nothing to do with "root" but permission on a /tmp/agent-random string directory [20:09:25] so, it's about legitimate users not opening themselves to attack [20:09:54] if the user that bypasses the no-agent-forwarding part is malicious.. well too bad for him... [20:10:17] and even worse for us because we got a malicious user on our hosts [20:10:28] ok, that makes sense. thanks for explaining :) [20:11:31] (03CR) 10Merlijn van Deen: "Let me clarify what I mean with 'completely breaks the workflow': on tool labs, it's quite common to do the following:" [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [20:35:29] _joe_, on a side note, is there a way to make ssh-agent prompt every time it's requested to sign something? [20:36:11] <_joe_> I'm not sure I understand your question, but I suppose the answer is "no" as far as I know [20:40:23] _joe_: basically, keepass asks me 'A client has requested to use SSH key X, do you want to allow this?' before actually doing the signing, and refuses to sign if I answer 'no'. Agent hijacking would be pretty obvious in that case (although still not entirely preventable, so not forwarding is still safer); [20:41:26] <_joe_> valhallasw: I don't think the standard unix ssh-agent allows that [20:41:49] <_joe_> maybe if you integrate it with some desktop thing... I dunno. [20:42:53] yea, I'm not sure how it would work in a pure terminal setting [20:42:55] i use keepass on linux but i dont have that feature i think [20:43:01] it's keepassx from Debian [20:43:17] mutante: it's a keepass plugin called 'keeagent' [20:43:24] ah! [20:43:28] <_joe_> mutante: that would only work if you use something like gnome-keyring or something [20:43:39] mutante: but I'm not sure if the linux version supports it (I'm on windows) [20:44:23] probably not then [20:45:10] http://lechnology.com/software/keeagent/ [20:45:39] ssh-agent (KeeAgent only works in Client mode on Linux/Mac.) [20:46:44] https://github.com/dlech/KeeAgent/tree/master/debian [20:47:01] I guess that means it just calls ssh-agent to register/de-register the key when you unlock/lock the databse? [20:47:59] "Client: KeeAgent will act as an SSH agent client. For example, you can use this mode to load keys stored in KeePass into Pageant. If there is not SSH agent running, you will get an error when you try to load keys." [21:01:39] (03PS1) 10Dereckson: Namespace configuration for office. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212685 (https://phabricator.wikimedia.org/T99860) [21:05:04] (03PS1) 10Dzahn: bump version to 0.5.6-1), build for trusty [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/212707 (https://phabricator.wikimedia.org/T99919) [21:06:56] (03PS2) 10Dzahn: bump version to 0.5.6-1, build for trusty [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/212707 (https://phabricator.wikimedia.org/T99919) [21:07:17] (03CR) 10Dzahn: [C: 04-2] bump version to 0.5.6-1, build for trusty [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/212707 (https://phabricator.wikimedia.org/T99919) (owner: 10Dzahn) [21:29:23] PROBLEM - puppet last run on db1067 is CRITICAL puppet fail [21:35:44] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 5 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1302630 (10BBlack) Just to touch base on this issue, in the [[ http://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-... [21:46:45] RECOVERY - puppet last run on db1067 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:58:58] 6operations: Meta task for various security updates - https://phabricator.wikimedia.org/T96545#1302727 (10MoritzMuehlenhoff) 947 hosts (precise and trusty) have been updated for the local root privilege escalation in fuse (CVE-2015-3202). [22:07:05] (03PS3) 10Dzahn: bump version to 0.5.6-1, build for trusty [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/212707 (https://phabricator.wikimedia.org/T99919) [22:20:27] 6operations, 10Deployment-Systems, 5Patch-For-Review: need package 'trebuchet-trigger' for trusty - https://phabricator.wikimedia.org/T99919#1302773 (10Dzahn) I have @carbon:~# ls /home/dzahn/trebuchet* /home/dzahn/trebuchet-trigger_0.5.6-1_all.deb /home/dzahn/trebuchet-trigger_0.5.6-1_i386.build /home... [22:21:05] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1302774 (10BBlack) We have some hints now that we may get to move on an ECDSA solution sometime in June. More details later on, but just noting for planning/estimation. [22:21:40] (03Abandoned) 10Tim Landscheidt: Labs: Include public IPs in ferm's $INTERNAL [puppet] - 10https://gerrit.wikimedia.org/r/210853 (https://phabricator.wikimedia.org/T96924) (owner: 10Tim Landscheidt) [22:23:17] 6operations, 10Deployment-Systems, 5Patch-For-Review: need package 'trebuchet-trigger' for trusty - https://phabricator.wikimedia.org/T99919#1302787 (10Dzahn) >>! In T99919#1301738, @cscott wrote: > So it would be helpful to me to know the mechanism of getting latest trebuchet packaged and deployed on our in... [22:27:17] (03PS1) 10Dereckson: Configure import sources for hif.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212721 (https://phabricator.wikimedia.org/T99826) [22:46:12] (03PS1) 10Ori.livneh: Fix travis and coverall configuration [debs/pybal] - 10https://gerrit.wikimedia.org/r/212722 [22:46:35] (03CR) 10Ori.livneh: [C: 032] "Cosmetic changes, tested with my local fork." [debs/pybal] - 10https://gerrit.wikimedia.org/r/212722 (owner: 10Ori.livneh) [22:46:52] (03CR) 10jenkins-bot: [V: 04-1] Fix travis and coverall configuration [debs/pybal] - 10https://gerrit.wikimedia.org/r/212722 (owner: 10Ori.livneh) [22:48:48] (03PS2) 10Ori.livneh: Fix travis and coverall configuration [debs/pybal] - 10https://gerrit.wikimedia.org/r/212722 [22:49:28] (03CR) 10Ori.livneh: [C: 032] Fix travis and coverall configuration [debs/pybal] - 10https://gerrit.wikimedia.org/r/212722 (owner: 10Ori.livneh) [22:49:43] (03CR) 10jenkins-bot: [V: 04-1] Fix travis and coverall configuration [debs/pybal] - 10https://gerrit.wikimedia.org/r/212722 (owner: 10Ori.livneh) [22:50:19] (03PS1) 10Dereckson: Enable NewUserMessage on hif.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212723 (https://phabricator.wikimedia.org/T99824) [22:52:20] (03PS3) 10Ori.livneh: Fix travis and coverall configuration [debs/pybal] - 10https://gerrit.wikimedia.org/r/212722 [22:52:38] (03CR) 10Ori.livneh: [C: 032] Fix travis and coverall configuration [debs/pybal] - 10https://gerrit.wikimedia.org/r/212722 (owner: 10Ori.livneh) [22:52:50] (03PS1) 10Dereckson: Enable NewUserMessage on sa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212724 (https://phabricator.wikimedia.org/T99879) [22:52:56] (03Merged) 10jenkins-bot: Fix travis and coverall configuration [debs/pybal] - 10https://gerrit.wikimedia.org/r/212722 (owner: 10Ori.livneh) [23:00:04] RoanKattouw, ^d, rmoen, Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150521T2300). Please do the needful. [23:00:06] i. [23:00:19] Hi. [23:06:55] !log ori Synchronized php-1.26wmf7/includes: da79b19b88: Defer some updates in doEditUpdates() (duration: 00m 16s) [23:07:05] Logged the message, Master [23:08:18] !log ori Synchronized php-1.26wmf6/includes: 7238213e6d: Defer some updates in doEditUpdates() (duration: 00m 16s) [23:08:23] Logged the message, Master [23:11:55] greg-g: around? [23:11:55] hoo: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around. [23:12:25] meh... y u no IRC away :P [23:12:53] PROBLEM - HHVM rendering on mw1228 is CRITICAL - Socket timeout after 10 seconds [23:13:26] hoo: with hl, IRC works very well in async mode. [23:13:32] PROBLEM - Apache HTTP on mw1228 is CRITICAL - Socket timeout after 10 seconds [23:14:12] Dereckson: Sure, but using away info (or an |away or _away nick name) is nice :P [23:16:03] PROBLEM - HHVM busy threads on mw1228 is CRITICAL 70.00% of data above the critical threshold [115.2] [23:16:12] speaking about away, is there anyone to deploy SWAT patches this evening? [23:16:43] PROBLEM - HHVM queue size on mw1228 is CRITICAL 77.78% of data above the critical threshold [80.0] [23:20:17] Dereckson: oO [23:20:20] No one showed up yet? [23:20:28] Right. [23:20:29] If they're easy, I guess I could jump in [23:21:02] Patches are config changes, without any database creation or sophisticated stuff. [23:21:42] I guess I can do that [23:21:47] There is one of them with an high priority, 212721, the others are priority normal. [23:22:48] I'll go through all that are listed on wikitech... if I consider something to scary, I'll just skip it [23:23:03] Fine. [23:24:45] (03CR) 10Hoo man: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211408 (https://phabricator.wikimedia.org/T99315) (owner: 10Dereckson) [23:25:45] (03Merged) 10jenkins-bot: Site name configuration on ast.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211408 (https://phabricator.wikimedia.org/T99315) (owner: 10Dereckson) [23:26:53] !log hoo Synchronized wmf-config/InitialiseSettings.php: Site name configuration on ast.wiktionary (duration: 00m 12s) [23:27:02] Logged the message, Master [23:27:02] Please verify [23:28:01] Works. [23:28:04] :) [23:28:46] (03PS2) 10Hoo man: Configure import sources for hif.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212721 (https://phabricator.wikimedia.org/T99826) (owner: 10Dereckson) [23:28:55] (03CR) 10Hoo man: [C: 032] Configure import sources for hif.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212721 (https://phabricator.wikimedia.org/T99826) (owner: 10Dereckson) [23:29:01] (03Merged) 10jenkins-bot: Configure import sources for hif.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212721 (https://phabricator.wikimedia.org/T99826) (owner: 10Dereckson) [23:30:03] !log hoo Synchronized wmf-config/InitialiseSettings.php: Configure import sources for hif.wikipedia (duration: 00m 12s) [23:30:08] Logged the message, Master [23:30:23] Can you verify that? [23:30:45] Guess it's easier if I do [23:31:09] Looks good :) [23:31:37] Thank you. [23:31:52] You check that with mweval? [23:32:54] No, I checked with my volunteer account [23:32:59] stewards can use import on every wiki [23:33:24] (03PS2) 10Hoo man: Enable NewUserMessage on hif.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212723 (https://phabricator.wikimedia.org/T99824) (owner: 10Dereckson) [23:33:42] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.060 second response time [23:34:17] (03CR) 10Hoo man: [C: 032] Enable NewUserMessage on hif.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212723 (https://phabricator.wikimedia.org/T99824) (owner: 10Dereckson) [23:34:20] (03Merged) 10jenkins-bot: Enable NewUserMessage on hif.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212723 (https://phabricator.wikimedia.org/T99824) (owner: 10Dereckson) [23:34:42] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 66032 bytes in 0.544 second response time [23:35:22] !log hoo Synchronized wmf-config/InitialiseSettings.php: Enable NewUserMessage on hif.wikipedia (duration: 00m 14s) [23:35:28] Logged the message, Master [23:35:48] Testing. [23:36:07] :) [23:38:13] The discussion about https://gerrit.wikimedia.org/r/212724 has only been started yesterday, I guess it would be better to wait until Monday [23:38:42] PROBLEM - puppet last run on db1072 is CRITICAL Puppet has 1 failures [23:38:45] https://sa.wikipedia.org/w/index.php?title=%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A4%BF%E0%A4%AA%E0%A5%80%E0%A4%A1%E0%A4%BF%E0%A4%AF%E0%A4%BE:%E0%A4%B5%E0%A4%BF%E0%A4%9A%E0%A4%BE%E0%A4%B0%E0%A4%AE%E0%A4%A3%E0%A5%8D%E0%A4%A1%E0%A4%AA%E0%A4%AE%E0%A5%8D_%28%E0%A4%A8%E0%A4%AF%E0%A4%B0%E0%A5%82%E0%A4%AA%E0%A5%80%E0%A4%95%E0%A4%B0%E0%A4%A3%E0%A4%AE%E0%A5%8D%29&action=history&uselang=en [23:39:29] Indeed, sa.wikipedia seems to have an active community enough to get more advices. [23:39:41] Ok :) [23:40:33] RECOVERY - HHVM queue size on mw1228 is OK Less than 30.00% above the threshold [10.0] [23:40:43] Can you take care of the phabricator tickets? [23:40:54] Yes, already commented. [23:41:33] RECOVERY - HHVM busy threads on mw1228 is OK Less than 30.00% above the threshold [76.8] [23:43:18] For hif., it's not yet triggered, but this extension also requires configuration isn't completed on wiki. It's enabled in Special:Version, so as far as we know now, it seems to be okay. [23:43:41] ok [23:48:30] Dereckson: Still around? [23:48:41] I just noticed I forgot https://gerrit.wikimedia.org/r/212685 [23:48:59] Still around. [23:49:03] Ok [23:49:17] (03CR) 10Hoo man: [C: 032] Namespace configuration for office. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212685 (https://phabricator.wikimedia.org/T99860) (owner: 10Dereckson) [23:49:24] (03Merged) 10jenkins-bot: Namespace configuration for office. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212685 (https://phabricator.wikimedia.org/T99860) (owner: 10Dereckson) [23:50:19] !log hoo Synchronized wmf-config/InitialiseSettings.php: Re-enable subpages for the template namespace on officewiki (duration: 00m 13s) [23:50:24] Logged the message, Master [23:50:35] guillom: could you test if it's fine now (for example in a Template:Quux/doc if you see the the link to Template:Quux)? ^ [23:50:47] I can't verify that... I guess you can't verify that either :P [23:52:47] Oh, 23:52:31 -!- guillom is away: afk, screen detached. Messages are logged. [23:53:21] Well, I'm letting a message on the bug asking confirmation it works in this case. [23:53:38] That should be good enough, yes [23:54:40] And with that, we've finished. [23:54:41] Thanks for the deploy. [23:54:47] You're welcome [23:55:33] RECOVERY - puppet last run on db1072 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:57:52] By the way, 212723 works / https://hif.wikipedia.org/wiki/sadasya_ke_baat:Deasyzor [23:58:58] :)