[00:00:02] PROBLEM - puppet last run on mw2182 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:00:02] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:00:02] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:00:02] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:00:22] PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:00:22] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:00:22] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:00:31] robh: ^ could be related to access request change? [00:00:40] shitttttts [00:00:41] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:00:48] maybe cuz i merged too close together [00:00:49] i'm trying on a random one now [00:00:55] one adds shell user [00:00:58] the other adds group [00:01:08] expects the first argument to be a hash, got "" which is of type String [00:01:09] and i pushed them live too close together that causes race condition [00:01:11] PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:12] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:12] PROBLEM - puppet last run on mw2247 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:20] i see [00:01:21] PROBLEM - puppet last run on mw1283 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:21] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:21] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:21] PROBLEM - puppet last run on mw2240 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:21] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:22] PROBLEM - puppet last run on mw2281 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:22] PROBLEM - puppet last run on mw2280 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:26] its going to fail on all systems with 'deployment' users [00:01:31] PROBLEM - puppet last run on mw2266 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:39] at least i think thats causing it [00:01:41] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:41] PROBLEM - puppet last run on mw1326 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:43] im also logging into one and trying a run [00:01:51] PROBLEM - puppet last run on mw2265 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:51] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:51] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:02:01] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:02:01] PROBLEM - puppet last run on mw2285 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:02:03] puppet failing isnt end of times, better to figure it out than just blind revert [00:02:12] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:02:21] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:02:21] PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:02:22] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:02:25] yeah.... [00:02:31] PROBLEM - puppet last run on mw2255 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:02:31] its for the user add [00:02:32] PROBLEM - puppet last run on mw2258 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:02:41] so im going to revert the second of the patchsets, the one adding to deployemnt group [00:02:41] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:02:46] mutante: sound reasonable? [00:02:51] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:02:51] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:02:52] PROBLEM - puppet last run on mw2147 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:14] robh: it's something different [00:03:22] PROBLEM - puppet last run on mw1311 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:22] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:23] oh? [00:03:34] robh: on https://gerrit.wikimedia.org/r/#/c/422081/1/modules/admin/data/data.yaml samwilson starts the beginning of the line [00:03:40] but the other users are indented [00:03:42] PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:42] PROBLEM - puppet last run on mw2169 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:42] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:42] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:42] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:42] PROBLEM - puppet last run on mw2248 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:51] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:51] PROBLEM - puppet last run on mw2162 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:51] PROBLEM - puppet last run on mw2156 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:52] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:55] ohhhh [00:03:58] indeed, shittttt [00:03:59] it's at the same level of the group name [00:04:00] ok, fixing [00:04:11] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:04:11] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:04:11] PROBLEM - puppet last run on mw2288 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:04:11] PROBLEM - puppet last run on mw2277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:04:12] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:04:31] (03CR) 10Krinkle: "Note that this is /not/ merging www with *.wikipedia.org, it is merging www with (root) wikipedia.org, which is mainly just a redirect to " [puppet] - 10https://gerrit.wikimedia.org/r/398396 (owner: 10EddieGP) [00:04:52] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:04:52] PROBLEM - puppet last run on mw2287 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:05:01] PROBLEM - puppet last run on mw1346 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:05:02] PROBLEM - puppet last run on mw2225 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:05:09] (03PS1) 10RobH: fixing samwilson user entry [puppet] - 10https://gerrit.wikimedia.org/r/422084 (https://phabricator.wikimedia.org/T189414) [00:05:12] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:05:14] mutante: ^ cr? [00:05:21] PROBLEM - puppet last run on mw2253 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:05:21] PROBLEM - puppet last run on mw2290 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:05:31] PROBLEM - puppet last run on mw2251 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:05:31] PROBLEM - puppet last run on mw2177 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:05:41] PROBLEM - puppet last run on mw2214 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:05:42] PROBLEM - puppet last run on mw2223 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:05:49] yes yes icinga we know im fixing it [00:06:11] (03PS2) 10RobH: fixing samwilson user entry [puppet] - 10https://gerrit.wikimedia.org/r/422084 (https://phabricator.wikimedia.org/T189414) [00:06:21] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:06:21] PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:06:21] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:06:22] PROBLEM - puppet last run on mw2271 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:06:31] PROBLEM - puppet last run on mw1345 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:06:31] PROBLEM - puppet last run on mw2176 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:06:32] PROBLEM - puppet last run on mw2279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:06:37] (03CR) 10RobH: [C: 032] fixing samwilson user entry [puppet] - 10https://gerrit.wikimedia.org/r/422084 (https://phabricator.wikimedia.org/T189414) (owner: 10RobH) [00:06:41] PROBLEM - puppet last run on mw1309 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:06:41] PROBLEM - puppet last run on mw1339 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:06:42] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:06:42] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:06:44] robh: Looks like a unit test would help here. We have some for admin, right? [00:06:49] yep! [00:06:51] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:06:58] im going to document this in a task once its fixed [00:07:01] PROBLEM - puppet last run on labweb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:01] PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:01] PROBLEM - puppet last run on mw1322 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:02] kk :) [00:07:02] becasue its an easy mistake to make [00:07:12] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:21] PROBLEM - puppet last run on mw2252 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:21] PROBLEM - puppet last run on mw2274 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:21] PROBLEM - puppet last run on mw1276 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:22] PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:22] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:22] PROBLEM - puppet last run on mw2218 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:22] PROBLEM - puppet last run on mw2262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:31] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:38] ok fix is merged, running now on random mw system [00:08:13] and it adds user just fine now ;D [00:08:16] 24 second puppet run. [00:08:22] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:08:38] we dont have any other deployments this evening [00:08:42] PROBLEM - puppet last run on mw2283 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:08:42] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:08:48] so its likely ok for it to just wait fo rthem to call in and get updates properly [00:08:51] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:08:51] PROBLEM - puppet last run on mw2245 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:08:52] PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:08:52] PROBLEM - puppet last run on mw1332 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:08:52] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:09:01] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:09:11] PROBLEM - puppet last run on mw2244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:09:11] PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:09:11] PROBLEM - puppet last run on mw2233 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:09:12] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:09:21] PROBLEM - puppet last run on mw2286 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:09:41] PROBLEM - puppet last run on mw2153 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:09:48] but i dislike seeing that spam... i think i shall cumin puppet run them for failed runs only [00:09:51] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:09:51] PROBLEM - puppet last run on mw2160 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:09:52] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:09:52] PROBLEM - puppet last run on mwlog2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:10:01] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:10:57] 1290 hosts will be targeted: [00:11:11] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:11:15] robh: sorry, back. got a distraction in RL [00:11:21] PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:11:21] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:11:21] PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:11:21] PROBLEM - puppet last run on mw2138 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:11:22] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:11:29] robh: i think it's ok to just wait.. but you can kill the bot [00:11:32] RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:11:36] ok, there they go. [00:11:36] and bring it back once most are recovered [00:11:41] PROBLEM - puppet last run on mw2174 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:12:00] oh wait 2221 is the one i ran ;D [00:12:01] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:12:11] mutante: whats the proper way to do that? [00:12:12] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:12:15] remove bot i mean [00:12:21] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:12:51] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:13:50] robh: systemctl stop ircecho on einsteinium [00:16:21] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [00:16:41] RECOVERY - puppet last run on mw1339 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [00:18:41] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:20:02] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [00:20:12] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:20:41] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:21:21] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:21:31] RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [00:21:41] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:21:51] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:21:51] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:22:22] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:22:53] 10Operations, 10Continuous-Integration-Config: add ci test for admin module indentation - https://phabricator.wikimedia.org/T190766#4083108 (10RobH) p:05Triage>03Normal [00:23:22] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:23:41] RECOVERY - puppet last run on mw2169 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:23:41] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:23:51] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:23:52] RECOVERY - puppet last run on mw2195 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [00:24:11] RECOVERY - puppet last run on mw2233 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:24:11] RECOVERY - puppet last run on mw2277 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:24:51] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:24:52] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:25:01] RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [00:25:01] RECOVERY - puppet last run on mw2225 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:25:22] RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:25:22] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [00:25:22] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [00:25:22] RECOVERY - puppet last run on mw2177 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:25:42] RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:26:12] RECOVERY - puppet last run on mw2241 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:26:12] RECOVERY - puppet last run on mw2247 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:26:21] RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [00:26:21] RECOVERY - puppet last run on mw1283 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [00:26:21] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:26:21] RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:26:21] RECOVERY - puppet last run on mw2138 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [00:26:41] RECOVERY - puppet last run on mw1326 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:26:41] RECOVERY - puppet last run on mw2174 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:27:01] RECOVERY - puppet last run on labweb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:27:01] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [00:27:01] RECOVERY - puppet last run on mw2285 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [00:27:01] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:27:02] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:27:12] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:27:12] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:27:31] RECOVERY - puppet last run on mw2258 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:27:31] RECOVERY - puppet last run on mw2255 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:27:51] RECOVERY - puppet last run on mwdebug2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:27:52] RECOVERY - puppet last run on mw2147 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [00:28:01] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:28:21] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:28:22] RECOVERY - puppet last run on mw1311 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [00:28:22] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:28:41] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:28:41] RECOVERY - puppet last run on mw1314 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:28:41] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [00:28:41] RECOVERY - puppet last run on mw2248 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:28:41] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:28:51] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:28:51] RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:28:51] RECOVERY - puppet last run on mw1332 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:28:52] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:28:52] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:29:11] RECOVERY - puppet last run on mw2288 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:29:12] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [00:29:21] RECOVERY - puppet last run on mw2286 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:29:42] RECOVERY - puppet last run on mw1316 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:29:52] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:29:52] RECOVERY - puppet last run on mw2287 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:30:01] RECOVERY - puppet last run on mw2182 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:30:01] RECOVERY - puppet last run on mw1346 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [00:30:02] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:30:02] RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:30:21] RECOVERY - puppet last run on mw2290 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:31:11] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:31:11] RECOVERY - puppet last run on mw2157 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:31:12] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:31:21] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:31:21] RECOVERY - puppet last run on mw2240 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:31:21] RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:31:21] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:31:21] RECOVERY - puppet last run on mw2259 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [00:31:22] RECOVERY - puppet last run on mw2281 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:31:22] RECOVERY - puppet last run on mw2280 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:31:31] RECOVERY - puppet last run on mw1345 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [00:31:31] RECOVERY - puppet last run on mw2279 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:31:32] RECOVERY - puppet last run on mw2266 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [00:31:41] RECOVERY - puppet last run on mw1309 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [00:31:42] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [00:31:42] RECOVERY - puppet last run on mw2265 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:31:51] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:32:01] RECOVERY - puppet last run on mw1274 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:32:01] RECOVERY - puppet last run on mw1322 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [00:32:21] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:32:21] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:32:21] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:32:21] RECOVERY - puppet last run on mw2274 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [00:32:22] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:32:22] RECOVERY - puppet last run on mw2262 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:32:22] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:32:42] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:32:51] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:32:51] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:33:41] RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:33:41] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:33:51] RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [00:33:51] RECOVERY - puppet last run on mw2162 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:33:51] RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:33:51] RECOVERY - puppet last run on mw2245 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [00:34:11] RECOVERY - puppet last run on mw2244 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [00:34:11] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:34:11] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:34:41] RECOVERY - puppet last run on mw2153 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:34:51] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:34:51] RECOVERY - puppet last run on mw2160 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [00:34:52] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:34:52] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:35:21] RECOVERY - puppet last run on mw2253 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:35:22] RECOVERY - puppet last run on mw2251 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:35:41] RECOVERY - puppet last run on mw2214 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:36:21] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:36:21] RECOVERY - puppet last run on mw2271 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:37:01] RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [00:37:12] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:37:21] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:37:21] RECOVERY - puppet last run on mw2252 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:37:21] RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:37:21] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:37:22] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:37:22] RECOVERY - puppet last run on mw2218 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:37:32] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:37:51] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:38:41] RECOVERY - puppet last run on mw2283 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:39:01] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:39:11] RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:39:12] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:39:52] RECOVERY - puppet last run on mwlog2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:02:52] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 61229 MB (12% inode=99%) [01:05:03] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Patch-For-Review: Requesting access to terbium/maintenance-log-readers for bmansurov - https://phabricator.wikimedia.org/T189285#4083240 (10bmansurov) Thank you, all. [01:15:52] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 61002 MB (12% inode=99%) [01:27:52] RECOVERY - Disk space on elastic1025 is OK: DISK OK [02:00:09] !log l10nupdate@tin LocalisationUpdate failed: git pull of core failed [02:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:31] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4083351 (10Krinkle) @Vgutierrez I've seen the conversion to mtail going on, and look forward to using Prometheus in the ResourceLoader dashboards. Howev... [02:41:11] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 61707 MB (12% inode=99%) [02:44:11] RECOVERY - Disk space on elastic1025 is OK: DISK OK [02:46:12] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#4083365 (10BBlack) Digging into the timestamps of the final entries of various types (for load.php slowlog entr... [02:57:48] !log Fix retention rules for Whisper files on graphite2001 and graphite1001 per T179622 (/var/lib/carbon/whisper/ve/*) [02:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:54] T179622: Update our Graphite metrics for current retention config - https://phabricator.wikimedia.org/T179622 [03:26:12] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 830.00 seconds [03:39:11] (03PS4) 10KartikMistry: apertium: New upstream release [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/419351 (https://phabricator.wikimedia.org/T189075) [03:39:22] (03CR) 10jerkins-bot: [V: 04-1] apertium: New upstream release [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/419351 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [03:42:42] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/419351 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [03:43:10] (03CR) 10jerkins-bot: [V: 04-1] apertium: New upstream release [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/419351 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [03:56:21] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 233.22 seconds [04:29:21] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[cdh::hadoop::directory /user/spark/share],Exec[cdh::hadoop::directory /user/spark/applicationHistory] [04:53:18] (03Abandoned) 10Ori.livneh: redis: prohibit commands CONFIG, SLAVEOF and DEBUG by default [puppet] - 10https://gerrit.wikimedia.org/r/251800 (owner: 10Ori.livneh) [04:53:41] (03Abandoned) 10Ori.livneh: Drop support for the legacy configuration format [debs/pybal] - 10https://gerrit.wikimedia.org/r/317823 (owner: 10Ori.livneh) [04:59:21] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:35:37] (03PS3) 10KartikMistry: apertium: Add apertium-separable package [puppet] - 10https://gerrit.wikimedia.org/r/421833 (https://phabricator.wikimedia.org/T189075) [05:51:05] (03PS2) 10Giuseppe Lavagetto: nagios: Remove 'krinkle' from cloud/cvn contact group [puppet] - 10https://gerrit.wikimedia.org/r/421475 (owner: 10Krinkle) [05:52:50] (03CR) 10Giuseppe Lavagetto: [C: 032] nagios: Remove 'krinkle' from cloud/cvn contact group [puppet] - 10https://gerrit.wikimedia.org/r/421475 (owner: 10Krinkle) [05:59:30] !log Manually purge https://en.wikipedia.org/static/images/project-logos/nds_nlwiki-1.5x.png – T190051 [05:59:35] !log Manually purge https://en.wikipedia.org/static/images/project-logos/nds_nlwiki-2x.png – T190051 [05:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:36] T190051: Please update nds-nl Wikipedia logo - https://phabricator.wikimedia.org/T190051 [05:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:32] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0 [06:08:02] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [06:09:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [06:14:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall seems to go in the right direction, but I think it could be simplified further." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/421842 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [06:18:12] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [06:29:20] (03PS6) 10Elukey: cassandra: upgrade version 2.2 package settings for aqs [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) [06:34:51] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 61134 MB (12% inode=99%) [06:37:21] PROBLEM - HHVM rendering on mw2188 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:38:12] RECOVERY - HHVM rendering on mw2188 is OK: HTTP OK: HTTP/1.1 200 OK - 76005 bytes in 0.305 second response time [06:49:32] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 61745 MB (12% inode=99%) [06:51:39] dcausse: o/ - disk alerts are still flapping, just to triple check, are we still good? [06:53:41] mmm restbase2007 down? [06:53:41] PROBLEM - Host restbase2007 is DOWN: PING CRITICAL - Packet loss = 100% [06:53:47] yep [06:56:32] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 62185 MB (12% inode=99%) [06:56:39] trying to access the console but not a lot of luck [06:59:23] !log powercycle restbase2007 (no ssh, vsp not available via mgmt console) [06:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:01] RECOVERY - Host restbase2007 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [07:02:41] the cassandra instances are booting [07:04:11] PROBLEM - cassandra-b CQL 10.192.16.177:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.177 and port 9042: Connection refused [07:04:11] PROBLEM - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [07:04:21] PROBLEM - cassandra-a SSL 10.192.16.176:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [07:04:31] PROBLEM - cassandra-a CQL 10.192.16.176:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.176 and port 9042: Connection refused [07:04:46] yes we know it [07:04:52] PROBLEM - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.178 and port 9042: Connection refused [07:05:01] PROBLEM - cassandra-b SSL 10.192.16.177:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [07:17:29] (03PS1) 10Elukey: statistics::packages: deploy 'python3-protobuf' only on stretch [puppet] - 10https://gerrit.wikimedia.org/r/422098 [07:18:36] restbase2007's instances still in STARTING mode [07:18:59] (03CR) 10Elukey: [C: 032] statistics::packages: deploy 'python3-protobuf' only on stretch [puppet] - 10https://gerrit.wikimedia.org/r/422098 (owner: 10Elukey) [07:19:02] RECOVERY - cassandra-b SSL 10.192.16.177:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-b valid until 2018-08-17 16:12:09 +0000 (expires in 143 days) [07:19:11] RECOVERY - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-c valid until 2018-08-17 16:12:10 +0000 (expires in 143 days) [07:19:22] RECOVERY - cassandra-a SSL 10.192.16.176:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-a valid until 2018-08-17 16:12:08 +0000 (expires in 143 days) [07:19:41] RECOVERY - cassandra-a CQL 10.192.16.176:9042 on restbase2007 is OK: TCP OK - 3.068 second response time on 10.192.16.176 port 9042 [07:19:52] RECOVERY - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is OK: TCP OK - 0.036 second response time on 10.192.16.178 port 9042 [07:20:11] RECOVERY - cassandra-b CQL 10.192.16.177:9042 on restbase2007 is OK: TCP OK - 0.036 second response time on 10.192.16.177 port 9042 [07:20:45] better now [07:22:01] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:26:21] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:29:19] elukey: yes reindex still on going, wikidata is processing so we are near the end, sorry about that, I'll talk to guillaume to see how we can adjust the atlerting during reindex [07:29:57] dcausse: nono I am ok with those, just wanted to make sure that things were under control :) [07:30:00] thanks! [07:31:01] RECOVERY - Disk space on elastic1019 is OK: DISK OK [07:33:43] (03CR) 10Elukey: [C: 032] cassandra: upgrade version 2.2 package settings for aqs [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [07:33:48] (03PS7) 10Elukey: cassandra: upgrade version 2.2 package settings for aqs [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) [07:34:19] rebase-snipered by another code review of mine [07:41:12] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4083535 (10Vgutierrez) >>! In T184942#4081704, @Pchelolo wrote: > - API Summary dashboard - this one uses `varnish.$dc.backends.be_{backend}` metric and... [07:47:41] RECOVERY - Disk space on elastic1027 is OK: DISK OK [07:48:27] (03PS2) 10Elukey: role::aqs: enable jmx agent [puppet] - 10https://gerrit.wikimedia.org/r/421878 (https://phabricator.wikimedia.org/T184795) [07:49:00] (03CR) 10Elukey: [C: 032] role::aqs: enable jmx agent [puppet] - 10https://gerrit.wikimedia.org/r/421878 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [07:53:42] (03PS2) 10Jcrespo: dump_section.py: Allow manual run of the backup [puppet] - 10https://gerrit.wikimedia.org/r/421931 [07:54:15] (03CR) 10jerkins-bot: [V: 04-1] dump_section.py: Allow manual run of the backup [puppet] - 10https://gerrit.wikimedia.org/r/421931 (owner: 10Jcrespo) [07:59:12] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 [07:59:31] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:05:03] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#4083574 (10akosiaris) [08:07:16] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#4083581 (10akosiaris) [08:07:21] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Validate whether the (implemented) standardized application environment works as expected - https://phabricator.wikimedia.org/T184923#4083578 (10akosiaris) 05Open>03Resolved a:03akosiaris * Network policy has been validated * statsd_p... [08:08:30] (03PS1) 10Alexandros Kosiaris: ci: Add kubernetes deployment classes to CI [puppet] - 10https://gerrit.wikimedia.org/r/422100 (https://phabricator.wikimedia.org/T184924) [08:11:47] !log reboot aqs1005 for kernel + openjdk-8 + cassandra upgrade [08:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:11] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=aqs1005.eqiad.wmnet [08:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:03] !log add more weight to ms-be204[0-3] - T189633 [08:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:09] T189633: rack/setup/install ms-be204[0-3] - https://phabricator.wikimedia.org/T189633 [08:33:36] !log reboot aqs1006 for kernel + openjdk-8 + cassandra upgrade [08:33:36] !log kartik@tin Started deploy [cxserver/deploy@a6b029f]: Update cxserver to 9e8ebda (Fix etag parsing and T188403) [08:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:46] T188403: CX2: Fix issues adapting images - https://phabricator.wikimedia.org/T188403 [08:36:45] !log kartik@tin Finished deploy [cxserver/deploy@a6b029f]: Update cxserver to 9e8ebda (Fix etag parsing and T188403) (duration: 03m 09s) [08:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:44] (03PS3) 10Jcrespo: dump_section.py: Allow manual run of the backup [puppet] - 10https://gerrit.wikimedia.org/r/421931 [08:42:17] (03CR) 10jerkins-bot: [V: 04-1] dump_section.py: Allow manual run of the backup [puppet] - 10https://gerrit.wikimedia.org/r/421931 (owner: 10Jcrespo) [08:50:04] (03PS4) 10Jcrespo: dump_section.py: Allow manual run of the backup [puppet] - 10https://gerrit.wikimedia.org/r/421931 [08:50:34] (03PS4) 10Filippo Giunchedi: puppetmaster: install keypair for 'puppet' when running as CA [puppet] - 10https://gerrit.wikimedia.org/r/421842 (https://phabricator.wikimedia.org/T189891) [08:50:36] (03PS2) 10Filippo Giunchedi: Revert "hieradata: use puppetmaster2001 as ca_server" [puppet] - 10https://gerrit.wikimedia.org/r/421917 (https://phabricator.wikimedia.org/T189891) [08:50:38] (03PS2) 10Filippo Giunchedi: Revert "cache: depool puppetmaster1001 from config-master.w.o" [puppet] - 10https://gerrit.wikimedia.org/r/421919 (https://phabricator.wikimedia.org/T184562) [08:51:26] (03CR) 10Filippo Giunchedi: "Thanks for the review!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/421842 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [08:54:31] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler03/10679/" [puppet] - 10https://gerrit.wikimedia.org/r/421860 (https://phabricator.wikimedia.org/T184561) (owner: 10Filippo Giunchedi) [08:55:16] (03CR) 10Alexandros Kosiaris: "PCC happy at https://puppet-compiler.wmflabs.org/compiler03/10680/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/422100 (https://phabricator.wikimedia.org/T184924) (owner: 10Alexandros Kosiaris) [09:02:17] (03PS1) 10Elukey: role::prometheus::analytics: poll cassandra aqs metrics [puppet] - 10https://gerrit.wikimedia.org/r/422103 (https://phabricator.wikimedia.org/T184795) [09:09:47] !log reboot aqs1007 for kernel + cassandra upgrades [09:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:23] (03PS7) 10Rduran: Create tests skeleton [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746 [09:12:26] (03PS7) 10Rduran: Refactor and test the main OSC run method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 [09:14:33] (03PS5) 10Jcrespo: dump_section.py: Allow manual run of the backup [puppet] - 10https://gerrit.wikimedia.org/r/421931 [09:25:26] !log uploaded mtail-3.0.0~rc5-1 to apt.w.o for jessie-wikimedia [09:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:52] !log reboot aqs1008 for kernel + cassandra upgrades [09:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:16] (03PS6) 10Jcrespo: dump_section.py: Allow manual run of the backup [puppet] - 10https://gerrit.wikimedia.org/r/421931 [09:36:51] (03CR) 10jerkins-bot: [V: 04-1] dump_section.py: Allow manual run of the backup [puppet] - 10https://gerrit.wikimedia.org/r/421931 (owner: 10Jcrespo) [09:38:45] (03PS7) 10Jcrespo: dump_section.py: Allow manual run of the backup [puppet] - 10https://gerrit.wikimedia.org/r/421931 [09:41:29] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4083723 (10fgiunchedi) >>! In T184942#4083351, @Krinkle wrote: > I did find `fmt_inm` as `varnish_resourceloader_inm` via [varnishrls.mtail](https://gith... [09:43:16] (03PS1) 10Alexandros Kosiaris: nrpe: Pass ensure from monitor_service to nrpe::check [puppet] - 10https://gerrit.wikimedia.org/r/422106 [09:44:56] !log reboot aqs1009 for kernel + cassandra upgrades [09:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:55] (03CR) 10Ema: [C: 031] nrpe: Pass ensure from monitor_service to nrpe::check [puppet] - 10https://gerrit.wikimedia.org/r/422106 (owner: 10Alexandros Kosiaris) [09:48:57] (03CR) 10Jcrespo: [C: 032] dump_section.py: Allow manual run of the backup [puppet] - 10https://gerrit.wikimedia.org/r/421931 (owner: 10Jcrespo) [09:50:33] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans, 10User-Elukey: Test/upload new cassandra 2.2.6 package (wmf3) - https://phabricator.wikimedia.org/T189529#4083750 (10elukey) 05Open>03Resolved Deployed to the aqs cluster together with the jmx agent, as far as I can see all good! Thanks a... [09:54:23] (03CR) 10Giuseppe Lavagetto: [C: 031] "This is obviously correct; actually I'm not fully sure of the implications on our codebase. Did someone check?" [puppet] - 10https://gerrit.wikimedia.org/r/422106 (owner: 10Alexandros Kosiaris) [09:57:29] (03PS1) 10Filippo Giunchedi: base: alert on edac correctable errors [puppet] - 10https://gerrit.wikimedia.org/r/422110 (https://phabricator.wikimedia.org/T183177) [09:59:09] (03CR) 10Filippo Giunchedi: [C: 031] role::prometheus::analytics: poll cassandra aqs metrics [puppet] - 10https://gerrit.wikimedia.org/r/422103 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [10:00:10] (03CR) 10Elukey: [C: 032] role::prometheus::analytics: poll cassandra aqs metrics [puppet] - 10https://gerrit.wikimedia.org/r/422103 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [10:00:14] (03PS2) 10Elukey: role::prometheus::analytics: poll cassandra aqs metrics [puppet] - 10https://gerrit.wikimedia.org/r/422103 (https://phabricator.wikimedia.org/T184795) [10:04:25] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler03/10683/" [puppet] - 10https://gerrit.wikimedia.org/r/422110 (https://phabricator.wikimedia.org/T183177) (owner: 10Filippo Giunchedi) [10:25:08] (03CR) 10Filippo Giunchedi: "This will add about 1500 exported resources, it doesn't sound like much to add to puppetdb and icinga so I think it won't be a problem. We" [puppet] - 10https://gerrit.wikimedia.org/r/422110 (https://phabricator.wikimedia.org/T183177) (owner: 10Filippo Giunchedi) [10:32:28] (03PS1) 10Filippo Giunchedi: base: enable exporting SMART metrics by default [puppet] - 10https://gerrit.wikimedia.org/r/422112 (https://phabricator.wikimedia.org/T86552) [10:36:43] 10Operations, 10Traffic: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4083947 (10ema) >>! In T190450#4074126, @Krinkle wrote: > I'm not sure why this would've changed recently. I suppose we can try to track it down. However, i... [10:36:53] 10Operations, 10Traffic: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4083949 (10ema) p:05Triage>03Normal [10:41:42] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:41:46] (03PS1) 10Jcrespo: mariadb backups: Skip x1 and misc hosts this week [puppet] - 10https://gerrit.wikimedia.org/r/422115 (https://phabricator.wikimedia.org/T183177) [10:42:24] (03PS1) 10Muehlenhoff: Record new emaila address for Matt Flaschen [puppet] - 10https://gerrit.wikimedia.org/r/422116 [10:42:39] (03CR) 10Jcrespo: [C: 032] mariadb backups: Skip x1 and misc hosts this week [puppet] - 10https://gerrit.wikimedia.org/r/422115 (https://phabricator.wikimedia.org/T183177) (owner: 10Jcrespo) [10:42:51] (03PS2) 10Muehlenhoff: Record new email address for Matt Flaschen [puppet] - 10https://gerrit.wikimedia.org/r/422116 [10:45:27] (03PS3) 10Muehlenhoff: Record new email address for Matt Flaschen [puppet] - 10https://gerrit.wikimedia.org/r/422116 [10:46:06] (03CR) 10Muehlenhoff: [C: 032] Record new email address for Matt Flaschen [puppet] - 10https://gerrit.wikimedia.org/r/422116 (owner: 10Muehlenhoff) [10:50:30] !log reboot labtestvirt2002 to test if it would boot or not [10:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:12] PROBLEM - Host labtestvirt2002 is DOWN: PING CRITICAL - Packet loss = 100% [10:52:51] RECOVERY - Host labtestvirt2002 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [10:55:11] PROBLEM - configured eth on labtestvirt2002 is CRITICAL: eth1 reporting no carrier. [10:55:21] I get "fatal internal server error" from git pull (getting things from gerrit.wikimedia.org) and same with git fetch [10:55:32] fatal: internal server error [10:55:32] remote: internal server error [10:55:32] fatal: protocol error: bad pack header [10:56:28] Amir1: looks like T190676 [10:56:28] T190676: git pull fails for MW core with "fatal: protocol error: bad pack header" when local branches point to remote branches that no more exist - https://phabricator.wikimedia.org/T190676 [10:57:36] godog: yeah but it's resolved :/ [10:58:46] 10Operations, 10Gerrit: git pull fails for MW core with "fatal: protocol error: bad pack header" when local branches point to remote branches that no more exist - https://phabricator.wikimedia.org/T190676#4084023 (10Ladsgroup) 05Resolved>03Open We can't tell anyone to run "git remote prune origin" when see... [11:06:42] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:11:14] (03CR) 10Giuseppe Lavagetto: [C: 031] puppetmaster: install keypair for 'puppet' when running as CA [puppet] - 10https://gerrit.wikimedia.org/r/421842 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [11:11:43] (03PS5) 10Filippo Giunchedi: puppetmaster: install keypair for 'puppet' when running as CA [puppet] - 10https://gerrit.wikimedia.org/r/421842 (https://phabricator.wikimedia.org/T189891) [11:12:03] (03PS1) 10Jcrespo: Revert "mariadb backups: Skip x1 and misc hosts this week" [puppet] - 10https://gerrit.wikimedia.org/r/422122 [11:12:09] (03PS2) 10Jcrespo: Revert "mariadb backups: Skip x1 and misc hosts this week" [puppet] - 10https://gerrit.wikimedia.org/r/422122 [11:12:58] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: install keypair for 'puppet' when running as CA [puppet] - 10https://gerrit.wikimedia.org/r/421842 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [11:13:24] 10Operations, 10Traffic: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4084046 (10zeljkofilipin) >>! In T190450#4081937, @Ragesoss wrote: > It might make sense to patch the gem to return empty string for 404s, which would resto... [11:13:42] 10Operations, 10Traffic, 10User-zeljkofilipin: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4084049 (10zeljkofilipin) [11:14:07] (03PS1) 10Jcrespo: mariadb backups: Execute tar in parallel [puppet] - 10https://gerrit.wikimedia.org/r/422123 (https://phabricator.wikimedia.org/T189383) [11:14:43] (03CR) 10jerkins-bot: [V: 04-1] mariadb backups: Execute tar in parallel [puppet] - 10https://gerrit.wikimedia.org/r/422123 (https://phabricator.wikimedia.org/T189383) (owner: 10Jcrespo) [11:16:38] (03PS1) 10Filippo Giunchedi: puppetmaster: swap ssl/server with server/ssl in ca_server [puppet] - 10https://gerrit.wikimedia.org/r/422124 [11:16:44] (03CR) 10Alexandros Kosiaris: [C: 04-2] "It's actually not so correct and now that I relook at my change, IIRC I did this on purpose. The reasoning being that we don't want the fi" [puppet] - 10https://gerrit.wikimedia.org/r/422106 (owner: 10Alexandros Kosiaris) [11:16:52] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/var/lib/puppet/ssl/server/certs/puppet.pem],File[/var/lib/puppet/ssl/server/private_keys/puppet.pem] [11:17:02] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/var/lib/puppet/ssl/server/certs/puppet.pem],File[/var/lib/puppet/ssl/server/private_keys/puppet.pem] [11:17:02] that's me ^ fixing [11:17:33] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: swap ssl/server with server/ssl in ca_server [puppet] - 10https://gerrit.wikimedia.org/r/422124 (owner: 10Filippo Giunchedi) [11:17:35] (03PS2) 10Jcrespo: mariadb backups: Execute tar in parallel [puppet] - 10https://gerrit.wikimedia.org/r/422123 (https://phabricator.wikimedia.org/T189383) [11:18:17] (03PS2) 10Alexandros Kosiaris: ci: Refactor pipeline deps using separate CI role [puppet] - 10https://gerrit.wikimedia.org/r/421973 (https://phabricator.wikimedia.org/T188936) (owner: 10Dduvall) [11:18:25] (03CR) 10Alexandros Kosiaris: [C: 032] ci: Refactor pipeline deps using separate CI role [puppet] - 10https://gerrit.wikimedia.org/r/421973 (https://phabricator.wikimedia.org/T188936) (owner: 10Dduvall) [11:19:00] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ci: Refactor pipeline deps using separate CI role [puppet] - 10https://gerrit.wikimedia.org/r/421973 (https://phabricator.wikimedia.org/T188936) (owner: 10Dduvall) [11:19:52] (03PS3) 10Jcrespo: mariadb backups: Execute tar in parallel [puppet] - 10https://gerrit.wikimedia.org/r/422123 (https://phabricator.wikimedia.org/T189383) [11:21:51] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:22:02] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:23:52] (03CR) 10Jcrespo: [C: 032] mariadb backups: Execute tar in parallel [puppet] - 10https://gerrit.wikimedia.org/r/422123 (https://phabricator.wikimedia.org/T189383) (owner: 10Jcrespo) [11:27:07] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4084098 (10MarcoAurelio) @Joe Thanks for having a look. I however don't know what to do to fix this. Any tips? Thanks! [11:32:41] (03PS4) 10MarcoAurelio: Disable AbuseFilter from collecting IP addresses on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421864 (https://phabricator.wikimedia.org/T188862) [11:33:22] 10Operations, 10Gerrit: git pull fails for MW core with "fatal: protocol error: bad pack header" when local branches point to remote branches that no more exist - https://phabricator.wikimedia.org/T190676#4084113 (10Aklapper) I propose to decline this task as this is an upstream issue. [11:36:29] !log installing ICU security updates [11:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:24] (03PS1) 10Arturo Borrero Gonzalez: wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) [11:42:00] (03CR) 10jerkins-bot: [V: 04-1] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) (owner: 10Arturo Borrero Gonzalez) [11:44:43] (03PS2) 10Arturo Borrero Gonzalez: wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) [11:45:18] (03CR) 10jerkins-bot: [V: 04-1] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) (owner: 10Arturo Borrero Gonzalez) [11:52:53] 10Operations, 10Gerrit: git pull fails for MW core with "fatal: protocol error: bad pack header" when local branches point to remote branches that no more exist - https://phabricator.wikimedia.org/T190676#4084154 (10Paladox) This has nothing to do with gerrit now, the problem lies in git not doing this behavio... [11:55:04] (03PS3) 10Arturo Borrero Gonzalez: wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) [11:55:39] (03CR) 10jerkins-bot: [V: 04-1] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) (owner: 10Arturo Borrero Gonzalez) [12:14:56] (03CR) 10Catrope: [C: 031] Configure maps source for localized labels [puppet] - 10https://gerrit.wikimedia.org/r/420315 (https://phabricator.wikimedia.org/T112948) (owner: 10Sbisson) [12:18:36] (03PS6) 10Sbisson: Configure maps source for localized labels [puppet] - 10https://gerrit.wikimedia.org/r/420315 (https://phabricator.wikimedia.org/T112948) [12:25:11] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 32, down: 1, dormant: 0, excluded: 1, unused: 0 [12:26:32] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [12:28:12] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 36, down: 0, dormant: 0, excluded: 1, unused: 0 [12:28:51] (03PS1) 10Elukey: eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) [12:31:41] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 87.56 ms [12:35:39] (03PS4) 10Arturo Borrero Gonzalez: wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) [12:36:15] (03CR) 10jerkins-bot: [V: 04-1] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) (owner: 10Arturo Borrero Gonzalez) [12:43:28] hey folks, is it allowed to query a hiera key when setting another hiera key in the .yaml file? i.e https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/e6e274023030d53ff590b1341b5bcda83754cc8c%5E%21/#F4 [12:45:32] arturo: I am not sure if it works or not but it seems a bit confusing :( [12:46:11] elukey: alternative? I would like to avoid having the same values all over different files [12:47:26] perhaps I could simply get rid of the intermediate variable [12:49:10] arturo: could be an option, it seems a use case for a more global hiera variable [12:49:28] ok thanks elukey [12:59:05] btw I'm going to failover the ca/private server back to puppetmaster1001, disabling puppet for a short while fleetwide [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180327T1300). [13:00:04] Daimona: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:18] doh, right on time [13:00:28] ok nevermind I'll hold off for swat, just in case [13:00:32] :) [13:00:37] I can SWAT today [13:00:44] Hi :) [13:00:59] zeljkof: nice, can you let me know when done with swat and I'll resume? thanks! [13:01:18] godog: sure [13:01:28] 10Operations, 10Gerrit: git pull fails for MW core with "fatal: protocol error: bad pack header" when local branches point to remote branches that no more exist - https://phabricator.wikimedia.org/T190676#4084291 (10daniel) @Ladsgroup I agree that it's a bug and should be fixed. But if I understand correctly,... [13:01:31] Daimona: hi! [13:01:32] PROBLEM - Disk space on elastic1026 is CRITICAL: DISK CRITICAL - free space: /srv 62269 MB (12% inode=99%) [13:02:13] 10Operations, 10Goal, 10Patch-For-Review, 10User-Elukey, 10User-fgiunchedi: Stop using jmx_exporter deployed via scap in favour of Debian package - https://phabricator.wikimedia.org/T181728#4084292 (10elukey) So `lsof -X / | grep jmx_prometheus` on restbase* shows only ``` java 17424 cassand... [13:02:48] Daimona: I'll review, merge and deploy your changes one by one to mwdebug1002 [13:02:59] Alright, thanks [13:03:04] I'll let you know as soon as each patch is there for testing [13:03:20] do you know how to test there? (I can find link to docs) [13:03:31] (03CR) 10Filippo Giunchedi: "Hosts that will be affected:" [puppet] - 10https://gerrit.wikimedia.org/r/422112 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [13:03:50] (03PS5) 10Zfilipin: Enable AbuseFilter runtime profile on more Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420672 (https://phabricator.wikimedia.org/T175954) (owner: 10Daimona Eaytoy) [13:03:51] Nope [13:04:12] Daimona: this is the documentation https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [13:04:44] Many thanks [13:04:52] the important thing is to install the browser plugin, then in the plugin user interface, you select mwdebug1002 and go to a page, your request will be served by mwdebug1002 [13:05:52] Alright [13:05:56] let me know if you have any questions [13:05:56] I just installed the extension [13:06:14] you have to enable it (there is on/off button) [13:06:24] and then select a host, we only use mwdebug1002 [13:06:25] So basically I just have to use the wiki with that plugin switched on? [13:06:32] on mwdebug1002 [13:06:39] yes, on and pointing to mwdebug1002 [13:06:46] Alright [13:07:03] but not yet, nothing is there yet, will be in a few minutes [13:07:09] Of course :) [13:07:11] Just a note [13:07:23] Which should I select amongst profile, readonly and log [13:07:24] ? [13:08:08] you can ignore those checkboxes :) [13:08:12] I never click them [13:09:00] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420672 (https://phabricator.wikimedia.org/T175954) (owner: 10Daimona Eaytoy) [13:09:12] Oh nice, many thanks :) [13:09:19] Daimona: this ^ means the commit will get merged soon [13:09:36] you can see jobs running here https://integration.wikimedia.org/zuul/ [13:09:38] Yeah [13:09:42] I'll wait for jenkins [13:09:47] look for gerrit number 420672 [13:10:40] (03Merged) 10jenkins-bot: Enable AbuseFilter runtime profile on more Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420672 (https://phabricator.wikimedia.org/T175954) (owner: 10Daimona Eaytoy) [13:10:44] (03CR) 10Ottomata: "Ooo, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/422098 (owner: 10Elukey) [13:10:54] (03CR) 10jenkins-bot: Enable AbuseFilter runtime profile on more Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420672 (https://phabricator.wikimedia.org/T175954) (owner: 10Daimona Eaytoy) [13:11:04] this ^ means it is merged, I will deploy it in a minute or so [13:11:24] Alright, I'll wait for the log [13:11:54] Daimona: 420672 is at mwdebug1002, please test and let me know if I can deploy it [13:12:10] (03PS8) 10Muehlenhoff: Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) [13:12:30] (03CR) 10Muehlenhoff: Allow to selectively run time servers on Chrony (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff) [13:13:33] zeljkof: I can't see any problem, however this one should probably be checked on grafana [13:15:12] (03CR) 10Ottomata: [C: 031] "Some nits, but +1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [13:15:25] Daimona: do you have a link? what should I check for? [13:15:54] Profiling stats should be manually added here https://grafana.wikimedia.org/dashboard/db/mediawiki-abusefilter-profiling?orgId=1 [13:16:00] A visual review should be enough for this one [13:16:25] Daimona: I'm not really familiar with grafana [13:16:37] I am not sure how to add it, or how to check it [13:17:11] do you know who can check it? [13:17:19] Nor I am. I asked to kindly add missing wikis on phab task [13:17:34] To dmaza [13:17:41] RECOVERY - Disk space on elastic1026 is OK: DISK OK [13:18:07] (03CR) 10Elukey: eventlogging: move alarms from graphite to prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [13:18:25] Daimona: can you access grafana? maybe it is there, but I do not see it [13:18:34] (03PS9) 10Muehlenhoff: Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) [13:18:56] No, I don't have admin access [13:18:58] (03CR) 10Muehlenhoff: "Also added some Package statements to ensure that ntp is removed before chrony is installed (since the two conflict on the dpkg level)" [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff) [13:19:22] I can see enwiki, ptwiki, wikidata, commons, mediawiki, meta, testwiki, eswiki, but that is all [13:19:35] Yeah, right now [13:19:42] no, that is eswikibooks [13:19:45] not eswiki [13:20:01] es, de and it wikis should be added as well [13:20:01] no de, es, or it wikis [13:20:21] I don't know how to do that [13:20:24] They should be manually added, basically by copying existing graphs, I think [13:20:47] We may do it later, as some data will be recorded [13:21:06] (03PS2) 10Elukey: eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) [13:21:12] can I deploy this without grafana changes? or should I revert it until grafana is updated? [13:22:02] 10Operations, 10Traffic, 10User-zeljkofilipin: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4084338 (10BBlack) So, recapping a bit what's already been mentioned above: the proximate cause of behavior change was the AE:gzip... [13:22:20] (03CR) 10Filippo Giunchedi: "Some puppet-y comments but LGTM otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff) [13:22:27] Daimona: can I deploy this without grafana changes? or should I revert it until grafana is updated? [13:22:29] I think you can deploy, when this was first enabled on some wikis grafana hadn't been available for some time [13:22:50] So basically it'll just record data without showing, nothing bad [13:22:59] Daimona: ok, deploying then, please remind at the task that the graphs should be added [13:23:01] I'll also ping dmaza on phab [13:23:07] Daimona: deploying [13:23:10] Yeah, I will right now [13:24:07] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:420672|Enable AbuseFilter runtime profile on more Wikis (T175954)]] (duration: 00m 58s) [13:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:15] T175954: Enable AbuseFilter runtime measurement (Grafana) on more wikis of other languages - https://phabricator.wikimedia.org/T175954 [13:24:19] Daimona: ^ this means it is deployed [13:24:36] please turn off the browser plugin and check the wikis [13:24:47] I will review/merge/deploy the second patch [13:24:58] Thanks, I'm checking [13:25:22] it took us 24 minutes for the first patch, we have to speed it up if more than 2-3 patches should be deployed today (the swat is 60 minutes) [13:25:35] Sounds like everything is fine [13:25:38] Yeah, indeed [13:25:42] (03PS4) 10Zfilipin: Enable $wgAbuseFilterProfile on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420687 (https://phabricator.wikimedia.org/T190137) (owner: 10Daimona Eaytoy) [13:26:03] The backports are already tested, being in wmf.26 [13:26:47] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420687 (https://phabricator.wikimedia.org/T190137) (owner: 10Daimona Eaytoy) [13:28:00] (03Merged) 10jenkins-bot: Enable $wgAbuseFilterProfile on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420687 (https://phabricator.wikimedia.org/T190137) (owner: 10Daimona Eaytoy) [13:28:11] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /srv 61862 MB (12% inode=99%) [13:28:22] (03CR) 10jenkins-bot: Enable $wgAbuseFilterProfile on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420687 (https://phabricator.wikimedia.org/T190137) (owner: 10Daimona Eaytoy) [13:28:52] Ready for testing [13:29:05] Daimona: 420687 is at mwdebug1002 [13:29:36] Testing [13:29:58] Seems fine [13:30:12] Daimona: ok, deploying [13:30:20] ^ [13:30:48] !log deactivate/clean iridium.eqiad.wmnet -- decom'd [13:30:49] (03CR) 10Elukey: [C: 031] base: enable exporting SMART metrics by default [puppet] - 10https://gerrit.wikimedia.org/r/422112 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [13:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:29] !log zfilipin@tin Synchronized wmf-config/abusefilter.php: SWAT: [[gerrit:420687|Enable $wgAbuseFilterProfile on itwiki (T190137)]] (duration: 00m 57s) [13:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:37] T190137: Enable AbuseFilter per-filter profiling on Italian Wikipedia & monitor if there is a performance impact - https://phabricator.wikimedia.org/T190137 [13:31:54] Daimona: 420687 is deployed, please check [13:32:20] (03PS3) 10Zfilipin: Change wording for AbuseFilter global block durations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421691 (https://phabricator.wikimedia.org/T190602) (owner: 10Daimona Eaytoy) [13:32:34] Checked, everything is fine [13:32:57] This one I won't probably be able to test online. However, very low risk and tested locally [13:33:11] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /srv 60944 MB (12% inode=99%) [13:33:50] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421691 (https://phabricator.wikimedia.org/T190602) (owner: 10Daimona Eaytoy) [13:34:13] Daimona: 421691 is not testable? not at mwdebug1002, or when deployed? [13:34:24] not by me [13:34:35] I don't have enough rights on involved wikis [13:35:10] (03Merged) 10jenkins-bot: Change wording for AbuseFilter global block durations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421691 (https://phabricator.wikimedia.org/T190602) (owner: 10Daimona Eaytoy) [13:35:19] Daimona: ok, in that case I'll deploy, but please ask relevant people to check if it works for them [13:35:26] Of course [13:36:11] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /srv 61823 MB (12% inode=99%) [13:37:13] !log zfilipin@tin Synchronized wmf-config/abusefilter.php: SWAT: [[gerrit:421691|Change wording for AbuseFilter global block durations (T190602)]] (duration: 00m 57s) [13:37:20] Daimona: 421691 is deployed [13:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:20] T190602: Block duration defaults to 2 hours - https://phabricator.wikimedia.org/T190602 [13:37:34] Nice, thanks [13:37:37] 10Operations, 10Traffic, 10User-zeljkofilipin: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4084401 (10Gilles) Sure, the behavior can be reverted in these ways, but the rationale to have more human-friendly error pages shoul... [13:37:54] Daimona: CI is still running for 420045 [13:38:12] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /srv 60778 MB (12% inode=99%) [13:38:12] (03CR) 10jenkins-bot: Change wording for AbuseFilter global block durations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421691 (https://phabricator.wikimedia.org/T190602) (owner: 10Daimona Eaytoy) [13:38:26] Yeah it usually takes quite long [13:40:09] BTW, as I was saying, this and the next one have been already fixed (and tested) in wmf.26, so they don't need any further testing [13:42:43] Daimona: we have a problem [13:42:56] What? [13:43:06] oh, no, I made a mistake [13:43:17] Pheeew :-D [13:43:21] I or did? [13:43:25] let me check something [13:43:30] Fine [13:43:41] no, I don't think I made a mistake [13:43:45] so [13:44:01] why are patches for wmf.25 created? [13:44:08] when all wikis are at wmf.26? [13:44:15] see https://tools.wmflabs.org/versions/ [13:45:19] I know [13:45:32] I thought we shouldn't leave the version bugged [13:45:58] Daimona: ok, but I have no place to deploy it [13:46:11] Yeah right [13:46:12] so how it this scheduled for SWAT? by mistake? [13:46:35] Misunderstanding [13:46:52] Should've been set up for last week [13:47:21] ok, I'll revert the commit I have already merged [13:47:34] I guess that concludes this SWAT window then :) [13:47:45] Daimona: thanks for deploying with #releng :) [13:47:59] (that's short for #wikimedia-releng) [13:48:11] Indeed [13:48:15] Many thanks :) [13:49:33] Daimona: was this your first swat? [13:49:39] Yeah [13:49:51] it went fairly well then :) [13:50:06] I think so :) [13:50:09] hopefully nothing broken that we did not notice [13:50:12] The first of many, I hope [13:50:27] I think it's almost impossible, they were all low risk changes [13:50:36] you would be surprised... :D [13:51:33] I hope not :D [13:53:00] (03PS10) 10Muehlenhoff: Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) [13:54:34] !log ppchelko@tin Started restart [restbase/deploy@e19bad9]: Restart to verify that misterious deploy timeouts still happen [13:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:18] RECOVERY - Disk space on elastic1020 is OK: DISK OK [14:01:58] !log ppchelko@tin Started deploy [restbase/deploy@e19bad9]: Deploy without feed check to verify that misterious deploy timeouts still happen [14:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:54] (03PS11) 10Muehlenhoff: Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) [14:04:19] !log EU SWAT finished [14:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:36] godog: done! [14:04:42] zeljkof: awesome, thanks! [14:05:54] (03PS3) 10Ema: VCL: use hfp only for uncacheable candidates for conditional requests [puppet] - 10https://gerrit.wikimedia.org/r/421542 (https://phabricator.wikimedia.org/T180712) [14:11:20] (03PS12) 10Muehlenhoff: Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) [14:12:50] !log ppchelko@tin Finished deploy [restbase/deploy@e19bad9]: Deploy without feed check to verify that misterious deploy timeouts still happen (duration: 10m 52s) [14:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:25] 10Operations, 10Traffic, 10User-zeljkofilipin: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4084478 (10BBlack) Yeah, I guess I was taking it as implicitly true that it's correct for MW to desire content-free 404s in the case... [14:14:36] (03CR) 10Ori.livneh: "Ping?" [puppet] - 10https://gerrit.wikimedia.org/r/421981 (owner: 10Ori.livneh) [14:16:59] (03CR) 10Filippo Giunchedi: [C: 031] Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff) [14:18:02] (03CR) 10Imarlier: "> Ping?" [puppet] - 10https://gerrit.wikimedia.org/r/421981 (owner: 10Ori.livneh) [14:20:50] (03CR) 10Ori.livneh: "Not sure I follow. Are you saying that restarting it now would cause 5 minutes of data to be lost irretrievably, whereas the data that is " [puppet] - 10https://gerrit.wikimedia.org/r/421981 (owner: 10Ori.livneh) [14:22:47] (03CR) 10Imarlier: "> Not sure I follow. Are you saying that restarting it now would" [puppet] - 10https://gerrit.wikimedia.org/r/421981 (owner: 10Ori.livneh) [14:36:20] !log rebooting labpuppetmaster1002 for T189115 [14:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:42] (03PS1) 10Jcrespo: mariadb backups: Execute 2 backups concurrently [puppet] - 10https://gerrit.wikimedia.org/r/422152 (https://phabricator.wikimedia.org/T189384) [14:37:26] (03CR) 10jerkins-bot: [V: 04-1] mariadb backups: Execute 2 backups concurrently [puppet] - 10https://gerrit.wikimedia.org/r/422152 (https://phabricator.wikimedia.org/T189384) (owner: 10Jcrespo) [14:37:53] (03PS1) 10Catrope: Use maps-beta to serve maps in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422154 [14:38:26] PROBLEM - Host labpuppetmaster1002 is DOWN: CRITICAL - Host Unreachable (208.80.155.120) [14:39:12] (03PS1) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) [14:39:36] RECOVERY - Host labpuppetmaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [14:39:46] (03CR) 10jerkins-bot: [V: 04-1] mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [14:41:37] PROBLEM - Keyholder SSH agent on labpuppetmaster1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [14:42:27] (03PS1) 10Muehlenhoff: Update to 1.1.0h [debs/openssl11] - 10https://gerrit.wikimedia.org/r/422157 [14:44:46] RECOVERY - Keyholder SSH agent on labpuppetmaster1002 is OK: OK: Keyholder is armed with all configured keys. [14:44:47] (03CR) 10Muehlenhoff: [C: 032] Update to 1.1.0h [debs/openssl11] - 10https://gerrit.wikimedia.org/r/422157 (owner: 10Muehlenhoff) [14:45:47] !log rebooting labpuppetmaster1001 for T189115 [14:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:09] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4084586 (10RobH) Tim emailed about this a couple of weeks ago, and I sent out another email to them regarding this. Hopefully it gets some movement soon. [14:52:20] (03PS1) 10Ottomata: Install nrpe check for Kafka consumer lag by checking burrow [puppet] - 10https://gerrit.wikimedia.org/r/422163 (https://phabricator.wikimedia.org/T189611) [14:52:56] (03CR) 10jerkins-bot: [V: 04-1] Install nrpe check for Kafka consumer lag by checking burrow [puppet] - 10https://gerrit.wikimedia.org/r/422163 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [14:53:00] (03PS2) 10Jcrespo: mariadb backups: Execute 2 backups concurrently [puppet] - 10https://gerrit.wikimedia.org/r/422152 (https://phabricator.wikimedia.org/T189384) [14:53:03] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#4084601 (10elukey) [14:53:28] (03CR) 10jerkins-bot: [V: 04-1] mariadb backups: Execute 2 backups concurrently [puppet] - 10https://gerrit.wikimedia.org/r/422152 (https://phabricator.wikimedia.org/T189384) (owner: 10Jcrespo) [15:01:27] (03PS1) 10Muehlenhoff: Update def_buf_8k.patch to 1.1.0h [debs/openssl11] - 10https://gerrit.wikimedia.org/r/422164 [15:01:32] (03PS3) 10Jcrespo: mariadb backups: Execute 2 backups concurrently [puppet] - 10https://gerrit.wikimedia.org/r/422152 (https://phabricator.wikimedia.org/T189384) [15:03:13] (03PS2) 10Alexandros Kosiaris: ci: Add kubernetes deployment classes to CI [puppet] - 10https://gerrit.wikimedia.org/r/422100 (https://phabricator.wikimedia.org/T184924) [15:03:15] (03PS1) 10Alexandros Kosiaris: calico: Use the kubelet specific kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/422165 [15:06:52] ok I'm going to failover the puppet CA back to eqiad shortly, and stopping puppet fleetwide while doing so [15:07:03] please LMK if that's not ok [15:09:42] 10Operations, 10Traffic, 10User-zeljkofilipin: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4084655 (10Ragesoss) I think the new behavior isn't a significant problem; it'll only really be an issue for anyone who was relying... [15:10:12] !log stop puppet fleetwide for CA failover - T189891 [15:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:18] T189891: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891 [15:12:06] (03CR) 10Muehlenhoff: [C: 032] Update def_buf_8k.patch to 1.1.0h [debs/openssl11] - 10https://gerrit.wikimedia.org/r/422164 (owner: 10Muehlenhoff) [15:13:46] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [15:13:56] looking ^ [15:14:30] (03PS3) 10Filippo Giunchedi: Revert "hieradata: use puppetmaster2001 as ca_server" [puppet] - 10https://gerrit.wikimedia.org/r/421917 (https://phabricator.wikimedia.org/T189891) [15:15:29] (03CR) 10Filippo Giunchedi: [C: 032] Revert "hieradata: use puppetmaster2001 as ca_server" [puppet] - 10https://gerrit.wikimedia.org/r/421917 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [15:16:35] (03CR) 10Jcrespo: [C: 032] mariadb backups: Execute 2 backups concurrently [puppet] - 10https://gerrit.wikimedia.org/r/422152 (https://phabricator.wikimedia.org/T189384) (owner: 10Jcrespo) [15:16:41] (03PS4) 10Jcrespo: mariadb backups: Execute 2 backups concurrently [puppet] - 10https://gerrit.wikimedia.org/r/422152 (https://phabricator.wikimedia.org/T189384) [15:17:18] jynus: almost done with my change btw, reenabling puppet shortly [15:17:53] I am programming, so not affected really [15:18:05] if I can merge, even if I cannot deploy? [15:18:42] jynus: yeah puppet-merge works fine, puppet agent are stopped tho [15:18:48] that's ok [15:19:29] (03CR) 10Vgutierrez: "Re-run tests after https://gerrit.wikimedia.org/r/#/c/422169/ is merged" [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [15:20:56] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for RI-maintained services - https://phabricator.wikimedia.org/T189524#4084744 (10Tgr) I don't think Icinga is applicable to reading lists. It's normally used to check whether services ar... [15:21:24] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4084760 (10Papaul) [15:21:26] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4084758 (10Papaul) 05Open>03Resolved Complete [15:23:29] !log reenable puppet fleetwide for CA failover - T189891 [15:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:34] T189891: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891 [15:30:26] 10Puppet: Investigate wrong location for /srv/private post-receive hook in puppetmaster::gitclone - https://phabricator.wikimedia.org/T190157#4084895 (10fgiunchedi) See also the related review that spew this task, we should make sure commits are not enabled on `/srv/private` for non-ca hosts. [15:35:38] (03PS2) 10Filippo Giunchedi: Revert "Move config-master to codfw" [dns] - 10https://gerrit.wikimedia.org/r/421918 (https://phabricator.wikimedia.org/T184562) [15:35:41] !log Bumping operations-puppet-tests-docker job to docker-registry.wikimedia.org/releng/operations-puppet:0.3.1 | https://gerrit.wikimedia.org/r/#/c/422169/ | ping vgutierrez [15:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:48] (03CR) 10Filippo Giunchedi: [C: 032] Revert "Move config-master to codfw" [dns] - 10https://gerrit.wikimedia.org/r/421918 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:36:21] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [15:37:04] (03CR) 10jerkins-bot: [V: 04-1] mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [15:37:25] (03PS3) 10Filippo Giunchedi: Revert "cache: depool puppetmaster1001 from config-master.w.o" [puppet] - 10https://gerrit.wikimedia.org/r/421919 (https://phabricator.wikimedia.org/T184562) [15:38:13] (03CR) 10Filippo Giunchedi: [C: 032] Revert "cache: depool puppetmaster1001 from config-master.w.o" [puppet] - 10https://gerrit.wikimedia.org/r/421919 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:38:22] (03CR) 10Jcrespo: [C: 04-1] "ddl method works if the alter is succesful, but if it encounters a problem (e.g. the alter fails, it throws an exception):" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 (owner: 10Rduran) [15:41:04] (03PS2) 10Ottomata: Install nrpe check for Kafka consumer lag by checking burrow [puppet] - 10https://gerrit.wikimedia.org/r/422163 (https://phabricator.wikimedia.org/T189611) [15:41:51] (03CR) 10jerkins-bot: [V: 04-1] Install nrpe check for Kafka consumer lag by checking burrow [puppet] - 10https://gerrit.wikimedia.org/r/422163 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [15:41:56] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [15:42:03] (03PS2) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) [15:42:05] (03CR) 10Jcrespo: [C: 04-1] "The percona method now works, although it would be nice to print the debug information about that, both on success and on error (I think p" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 (owner: 10Rduran) [15:42:15] (03PS2) 10Alexandros Kosiaris: calico: Use the kubelet specific kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/422165 [15:42:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] calico: Use the kubelet specific kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/422165 (owner: 10Alexandros Kosiaris) [15:44:08] (03CR) 10Jcrespo: [C: 04-1] "BTW, to simulate an error, I run "add column X int" twice, the first one will be successful, the second will fail because you cannot have " [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 (owner: 10Rduran) [15:44:44] (03PS3) 10Ottomata: Install nrpe check for Kafka consumer lag by checking burrow [puppet] - 10https://gerrit.wikimedia.org/r/422163 (https://phabricator.wikimedia.org/T189611) [15:47:42] (03PS1) 10Filippo Giunchedi: puppetmaster: use puppetmaster1001 as test server [puppet] - 10https://gerrit.wikimedia.org/r/422175 [15:48:16] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: use puppetmaster1001 as test server [puppet] - 10https://gerrit.wikimedia.org/r/422175 (owner: 10Filippo Giunchedi) [15:48:26] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:07] that's me ^ [15:51:50] (03PS1) 10Muehlenhoff: Update symbols for 1.1.0h [debs/openssl11] - 10https://gerrit.wikimedia.org/r/422177 [15:53:26] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:41] (03PS5) 10Arturo Borrero Gonzalez: [WIP] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) [15:55:12] (03CR) 10jerkins-bot: [V: 04-1] [WIP] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) (owner: 10Arturo Borrero Gonzalez) [15:57:05] (03PS1) 10Chad: Adding reviewers-by-blame plugin [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422178 (https://phabricator.wikimedia.org/T101131) [15:58:14] (03PS2) 10Filippo Giunchedi: wmnet: point esams puppet to eqiad [dns] - 10https://gerrit.wikimedia.org/r/421060 (https://phabricator.wikimedia.org/T184562) [15:58:30] (03CR) 10Filippo Giunchedi: [C: 032] wmnet: point esams puppet to eqiad [dns] - 10https://gerrit.wikimedia.org/r/421060 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:58:55] !log point esams puppet agent traffic to eqiad [15:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog, moritzm, and _joe_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180327T1600). [16:00:04] stephanebisson: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:06] (03CR) 10Paladox: [C: 031] Adding reviewers-by-blame plugin [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422178 (https://phabricator.wikimedia.org/T101131) (owner: 10Chad) [16:00:32] hello [16:00:55] (03CR) 10Paladox: [C: 031] Adding reviewers-by-blame plugin (031 comment) [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422178 (https://phabricator.wikimedia.org/T101131) (owner: 10Chad) [16:03:22] (03PS1) 10Chad: Adding motd plugin [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/422179 (https://phabricator.wikimedia.org/T190810) [16:03:56] (03CR) 10Paladox: [C: 031] Adding motd plugin [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/422179 (https://phabricator.wikimedia.org/T190810) (owner: 10Chad) [16:05:08] stephanebisson: I'm looking at your patch btw [16:05:35] stephanebisson: what's the plan for testing/validating after merge? [16:06:03] godog: I'll do a deployment on maps-test* servers and check it there. [16:07:06] stephanebisson: kk, I'm running the puppet compiler to validate the change [16:07:35] godog: please share the command you are using so I can also do it in the future [16:08:32] stephanebisson: for sure, the thing I'm talking about is https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler [16:08:39] specifically point #2 [16:08:45] great [16:09:56] stephanebisson: so I think the change will also go on maps (non -test) too, but I'm verifying that, is that known/expected? [16:10:06] (03PS2) 10Chad: Adding reviewers-by-blame plugin [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422178 (https://phabricator.wikimedia.org/T101131) [16:10:41] godog: yes, it's the goal after it's tested on the test servers [16:11:44] stephanebisson: I think that the concern is getting the puppet change on maps while testing on maps-test [16:12:12] indeed, what elukey said [16:12:18] godog: would that change have any effect before I do a deployment? [16:13:05] stephanebisson: I don't know, puppet runs on its own schedule every 30 min more or less [16:13:35] ok, abort mission [16:14:07] haha ok, thanks for participating [16:14:07] (03CR) 10Paladox: [C: 031] Adding reviewers-by-blame plugin [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422178 (https://phabricator.wikimedia.org/T101131) (owner: 10Chad) [16:14:28] godog: I need to push that config change on a very precise schedule, I'll investigate how that can be done [16:14:53] stephanebisson: FWIW this patch is out of scope for puppet swat at the moment I think, https://wikitech.wikimedia.org/wiki/PuppetSWAT [16:15:06] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 59477 MB (12% inode=99%) [16:15:29] stephanebisson: finding SREs that knows maps would be better to help with deployment [16:16:19] 10Operations, 10ops-codfw, 10Traffic: cp2006, cp2010, cp2017: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4085174 (10ema) [16:16:38] (03PS2) 10Filippo Giunchedi: Point wikimedia.org and eqiad puppet to eqiad [dns] - 10https://gerrit.wikimedia.org/r/421061 (https://phabricator.wikimedia.org/T184562) [16:16:46] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [16:16:46] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [16:16:46] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [16:16:47] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [16:16:47] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [16:16:55] I usually work with Guillaume but he is on vacation how. I think storing this file in puppet is just not ideal. [16:17:00] *now [16:17:26] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp5002_v6 [16:17:38] stephanebisson: I see, then yeah he can help for sure [16:17:46] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK [16:17:46] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 66 ESP OK [16:17:47] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [16:17:47] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [16:17:47] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 66 ESP OK [16:18:16] 10Operations, 10ops-codfw, 10Traffic: cp2006, cp2010, cp2017: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076372 (10ema) Same issue on cp2017 today. Host depooled. ``` 6 | Sep-28-2015 | 20:10:59 | ECC Uncorr Err | Memory | Uncorrectable memory error ; OEM Ev... [16:18:24] (03CR) 10Filippo Giunchedi: [C: 032] Point wikimedia.org and eqiad puppet to eqiad [dns] - 10https://gerrit.wikimedia.org/r/421061 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [16:18:27] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK [16:18:45] !log point eqiad puppet traffic to eqiad [16:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:56] the IPsec spam is due to cp2017's reboot taking longer because of memory issues -> T190540 [16:19:57] T190540: cp2006, cp2010, cp2017: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540 [16:22:19] (03CR) 10Nuria: eventlogging: move alarms from graphite to prometheus (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [16:22:59] (03Abandoned) 10Sbisson: Configure maps source for localized labels [puppet] - 10https://gerrit.wikimedia.org/r/420315 (https://phabricator.wikimedia.org/T112948) (owner: 10Sbisson) [16:23:04] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4085196 (10faidon) a:05Eevans>03RobH [16:25:47] 10Operations, 10Gerrit: git pull fails for MW core with "fatal: protocol error: bad pack header" when local branches point to remote branches that no more exist - https://phabricator.wikimedia.org/T190676#4085208 (10Ladsgroup) 05Open>03declined If you think so I'm fine but please document this somewhere so... [16:26:36] (03CR) 10Elukey: eventlogging: move alarms from graphite to prometheus (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [16:27:12] 10Operations, 10HHVM, 10User-Elukey, 10User-notice: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4085213 (10Ladsgroup) It's safe to say I'm a Persian Wikipedia community member (with 56K edits). I will test it and let you know ASAP [16:28:05] (03CR) 10Ottomata: "Hmm, looks good! https://puppet-compiler.wmflabs.org/compiler03/10694/" [puppet] - 10https://gerrit.wikimedia.org/r/422163 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [16:31:20] (03PS1) 10Ladsgroup: Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422181 (https://phabricator.wikimedia.org/T180879) [16:31:32] (03CR) 10jerkins-bot: [V: 04-1] Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422181 (https://phabricator.wikimedia.org/T180879) (owner: 10Ladsgroup) [16:33:29] (03CR) 10Chad: [V: 032 C: 032] Adding reviewers-by-blame plugin [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422178 (https://phabricator.wikimedia.org/T101131) (owner: 10Chad) [16:33:59] (03PS2) 10Ladsgroup: Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422181 (https://phabricator.wikimedia.org/T180879) [16:36:38] (03PS1) 10Chad: Adding motd plugin [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422182 [16:36:52] (03CR) 10Catrope: [C: 032] Use maps-beta to serve maps in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422154 (owner: 10Catrope) [16:36:54] (03PS2) 10Chad: Adding motd plugin [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422182 [16:37:28] (03CR) 10Elukey: [C: 031] Install nrpe check for Kafka consumer lag by checking burrow [puppet] - 10https://gerrit.wikimedia.org/r/422163 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [16:38:08] (03Merged) 10jenkins-bot: Use maps-beta to serve maps in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422154 (owner: 10Catrope) [16:38:23] (03CR) 10jenkins-bot: Use maps-beta to serve maps in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422154 (owner: 10Catrope) [16:39:34] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4085264 (10RobH) a:05RobH>03Eevans @eevens: restbase2009 was ordered with 5 Intel 1.6TB S3610 SSDs, not spinni... [16:39:58] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4085273 (10RobH) a:05Eevans>03RobH [16:41:54] 10Operations, 10Analytics, 10Traffic: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843#4085283 (10Nuria) [16:42:18] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4085294 (10RobH) I'm a bit confused on which system will get the replacement SSDs installed? I'm guessing its rest... [16:43:06] !log uploaded openssl 1.1.0h for jessie-wikimedia to apt.wikimedia.org [16:43:07] 10Operations, 10Analytics, 10Traffic: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843#4085301 (10Nuria) [16:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:37] 10Operations, 10Analytics, 10Traffic: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843#4085283 (10Nuria) Adding folks from traffic. [16:44:49] 10Operations, 10Analytics, 10Traffic: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843#4085324 (10Nuria) Seems like this is an "unofficial mirror". Looping legal [16:49:51] (03PS4) 10Ottomata: Install nrpe check for Kafka consumer lag by checking burrow [puppet] - 10https://gerrit.wikimedia.org/r/422163 (https://phabricator.wikimedia.org/T189611) [16:51:15] !log Running rsync catch up job for dumps from ms1001 to labstore1007 [16:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:56] 10Operations, 10Analytics, 10Traffic: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843#4085283 (10Jdlrobson) Note there is also https://ru.wiki.ng/ doing exactly the same thing (different host) [16:56:56] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 61113 MB (12% inode=99%) [16:58:45] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4085392 (10Eevans) >>! In T189822#4085294, @RobH wrote: > I'm a bit confused on which system will get the replaceme... [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180327T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:56] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 60800 MB (12% inode=99%) [17:00:58] (03CR) 10Chad: [V: 032 C: 032] Adding motd plugin [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422182 (owner: 10Chad) [17:01:05] (03PS4) 10Imarlier: [WIP] coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) [17:01:27] (03PS5) 10Imarlier: coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) [17:05:06] (03PS1) 10Chad: Move motd to proper location [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422184 [17:05:08] (03CR) 10Chad: [C: 032] Move motd to proper location [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422184 (owner: 10Chad) [17:05:11] (03CR) 10Chad: [V: 032 C: 032] Move motd to proper location [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422184 (owner: 10Chad) [17:08:58] (03CR) 10Ottomata: [C: 032] Install nrpe check for Kafka consumer lag by checking burrow [puppet] - 10https://gerrit.wikimedia.org/r/422163 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [17:10:49] krinkle: elukey: https://gerrit.wikimedia.org/r/421933 appears to address the issues that we've been having with coal, if you guys want to take a look. I'm running it by hand on both hafnium and on graphite1001 itself (just writing to whisper files in my own home dir) in order to verify that data continues to flow as expected, but I haven't seen anything that looks off. [17:13:00] RECOVERY - Disk space on elastic1019 is OK: DISK OK [17:13:13] (03PS1) 10Bstorm: toolforge: Add tmpreaper with a custom config to web nodes [puppet] - 10https://gerrit.wikimedia.org/r/422186 (https://phabricator.wikimedia.org/T190185) [17:14:47] (03PS2) 10Bstorm: toolforge: Add tmpreaper with a custom config to web nodes [puppet] - 10https://gerrit.wikimedia.org/r/422186 (https://phabricator.wikimedia.org/T190185) [17:20:01] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4085471 (10Eevans) >>! In T189822#4085294, @RobH wrote: > The SSDs we get these days from Intels line are: https:/... [17:22:42] (03PS1) 10Muehlenhoff: Update to 1.0.2o [debs/openssl] - 10https://gerrit.wikimedia.org/r/422189 [17:24:08] (03CR) 10Muehlenhoff: [C: 032] Update to 1.0.2o [debs/openssl] - 10https://gerrit.wikimedia.org/r/422189 (owner: 10Muehlenhoff) [17:33:33] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4085516 (10RobH) >>! In T189822#4085471, @Eevans wrote: > > It could be any of: > > - restbase1010 > - restbase10... [17:35:10] (03PS1) 10Chad: Add webhooks and go-import to plugin test deps [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422190 [17:35:20] (03CR) 10Chad: [V: 032 C: 032] Add webhooks and go-import to plugin test deps [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422190 (owner: 10Chad) [17:35:53] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Research, and 3 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4024896 (10Nuria) ping @DYNKM , this query will not run as written. please work with us to improve it . cc @JAllem... [17:36:08] (03CR) 10Catrope: [C: 031] Enable Flow on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421606 (https://phabricator.wikimedia.org/T190500) (owner: 10Urbanecm) [17:38:14] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Research, and 3 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4085551 (10DYNKM) @Nuria sorry for the bother; I worked out an alternate/actually faster way of doing it anyhoo! [17:41:12] (03PS1) 10Ottomata: Add mirror_name and host as labels for mirror maker prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422192 (https://phabricator.wikimedia.org/T189611) [17:41:51] (03CR) 10jerkins-bot: [V: 04-1] Add mirror_name and host as labels for mirror maker prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422192 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [17:42:21] RECOVERY - Disk space on elastic1027 is OK: DISK OK [17:42:53] (03PS2) 10Ottomata: Add mirror_name and host as labels for mirror maker prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422192 (https://phabricator.wikimedia.org/T189611) [17:43:23] (03CR) 10jerkins-bot: [V: 04-1] Add mirror_name and host as labels for mirror maker prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422192 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [17:43:56] (03PS3) 10Ottomata: Add mirror_name and host as labels for mirror maker prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422192 (https://phabricator.wikimedia.org/T189611) [17:48:17] (03CR) 10Ottomata: [C: 032] "Looks right https://puppet-compiler.wmflabs.org/compiler03/10700/kafka1020.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/422192 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [17:51:57] !log disable 2FA from User:Céréales Killer [17:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:54] (03PS1) 10Bstorm: dynamicproxy: run logrotate hourly [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218) [17:55:55] !log rebooting labsdb1006 for T189115 [17:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:49] (03CR) 10Chad: [V: 032 C: 032] Adding motd plugin [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/422179 (https://phabricator.wikimedia.org/T190810) (owner: 10Chad) [17:59:29] !log demon@tin Started deploy [gerrit/gerrit@4910e7c]: motd plugin [17:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:40] !log demon@tin Finished deploy [gerrit/gerrit@4910e7c]: motd plugin (duration: 00m 11s) [17:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180327T1800) [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:02:34] (03PS1) 10Bstorm: wiki replicas: record grants to be added to index-management user [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) [18:03:40] !log rebooting labsdb1007 for T189115 [18:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:04] (03PS1) 10Ottomata: Can't set labels on metric without name set [puppet] - 10https://gerrit.wikimedia.org/r/422201 (https://phabricator.wikimedia.org/T189611) [18:08:48] (03CR) 10Ottomata: [C: 032] Can't set labels on metric without name set [puppet] - 10https://gerrit.wikimedia.org/r/422201 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [18:12:50] (03PS2) 10Ppchelko: Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) [18:18:08] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, and 3 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#4085808 (10Pnorman) a:03Pnorman Going to take another look at a couple of things here [18:21:44] (03PS2) 10Ori.livneh: coal: add a simple systemd watchdog notifier; set WatchdogSec=60 [puppet] - 10https://gerrit.wikimedia.org/r/421981 [18:22:21] (03PS1) 10Jdlrobson: Rollout VirtualPageViews (stage 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422206 (https://phabricator.wikimedia.org/T189906) [18:22:53] (03PS2) 10Jdlrobson: Rollout VirtualPageViews (stage 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422206 (https://phabricator.wikimedia.org/T189906) [18:23:00] (03CR) 10Jdlrobson: "ps2 adds polish to blacklist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422206 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [18:23:36] robh / dr0ptp4kt: can either of you please help me login to the noc@wiki account? can't get past the 2nd factor auth code. please and thank you :) [18:25:24] (03PS2) 10Bstorm: wiki replicas: record grants and set user for maintain_indexes script [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) [18:25:50] (03CR) 10jerkins-bot: [V: 04-1] wiki replicas: record grants and set user for maintain_indexes script [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [18:27:03] bearloga: noc@ isnt an account [18:27:05] its an alias to root [18:27:38] you mean noc@wikimedia.org right? [18:27:42] robh: yup yup [18:28:03] yeah, i mean, someone in oit could have made a noc@ inbox, but it is overriden on our mail servers to be an alias [18:28:21] and our alias file overrides everything else as authoritative (is my understanding) [18:28:41] so when someone emails noc, it also goes to either the root alias or just ops team... checking... [18:29:12] yeah, noc goes to root ;] [18:29:23] and root is pretty much the sre team [18:29:32] bearloga: What did you need to login to the account for? [18:29:48] (continuing in PMs) [18:34:51] (03PS1) 10Bstorm: wiki-replicas: add user for index maintenance script [labs/private] - 10https://gerrit.wikimedia.org/r/422209 [18:35:21] (03CR) 10Ori.livneh: "Refactored the notifier into a simple standalone class. I think this CL would improve reliability quite a bit. I would personally choose t" [puppet] - 10https://gerrit.wikimedia.org/r/421981 (owner: 10Ori.livneh) [18:36:16] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4054767 (10mobrovac) @RobH as a first step, could we ensure that we have no spare Intel disks laying around that we... [18:42:05] !log branching 1.31.0-wmf.27 [18:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:54] (03CR) 10Krinkle: "The main issue right now is that coal stops consuming from Kafka after a few hours of running, and we still haven't figured out the root c" [puppet] - 10https://gerrit.wikimedia.org/r/421981 (owner: 10Ori.livneh) [18:47:50] (03CR) 10Krinkle: coal: add a simple systemd watchdog notifier; set WatchdogSec=60 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421981 (owner: 10Ori.livneh) [18:48:39] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4085923 (10RobH) >>! In T189822#4085872, @mobrovac wrote: > do we have any non-Samsung disks laying around that we... [19:00:04] twentyafterfour: #bothumor I � Unicode. All rise for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180327T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:01:03] (03PS3) 10Jcrespo: Revert "mariadb backups: Skip x1 and misc hosts this week" [puppet] - 10https://gerrit.wikimedia.org/r/422122 [19:02:37] (03CR) 10Krinkle: coal: be smarter about consuming from Kafka (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [19:04:04] (03CR) 10Krinkle: coal: be smarter about consuming from Kafka (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [19:05:38] (03CR) 10Jcrespo: [C: 032] Revert "mariadb backups: Skip x1 and misc hosts this week" [puppet] - 10https://gerrit.wikimedia.org/r/422122 (owner: 10Jcrespo) [19:13:09] (03PS1) 10Ottomata: Add check_prometheus alerts for Kafka MirrorMaker instances. [puppet] - 10https://gerrit.wikimedia.org/r/422230 (https://phabricator.wikimedia.org/T189611) [19:13:42] (03CR) 10jerkins-bot: [V: 04-1] Add check_prometheus alerts for Kafka MirrorMaker instances. [puppet] - 10https://gerrit.wikimedia.org/r/422230 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [19:13:51] (03PS6) 10Imarlier: coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) [19:14:01] (03CR) 10Imarlier: coal: be smarter about consuming from Kafka (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [19:16:25] (03PS2) 10Ottomata: Add check_prometheus alerts for Kafka MirrorMaker instances. [puppet] - 10https://gerrit.wikimedia.org/r/422230 (https://phabricator.wikimedia.org/T189611) [19:16:58] (03CR) 10jerkins-bot: [V: 04-1] Add check_prometheus alerts for Kafka MirrorMaker instances. [puppet] - 10https://gerrit.wikimedia.org/r/422230 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [19:18:18] jouncebot: next [19:18:19] In 3 hour(s) and 41 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180327T2300) [19:18:55] twentyafterfour: did i miss it ? if so i'm sorry [19:20:45] (03PS3) 10Ottomata: Add check_prometheus alerts for Kafka MirrorMaker instances. [puppet] - 10https://gerrit.wikimedia.org/r/422230 (https://phabricator.wikimedia.org/T189611) [19:21:33] (03PS7) 10Imarlier: coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) [19:21:47] (03PS4) 10Ottomata: Add check_prometheus alerts for Kafka MirrorMaker instances. [puppet] - 10https://gerrit.wikimedia.org/r/422230 (https://phabricator.wikimedia.org/T189611) [19:22:33] twentyafterfour: just checked calendar. 12-14 now :) [19:25:32] (03CR) 10Imarlier: coal: be smarter about consuming from Kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [19:25:36] (03PS5) 10Ottomata: Add check_prometheus alerts for Kafka MirrorMaker instances. [puppet] - 10https://gerrit.wikimedia.org/r/422230 (https://phabricator.wikimedia.org/T189611) [19:25:38] (03PS1) 10Sbisson: Make 'style' and 'storage id' available to maps services [puppet] - 10https://gerrit.wikimedia.org/r/422239 (https://phabricator.wikimedia.org/T112948) [19:26:24] (03PS6) 10Ottomata: Add check_prometheus alerts for Kafka MirrorMaker instances. [puppet] - 10https://gerrit.wikimedia.org/r/422230 (https://phabricator.wikimedia.org/T189611) [19:26:26] (03CR) 10jerkins-bot: [V: 04-1] Make 'style' and 'storage id' available to maps services [puppet] - 10https://gerrit.wikimedia.org/r/422239 (https://phabricator.wikimedia.org/T112948) (owner: 10Sbisson) [19:27:57] (03PS2) 10Sbisson: Make 'style' and 'storage id' available to maps services [puppet] - 10https://gerrit.wikimedia.org/r/422239 (https://phabricator.wikimedia.org/T112948) [19:32:05] (03PS7) 10Ottomata: Add check_prometheus alerts for Kafka MirrorMaker instances. [puppet] - 10https://gerrit.wikimedia.org/r/422230 (https://phabricator.wikimedia.org/T189611) [19:32:10] (03CR) 10Ottomata: "Looks good https://puppet-compiler.wmflabs.org/compiler03/10704/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/422230 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [19:32:29] (03CR) 10Dzahn: "really, we want these in roles too now? i mean fine with me, i just remember that we used to say they should stay on node level" [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [19:33:02] (03PS1) 10Gergő Tisza: Enable TemplateStyle on all Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422246 (https://phabricator.wikimedia.org/T189838) [19:33:47] (03CR) 10Ottomata: [C: 032] Add check_prometheus alerts for Kafka MirrorMaker instances. [puppet] - 10https://gerrit.wikimedia.org/r/422230 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [19:38:01] (03PS1) 10Dzahn: Revert "deployment_server: allow rsyncing of /srv/ to new server" [puppet] - 10https://gerrit.wikimedia.org/r/422250 [19:40:52] (03PS1) 10Ottomata: Fix path to check_kafka_consumer_lag nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/422251 (https://phabricator.wikimedia.org/T189611) [19:41:36] !log deploy1001 - deleting /srv and letting puppet recreate it, so _not_ rsyncing manually from tin but just a clean version of what puppet pulls in (T175288) [19:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:42] T175288: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288 [19:43:47] (03CR) 10Ottomata: [C: 032] Fix path to check_kafka_consumer_lag nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/422251 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [19:44:12] (03PS2) 10Dzahn: Revert "deployment_server: allow rsyncing of /srv/ to new server" [puppet] - 10https://gerrit.wikimedia.org/r/422250 [19:45:39] (03CR) 10Dzahn: [C: 032] "instead of a 1:1 copy of tin we are opting for a clean version of just what puppet pulls in" [puppet] - 10https://gerrit.wikimedia.org/r/422250 (owner: 10Dzahn) [19:48:10] mutante: Have we added it as a co-master yet to scap? [19:48:24] Once we do, stuff should get back into sync nicely [19:48:39] no_justification: i think that is the problem i have right now [19:48:48] i get a bunch of scap errors because i'm not active [19:48:57] but i want to have the clean /srv again that i had before [19:48:59] Wait, you're trying to run that on....deploy1001? [19:49:05] did somebody else do that before me? heh [19:49:11] NO WE SHOULD NOT DO THAT [19:49:21] We need to run it from the *active* master [19:49:26] what? run puppet? [19:49:30] since that's all i'm doing [19:49:35] No no, I meant running scap stuff [19:49:47] i meant that puppet shows me scap errors when i run puppet [19:49:50] That's how we end up with a hosed /srv/mediawiki again :) [19:49:50] like this: [19:49:52] Ahhhhhh [19:50:04] Execution of '/usr/bin/scap deploy --init' returned 70: 19:42:03 deploy failed: Failed to acquire lock "/var/lock/scap-global-lock"; owner is "root"; reason is "Not the active deployment server, use tin.eqiad.wmnet" [19:50:29] so "add as co-master" you say? [19:53:01] no_justification: which of these options is really better: a) /srv is an exact copy /srv on tin b) /srv is only what puppet pulls in via deployment_server role and there is no manual rsync at all so it's clean [19:53:18] the diff was mostly mediawiki-staging when i looked [19:53:25] but also something in mediawiki [19:55:55] puppet is also still running.. it's still doing stuff.. and /srv is growing despite the errors [19:56:20] i'll report when it's done. so far it is 5G [19:57:18] mediawiki-staging and mediawiki will get sync'd properly from next master sync [19:57:26] (03PS1) 10Ottomata: Fix prometheus_url for mirror maker alert [puppet] - 10https://gerrit.wikimedia.org/r/422258 (https://phabricator.wikimedia.org/T189611) [19:57:34] The biggest issue is in the php-* directories, we haven't automated checkout of those yet on new provision [19:57:51] no_justification: ok, so i deleted /srv and letting puppet recreate it from scratch.. because before i thought rsyncing from tin was good.. [19:57:54] (03PS2) 10Ottomata: Fix prometheus_url for mirror maker alert [puppet] - 10https://gerrit.wikimedia.org/r/422258 (https://phabricator.wikimedia.org/T189611) [19:58:09] I'd run puppet first, then rsync [19:58:19] arr. that's what i did yesterday :) [19:58:22] Might miss some puppet stages that do a Creates or Provides or something [19:58:29] but that's the _other_ option [19:58:30] Actually.... [19:58:41] 1) puppet, 2) `scap sync` for mw [19:58:52] The scap3-style deployments in deployments/ *should* provision properly [19:58:56] ok, but no "quickdatacopy" [19:58:58] then [19:59:02] which i did [19:59:07] (03CR) 10Ottomata: [C: 032] Fix prometheus_url for mirror maker alert [puppet] - 10https://gerrit.wikimedia.org/r/422258 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [19:59:17] I would skip the quickdatacopy and let scap do it [19:59:19] mutante: mergeing yours ok? [19:59:25] Revert "deployment_server .. [19:59:34] ottomata: yes please, sorry [19:59:38] np [20:00:16] that was to remove the "quickdatacopy" thing but it wasn't auto-sync anyways [20:01:40] It's also new branch day (cc twentyafterfour) which probably isn't best time to mess with scap masters [20:01:45] i ran puppet on tin, it's fine and removed the ferm rule [20:02:32] no_justification: it's ok, I'm coordinating with mutante on this [20:02:33] Notice: Applied catalog in 1278.52 seconds [20:02:34] to test things [20:02:41] OK making sure [20:02:50] Didn't want y'all competing for rsync [20:03:03] I cut the branch already on tin, gonna do `scap prep` and the rest on the new server [20:03:16] i'm running puppet a second time [20:03:32] it should be much faster than 1278 seconds [20:03:35] yes, 37 [20:03:41] For a new deployment server, running puppet a few times might be necessary [20:03:46] ack [20:04:17] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Puppet has 41 failures. Last run 41 seconds ago with 41 failures. Failed resources (up to 3 shown): Scap_source[changeprop/deploy],Scap_source[citoid/deploy],Scap_source[cpjobqueue/deploy],Scap_source[cxserver/deploy] [20:04:54] What's failing? [20:05:06] looks like the puppet scap provider [20:05:16] Execution of '/usr/bin/scap deploy --init' returned 70: 20:04:18 You have set `git_rev` to "origin/master" without setting `git_upstream_submodules=True`. This could lead to unexpected behavior. [20:05:16] The deploy bit? It shouldn't try to deploy unless it's active imho [20:05:19] those are large repos, no? maybe timing out? [20:05:41] Ah init automatically. That's newish [20:05:47] size of /srv on tin: 38G size of /srv on deploy1001: 16G [20:05:53] it was more yesterday [20:05:58] Added since last time we had a reimage [20:05:59] no_justification: yeah deploy init is to clone the repos on the deploy master [20:06:14] Deploy init is after clone [20:06:18] mutante: the mediawiki branches are missing [20:06:31] so _submodules_ are causing issues? heh [20:06:44] wouldn't be the first time for that [20:06:46] That's expected. [20:06:55] Need to run from tin to push them over [20:06:58] I can check out the mediawiki branches fresh [20:07:05] no_justification: just run scap sync? [20:07:08] on tin? [20:07:26] so should I do scap prep first or second [20:07:35] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4086219 (10mobrovac) Ah I see, thnx for info @RobH. I agree, these wouldn't make good candidates for what we need.... [20:07:47] Start with a scap sync file read me [20:07:53] i think the puppet part on deploy1001 is now done. it stays at around 40 seconds runtime and does the same thing .. for now before other scap steps [20:08:15] mutante: no_justification: ok I'm gonna try syncing a file on tin [20:08:22] Yeah, just a single file [20:08:25] takes hands off of it [20:08:26] We want to force a co-master sync [20:08:34] We don't really care about the rest of the build [20:08:50] (plus don't wanna fuck with destinations accidentally pulling from a bad deploy1001) [20:09:37] !log twentyafterfour@tin Synchronized README: (no justification provided) (duration: 00m 52s) [20:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:52] ok that didn't sync the branches [20:10:05] Well shitnuggests. co master should sync everything :\ [20:10:17] meanwhile going to fix my gerrit change to switch the servers .. needs manual rebase [20:10:20] I think I uses the same exclusion as the proxy sync [20:10:28] That's....ok [20:10:29] Meh [20:10:35] Maybe a full sync then [20:11:16] !log twentyafterfour@tin Started scap: Sync to co-masters to initialize deploy1001.eqiad.wmnet [20:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:45] (03PS2) 10Dzahn: switch deployment server from tin to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/420914 (https://phabricator.wikimedia.org/T175288) [20:12:07] cannot delete non-empty directory: php-1.31.0-wmf.25/cache/l10n [20:12:09] cannot delete non-empty directory: php-1.31.0-wmf.25/cache/l10n [20:12:11] cannot delete non-empty directory: php-1.31.0-wmf.25/cache [20:12:13] that doesn't look good [20:12:24] (03CR) 10jerkins-bot: [V: 04-1] switch deployment server from tin to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/420914 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [20:12:29] That's fine [20:12:41] why's it trying to delete ? [20:12:51] Because the directory is empty on the master [20:12:58] wmf.25 is a "cleaned" branch [20:13:11] But it won't delete on targets because...well...it's not empty there [20:13:31] is it possible that some of these changes here: https://gerrit.wikimedia.org/r/#/c/420914/ are needed before the others [20:13:34] * no_justification manually deletes it everywhere [20:13:51] like can it really be a single change [20:14:05] Honestly, I don't think so right now [20:14:11] I'd rather run deploy1001 as a co-master [20:14:13] (like naos) [20:14:14] For now [20:14:21] Then when it's all dandy, switch over [20:14:28] does it need to be a master here https://gerrit.wikimedia.org/r/#/c/420914/2/hieradata/common/scap/dsh.yaml [20:14:41] that sounds better, yes [20:14:46] yeah that's what controls the co-master sync I think [20:14:51] That's what I thought we were doing [20:15:05] Yeah, so let's add it to that second stanza in dsh.yaml [20:15:25] ok [20:15:58] meanwhile scap is buisy updating localisationcache for the next 20 minutes ;) [20:16:18] I'd say abort, but you're gonna want that later anyway [20:16:19] :p [20:16:28] yeah I'll just let it go [20:16:29] (03PS1) 10Dzahn: add deploy1001 to scap masters in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/422334 (https://phabricator.wikimedia.org/T175288) [20:17:03] twentyafterfour: But yeah, that cache directory spam is harmless. There's a task, and I have a fix in mind for it (the core part just went out recently, requires a second patch to clean.py to make use of it) [20:17:17] (it sounds scarier than it is) [20:17:26] ^ +1 ? [20:17:39] (03PS1) 10Ottomata: Use scalar in dropped messages prometheus check for mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/422335 (https://phabricator.wikimedia.org/T189611) [20:17:59] scap::dsh::scap_masters [20:18:00] (03CR) 10Ottomata: [V: 032 C: 032] Use scalar in dropped messages prometheus check for mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/422335 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [20:18:04] dsh .. [20:18:44] but there is also "hosts" in that file [20:19:50] Yeah should also go in hosts bit [20:20:19] Where do we define the active bit again? tin.yaml? [20:20:41] (03PS2) 10Dzahn: add deploy1001 to scap masters and hosts in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/422334 (https://phabricator.wikimedia.org/T175288) [20:20:44] https://gerrit.wikimedia.org/r/#/c/420914/2/modules/profile/manifests/mediawiki/deployment/server.pp [20:21:00] Oh yeah, deployment_server [20:21:03] deployment_server: deploy1001.eqiad.wmnet [20:21:08] https://gerrit.wikimedia.org/r/#/c/420914/2/hieradata/common.yaml [20:21:10] common.yaml [20:21:15] and that other place as default [20:21:29] Gotcha. So yeah let's do the dsh.yaml changes first to get it properly setup as a sync target/co-master [20:21:34] Then let it run like that for awhile [20:21:44] *Then* we can swap the deployment_server [20:21:49] (03CR) 10Nuria: [C: 031] Rollout VirtualPageViews (stage 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422206 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [20:22:10] (03CR) 10Dzahn: [C: 04-2] switch deployment server from tin to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/420914 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [20:22:20] (03PS3) 10Dzahn: add deploy1001 to scap masters and hosts in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/422334 (https://phabricator.wikimedia.org/T175288) [20:23:01] (03CR) 10Dzahn: [C: 032] add deploy1001 to scap masters and hosts in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/422334 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [20:23:47] ok, merged. added tp scap proxies and scap hosts [20:25:19] added to mediawiki-installation dsh group. ran puppet [20:26:47] PROBLEM - High CPU load on API appserver on mw1315 is CRITICAL: CRITICAL - load average: 114.29, 36.87, 22.53 [20:28:28] (03PS1) 10Chad: Gerrit: Remove old gerrit.war location in homedir [puppet] - 10https://gerrit.wikimedia.org/r/422336 [20:28:46] RECOVERY - High CPU load on API appserver on mw1315 is OK: OK - load average: 30.41, 29.97, 21.63 [20:28:49] * twentyafterfour wonders what that icinga alert about? [20:29:06] mw1315 has a high load :\ [20:29:11] Could be anything or nothing [20:29:13] my first scap sync is almost done [20:29:30] mw1315 load probably unrelated? [20:29:34] co-master should work now if mutante's change is out :) [20:29:42] Most likely [20:29:58] it's hhvm and scap that is running on mw1315 [20:30:10] yeah nothing else tripping so ... [20:30:20] if we had a real problem there would be more alerts [20:30:28] ack @ unrelated [20:30:41] I'm gonna run scap sync one more time in one minute when this first sync is finished [20:31:16] scap-cdb-rebuild: 98% [20:31:19] (03CR) 10Paladox: Gerrit: Remove old gerrit.war location in homedir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422336 (owner: 10Chad) [20:32:46] !log twentyafterfour@tin Finished scap: Sync to co-masters to initialize deploy1001.eqiad.wmnet (duration: 21m 30s) [20:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:29] !log twentyafterfour@tin Started scap: 2nd Sync to co-masters to initialize deploy1001.eqiad.wmnet [20:34:32] (03CR) 10Chad: Gerrit: Remove old gerrit.war location in homedir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422336 (owner: 10Chad) [20:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:54] ok it appears to be syncing deploy1001 now [20:47:20] !log twentyafterfour@tin Finished scap: 2nd Sync to co-masters to initialize deploy1001.eqiad.wmnet (duration: 12m 50s) [20:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:16] ok co-master sync appears to have worked [20:49:28] mutante: anything else to do before I resume the train? [20:49:36] cc mutante [20:49:47] erm cc no_justification [20:50:08] What's puppet saying on deploy1001 now? [20:50:16] (03PS2) 10Bstorm: wiki-replicas: add user for index maintenance script [labs/private] - 10https://gerrit.wikimedia.org/r/422209 [20:52:04] (03CR) 10Bstorm: [V: 032] wiki-replicas: add user for index maintenance script [labs/private] - 10https://gerrit.wikimedia.org/r/422209 (owner: 10Bstorm) [20:52:09] (03CR) 10Bstorm: [V: 032 C: 032] wiki-replicas: add user for index maintenance script [labs/private] - 10https://gerrit.wikimedia.org/r/422209 (owner: 10Bstorm) [20:52:18] running puppet again, hold on [20:52:58] twentyafterfour: much better now! [20:53:00] Notice: /Stage[main]/Scap::Master/Git::Clone[operations/mediawiki-config]/File[/srv/mediawiki-staging]/owner: owner changed '996' to 'mwdeploy' [20:53:08] applied catalog in 20 sec. all green, no errors [20:53:33] the size of /srv is now 36G as well [20:53:56] 36741276 /srv/ [20:54:00] 37824584 . [20:54:07] yeah I think we are all good [20:54:23] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:54:26] I just ran scap prep on deploy1001 so it's got a new branch that tin doesn't have [20:54:58] keyholder is armed and active too [20:55:49] ok syncing the new branch [20:55:51] !log twentyafterfour@deploy1001 scap failed: LockFailedError Failed to acquire lock "/var/lock/scap-global-lock"; owner is "root"; reason is "Not the active deployment server, use tin.eqiad.wmnet" (duration: 00m 00s) [20:55:54] nope [20:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:05] scap failed: LockFailedError Failed to acquire lock "/var/lock/scap-global-lock"; owner is "root"; reason is "Not the active deployment server, use tin.eqiad.wmnet" (duration: 00m 00s) [20:56:19] so we still need to merge the other changes? [20:59:53] so.. we did dsh.yaml [21:00:01] and we still have scap.yaml [21:00:22] that has scap::deployment_server: [21:00:36] I'm not sure what controls that lock file [21:00:39] but this is not common.yaml [21:00:45] it might be manually placed [21:02:02] ops/puppet:/modules/profile/manifests/mediawiki/deployment/server.pp [21:02:06] luine 103 [21:02:41] yea, but that's the default for common.yaml in Hiera [21:02:48] and the actual switch of the active server, right [21:02:54] (03PS1) 10Dzahn: bastionhost: add MOTD warning of imminent bast1001 shutdown [puppet] - 10https://gerrit.wikimedia.org/r/422339 (https://phabricator.wikimedia.org/T186623) [21:03:06] let me rebase my other change again [21:03:19] (03CR) 10jerkins-bot: [V: 04-1] bastionhost: add MOTD warning of imminent bast1001 shutdown [puppet] - 10https://gerrit.wikimedia.org/r/422339 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn) [21:05:49] (03PS3) 10Dzahn: switch deployment server from tin to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/420914 (https://phabricator.wikimedia.org/T175288) [21:07:24] https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/mediawiki/deployment/server.pp#L104 [21:07:29] mutante: ^ [21:07:52] i'm changing that so it's just adding deploy1001 and not removing tin as well [21:07:53] yeah actually your change looks good [21:08:21] $main_deployment_server = hiera('scap::deployment_server'), [21:08:46] (03PS4) 10Dzahn: switch deployment server from tin to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/420914 (https://phabricator.wikimedia.org/T175288) [21:09:47] hieradata/common.yaml [21:11:07] (03PS1) 1020after4: Switch scap master server to deploy1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/422340 [21:11:10] (03PS5) 10Dzahn: switch deployment server from tin to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/420914 (https://phabricator.wikimedia.org/T175288) [21:11:37] mutante: https://gerrit.wikimedia.org/r/#/c/422340/ [21:11:50] twentyafterfour: you think it should be separate step? [21:11:59] not necessarily [21:12:12] you can incorporate it into the same patch if you'd rather [21:12:36] it's already in it [21:12:41] oh? [21:12:49] what did I miss [21:12:52] hmm [21:13:18] check out https://gerrit.wikimedia.org/r/#/c/420914/ one more time now [21:13:26] I was looking at https://gerrit.wikimedia.org/r/#/c/422334/ heheh [21:13:38] it changes common.yaml and some comments and the default in another place that it would fall back if Hiera failed [21:13:49] ah :) [21:13:52] (03Abandoned) 1020after4: Switch scap master server to deploy1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/422340 (owner: 1020after4) [21:15:08] (03CR) 1020after4: [C: 031] "looks good, let's merge it :)" [puppet] - 10https://gerrit.wikimedia.org/r/420914 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [21:15:33] (03PS2) 10Chad: Gerrit: Simplify directory structure [puppet] - 10https://gerrit.wikimedia.org/r/422336 [21:16:12] (03CR) 10Dzahn: [C: 032] switch deployment server from tin to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/420914 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [21:17:24] running puppet on tin and deploy1001 [21:17:33] it should switch the motd [21:17:57] it changed master_rsync on both [21:18:52] that includes an rsync for /srv/patches [21:19:12] no_justification: cc: ^ [21:19:23] no puppet issues [21:19:36] Sounds good! [21:22:03] tin now has the "do NOT use this server" motd as it should [21:22:31] !log deployment_server has been switched to deploy1001.eqiad.wmnet. tin is not the active server anymore as of right now [21:22:34] !log twentyafterfour@deploy1001 scap failed: LockFailedError Failed to acquire lock "/var/lock/scap-global-lock"; owner is "root"; reason is "Not the active deployment server, use tin.eqiad.wmnet" (duration: 00m 00s) [21:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:50] mutante: it didn't remove the lock :-/ [21:23:41] meh.. but it added it on tin [21:24:24] mutante: the puppet code doesn't remove the file, needs manual removal [21:24:34] # Lock the passive servers, leave untouched the active one. [21:24:43] just wanted to check if it's a puppet bug before just deleting [21:24:49] ok! [21:25:12] it's intentional because that lock could be manually placed during maintenance [21:25:19] so we don't want puppet to always remove it [21:25:44] !log deploy100 rm /var/lock/scap-global-lock to switch to active server, puppet code only adds lock file to inactive servers (T175288) [21:25:47] try again [21:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:49] T175288: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288 [21:25:53] it would be nice if it would remove when it contains the default content on the default server [21:26:08] !log twentyafterfour@deploy1001 Started scap: Deploy 1.31.0-wmf.27 to test wikis [21:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:26] I think the thinking was also we wanted manual checking before making active if it had been passive [21:26:36] Cf: last time we reimaged and hosed servers [21:26:43] yeah [21:26:51] !log twentyafterfour@deploy1001 scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="cawikibooks" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.738JVwJRDN" ' returned non-zero exit status 127 (duration: 00m 43s) [21:26:53] sounds likely from the code comment too, ack [21:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:01] maybe we should make the file not owned by root though, so a deployer could remove the lock? [21:27:04] erg [21:27:22] 1:26:51 Last output: [21:27:25] /usr/local/bin/mwscript: line 25: php5: command not found [21:27:25] Could make it mwdeploy owned [21:27:26] Perhaps [21:27:36] why php5? [21:27:40] duh.. and there we go.. with the stretch issues [21:27:46] this is stretch now with php7 [21:27:57] There's a workaround [21:28:02] PHP=php7 ........ [21:28:14] shouldn't it just use /usr/bin/php ? [21:28:20] Because that points to hhvm [21:28:23] mwscript should not hardcode php5 [21:28:25] And makes message rebuilding slow [21:28:26] anymore now [21:28:29] It does [21:28:32] I had to revert, remember? [21:28:50] /etc/alternatives/php -> /usr/bin/hhvm [21:29:01] On stretch I don't think we should do that ^ [21:29:25] And I think we should use alternates on distros below stretch to point to php5 [21:29:34] Well, except php5 eol :\ [21:29:45] So it gets complicated :( [21:30:29] Easy (but hacky) solution: would be to have a symlink on stretch for php5->php7 [21:30:33] maybe symblink php5 on stretch to either php 7 or hhvm [21:30:39] Or vice verswa [21:31:13] mwscript. ah we knew this would bite us right? [21:31:19] Use alternates to point to php vs hhvm binary, but use a symlink so scripts that /do/ hardcode can avoid hhvm [21:31:26] We tried to fix it last week [21:32:02] fwiw we are banking on /etc/alternatives/php pointing to hhvm on all stretch for now [21:32:09] (see: appservers upgrade, etc) [21:32:15] hhvm performance on the cli is atrocious, and I don't think it's a good use of time to debug it considering the (hopefully small) window after Php5 is dead but before php7 is everywhere and we're forced on hhvm [21:32:42] I don't see /why/ we'd want that though. Stretch should point to php7 for cli operations [21:32:50] We shouldn't even bother with hhvm on stretch [21:33:15] that's not the migration plan though [21:33:39] it's hhvm for everything except dumps (first) and then other things little by little [21:34:31] Cli operations on deploy masters & work hosts (terbium, wasat) needs to be happening now though :( [21:34:35] This is going to keep being an issue. [21:34:38] how bad is hhvm cli, is it seconds or minutes difference for scap? [21:34:46] Minutes. [21:34:48] Many minutes [21:34:50] oh [21:35:42] we could do something gross like using logic in mwscript to decide which one to run depending on the link target ;) [21:35:43] can't mwscript check to see if there's a php5 or a php7 and choose one? [21:35:48] ayup [21:35:49] Yes [21:35:55] That's probably easiest [21:37:26] is terbium stretch already? [21:37:33] But it forces the migration sorta. [21:37:48] So when we tried to swap php5 -> php, we got hhvm as default. [21:37:49] apergos: no [21:38:01] so we're really just talking about deploy1001 right? [21:38:02] deploy1001 was first [21:38:13] I'm not suggesting we move appservers or anything else. I'm just saying for the cli, I want php5 or php7, hhvm be damned [21:38:18] There's a reason we forced php5 for years now [21:38:22] yea and then deploy2001 to replace wasat [21:38:26] eh. naos!! [21:38:54] I'm saying, I don't think we can commit to that yet. for deployxxxx it's probably ok [21:39:11] That's what I'm talking about though..... [21:39:27] deploy* and then whatever new name for wasat/terbium is [21:39:27] so let's leave aside terbium and other such cases [21:39:32] for now. [21:39:44] I mostly care about deploy masters. [21:40:32] so right now mwscript is php5 on all boxes, is that right? [21:41:04] Yes. But mwscript isn't only in use on deploy* [21:41:10] I unedrstand that [21:41:25] but it's not using hhvm anywhere, correct? [21:41:41] Nope [21:42:36] are the deployxxx boxes the first ones using it where php7 is installed? [21:42:42] sorry for these basic questions [21:43:04] apergos: yes afaik [21:43:07] ok [21:43:29] so for right now, adding logic that says "if there's php7 installed, use that, otherwise use php5" seems fine [21:43:30] BUT [21:44:09] let's put a huge red flag all over this so that well before the next case (terbium) we've figured out what to do (hhvm vs php7) [21:44:19] where be "we" I mean you but also the ops side of the migration [21:44:22] (03CR) 10Paladox: [C: 031] "untested though looks good." [puppet] - 10https://gerrit.wikimedia.org/r/422336 (owner: 10Chad) [21:44:25] *by we [21:44:43] okay? [21:44:52] works for me [21:45:19] I'll remember to bring this up to joe, mor itz tomorrow (mm today :-P) [21:45:55] maybe it will be a topic for discussion in our subteam meeting too. don't want to block anything. [21:47:42] /src/ops/puppet/modules/scap/files/foreachwikiindblist [21:48:01] /src/ops/puppet/modules/scap/files/mwscript [21:49:08] I guess those are the only two places that need to change? [21:49:10] later when the deployhosts have been all sorted out, it would be nice to know the uses of mwscript on terbium [21:49:26] i assume we want the codfw deploy server to also be stretch asap then [21:49:31] so at least we dont have a mix there? [21:49:56] mutante: yeah I think so [21:49:57] the opposite would be to keep it as a backup to fall back to [21:49:59] (03PS1) 10EBernhardson: Upgrade enwiki search ranking model to prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422347 (https://phabricator.wikimedia.org/T187148) [21:50:10] also, git-lfs [21:52:06] (03PS1) 10Chad: mwscript: Detect php across distros [puppet] - 10https://gerrit.wikimedia.org/r/422348 [21:52:14] mutante, twentyafterfour, apergos ^^^^ [21:53:01] (03PS2) 10Chad: mwscript: Detect php across distros [puppet] - 10https://gerrit.wikimedia.org/r/422348 [21:55:04] no_justification: doesn't work on deploy1001 [21:55:16] it sets PHP_BIN to 'php' not 'php7' [21:56:31] there is php7.0 [21:56:33] no php7 [21:56:34] oh it's php7.0 [21:56:38] yeah that [21:57:08] other than that it looks good [21:57:20] (03PS3) 1020after4: mwscript: Detect php across distros [puppet] - 10https://gerrit.wikimedia.org/r/422348 (owner: 10Chad) [21:58:02] thumbs up [21:58:26] (03CR) 1020after4: [C: 031] mwscript: Detect php across distros [puppet] - 10https://gerrit.wikimedia.org/r/422348 (owner: 10Chad) [21:58:50] we don't have +2 [21:59:36] apergos or mutante, mind merging that, then I'll get the train started again [21:59:42] Oh [21:59:48] So just if php5 else php [22:00:03] I figured stretch would keep a php7 [22:00:10] it's named php7.0 [22:00:12] stretch has php7.0 [22:00:13] not php7 [22:00:22] and that's in all the paths too btw [22:00:24] Ahhh, I see now [22:00:26] On your edit [22:00:40] going to pass to mutante though because [22:00:44] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4086547 (10kchapman) @Tgr perhaps I was not as clear as I could have been. The other issue we see is there should be multiple RFCs broken out for that.... [22:00:53] I'm not going to be around to babysit at 1 am (sorry, strictly moonlighting here) [22:02:32] (03CR) 10ArielGlenn: [C: 031] mwscript: Detect php across distros [puppet] - 10https://gerrit.wikimedia.org/r/422348 (owner: 10Chad) [22:02:49] (03CR) 10Dzahn: [C: 032] mwscript: Detect php across distros [puppet] - 10https://gerrit.wikimedia.org/r/422348 (owner: 10Chad) [22:03:24] thanks apergos and mutante [22:03:38] 👍 [22:04:12] I'm pingable for a while yet but mostly checked out [22:06:49] (03PS5) 10Chad: Disable AbuseFilter from collecting IP addresses on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421864 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio) [22:06:56] (03CR) 10Chad: [C: 032] Disable AbuseFilter from collecting IP addresses on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421864 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio) [22:08:08] (03Merged) 10jenkins-bot: Disable AbuseFilter from collecting IP addresses on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421864 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio) [22:08:22] (03CR) 10jenkins-bot: Disable AbuseFilter from collecting IP addresses on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421864 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio) [22:10:51] (03CR) 10Nuria: eventlogging: move alarms from graphite to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [22:12:03] !log demon@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: beta-only sync (duration: 02m 32s) [22:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:51] and now somebody can reinstall deployment-tin.eqiad.wmflabs /me hides [22:14:05] :) [22:14:06] !log demon@deploy1001 Synchronized wmf-config/abusefilter.php: beta-only sync (duration: 00m 53s) [22:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:30] is mwscript updated now? [22:14:34] the timestamp looks old [22:15:04] ehm.. running puppet [22:15:12] try now [22:15:22] better [22:15:28] !log twentyafterfour@deploy1001 Started scap: Deploy 1.31.0-wmf.27 to test wikis [22:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:54] (03PS2) 10Dzahn: decom and remove remnants of tin.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) [22:22:02] Decom already? [22:23:16] it's obsolete now. Time for it to go the way of the ✆ and the ✇ [22:23:51] :) [22:23:53] (03PS1) 10Andrew Bogott: Change cname for deployment.eqiad.wmnet and deployment.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/422349 [22:26:40] (03PS2) 10Dzahn: Change cname for deployment.eqiad.wmnet and deployment.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/422349 (https://phabricator.wikimedia.org/T175288) (owner: 10Andrew Bogott) [22:26:57] (03CR) 10Dzahn: [C: 032] Change cname for deployment.eqiad.wmnet and deployment.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/422349 (https://phabricator.wikimedia.org/T175288) (owner: 10Andrew Bogott) [22:27:17] no_justification: no, just preparing changes. no merge [22:28:21] !log DNS - switching deployment service name to deploy1001 (T175288) [22:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:28] T175288: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288 [22:29:30] https://i.imgur.com/q46L4QH.jpg [22:35:20] (03PS8) 10Imarlier: coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) [22:48:13] 10Operations, 10hardware-requests, 10Release-Engineering-Team (Watching / External): eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#4086641 (10Dzahn) [22:48:19] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4086639 (10Dzahn) 05Open>03Resolved There has been a deploy from it, and all the changes above, incl. DNS service name, Muku... [22:49:02] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4086643 (10Dzahn) [22:50:53] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 60.68, 25.37, 15.61 [22:50:55] 10Operations, 10hardware-requests: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#4086649 (10Dzahn) 05Open>03Resolved Closing as duplicate of T174452 and T175288 [22:52:13] PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 60.21, 27.24, 17.10 [22:52:43] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4086666 (10Tgr) >>! In T66214#4086547, @kchapman wrote: > The other issue we see is there should be multiple RFCs broken out for that. Perhaps that mean... [22:52:53] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 18.75, 21.49, 15.41 [22:53:11] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#4086670 (10Dzahn) done! this was a duplicate of T175288 (see all the details for this there) T184481 [22:53:13] RECOVERY - High CPU load on API appserver on mw1286 is OK: OK - load average: 28.70, 24.08, 16.64 [22:53:17] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#4086672 (10Dzahn) 05Open>03Resolved [22:53:19] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4086673 (10Dzahn) [22:53:58] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#3911588 (10Dzahn) It has been replaced by deploy1001.eqiad.wmnet on stretch with PHP7. [22:56:28] !log twentyafterfour@deploy1001 Finished scap: Deploy 1.31.0-wmf.27 to test wikis (duration: 41m 00s) [22:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:43] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4086682 (10Dzahn) tin has been replaced by deploy1001 which is running stretch [deploy1001:~] $ apt-cache show git-lfs Package: git-lfs Version: 2.... [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180327T2300). Please do the needful. [23:00:05] ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:00] \o [23:01:47] jdlrobson: swat [23:02:16] (03CR) 10EBernhardson: [C: 032] Upgrade enwiki search ranking model to prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422347 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [23:03:25] \o [23:03:30] (03Merged) 10jenkins-bot: Upgrade enwiki search ranking model to prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422347 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [23:03:32] hey ebernhardson [23:04:28] twentyafterfour: unstaged changes on deployment1001, wikiversions.json [23:04:44] jdlrobson: hey. I'll ship your's in a moment. Looks easy enough and already deployed elsewhere [23:07:14] (03CR) 10EBernhardson: [C: 032] Rollout VirtualPageViews (stage 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422206 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [23:07:19] (03PS3) 10EBernhardson: Rollout VirtualPageViews (stage 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422206 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [23:07:24] ebernhardson: will take me some time to test though fyi [23:07:25] (03CR) 10EBernhardson: [C: 032] Rollout VirtualPageViews (stage 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422206 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [23:08:01] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: T187148: Update enwiki search ranking model (duration: 00m 54s) [23:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:06] T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin - https://phabricator.wikimedia.org/T187148 [23:08:15] (03CR) 10jenkins-bot: Upgrade enwiki search ranking model to prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422347 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [23:08:48] (03Merged) 10jenkins-bot: Rollout VirtualPageViews (stage 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422206 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [23:09:04] (03PS1) 10Andrew Bogott: keystone-paste.ini: Remove deprecated extension filters [puppet] - 10https://gerrit.wikimedia.org/r/422352 (https://phabricator.wikimedia.org/T187954) [23:09:12] (03PS2) 10Andrew Bogott: keystone-paste.ini: Remove deprecated extension filters [puppet] - 10https://gerrit.wikimedia.org/r/422352 (https://phabricator.wikimedia.org/T187954) [23:10:05] jdlrobson: you're live on mwdebug1001 [23:10:10] on it! [23:12:52] (03CR) 10jenkins-bot: Rollout VirtualPageViews (stage 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422206 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [23:12:55] none of you is on tin, right ?:) [23:13:36] mutante: nope. i would also hope that would throw up a big error and not work as well :) [23:14:41] great:) i like seeing the new host used without issues [23:15:00] ebernhardson: GO for it! Sync away. [23:18:10] jdlrobson: on it's way [23:18:28] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: T189906: (duration: 00m 55s) [23:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:35] T189906: Roll out VirtualPageViews to all Wikipedia wikis - https://phabricator.wikimedia.org/T189906 [23:20:01] jdlrobson: should be out [23:21:19] \o/ [23:21:23] lets see if the graphs agree [23:27:36] ebernhardson: getting some weirdness but need some more time [23:29:51] jdlrobson: i'm going to have to run to grab a train in 10 minutes. Might need to find someone else to help, or revert? [23:30:18] it's okay (and doesn't necessarily need a revert) ebernhardson thanks anyhow [23:38:48] jdlrobson: I can help with revert if needed [23:39:06] i think we're good MaxSem . it's just awfully quiet on a graph i expected a big jump on [23:39:10] it appears to be working how i expect [23:39:56] thanks though :) [23:44:03] PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100% [23:49:13] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms