[00:00:01] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.023 second response time [00:00:06] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:00:36] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Could not fetch url http://mobileapps.svc.eqiad.wmnet:8888/en.wikipedia.org/v1/page/mobile-sections-lead/Altrincham: Timeout on connection while downloading http://mobileapps.svc.eqiad.wmnet:8888/en.wikipedia.org/v1/page/mobile-se [00:00:47] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [00:01:00] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [00:01:27] PROBLEM - Apache HTTP on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:01:39] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 71492 bytes in 5.507 second response time [00:01:46] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [00:01:46] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:07] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 4.303 second response time [00:02:37] RECOVERY - HHVM rendering on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 71491 bytes in 8.803 second response time [00:02:37] RECOVERY - Apache HTTP on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.059 second response time [00:02:56] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [00:03:32] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.127 second response time [00:03:33] PROBLEM - Apache HTTP on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:57] RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.866 second response time [00:04:06] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 71491 bytes in 0.181 second response time [00:04:18] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:04:39] PROBLEM - Apache HTTP on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:05:02] ^ what's up, API again? [00:05:23] PROBLEM - Apache HTTP on mw1198 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.014 second response time [00:05:44] !log restarted hhvm on mw1194,mw1197,mw1198 [00:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:10] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:06:13] PROBLEM - HHVM rendering on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:06:16] RECOVERY - HHVM rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 71492 bytes in 4.160 second response time [00:06:27] bblack: yes, looks like it [00:06:48] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Could not fetch url http://mobileapps.svc.codfw.wmnet:8888/en.wikipedia.org/v1/page/most-read/2016/01/01: Timeout on connection while downloading http://mobileapps.svc.codfw.wmnet:8888/en.wikipedia.org/v1/page/most-read/2016/01/0 [00:06:51] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 71491 bytes in 0.220 second response time [00:07:03] bblack: i'll restart the others too, ack? [00:07:08] PROBLEM - Apache HTTP on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:08:00] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.078 second response time [00:08:13] the ones where i did have recovered ^ [00:08:16] mutante: should we just roll through the whole set via salt, maybe batched and delayed? [00:08:23] checking ganglia [00:08:38] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.474 second response time [00:08:47] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 71491 bytes in 8.370 second response time [00:08:52] oh, wait, that one recovered before i did anything to it [00:08:56] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:09:06] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [00:09:12] the memory pattern hasn't hit them all, yet [00:09:16] PROBLEM - HHVM rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:30] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.235 second response time [00:09:35] looks like about 16 are currently affected and unrestarted, though [00:09:36] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:10:06] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [00:10:46] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [00:10:46] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:11:08] mutante: let's try something like jo.e's memory checking restart? [00:11:09] bblack: i'm running it via salt [00:11:15] ok [00:11:18] RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.865 second response time [00:11:36] PROBLEM - HHVM rendering on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:11:38] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 71491 bytes in 0.132 second response time [00:12:28] i was about to just use the same command ori used earlier just plus batch [00:12:36] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.291 second response time [00:12:39] looked at joe's change, it's cron [00:13:07] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:13:08] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:13:26] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 6.947 second response time [00:13:27] ah, yes [00:13:39] PROBLEM - HHVM rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:13:57] not cron, he had one that looked at actual mem usage, let me see... [00:14:06] yes, both [00:14:06] have you started on salt already? [00:14:36] PROBLEM - Apache HTTP on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:37] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:40] in that moment, yes, but it wasn't good yet [00:14:46] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71492 bytes in 4.696 second response time [00:14:47] PROBLEM - HHVM rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:47] PROBLEM - HHVM rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:47] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:49] /bin/ps -C hhvm -o pmem= | awk \'{sum+=$1} END { if (sum <= 50.0) exit 1 }\' && /usr/sbin/service hhvm restart [00:14:57] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.217 second response time [00:15:06] PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:15:09] ^ salting that would only restart ones using 50% of ram [00:15:28] ok, ack [00:15:38] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:15:47] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.209 second response time [00:15:55] PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:15:56] PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:16:16] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 71492 bytes in 4.772 second response time [00:16:19] tries to find the right escaping for brackets [00:16:27] uses awk heh [00:16:45] PROBLEM - Apache HTTP on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:17:07] RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.071 second response time [00:17:09] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 71491 bytes in 0.121 second response time [00:17:21] RECOVERY - HHVM rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 71492 bytes in 8.223 second response time [00:17:37] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 71492 bytes in 5.501 second response time [00:18:06] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:18:07] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.052 second response time [00:18:17] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 71490 bytes in 0.120 second response time [00:18:30] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 9.937 second response time [00:18:36] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:19:08] RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.959 second response time [00:19:26] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [00:19:38] PROBLEM - HHVM rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:20:00] PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:20:00] I think I've figured it out [00:20:10] i can't seem to escape the ( [00:20:13] please do [00:20:36] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:20:57] PROBLEM - HHVM rendering on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:21:06] RECOVERY - Apache HTTP on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.585 second response time [00:21:16] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.649 second response time [00:21:48] ok doing a test-run now with just an echo in place of restart [00:21:58] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:00] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Could not fetch url http://mobileapps.svc.eqiad.wmnet:8888/en.wikipedia.org/v1/page/media/Cat: Timeout on connection while downloading http://mobileapps.svc.eqiad.wmnet:8888/en.wikipedia.org/v1/page/media/Cat: /{domain}/v1/page/featured/{yyyy}/{mm}/{dd} (r [00:22:01] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:08] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 71490 bytes in 0.609 second response time [00:22:38] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:44] !log restarting hhvm on nodes where it's using >50% mem [00:22:47] bleh [00:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:22:53] !log restarting hhvm on *API* nodes where it's using >50% mem [00:22:58] PROBLEM - Apache HTTP on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.009 second response time [00:22:58] PROBLEM - Apache HTTP on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.011 second response time [00:22:58] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:23:16] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [00:23:29] PROBLEM - Apache HTTP on mw1283 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.013 second response time [00:23:29] PROBLEM - HHVM rendering on mw1282 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.013 second response time [00:23:36] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:56] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [00:24:07] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [00:24:08] done [00:24:17] mutante: for reference: [00:24:17] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.048 second response time [00:24:33] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.052 second response time [00:24:33] RECOVERY - HHVM rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 71491 bytes in 0.198 second response time [00:24:34] cmd.run '/bin/ps -C hhvm -o pmem= | awk '"'"'{sum+=$1} END { if (sum <= 40.0) exit 1 }'"'"' && do_something' [00:24:38] so in the cron it was ps -C hhvm -o etimes= | head -n 1 ) > 259200 to determine how long they've been running, but not memory usage [00:24:58] PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:03] my log is wrong, my cutoff was 40% heh [00:25:06] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:06] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 71490 bytes in 0.306 second response time [00:25:06] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 71491 bytes in 5.261 second response time [00:25:07] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.066 second response time [00:25:07] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 71491 bytes in 0.346 second response time [00:25:09] either one is in the right ballpark [00:25:13] bblack: ah, thanks [00:25:17] *nod* [00:25:37] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.216 second response time [00:25:37] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.415 second response time [00:25:40] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.033 second response time [00:25:51] PROBLEM - HHVM rendering on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:56] PROBLEM - HHVM rendering on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:58] i was stuck at Syntax error: "(" unexpected [00:26:07] RECOVERY - HHVM rendering on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 71493 bytes in 2.161 second response time [00:26:07] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.350 second response time [00:26:07] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 71493 bytes in 2.162 second response time [00:26:28] yeah that's because \' doesn't work right within the ' we need for the salt cmd.run on the CLI [00:26:46] PROBLEM - Apache HTTP on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:55] but within a single-quoted string, we can insert a single-quote correctly by ending the single quote and concatenating a double-quoted string containing a single-quote :) [00:26:57] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:57] PROBLEM - HHVM rendering on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:15] 'Don'"'"'t shoot the messenger!' [00:27:16] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:17] :p ok, i just tried double quotes around the whole cmd [00:27:22] hahaa, ok [00:27:27] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [00:27:28] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 71491 bytes in 2.277 second response time [00:27:36] looks like it's hitting some of the others now [00:27:37] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.644 second response time [00:27:43] will wait a few mins and re-run the salt again to pick up the new ones [00:28:09] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [00:28:16] PROBLEM - HHVM rendering on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:28:20] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 71503 bytes in 0.157 second response time [00:28:47] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.927 second response time [00:29:01] bblack: now i wonder if https://gerrit.wikimedia.org/r/#/c/315938/1/modules/role/manifests/mediawiki/webserver.pp should be amended to use actual memory usage instead of just "time running" [00:29:29] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.664 second response time [00:29:31] maybe [00:29:38] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 71518 bytes in 1.518 second response time [00:29:46] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 4.671 second response time [00:29:47] this particular problem ramps in too fast for it to be daily (or longer) cron check though [00:30:17] looks like some cases that are obviously-bad in ganglia slip through the 40% filter, will try 30% next time around [00:31:12] !log restarting hhvm on API nodes where it's using >30% mem [00:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:31:27] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 71518 bytes in 4.380 second response time [00:31:27] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [00:32:58] PROBLEM - HHVM rendering on mw1191 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.018 second response time [00:33:06] PROBLEM - Apache HTTP on mw1201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.049 second response time [00:33:38] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 71517 bytes in 0.200 second response time [00:34:35] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [00:35:21] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.218 second response time [00:35:37] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 71519 bytes in 2.436 second response time [00:35:38] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.103 second response time [00:36:41] looks promising, Icinga shows none of them anymore [00:36:50] well, 1 [00:36:56] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [00:37:32] just mw1281 [00:38:17] PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:39:27] PROBLEM - HHVM rendering on mw1281 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.009 second response time [00:39:42] !log restarted hhvm on mw1281 (was at 47.7% usage) [00:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:40:49] yeah I think they'll keep popping up periodically [00:41:13] I'm tempted to start a screen session with a while loop doing the memory-restart-salt every 10-15 minutes [00:41:57] with > 50% ? yea, that doesnt sound bad [00:42:04] even the ones we've already restarted, ganglia shows the memory ramping back in again [00:42:07] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 71487 bytes in 0.652 second response time [00:43:27] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.297 second response time [00:43:36] mutante: starting one up with -S api-hhvm-restarts, with a 10 minute while loop and 40% cutoff [00:44:32] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [00:44:33] bblack: sounds good [00:44:52] i see that in ganglia too yea [00:45:04] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:47:00] mutante: it's running now (first salt going out now from the loop) [00:47:40] ack [00:47:42] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:48:09] !log restarting API hhvms with >40% mem usage via salt every 10 minutes in a loop from here forward. screen session on neodymium, named api-hhvm-restarts [00:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:48:38] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:49:46] I'm detached from it, got to run for a bit. I think this will keep things relatively in check until it subsides. call if you need me! [00:49:47] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:51:55] bblack: thank you, ok [00:55:48] (03CR) 10Dzahn: "do we want to make it check actual memory usage with the "/bin/ps -C hhvm -o pmem= | awk .." instead of time running?" [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) (owner: 10Giuseppe Lavagetto) [00:57:49] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:59:41] 06Operations: sftp gives bogus "Couldn't stat remote file: No such file or directory" - https://phabricator.wikimedia.org/T146509#2724323 (10Dzahn) a:03Mattflaschen-WMF [01:45:25] (03PS1) 10Yuvipanda: docker: Don't subscribe service to package [puppet] - 10https://gerrit.wikimedia.org/r/316507 [01:46:04] (03PS2) 10Yuvipanda: docker: Don't subscribe service to package [puppet] - 10https://gerrit.wikimedia.org/r/316507 [01:46:11] (03CR) 10Yuvipanda: [C: 032 V: 032] docker: Don't subscribe service to package [puppet] - 10https://gerrit.wikimedia.org/r/316507 (owner: 10Yuvipanda) [02:03:01] (03PS1) 10Yuvipanda: toollabs: Don't mount /srv in docker builders [puppet] - 10https://gerrit.wikimedia.org/r/316511 [02:03:34] (03PS2) 10Yuvipanda: toollabs: Don't mount /srv in docker builders [puppet] - 10https://gerrit.wikimedia.org/r/316511 [02:04:58] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Don't mount /srv in docker builders [puppet] - 10https://gerrit.wikimedia.org/r/316511 (owner: 10Yuvipanda) [02:06:40] (03PS1) 10Alex Monk: deployment-prep: Make LVS config compatible with new requirements [puppet] - 10https://gerrit.wikimedia.org/r/316512 [02:09:51] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2724169 (10demon) >>! In T148478#2724184, @Dzahn wrote: > On Oct 13 the log4j.properties were merged in https://gerrit.wikimedia.org/r/#/c/315571/ and there is no... [02:30:15] (03PS4) 10Huji: Reverting votewiki back to en Change was meant to be temporary. fawiki elections have been since over [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316291 (https://phabricator.wikimedia.org/T148352) [02:30:56] is there some kind of scap going on? [02:31:10] (03PS5) 10Huji: Reverting votewiki back to en [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316291 (https://phabricator.wikimedia.org/T148352) [02:31:11] nevermind! [02:31:12] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.22) (duration: 10m 20s) [02:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:02] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Oct 18 02:37:01 UTC 2016 (duration 5m 49s) [02:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:57] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:42:57] Krenair , isn't change tracking intended to keep changes that are done, to be on-topic? [02:43:11] 06Operations, 10Traffic: Extend check_sslxnn to check OCSP Stapling - https://phabricator.wikimedia.org/T148490#2724429 (10BBlack) [03:03:48] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [03:04:49] (03PS1) 10Dzahn: gerrit: log4j config to log to files [puppet] - 10https://gerrit.wikimedia.org/r/316517 [03:06:44] (03CR) 10Dzahn: [C: 032] "taken from an example specifically for gerrit, adjusted pathes to /var/log/gerrit/, commented the Apache part" [puppet] - 10https://gerrit.wikimedia.org/r/316517 (owner: 10Dzahn) [03:06:59] (03PS2) 10Dzahn: gerrit: log4j config to log to files [puppet] - 10https://gerrit.wikimedia.org/r/316517 [03:09:58] (03CR) 10Dzahn: "thanks to https://gerrit.wikimedia.org/r/#/c/315846/ we dont have to restart gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/316517 (owner: 10Dzahn) [03:15:13] !log restarting gerrit for logging config change [03:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:18:13] !log gerrit has logs now in /var/log/gerrit/ [03:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:19:34] !log restarted grrrit-wm [03:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:22:00] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [03:22:00] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2724455 (10Dzahn) >>! In T148478#2724425, @demon wrote: > Ugh, then we need to remove that until we've got acceptable logging config, mea culpa. I made log4j log t... [03:28:44] (03PS2) 10Dzahn: tcpircbot: improve firewall rule setup [puppet] - 10https://gerrit.wikimedia.org/r/316497 [03:47:39] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:24:40] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Testing on Production, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2724464 (10AndyRussG) Another option: In CentralNotice's `BannerRenderer`, instead of [[ https://gi... [04:28:00] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:31:50] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Testing on Production, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2724469 (10awight) I'm not sure pulling the revision will reduce complexity. If we do it that way, w... [05:42:48] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:44:37] (03CR) 10Yuvipanda: [C: 04-1] "I don't understand what advantage using 'profiles' here gives us over using classes, and I see several disadvantages:" [puppet] - 10https://gerrit.wikimedia.org/r/315717 (https://phabricator.wikimedia.org/T147181) (owner: 10Giuseppe Lavagetto) [05:46:56] !log a l v a r o m o l i n a y j e m s e a m a n [05:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:47:20] !log a l v a r o m o l i n a y a l e x z s e a m a n [05:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:47:45] !log d a x i s m y e n e m i e [05:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:57:11] logbot should have been take measures on who is allowed to post to sal lol [05:58:08] (03CR) 10Giuseppe Lavagetto: "1 - hiera calls are all at the top, and are explicit vs the old implicit lookup. Just to remembere: a class param can be defined either v" [puppet] - 10https://gerrit.wikimedia.org/r/315717 (https://phabricator.wikimedia.org/T147181) (owner: 10Giuseppe Lavagetto) [05:58:52] not this again [05:59:03] should we revert the sal page or thats not something advisable to do as i only see the bot in the history of the page? [06:00:16] AlexZ [06:00:43] oh I fixed it already [06:02:36] i could too though as i already had it open but wanted to confirm the above point [06:03:25] yes it should be fine to edit it to remove things like that [06:03:59] The only thing that can't be fixed right now is the twitter feed [06:04:15] which is unfortunate https://twitter.com/wikimediatech/ [06:04:27] login to twitter and delete the tweets? [06:04:47] I don't have access to it and don't know who does. [06:05:52] otherwise i smell breaking news headlines telling wmf has been h@x4d lol [06:06:28] enen though its false [06:07:03] It was worse the other day, so thankfully no one has caught on quite yet. [06:07:59] the other day? [06:08:40] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:09:58] https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&oldid=900185 [06:12:29] oh god [06:13:31] (03CR) 10Giuseppe Lavagetto: "Also, having only explicit hiera calls and avoiding implicit hiera params lookup was one of the things we agreed on at the puppet session " [puppet] - 10https://gerrit.wikimedia.org/r/315717 (https://phabricator.wikimedia.org/T147181) (owner: 10Giuseppe Lavagetto) [06:26:11] (03PS2) 10Marostegui: db-equiad: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316296 (https://phabricator.wikimedia.org/T147305) [06:28:35] (03CR) 10Jcrespo: [C: 04-1] db-equiad: Depool db1064 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316296 (https://phabricator.wikimedia.org/T147305) (owner: 10Marostegui) [06:30:30] (03PS3) 10Marostegui: db-equiad: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316296 (https://phabricator.wikimedia.org/T147305) [06:41:42] (03CR) 10Muehlenhoff: "Looks good to me, but let's only add eventlog2001 when actually setup." [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [06:45:58] (03CR) 10Alexandros Kosiaris: [C: 031] tcpircbot: improve firewall rule setup [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [06:46:36] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2724539 (10yuvipanda) Hello! thanks for writing up the detailed proposal. There are several concerns / points of confusion for me, so I'll try to lay th... [06:48:16] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2724551 (10yuvipanda) > Any node in site.pp must include exactly one role, with the "role()" function. No exceptions allowed. Does this include labs to... [06:50:32] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2724553 (10yuvipanda) > Almost all hiera definitions should go in the "role" heirarchy. Only exceptions can be shared data structures that can be used b... [06:51:39] (03CR) 10Alexandros Kosiaris: "Ah just saw the comment about some AAAA records not being there yet. Well ferm will actually fail if AAAA are missing. It will return" [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [06:54:31] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2724555 (10yuvipanda) > Labs and production roles might differ without harm; so for instance a production role might include standard and base::firewall... [06:55:21] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2724556 (10Joe) So @yuvi first things to make clear: - our current roles are mostly profiles in this definition. - The one role: one machine rule is ba... [07:04:12] (03PS1) 10Yuvipanda: puppetmaster: Normalize per project hiera path between private/public [puppet] - 10https://gerrit.wikimedia.org/r/316526 [07:05:19] (03CR) 10Alexandros Kosiaris: [C: 031] postgresql - ensure that correct locale is used to initialise postgres cluster [puppet] - 10https://gerrit.wikimedia.org/r/316342 (owner: 10Gehel) [07:10:01] 06Operations, 10Traffic: repeated 503 errors for 90 minutes now - https://phabricator.wikimedia.org/T146451#2724568 (10Joe) For the record, yesterday's problem is different from the one we had before; it's still a memleak but of a different nature. if no ticket is open for that, I'll open one this morning. [07:10:27] (03PS2) 10Yuvipanda: puppetmaster: Normalize per project hiera path between private/public [puppet] - 10https://gerrit.wikimedia.org/r/316526 [07:10:32] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetmaster: Normalize per project hiera path between private/public [puppet] - 10https://gerrit.wikimedia.org/r/316526 (owner: 10Yuvipanda) [07:15:44] (03PS1) 10Yuvipanda: puppetmaster: Match ops/puppet for labs projects' common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/316527 [07:17:39] (03PS2) 10Yuvipanda: puppetmaster: Match ops/puppet for labs projects' common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/316527 [07:17:45] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetmaster: Match ops/puppet for labs projects' common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/316527 (owner: 10Yuvipanda) [07:35:47] (03PS1) 10Alexandros Kosiaris: icinga: Enable apache::mod::cgi [puppet] - 10https://gerrit.wikimedia.org/r/316528 [07:36:22] (03PS3) 10Jcrespo: mariadb: add mariadb::service for all 10.0 mariadb::packages_wmf [puppet] - 10https://gerrit.wikimedia.org/r/316332 [07:41:09] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2724580 (10elukey) I tried to compare a Miley Cyrus link logging correctly a 400 (and Timestamp:Resp) with a "ba... [07:41:44] (03PS3) 10Giuseppe Lavagetto: role::scb: use role::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316346 [07:42:17] <_joe_> mobrovac: I'm merging this ^^ [07:42:28] yay [07:42:33] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::scb: use role::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316346 (owner: 10Giuseppe Lavagetto) [07:42:52] <_joe_> running it on scb2001 now [07:44:20] <_joe_> noop! [07:47:54] (03CR) 10Jcrespo: [C: 031] db-equiad: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316296 (https://phabricator.wikimedia.org/T147305) (owner: 10Marostegui) [07:48:32] (03CR) 10Jcrespo: "Check ongoing traffic, vslow/dumps can take hours to depool every time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316296 (https://phabricator.wikimedia.org/T147305) (owner: 10Marostegui) [07:49:49] (03CR) 10Elukey: "Tests:" [puppet] - 10https://gerrit.wikimedia.org/r/316306 (https://phabricator.wikimedia.org/T148412) (owner: 10Elukey) [07:51:03] (03CR) 10Marostegui: [C: 032] db-equiad: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316296 (https://phabricator.wikimedia.org/T147305) (owner: 10Marostegui) [07:51:34] (03Merged) 10jenkins-bot: db-equiad: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316296 (https://phabricator.wikimedia.org/T147305) (owner: 10Marostegui) [07:52:40] (03PS1) 10Ema: Fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/316529 (https://phabricator.wikimedia.org/T95064) [07:53:29] <_joe_> ema: I love bashisms :) [07:54:46] :) [07:57:16] !log marostegui@mira Synchronized wmf-config/db-eqiad.php: Depool db1064 as it needs an ALTER table and pool db1068 temporarily to serve vslow and dump service - T147305 (duration: 02m 53s) [07:57:17] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [07:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:58:10] <_joe_> ema: I'm pretty sure all of my shell code is full of bashisms [07:58:13] 07Puppet, 13Patch-For-Review: Bashisms in various /bin/sh scripts - https://phabricator.wikimedia.org/T95064#1179518 (10MoritzMuehlenhoff) We should also add a CI test, otherwise these will inevitably creep back in over time... [07:58:26] <_joe_> both conscious and unconscious [08:00:05] _joe_: I'm perfectly happy with bash as long as the script uses /bin/bash ;) [08:00:18] <_joe_> ema: ah right! [08:00:19] agree [08:02:16] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::webserver: convert to role::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316347 [08:02:51] (03CR) 10Jcrespo: "I have no issue with this, but I would like Riccardo or Giuseppe's opinion on modifying an imported script (we discussed this on the offsi" [puppet] - 10https://gerrit.wikimedia.org/r/316529 (https://phabricator.wikimedia.org/T95064) (owner: 10Ema) [08:05:12] (03CR) 10Jcrespo: "To clarify, I am +1 eventlogging_sync.init and can do it now (I did not create the script, but I maintain it). I can apply that right now." [puppet] - 10https://gerrit.wikimedia.org/r/316529 (https://phabricator.wikimedia.org/T95064) (owner: 10Ema) [08:05:31] jynus: good point, which one are externals/imported? [08:05:48] I will fix files/mariadb/eventlogging_sync.init now [08:05:56] the imported one is mysql, volans [08:06:03] the right db module is mariadb [08:06:23] however, mysql is apparently used by icinga to install client libraries, etc. [08:06:44] I haven't check the others [08:08:44] whatever we do, better discuss it on T95064 [08:08:44] T95064: Bashisms in various /bin/sh scripts - https://phabricator.wikimedia.org/T95064 [08:09:04] I love gerrit, but I hate it as a discussion platform [08:09:16] yeah [08:12:28] 07Puppet, 13Patch-For-Review: Bashisms in various /bin/sh scripts - https://phabricator.wikimedia.org/T95064#1179518 (10jcrespo) @scfc note at least some of those are 3rd party, and we may (not sure yet) decide to keep compatibility with upstream, and it should be reported there anyway. That is at least true f... [08:13:10] (03CR) 10Giuseppe Lavagetto: [C: 032] "PCC thinks this is fine." [puppet] - 10https://gerrit.wikimedia.org/r/316347 (owner: 10Giuseppe Lavagetto) [08:17:20] (03PS1) 10Jcrespo: mariadb: Fix bashism on eventlogging.init [puppet] - 10https://gerrit.wikimedia.org/r/316532 (https://phabricator.wikimedia.org/T95064) [08:17:47] (03PS2) 10Jcrespo: mariadb: Fix bashism on eventlogging.init [puppet] - 10https://gerrit.wikimedia.org/r/316532 (https://phabricator.wikimedia.org/T95064) [08:18:55] (03CR) 10Jcrespo: [C: 032] mariadb: Fix bashism on eventlogging.init [puppet] - 10https://gerrit.wikimedia.org/r/316532 (https://phabricator.wikimedia.org/T95064) (owner: 10Jcrespo) [08:20:11] (03PS2) 10Jcrespo: Fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/316529 (https://phabricator.wikimedia.org/T95064) (owner: 10Ema) [08:20:45] 07Puppet, 13Patch-For-Review: Bashisms in various /bin/sh scripts - https://phabricator.wikimedia.org/T95064#1179518 (10Volans) For 3rd party or we keep them as is because are 3rd party or when modifying them we also add a changelog entry in the header to keep track of the changes. In all cases we should repo... [08:21:51] 07Puppet, 13Patch-For-Review: Bashisms in various /bin/sh scripts - https://phabricator.wikimedia.org/T95064#2724599 (10jcrespo) To prove I am not blocking this for the sake of blocking, I fixed this on: * mysql_wmf * mysql_multi_instance * eventlogging_sync [08:25:39] em, jynus thanks because you forced me to open the task I had in my backlog from the offsite ;) [08:25:48] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2724601 (10Volans) [08:25:49] ema [08:26:38] so, I fixed mine, I really do not care about the rest, I do not use them or maintain them, so I will go with the group decision [08:26:58] aka it is all volans' problem now :-) [08:27:10] hehe [08:27:15] aka implied +1? [08:27:49] 07Puppet, 13Patch-For-Review: Bashisms in various /bin/sh scripts - https://phabricator.wikimedia.org/T95064#2724618 (10Volans) FYI: I've opened the related task T148494 that was actually on my backlog. [08:27:58] 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2724620 (10elukey) p:05Triage>03Normal [08:33:25] 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2724062 (10elukey) These are the groups available: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups I think that `analytics-privatedata-users` shoul... [08:34:57] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2724636 (10ema) p:05Triage>03Normal [08:37:31] 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2724142 (10elukey) Hi Piotr, welcome :) Have you completed all the steps in https://wikitech.wikimedia.org/wiki/Production_shell_access#New_users and take? Thanks! [08:37:38] 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2724639 (10elukey) p:05Triage>03Normal [08:38:34] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access request: #mediawiki_security and fluorine for Petr - https://phabricator.wikimedia.org/T148473#2724642 (10elukey) p:05Triage>03Normal [08:38:50] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access to fluorine for Petr - https://phabricator.wikimedia.org/T148475#2724643 (10elukey) p:05Triage>03Normal [08:39:01] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access request: #mediawiki_security for Petr. - https://phabricator.wikimedia.org/T148476#2724658 (10elukey) p:05Triage>03Normal [08:40:54] (03PS2) 10Muehlenhoff: Configure tin for installation with jessie [puppet] - 10https://gerrit.wikimedia.org/r/315469 [08:42:45] !log clone db1052 -> db1053, will perform maintenance (db restarts, reboots on both) at the same time [08:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:44:44] (03CR) 10Muehlenhoff: [C: 032] Configure tin for installation with jessie [puppet] - 10https://gerrit.wikimedia.org/r/315469 (owner: 10Muehlenhoff) [08:45:41] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access to fluorine for Petr - https://phabricator.wikimedia.org/T148475#2724106 (10elukey) The request looks good to me, the `mw-log-readers` group is probably the best fit for this. https://wikitech.wikimedia.org/wiki/Production_shell_access#Addit... [08:45:56] ignore my last log [08:52:07] (03PS2) 10Gehel: postgresql - ensure that correct locale is used to initialise postgres cluster [puppet] - 10https://gerrit.wikimedia.org/r/316342 [08:53:20] (03CR) 10Gehel: [C: 032] postgresql - ensure that correct locale is used to initialise postgres cluster [puppet] - 10https://gerrit.wikimedia.org/r/316342 (owner: 10Gehel) [08:53:56] !log Deploying ALTER table on S4 commonswiki (db1064 — last host) - T147305 [08:53:57] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [08:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:56:57] !log reimaging tin to jessie [08:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:05:05] (03PS4) 10Filippo Giunchedi: standard: add prometheus node_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/310519 (https://phabricator.wikimedia.org/T140646) [09:05:12] (03CR) 10Ema: [C: 031] Tune varnishkafka-webrequest parameters [puppet] - 10https://gerrit.wikimedia.org/r/316306 (https://phabricator.wikimedia.org/T148412) (owner: 10Elukey) [09:06:44] (03PS9) 10Gehel: Maps - cleanup postgres user creation [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) [09:07:39] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access request: #mediawiki_security for Petr. - https://phabricator.wikimedia.org/T148476#2724121 (10elukey) Looks good to me, waiting one day just in case. I'll coordinate with him once online. [09:08:41] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:11:01] (03CR) 10Filippo Giunchedi: "PCC for hosts that would have node-exporter added and those that have it already https://puppet-compiler.wmflabs.org/4405/" [puppet] - 10https://gerrit.wikimedia.org/r/310519 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [09:11:52] (03PS4) 10Elukey: Tune varnishkafka-webrequest parameters [puppet] - 10https://gerrit.wikimedia.org/r/316306 (https://phabricator.wikimedia.org/T148412) [09:12:39] any objections in going ahead with https://gerrit.wikimedia.org/r/#/c/310519 ? node-exporter in codfw [09:13:06] <_joe_> nope [09:13:18] <_joe_> I guess that if clause goes away soon, right? [09:13:19] <_joe_> :P [09:13:34] that's my hope too! [09:16:36] (03CR) 10Filippo Giunchedi: [C: 032] standard: add prometheus node_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/310519 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [09:18:08] (03CR) 10Elukey: [C: 032] Tune varnishkafka-webrequest parameters [puppet] - 10https://gerrit.wikimedia.org/r/316306 (https://phabricator.wikimedia.org/T148412) (owner: 10Elukey) [09:18:12] (03PS5) 10Elukey: Tune varnishkafka-webrequest parameters [puppet] - 10https://gerrit.wikimedia.org/r/316306 (https://phabricator.wikimedia.org/T148412) [09:18:55] !log upgrade nodejs to 4.6.0 on maps2* servers [09:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:25:49] !log varnishkafka restarting in upload/misc/maps with new settings (https://gerrit.wikimedia.org/r/316306) [09:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:27:44] (03PS1) 10Gehel: maps - change structure of slaves according to https://gerrit.wikimedia.org/r/#/c/315271/ [labs/private] - 10https://gerrit.wikimedia.org/r/316536 (https://phabricator.wikimedia.org/T147194) [09:28:07] (03CR) 10Gehel: [C: 032 V: 032] maps - change structure of slaves according to https://gerrit.wikimedia.org/r/#/c/315271/ [labs/private] - 10https://gerrit.wikimedia.org/r/316536 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [09:28:59] !log reimaging mw1168 to Debian Jessie (MW Jobrunner) [09:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:31:45] (03CR) 10Gehel: "Since the labs private repo needs to be updated for this change, the puppet compiler can't compile both the current and the changed versio" [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [09:35:15] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:16] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Group[prometheus-node-exporter],Service[prometheus-node-exporter] [09:37:44] sad_trombone.wav [09:38:45] ah the "failure while writing changes to /etc/group" error [09:40:09] 06Operations, 06Performance-Team, 10Thumbor: Thumbor doesn't handle page redirects like Mediawiki does - https://phabricator.wikimedia.org/T148410#2724696 (10Gilles) Having looked at both Thumbor and rewrite.py, it's a lot cleaner to do this in rewrite.py. rewrite.py already has logic to reconstruct which wi... [09:40:26] PROBLEM - puppet last run on mw2159 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[jobrunner] [09:44:17] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Group[prometheus-node-exporter],Service[prometheus-node-exporter] [09:47:03] mhh I remember we've ran into this issue before, no luck on phabricator yet [09:47:46] T145793 [09:47:47] T145793: Prometheus puppet manifest fail on Trusty instance deployment-zotero1 groupadd: failure while writing changes to /etc/group - https://phabricator.wikimedia.org/T145793 [10:03:46] RECOVERY - puppet last run on mw2159 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:05:05] 06Operations, 10MediaWiki-API: Api cluster issues - https://phabricator.wikimedia.org/T148448#2724719 (10Aklapper) @Zppix: For future reference, please follow https://mediawiki.org/wiki/How_to_report_a_bug to actually describe the "api cluster issues" in your bug report - thanks a lot! [10:10:25] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:23] <_joe_> kraz... [10:11:39] <_joe_> what was that again? [10:11:59] <_joe_> irc broadcast in codfw [10:12:08] <_joe_> so nothing I did before [10:12:55] error: could not load image from https://wikitech.wikimedia.org/w/images/thumb/4/4d/Infrastructure_overview.png/1024px-Infrastructure_overview.png [10:15:14] <_joe_> arseny92: that url is a 404 [10:15:17] https://wikitech.wikimedia.org/wiki/Category:Wikimedia_infrastructure [10:15:21] https://wikitech.wikimedia.org/wiki/File:Infrastructure_overview.png [10:15:50] <_joe_> heh [10:16:03] https://wikitech.wikimedia.org/w/images/4/4d/Infrastructure_overview.png [10:16:19] <_joe_> now why wikitech is not recreating the thumb is the good question [10:16:24] <_joe_> let me check the logs [10:17:19] the last 3 revisions lack a thumbs as you can see [10:17:21] _joe_: known problem,let me find the bug [10:17:29] <_joe_> oh ok [10:17:36] I merged the first part of the fix yesterday [10:17:55] <_joe_> arseny92: so it seems it's a known problem [10:18:01] https://phabricator.wikimedia.org/T145811 [10:18:05] ok lol [10:18:06] <_joe_> moritzm: has to do with firejailing imagemagick? [10:18:23] <_joe_> exactly :) [10:18:51] indirectly, wikitech was refactored to move away from mediawiki classes so the firejail wrapper and config are no longer provided/stale [10:19:05] <_joe_> sigh [10:19:12] <_joe_> anyways [10:19:32] arseny92: should be fixed today or tomorrow, I merged the first part of the fix yesterday, the second part is still TBD [10:19:37] <_joe_> arseny92: I hope this addresses your bug report :) [10:20:08] ok nvm i see the task references the same file so [10:28:49] _joe , moritzm , funny as just today morning i was trying to find something like this big , thanks [10:30:44] i.e. this https://en.wikipedia.org/wiki/Wikipedia:Help_desk#About_Logo_in_pa.wikipedia.org [10:31:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [10:32:46] arseny92, that issue is https://phabricator.wikimedia.org/T148497 [10:32:58] and my wild guess is that it's a bigger problem reported in https://phabricator.wikimedia.org/T148467 [10:33:12] namely that some JPEG files are served as binary garbage. [10:33:29] No idea if _joe_ or others want to investigate that ^^ [10:33:46] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:34:06] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [10:34:54] andre__ yes just seen it in releng chan [10:40:08] !log mw1168.eqiad.wmnet back in service after reimage (MW Jobrunner) [10:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:23] mw1169 is the last man standing [10:49:27] \o/ tin also running jessie now [10:53:09] <_joe_> :)) [10:54:28] niceeee [10:54:36] andre__: i'm getting it now (compared to this morning), looks like you can't export the developer tools results in either chrome or edge apart from the network load times [10:54:46] all the image files are loading fine for me directly [10:56:35] (03PS1) 10Jcrespo: Add new module proxysql [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) [10:58:10] andre__ issue might be on caching or cdn as only happens to some [10:59:13] bt welll setting the logo to a generated one is a bad idea [11:01:42] arseny92, why is it a bad idea? [11:02:06] p858snake|L2: I still cannot reproduce but I don't have Mac OS or Windows around here :( [11:07:27] didn't you read my reply on enwiki? [11:08:21] or you'd say things work differently? My message also has relevant links [11:13:16] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) [11:13:55] (03PS1) 10Filippo Giunchedi: Introduce mtail module [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) [11:13:57] (03PS1) 10Filippo Giunchedi: centralserver: add mtail for kernel messages [puppet] - 10https://gerrit.wikimedia.org/r/316544 (https://phabricator.wikimedia.org/T147923) [11:14:28] 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2724889 (10pmiazga) Hey @elukey Yes, I completed all the steps from wikitech page. I'm using two WikiMedia ssh keys (that are different from my prrivate key), one for Wikitech/... [11:18:05] (03CR) 10Marostegui: Add new module proxysql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [11:18:55] andre__ ^ [11:25:58] (03PS1) 10Muehlenhoff: Also provide imagemagick wrapper in openstack::nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/316545 (https://phabricator.wikimedia.org/T145811) [11:31:03] (03PS1) 10Filippo Giunchedi: prometheus: account for public hosts when generating targets [puppet] - 10https://gerrit.wikimedia.org/r/316546 [11:35:25] (03CR) 10Jcrespo: "See comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [11:35:59] (03PS3) 10Giuseppe Lavagetto: role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) [11:37:00] (03CR) 10Marostegui: [C: 031] Add new module proxysql [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [11:37:32] (03CR) 10Marostegui: Add new module proxysql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [11:37:47] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:37:49] !log reimaging mw1169 to Debian Jessie (MW Jobrunner) [11:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:39:01] (03PS2) 10Filippo Giunchedi: prometheus: expand domain search list [puppet] - 10https://gerrit.wikimedia.org/r/316546 [11:43:57] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [11:44:55] PROBLEM - puppet last run on db1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:46:22] (03PS3) 10Filippo Giunchedi: prometheus: expand domain search list [puppet] - 10https://gerrit.wikimedia.org/r/316546 (https://phabricator.wikimedia.org/T140646) [11:48:16] (03PS1) 10Gehel: maps - adding dummy monitoring password for postgresql [labs/private] - 10https://gerrit.wikimedia.org/r/316549 (https://phabricator.wikimedia.org/T147194) [11:48:47] (03PS1) 10Muehlenhoff: hhvm: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/316550 [11:50:15] (03CR) 10Gehel: [C: 032 V: 032] maps - adding dummy monitoring password for postgresql [labs/private] - 10https://gerrit.wikimedia.org/r/316549 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [11:50:34] (03CR) 10Marostegui: [C: 031] mariadb: add mariadb::service for all 10.0 mariadb::packages_wmf [puppet] - 10https://gerrit.wikimedia.org/r/316332 (owner: 10Jcrespo) [11:55:32] !log removed /etc/mysql/conf.d/research-client.cnf from stat1002 (root:root perms, not supposed to be there but only on stat1003) [11:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:56:50] (03CR) 10Faidon Liambotis: [C: 031] "LGTM — I'm not 100% sure if the acknowledgement will be welcome, might be worth asking Filippo + DBAs (the main consumers of this), as wel" [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [11:57:16] (03CR) 10Faidon Liambotis: [C: 031] "LGTM. It's a hack that I don't love but trying to be pragmatic :)" [puppet] - 10https://gerrit.wikimedia.org/r/303992 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [11:59:57] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:02:17] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [12:09:09] RECOVERY - puppet last run on db1052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:10:22] (03CR) 10Jcrespo: "@Faidon Actually, see my comment here: https://phabricator.wikimedia.org/T95064#2724592 and https://phabricator.wikimedia.org/T95064#27245" [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [12:11:20] jynus: are you sure you commented on the right changset? [12:11:40] arg [12:11:45] it sounds like your comments are applicable to https://gerrit.wikimedia.org/r/#/c/316529/ [12:11:48] sorry [12:11:56] yes [12:12:10] which I haven't commented on yet :) [12:12:24] too many tabs open [12:12:36] for that one, I was about to raise the 3rd-party code issue, then clicked on the referenced task and saw others did before me [12:12:44] he he :-) [12:12:51] we are at least on sync there [12:13:12] I am clearly on desync with gerrit [12:13:37] the other one you just replied to is volans' work to open phab tasks automatically on disk failures [12:14:01] no blocker to me [12:14:06] I was aware of it [12:14:45] to be fair, when there is disk throughput issues [12:15:00] the RAID tends to be very silent [12:15:34] as it is right now, an alert will be raised and then acknowledged shortly thereafter [12:15:45] and a task will be opened that won't have the #DBA tag initially [12:16:09] I documented this here: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Caused_by_hardware [12:16:09] so I'm kind of afraid that you guys (for example) might miss it, and thus miss the fact that one of our servers has a broken disk [12:16:36] yes, I share the sentiment, I do not think it is too bad- we can refine it later [12:16:43] alright then [12:16:58] on the link above^ [12:17:18] is an example of predictive failures creating throughput issues [12:17:26] but icinga not alerting [12:17:37] but that is not our fault, it is the RAID [12:17:56] so the gerrit behaviour is in no way problematic [12:20:23] (03CR) 10Jcrespo: "Ignore my last comment, that was for a different review. I am ok with the behaviour (I have not reviewed the patch, however). Problems wil" [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [12:21:09] I wonder if that could cause extra load on icinga? [12:21:45] nah [12:28:08] paravoid, BTW, https://gerrit.wikimedia.org/r/316541 (no need for review, just FYI) [12:28:31] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10faidon) [12:28:40] jynus: I saw! [12:29:03] hopefully that is the future/foundations of our app-level load balancing and failover [12:32:27] (03CR) 10Faidon Liambotis: [C: 04-1] "Cool! Minor nits inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [12:32:46] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724981 (10faidon) [12:38:11] (03CR) 10Jcrespo: Add new module proxysql (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [12:39:23] (03PS2) 10Jcrespo: Add new module proxysql [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) [12:42:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] "looks fine, minor comments inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [12:44:25] (03PS5) 10BBlack: Add ssl_stapling_proxy patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/315982 (https://phabricator.wikimedia.org/T93927) [12:44:27] (03PS8) 10BBlack: nginx (1.11.4-1+wmf3) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/311689 [12:44:29] (03PS4) 10BBlack: control: depend on libssl11-dev [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/316329 [12:44:31] (03PS1) 10BBlack: Lua module: OpenSSL-1.1 compat fixup [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/316553 [12:45:24] (03Abandoned) 10BBlack: Lua module: OpenSSL-1.1 compat fixup [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/316340 (owner: 10BBlack) [12:48:48] (03PS10) 10Gehel: Maps - cleanup postgres user creation [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) [12:49:38] (03PS3) 10Volans: Monitoring: avoid NRPE limit for RAID get status [puppet] - 10https://gerrit.wikimedia.org/r/303992 (https://phabricator.wikimedia.org/T142085) [12:50:00] (03CR) 10jenkins-bot: [V: 04-1] Maps - cleanup postgres user creation [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [12:51:18] (03PS11) 10Gehel: Maps - cleanup postgres user creation [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) [12:52:25] !log mw1169 back in service after reimage (MW Jobrunner) [12:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:52:41] aaaaaaaand all the job runners are running Jessie now \o/ [12:52:48] \o/ [12:52:57] yay! [12:53:07] nice work elukey! [12:53:13] and everyone else :) [12:53:17] (03CR) 10Volans: [C: 032] Monitoring: avoid NRPE limit for RAID get status [puppet] - 10https://gerrit.wikimedia.org/r/303992 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [12:53:43] volans provided the necessary magic to complete the job in a short time :D [12:55:21] Whos up for swat? [12:56:15] 06Operations, 06Discovery, 06WMDE-Analytics-Engineering, 10Wikidata, and 2 others: wdqs - move metric collections to diamond - https://phabricator.wikimedia.org/T146468#2725024 (10elukey) p:05Triage>03Low [12:56:20] elukey: my pleasure :) [12:57:33] 06Operations, 10Wikimedia-Apache-configuration: Font list resource doesn't have a "Content-type: text/plain;charset=utf-8" header - https://phabricator.wikimedia.org/T146421#2725031 (10elukey) p:05Triage>03Low [12:57:51] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2725032 (10elukey) p:05Triage>03Normal [12:59:01] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2725033 (10elukey) p:05Triage>03Normal [12:59:51] 06Operations, 10ops-codfw, 10DBA: es2014 raid controler temporary failure - https://phabricator.wikimedia.org/T148434#2725037 (10elukey) p:05Triage>03High [13:01:07] 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2725039 (10elukey) p:05Triage>03Normal [13:01:34] uh [13:01:35] (03PS1) 10Alexandros Kosiaris: sca: Fully qualify the lvs::realserver inclusion [puppet] - 10https://gerrit.wikimedia.org/r/316556 [13:01:36] 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2722348 (10elukey) @Cmjohnson Hi! How should we proceed? [13:01:53] 06Operations, 05Prometheus-metrics-monitoring: Consider moving prometheus from VM to baremetal - https://phabricator.wikimedia.org/T148408#2725043 (10elukey) p:05Triage>03Normal [13:02:05] 06Operations, 10ops-esams: cp3009: memory scrubbing error - https://phabricator.wikimedia.org/T148422#2725044 (10elukey) p:05Triage>03Normal [13:02:20] 06Operations, 10ops-esams: cp3009: memory scrubbing error - https://phabricator.wikimedia.org/T148422#2722360 (10elukey) @mark Hi! How should we proceed? [13:02:51] 06Operations, 10Analytics-Cluster, 10hardware-requests: Decommission analytics1026 and analytics1015 - https://phabricator.wikimedia.org/T147313#2725047 (10elukey) p:05Triage>03Normal [13:02:59] 06Operations, 10hardware-requests: Decommission db1019 - https://phabricator.wikimedia.org/T147309#2725048 (10elukey) p:05Triage>03Normal [13:03:20] 06Operations, 10ops-esams: cp3009: memory scrubbing error - https://phabricator.wikimedia.org/T148422#2725049 (10mark) @elukey: we're decommissionining a whole bunch of machines from that same batch, we can probably swap memory next time i'm there. [13:03:32] (03PS3) 10Jcrespo: Add new module proxysql [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) [13:04:00] 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2725051 (10Cmjohnson) @elukey, The h/w log is not reporting any DIMM error at the moment please depool this server so I can do some testing but yes most likely DIMM. Thanks [13:04:26] (03CR) 10Alexandros Kosiaris: [C: 031] Maps - cleanup postgres user creation [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [13:04:37] akosiaris: thanks! [13:04:43] (03CR) 10jenkins-bot: [V: 04-1] Add new module proxysql [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [13:04:54] gehel: :-) [13:05:06] (03CR) 10Jcrespo: "I will not fix the /tmp on the module, as it will make it fail for users that do not know how to use it. I will fix it on the role. See al" [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [13:05:24] 06Operations, 10ops-eqiad, 10procurement: eqiad: replenish 1GE copper SFP stock - https://phabricator.wikimedia.org/T148508#2725052 (10faidon) [13:05:40] akosiaris: I'm tempted to rewrite this postgres module with custom types instead of this mess of exec and augeas... but I'm not sure I have the courage... [13:05:47] (03PS4) 10Jcrespo: Add new module proxysql [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) [13:06:37] EU swat cancelled as i see no one around? [13:06:42] gehel: I am with you on this one. This postgres module is a constant source of pain [13:07:08] make sense given it was created for a specific purpose and it's now trying to become generic [13:07:35] yeah, it looks like something we are trying to push further than what it was supposed to do... [13:08:06] I would wait a bit though. We might be able to get the future parser this quarter, which should make the effort way easier [13:08:19] Yeah for future parser! [13:08:31] arseny92: it was cancelled yesterday due to api cluster issues, the last notice I've seen was "No SWAT/deploys until Ops give the all clear" [13:09:16] (03PS5) 10Jcrespo: Add new module proxysql [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) [13:09:43] (03PS6) 10Jcrespo: Add new module proxysql [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) [13:09:49] dcausse Greg only did put the cancelled label on yesterday's deployment window [13:09:54] (03CR) 10Alexandros Kosiaris: [C: 032] sca: Fully qualify the lvs::realserver inclusion [puppet] - 10https://gerrit.wikimedia.org/r/316556 (owner: 10Alexandros Kosiaris) [13:10:18] ie the evening swt [13:11:06] arseny92: I know but this is last info I have and the releng team is not around [13:11:59] 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2725078 (10elukey) This request is also related to T148472 in which access to stat* boxes is requested. I am going to move this ticket into the 3 days waiting period to let other... [13:12:02] (03PS7) 10Jcrespo: Add new module proxysql [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) [13:13:49] (03CR) 10Jcrespo: [C: 032] "I am going to deploy this now, we can continue the discussion on the dbproxy role." [puppet] - 10https://gerrit.wikimedia.org/r/316541 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [13:22:07] 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2725105 (10pmiazga) @elukey I'm following the Onboarding guide and it says I need also access to stat1003. @dr0ptp4kt could you confirm? [13:26:06] (03CR) 10Ottomata: "Hm, am find with the liquidprompt change, but since I just imported it from an external repo, I'd rather not change it. This one isn't a " [puppet] - 10https://gerrit.wikimedia.org/r/316529 (https://phabricator.wikimedia.org/T95064) (owner: 10Ema) [13:30:29] 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2724142 (10AlexMonk-WMF) First this ticket needs to identify groups (or, in the absence of any existing one, specific rights on specific servers - https://wikitech.wikimedia.org/... [13:30:58] (03PS3) 10Ema: Port varnishrls to Varnish 4 VSL API [puppet] - 10https://gerrit.wikimedia.org/r/295207 (https://phabricator.wikimedia.org/T131353) [13:31:38] (03PS1) 10Jcrespo: New labs::db::proxy role to load balance and failover replicas [puppet] - 10https://gerrit.wikimedia.org/r/316558 (https://phabricator.wikimedia.org/T148500) [13:32:20] (03PS2) 10Jcrespo: Create labs::db::proxy role to load balance and failover replicas [puppet] - 10https://gerrit.wikimedia.org/r/316558 (https://phabricator.wikimedia.org/T148500) [13:32:55] 06Operations, 10Wikimedia-Apache-configuration: Font list resource doesn't have a "Content-type: text/plain;charset=utf-8" header - https://phabricator.wikimedia.org/T146421#2725118 (10elukey) The `conf` file does not have an extension and I think that Apache does not have any instruction about the MIME type t... [13:36:26] (03CR) 10jenkins-bot: [V: 04-1] Create labs::db::proxy role to load balance and failover replicas [puppet] - 10https://gerrit.wikimedia.org/r/316558 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [13:39:03] 06Operations, 10hardware-requests: eqiad/codfw: swift frontend hardware refresh - https://phabricator.wikimedia.org/T148510#2725129 (10fgiunchedi) [13:40:03] (03PS3) 10Jcrespo: Create labs::db::proxy role to load balance and failover replicas [puppet] - 10https://gerrit.wikimedia.org/r/316558 (https://phabricator.wikimedia.org/T148500) [13:48:42] (03PS4) 10Jcrespo: mariadb: add mariadb::service for all 10.0 mariadb::packages_wmf [puppet] - 10https://gerrit.wikimedia.org/r/316332 [13:50:44] (03CR) 10Jcrespo: [C: 032] mariadb: add mariadb::service for all 10.0 mariadb::packages_wmf [puppet] - 10https://gerrit.wikimedia.org/r/316332 (owner: 10Jcrespo) [13:53:10] !log applying new init.d script on all mariadb 10 servers [13:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:06] PROBLEM - cassandra-b CQL 10.64.32.195:9042 on restbase1008 is CRITICAL: Connection refused [13:57:17] PROBLEM - cassandra-b service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [13:57:27] PROBLEM - cassandra-b SSL 10.64.32.195:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [13:58:00] PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [13:58:25] !log restarting and upgrading db2049 and es2019 to test new config [13:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:57] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: Connection refused [14:00:16] RECOVERY - cassandra-b CQL 10.64.32.195:9042 on restbase1008 is OK: TCP OK - 0.000 second response time on port 9042 [14:00:18] RECOVERY - cassandra-b service on restbase1008 is OK: OK - cassandra-b is active [14:00:26] PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [14:00:30] RECOVERY - cassandra-b SSL 10.64.32.195:7001 on restbase1008 is OK: SSL OK - Certificate restbase1008-b valid until 2017-09-12 15:33:41 +0000 (expires in 329 days) [14:00:45] (03PS2) 10Gehel: service::node: allow log level to be configured [puppet] - 10https://gerrit.wikimedia.org/r/316554 (https://phabricator.wikimedia.org/T148116) [14:01:43] (03CR) 10jenkins-bot: [V: 04-1] service::node: allow log level to be configured [puppet] - 10https://gerrit.wikimedia.org/r/316554 (https://phabricator.wikimedia.org/T148116) (owner: 10Gehel) [14:04:10] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 500 (expecting: 200) [14:04:16] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 500 (expecting: 200) [14:04:57] PROBLEM - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: Connection refused [14:05:38] PROBLEM - cassandra-b service on restbase1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [14:05:38] RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active [14:05:47] PROBLEM - cassandra-b SSL 10.64.0.115:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [14:05:57] (03PS3) 10Gehel: service::node: allow log level to be configured [puppet] - 10https://gerrit.wikimedia.org/r/316554 (https://phabricator.wikimedia.org/T148116) [14:06:06] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 500 (expecting: 200) [14:06:58] (03CR) 10jenkins-bot: [V: 04-1] service::node: allow log level to be configured [puppet] - 10https://gerrit.wikimedia.org/r/316554 (https://phabricator.wikimedia.org/T148116) (owner: 10Gehel) [14:07:07] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.001 second response time on port 9042 [14:07:25] 06Operations, 10hardware-requests: codfw/eqiad: 2x systems for prometheus - https://phabricator.wikimedia.org/T148513#2725201 (10fgiunchedi) [14:07:31] (03PS4) 10Gehel: service::node: allow log level to be configured [puppet] - 10https://gerrit.wikimedia.org/r/316554 (https://phabricator.wikimedia.org/T148116) [14:07:58] 06Operations, 05Prometheus-metrics-monitoring: Consider moving prometheus from VM to baremetal - https://phabricator.wikimedia.org/T148408#2725216 (10fgiunchedi) [14:08:56] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [14:09:07] RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 329 days) [14:09:16] (03CR) 10BBlack: [C: 032 V: 032] "This works, I've sniffed the OCSP requests to upstream while testing it with a client, etc. We won't actually configure this during initi" [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/315982 (https://phabricator.wikimedia.org/T93927) (owner: 10BBlack) [14:09:26] (03CR) 10BBlack: [C: 032 V: 032] control: depend on libssl11-dev [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/316329 (owner: 10BBlack) [14:09:32] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [14:09:32] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [14:10:12] (03CR) 10BBlack: [C: 032 V: 032] "It compiles, ship it? There's still a compile warning in the ssl session stuff here, but it looks relatively-harmless, and I don't think w" [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/316553 (owner: 10BBlack) [14:10:57] (03CR) 10BBlack: [C: 032 V: 032] nginx (1.11.4-1+wmf3) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/311689 (owner: 10BBlack) [14:10:57] RECOVERY - cassandra-b service on restbase1010 is OK: OK - cassandra-b is active [14:11:09] RECOVERY - cassandra-b SSL 10.64.0.115:7001 on restbase1010 is OK: SSL OK - Certificate restbase1010-b valid until 2017-09-12 15:33:58 +0000 (expires in 329 days) [14:12:43] (03CR) 10BBlack: [C: 031] role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) (owner: 10Giuseppe Lavagetto) [14:12:57] RECOVERY - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is OK: TCP OK - 0.001 second response time on port 9042 [14:15:50] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:16:44] (03CR) 10Elukey: [C: 031] Port varnishrls to Varnish 4 VSL API [puppet] - 10https://gerrit.wikimedia.org/r/295207 (https://phabricator.wikimedia.org/T131353) (owner: 10Ema) [14:17:19] oh, db1069 could be me [14:17:25] (03PS4) 10Ema: Port varnishrls to Varnish 4 VSL API [puppet] - 10https://gerrit.wikimedia.org/r/295207 (https://phabricator.wikimedia.org/T131353) [14:17:32] (03CR) 10Ema: [C: 032 V: 032] Port varnishrls to Varnish 4 VSL API [puppet] - 10https://gerrit.wikimedia.org/r/295207 (https://phabricator.wikimedia.org/T131353) (owner: 10Ema) [14:17:50] and analytics1003, probably, too [14:18:04] !log uploading nginx-1.11.4+wmf3 to carbon jessie-wikimedia - T144523 [14:18:05] T144523: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523 [14:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:44] what is going on? [14:18:46] urandom: ? [14:19:14] (03PS6) 10Filippo Giunchedi: Point to a folder firejailed thumbor can actually write to [puppet] - 10https://gerrit.wikimedia.org/r/315234 (owner: 10Gilles) [14:19:28] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:19:36] godog: cass pbs ^^^ [14:20:23] mobrovac: taking a look [14:21:45] !log upgrading nginx on cp4001 (cache_misc ulsfo) as prod canary [14:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:22:17] jynus: Good morning if you have a minute can you please poweroff es2015 so i can swap the CPU thanks [14:22:35] good morning, papaul [14:22:56] I am in the middle of something, could you wait some time? [14:23:08] jynus: ok [14:23:11] or maybe marostegui can handle it faster? [14:24:13] papaul: I will take care of it [14:24:31] marostegui: thanks [14:24:38] thank you very much, marostegui [14:24:48] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:24:58] I am restarting those servers, and later I have to fix db1069 and analytics [14:25:04] (03PS1) 10Gehel: service::node: Adding minimal test [puppet] - 10https://gerrit.wikimedia.org/r/316560 [14:25:26] (03Abandoned) 10Gehel: service::node: allow log level to be configured [puppet] - 10https://gerrit.wikimedia.org/r/316554 (https://phabricator.wikimedia.org/T148116) (owner: 10Gehel) [14:27:45] papaul: I have downtimed it and now stopping mysql [14:29:07] papaul: The server is now off [14:29:43] marostegui: thanks [14:33:33] (03PS1) 10Jcrespo: Revert "mariadb: add mariadb::service for all 10.0 mariadb::packages_wmf" [puppet] - 10https://gerrit.wikimedia.org/r/316562 [14:34:14] 06Operations, 10Cassandra, 06Services: some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2725296 (10fgiunchedi) [14:34:24] (03PS2) 10Jcrespo: Revert "mariadb: add mariadb::service for all 10.0 mariadb::packages_wmf" [puppet] - 10https://gerrit.wikimedia.org/r/316562 [14:34:45] I am going to do a full revert- it is going to be much easier [14:34:56] mobrovac: ^ [14:35:37] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: add mariadb::service for all 10.0 mariadb::packages_wmf" [puppet] - 10https://gerrit.wikimedia.org/r/316562 (owner: 10Jcrespo) [14:37:22] godog: sigh [14:38:13] (03CR) 10Filippo Giunchedi: [C: 032] Point to a folder firejailed thumbor can actually write to [puppet] - 10https://gerrit.wikimedia.org/r/315234 (owner: 10Gilles) [14:39:41] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2725308 (10GWicke) p:05Triage>03Unbreak! a:03Eevans [14:40:09] mobrovac: sigh indeed, unsurprisingly some high gc times too at the time [14:40:20] !log Shutting down es2015 for hardware maintenance - T147769 [14:40:21] T147769: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769 [14:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:26] (I know, with some delay) [14:40:48] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [14:41:14] (03PS7) 10Filippo Giunchedi: Point to a folder firejailed thumbor can actually write to [puppet] - 10https://gerrit.wikimedia.org/r/315234 (owner: 10Gilles) [14:42:54] (03PS1) 10Jcrespo: Remove managing the service from the package installation [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/316563 [14:43:15] !log upgrading nginx on all cache_misc @ ulsfo - T144523 [14:43:16] T144523: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523 [14:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:32] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [14:45:20] (03CR) 10Jcrespo: [C: 032] Remove managing the service from the package installation [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/316563 (owner: 10Jcrespo) [14:49:00] 06Operations, 10ops-codfw, 10DBA: db2037: Disk in predictive failure - https://phabricator.wikimedia.org/T148373#2725318 (10elukey) p:05Triage>03High [14:49:41] 06Operations, 07Beta-Cluster-reproducible, 15User-Joe: Update confd package - https://phabricator.wikimedia.org/T147204#2725319 (10elukey) p:05Triage>03Low [14:49:51] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:51:15] 06Operations, 10ops-eqiad: ms-be1017 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T148016#2725321 (10elukey) p:05Triage>03Normal [14:51:53] 06Operations, 10ops-codfw: lvs2002 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T148017#2725322 (10elukey) p:05Triage>03Normal [14:52:07] 06Operations, 06Services (next), 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2725323 (10GWicke) p:05Triage>03High [14:52:58] jouncebot: next [14:52:58] In 1 hour(s) and 7 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161018T1600) [14:54:08] !log rsync tools from labstore1001 to labstore1004 [14:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:30] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 13Patch-For-Review: mwscript on jessie mediawiki fails - https://phabricator.wikimedia.org/T146286#2725337 (10elukey) p:05Triage>03Normal [14:54:40] !log upgrading nginx on all cache_misc @ codfw - T144523 [14:54:41] T144523: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523 [14:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:06] 06Operations, 10ops-eqiad: ms-be1025 network down - https://phabricator.wikimedia.org/T148391#2725340 (10elukey) p:05Triage>03Normal [14:55:33] (03PS1) 10Jcrespo: Add mariadb::service selectively, only on single-instance dbs [puppet] - 10https://gerrit.wikimedia.org/r/316566 [14:56:29] (03PS4) 10Giuseppe Lavagetto: role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) [15:00:43] (03PS1) 10Milimetric: Correct and simplify EventLogging monitoring [puppet] - 10https://gerrit.wikimedia.org/r/316567 (https://phabricator.wikimedia.org/T147321) [15:01:35] (03PS2) 10Filippo Giunchedi: centralserver: add mtail for kernel messages [puppet] - 10https://gerrit.wikimedia.org/r/316544 (https://phabricator.wikimedia.org/T147923) [15:03:07] (03PS5) 10Giuseppe Lavagetto: role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) [15:05:16] (03CR) 10Marostegui: [C: 031] Create labs::db::proxy role to load balance and failover replicas [puppet] - 10https://gerrit.wikimedia.org/r/316558 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [15:05:57] (03PS38) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [15:06:47] !log upgrading nginx on all remaining cache_misc (eqiad, esams) - T144523 [15:06:48] T144523: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523 [15:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:24] (03CR) 10Ottomata: "Couple of nits" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/316567 (https://phabricator.wikimedia.org/T147321) (owner: 10Milimetric) [15:11:17] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2725420 (10Andrew) Just to confirm my understanding... the proposal is basically to rename everything currently called a role (and the role:: class hier... [15:12:43] (03CR) 10Alexandros Kosiaris: [C: 031] service::node: Adding minimal test [puppet] - 10https://gerrit.wikimedia.org/r/316560 (owner: 10Gehel) [15:14:36] (03PS6) 10Giuseppe Lavagetto: role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) [15:14:50] marostegui: the CPU swap is complete now i am updating the BIOS and the IDRAC firmware will ping you when done [15:15:01] (03CR) 10Mobrovac: service::node: Adding minimal test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316560 (owner: 10Gehel) [15:15:05] papaul: Awesome - thanks. Will you power on the server once you are done? [15:15:23] yes [15:15:27] Thanks! [15:15:32] marostegui: yw [15:16:40] !log upgrading puppetmaster on labtestcontrol2001 to trusty/3.8.5 [15:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:09] 06Operations, 10Traffic, 10netops, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2725483 (10BBlack) @faidon - re: eqiad recdns IPv4 - I've uploaded DNS and puppet patches to switch that IP (by turning on the new IP first in parallel with the old... [15:20:11] 07Puppet, 06Labs: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2725484 (10Andrew) This appears to be fixed in 3.8.5 [15:22:45] 07Puppet, 06Labs: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2725499 (10Andrew) [15:23:20] (03CR) 10Filippo Giunchedi: Log when HTTP status codes from Mediawiki and Thumbor are different (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) (owner: 10Gilles) [15:24:14] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2725515 (10Papaul) a:05Cmjohnson>03Papaul [15:24:35] 06Operations, 10Traffic, 10netops, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2725529 (10BBlack) Oh, one other minor thing: the eqiad recdns IPv6 is already in the correct subnet for where it's at (matches with changing the IPv4 as the patche... [15:25:17] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Enable apache::mod::cgi [puppet] - 10https://gerrit.wikimedia.org/r/316528 (owner: 10Alexandros Kosiaris) [15:25:21] (03PS2) 10Alexandros Kosiaris: icinga: Enable apache::mod::cgi [puppet] - 10https://gerrit.wikimedia.org/r/316528 [15:25:23] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: Enable apache::mod::cgi [puppet] - 10https://gerrit.wikimedia.org/r/316528 (owner: 10Alexandros Kosiaris) [15:27:56] (03PS7) 10Giuseppe Lavagetto: role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) [15:28:58] (03CR) 10jenkins-bot: [V: 04-1] role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) (owner: 10Giuseppe Lavagetto) [15:29:53] (03PS8) 10Giuseppe Lavagetto: role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) [15:32:28] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2725636 (10Papaul) Below are the step taken to troubleshoot this issue. 1- Swapped CPU 1 to CPU2 2 - Update BIOS... [15:33:02] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, I'm fine with auto-ack since I get notified via emails of new tasks anyways." [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [15:33:20] marostegui: system is back up [15:34:12] PROBLEM - cxserver endpoints health on scb1004 is CRITICAL: Timeout while attempting connection [15:34:16] papaul: I can see it - thanks! [15:34:52] PROBLEM - dhclient process on scb1004 is CRITICAL: Timeout while attempting connection [15:35:23] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: Timeout while attempting connection [15:35:23] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2725681 (10Marostegui) MySQL is back up and replicating Thanks @Papaul [15:35:47] PROBLEM - mathoid endpoints health on scb1004 is CRITICAL: Timeout while attempting connection [15:36:05] PROBLEM - Check size of conntrack table on scb1004 is CRITICAL: Timeout while attempting connection [15:36:05] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: Timeout while attempting connection [15:36:09] (03PS7) 10Alexandros Kosiaris: icinga: Remove event_profiling_enabled [puppet] - 10https://gerrit.wikimedia.org/r/315085 [15:36:11] (03PS7) 10Alexandros Kosiaris: icinga: retry_check_interval => retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/315087 [15:36:13] (03PS3) 10Alexandros Kosiaris: ntp: Update neon specific ACLs to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/315255 [15:36:15] (03PS7) 10Alexandros Kosiaris: icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 [15:36:17] (03PS3) 10Alexandros Kosiaris: Replace neon with einsteinium where applicable [puppet] - 10https://gerrit.wikimedia.org/r/315257 [15:36:19] (03PS4) 10Alexandros Kosiaris: icinga: Kill /etc/icinga/puppet_hostextinfo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315242 [15:36:21] (03PS4) 10Alexandros Kosiaris: naggen2: Kill hostextinfo support [puppet] - 10https://gerrit.wikimedia.org/r/315243 [15:36:23] (03PS4) 10Alexandros Kosiaris: Remove absented /etc/icinga/puppet_hostextinfo.cfg entry [puppet] - 10https://gerrit.wikimedia.org/r/315244 [15:36:25] (03PS4) 10Alexandros Kosiaris: icinga: Remove the last vestiges of hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315245 [15:36:27] (03PS1) 10Alexandros Kosiaris: icinga: Amend init script [puppet] - 10https://gerrit.wikimedia.org/r/316571 [15:36:32] PROBLEM - DPKG on scb1004 is CRITICAL: Timeout while attempting connection [15:36:32] PROBLEM - ores on scb1004 is CRITICAL: Connection timed out [15:36:52] PROBLEM - Disk space on scb1004 is CRITICAL: Timeout while attempting connection [15:36:52] PROBLEM - ores uWSGI web app on scb1004 is CRITICAL: Timeout while attempting connection [15:37:12] PROBLEM - puppet last run on scb1004 is CRITICAL: Timeout while attempting connection [15:37:12] PROBLEM - MD RAID on scb1004 is CRITICAL: Timeout while attempting connection [15:37:24] PROBLEM - salt-minion processes on scb1004 is CRITICAL: Timeout while attempting connection [15:38:05] PROBLEM - apertium apy on scb1004 is CRITICAL: Connection timed out [15:38:13] PROBLEM - changeprop endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:38:15] 07Puppet, 06Labs: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2725683 (10Andrew) p:05Triage>03Low [15:38:33] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:38:39] (03PS9) 10Giuseppe Lavagetto: role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) [15:38:53] PROBLEM - configured eth on scb1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:39:45] RECOVERY - MD RAID on scb1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [15:39:45] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:39:48] (03CR) 10Madhuvishy: [C: 031] bdsync backup setup for labstore [puppet] - 10https://gerrit.wikimedia.org/r/315595 (owner: 10Rush) [15:40:03] RECOVERY - salt-minion processes on scb1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:40:04] RECOVERY - dhclient process on scb1004 is OK: PROCS OK: 0 processes with command name dhclient [15:40:33] RECOVERY - apertium apy on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 5632 bytes in 0.014 second response time [15:41:24] RECOVERY - Check size of conntrack table on scb1004 is OK: OK: nf_conntrack is 0 % full [15:41:33] RECOVERY - configured eth on scb1004 is OK: OK - interfaces up [15:41:44] RECOVERY - DPKG on scb1004 is OK: All packages OK [15:42:04] RECOVERY - Disk space on scb1004 is OK: DISK OK [15:42:24] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Package[changeprop/deploy],Package[cxserver/deploy],Package[mobileapps/deploy],Package[citoid/deploy] [15:47:29] !log eqiad-prod: ms-be1022 to weight 3000 T136631 [15:47:31] T136631: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631 [15:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:12] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2725732 (10bd808) >>! In T147718#2725420, @Andrew wrote: > But, to echo @yuvi, I still don't understand why this 'profiles must have no params and only... [15:49:14] (03PS10) 10Giuseppe Lavagetto: role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) [15:49:35] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Amend init script [puppet] - 10https://gerrit.wikimedia.org/r/316571 (owner: 10Alexandros Kosiaris) [15:49:39] (03PS2) 10Alexandros Kosiaris: icinga: Amend init script [puppet] - 10https://gerrit.wikimedia.org/r/316571 [15:49:41] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: Amend init script [puppet] - 10https://gerrit.wikimedia.org/r/316571 (owner: 10Alexandros Kosiaris) [15:50:23] arseny92: ping? [15:58:27] (03CR) 10Madhuvishy: labsdb: maintain-views option to purge unspecified views (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/316487 (owner: 10Rush) [15:59:48] (03CR) 10Rush: [C: 032] bdsync backup setup for labstore [puppet] - 10https://gerrit.wikimedia.org/r/315595 (owner: 10Rush) [15:59:53] (03PS18) 10Rush: bdsync backup setup for labstore [puppet] - 10https://gerrit.wikimedia.org/r/315595 [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161018T1600). [16:00:16] (03CR) 10Alexandros Kosiaris: "Note that restrict in https://www.eecis.udel.edu/~mills/ntp/html/accopt.html says" [puppet] - 10https://gerrit.wikimedia.org/r/315255 (owner: 10Alexandros Kosiaris) [16:00:33] no patches afaics [16:02:08] PROBLEM - salt-minion processes on scb1003 is CRITICAL: Timeout while attempting connection [16:03:27] PROBLEM - apertium apy on scb1003 is CRITICAL: Connection timed out [16:04:19] PROBLEM - changeprop endpoints health on scb1003 is CRITICAL: Timeout while attempting connection [16:04:49] PROBLEM - DPKG on scb1003 is CRITICAL: Timeout while attempting connection [16:04:49] PROBLEM - ores on scb1003 is CRITICAL: Connection timed out [16:05:09] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: Timeout while attempting connection [16:05:19] PROBLEM - configured eth on scb1003 is CRITICAL: Timeout while attempting connection [16:05:20] PROBLEM - Disk space on scb1003 is CRITICAL: Timeout while attempting connection [16:05:20] PROBLEM - ores uWSGI web app on scb1003 is CRITICAL: Timeout while attempting connection [16:05:24] (03PS3) 10Rush: labstore: bdsync backup 'test' drbd volume from secondary [puppet] - 10https://gerrit.wikimedia.org/r/316414 [16:05:39] PROBLEM - cxserver endpoints health on scb1003 is CRITICAL: Timeout while attempting connection [16:06:04] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:06:33] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy [16:07:20] 06Operations, 10Monitoring, 05Prometheus-metrics-monitoring: Evaluate prometheus snmp_exporter for Torrus PDUs metrics use case - https://phabricator.wikimedia.org/T148541#2725758 (10fgiunchedi) [16:07:31] 06Operations, 10Monitoring, 05Prometheus-metrics-monitoring: Evaluate prometheus snmp_exporter for Torrus PDUs metrics use case - https://phabricator.wikimedia.org/T148541#2725770 (10fgiunchedi) p:05Triage>03Normal [16:09:17] (03CR) 10Rush: [C: 032] labstore: bdsync backup 'test' drbd volume from secondary [puppet] - 10https://gerrit.wikimedia.org/r/316414 (owner: 10Rush) [16:09:51] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2725772 (10Cmjohnson) Thanks @Papaul, let me know if the error returns and where. [16:10:07] mmmm is scb1003 a reimage? [16:10:46] 06Operations, 10Traffic, 10netops, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2725773 (10faidon) The network hardware I can configure in one swoop, so don't worry about that. Not sure if the PDUs/iDRACs/iLOs have any DNS configured whatsoever... [16:11:56] arseny92: brion: two patches of yours were skipped with the yesterday issue, 316069 - Repair text track attributes and 316291 - Reverting votewiki back to en. We resumed the normal deployment so you can plan them again for next SWAT windows. [16:12:24] woot [16:16:31] 06Operations, 10ops-codfw, 10DBA: db2037: Disk in predictive failure - https://phabricator.wikimedia.org/T148373#2725805 (10Papaul) a:03Papaul [16:17:14] (03PS1) 10Rush: labstore: block_sync use bash instead of sh [puppet] - 10https://gerrit.wikimedia.org/r/316577 [16:19:40] (03PS2) 10Rush: labsdb: maintain-views option to purge unspecified views [puppet] - 10https://gerrit.wikimedia.org/r/316487 [16:19:57] (03PS2) 10Gehel: service::node: Adding minimal test [puppet] - 10https://gerrit.wikimedia.org/r/316560 [16:20:36] (03CR) 10Rush: [C: 032] labstore: block_sync use bash instead of sh [puppet] - 10https://gerrit.wikimedia.org/r/316577 (owner: 10Rush) [16:21:16] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2701526 (10BBlack) FWIW, even though some aspects seem painful, I think the general direction of the proposal seems better than where we're at today. I... [16:21:29] 06Operations, 10RESTBase, 10RESTBase-Cassandra, 06Services (done): secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#2725815 (10GWicke) [16:21:52] (03PS3) 10Rush: labsdb: maintain-views option to purge unspecified views [puppet] - 10https://gerrit.wikimedia.org/r/316487 [16:22:10] (03CR) 10Rush: [C: 032 V: 032] labsdb: maintain-views option to purge unspecified views [puppet] - 10https://gerrit.wikimedia.org/r/316487 (owner: 10Rush) [16:22:15] 06Operations, 10RESTBase, 10RESTBase-Cassandra, 06Services (done): secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#1159901 (10GWicke) Tentatively moved to our done status. @eevans, please close or move back to the backlog if there is more to be done here. [16:23:19] (03PS11) 10Giuseppe Lavagetto: role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) [16:24:31] <_joe_> jesus, I can't seem to get it right. [16:24:35] RECOVERY - apertium apy on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 5632 bytes in 0.014 second response time [16:25:21] (03PS1) 10Rush: labsdb: add include for passwords to views manifest [puppet] - 10https://gerrit.wikimedia.org/r/316578 [16:25:37] RECOVERY - Disk space on scb1003 is OK: DISK OK [16:25:37] RECOVERY - DPKG on scb1003 is OK: All packages OK [16:25:43] (03PS12) 10Giuseppe Lavagetto: role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) [16:26:59] (03PS7) 10Volans: Monitoring: add event handler for RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) [16:27:06] RECOVERY - salt-minion processes on scb1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:27:13] (03PS2) 10Rush: labsdb: add include for passwords to views manifest [puppet] - 10https://gerrit.wikimedia.org/r/316578 [16:27:46] RECOVERY - configured eth on scb1003 is OK: OK - interfaces up [16:28:05] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Testing on Production, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2725855 (10AndyRussG) Another option, I think, would be to add to CN some lightweight monitoring of... [16:28:11] 06Operations, 10ops-codfw, 10DBA: es2014 raid controler temporary failure - https://phabricator.wikimedia.org/T148434#2722693 (10Papaul) @jcrespo there is not enough information to troubleshoot or creating a case with the information provide. [16:28:57] (03CR) 10Andrew Bogott: [C: 031] labsdb: add include for passwords to views manifest [puppet] - 10https://gerrit.wikimedia.org/r/316578 (owner: 10Rush) [16:28:59] (03CR) 10Rush: [C: 032] labsdb: add include for passwords to views manifest [puppet] - 10https://gerrit.wikimedia.org/r/316578 (owner: 10Rush) [16:32:57] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:35:17] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:35:18] RECOVERY - changeprop endpoints health on scb1004 is OK: All endpoints are healthy [16:35:48] PROBLEM - ores on scb2004 is CRITICAL: Connection refused [16:36:07] RECOVERY - cxserver endpoints health on scb1004 is OK: All endpoints are healthy [16:36:37] RECOVERY - mathoid endpoints health on scb1004 is OK: All endpoints are healthy [16:36:38] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [16:37:00] PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 3 minutes ago with 5 failures. Failed resources (up to 3 shown): Package[changeprop/deploy],Package[cxserver/deploy],Package[mobileapps/deploy],Package[citoid/deploy] [16:37:02] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [16:37:17] RECOVERY - ores on scb1004 is OK: HTTP OK: HTTP/1.0 200 OK - 2822 bytes in 0.011 second response time [16:37:27] RECOVERY - ores uWSGI web app on scb1004 is OK: ● uwsgi-ores.service - uwsgi-ores uwsgi app [16:38:17] PROBLEM - changeprop endpoints health on scb2004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.36, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:38:37] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.36, port=1970): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:39:17] PROBLEM - cxserver endpoints health on scb2004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.36, port=8080): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:39:19] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:39:43] (03PS1) 10Rush: labsdb: maintain-views allow specification of a single table [puppet] - 10https://gerrit.wikimedia.org/r/316584 [16:39:52] (03PS2) 10Rush: labsdb: maintain-views allow specification of a single table [puppet] - 10https://gerrit.wikimedia.org/r/316584 [16:40:07] PROBLEM - mathoid endpoints health on scb2004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.36, port=10042): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:40:27] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.36, port=8888): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:40:29] 06Operations, 10ops-codfw, 10DBA: db2037: Disk in predictive failure - https://phabricator.wikimedia.org/T148373#2725880 (10Papaul) Will Have the replacement disk on site tomorrow. Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms you... [16:41:01] (03CR) 10jenkins-bot: [V: 04-1] labsdb: maintain-views allow specification of a single table [puppet] - 10https://gerrit.wikimedia.org/r/316584 (owner: 10Rush) [16:41:13] 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2725886 (10dr0ptp4kt) Yes, please grant access to stat1003. [16:42:07] (03PS3) 10Rush: labsdb: maintain-views allow specification of a single table [puppet] - 10https://gerrit.wikimedia.org/r/316584 [16:42:07] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2725894 (10jcrespo) [16:42:10] 06Operations, 10ops-codfw, 10DBA: es2014 raid controler temporary failure - https://phabricator.wikimedia.org/T148434#2725892 (10jcrespo) 05Open>03Resolved a:03jcrespo [16:42:41] (03PS13) 10Giuseppe Lavagetto: role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) [16:42:50] PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:43:16] (03CR) 10jenkins-bot: [V: 04-1] labsdb: maintain-views allow specification of a single table [puppet] - 10https://gerrit.wikimedia.org/r/316584 (owner: 10Rush) [16:44:38] (03PS4) 10Rush: labsdb: maintain-views allow specification of a single table [puppet] - 10https://gerrit.wikimedia.org/r/316584 [16:44:49] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:48:00] (03PS5) 10Rush: labsdb: maintain-views allow specification of a single table [puppet] - 10https://gerrit.wikimedia.org/r/316584 [16:50:17] (03CR) 10Rush: [C: 032] labsdb: maintain-views allow specification of a single table [puppet] - 10https://gerrit.wikimedia.org/r/316584 (owner: 10Rush) [16:53:30] (03PS2) 10Jcrespo: Add mariadb::service selectively, only on single-instance dbs [puppet] - 10https://gerrit.wikimedia.org/r/316566 [16:53:45] (03CR) 10jenkins-bot: [V: 04-1] Add mariadb::service selectively, only on single-instance dbs [puppet] - 10https://gerrit.wikimedia.org/r/316566 (owner: 10Jcrespo) [16:54:23] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2725920 (10yuvipanda) Putting aside how labs will work with this, I am trying to understand why explicit hiera lookups with arbitrary, non puppet enforc... [16:55:37] (03PS3) 10Jcrespo: Add mariadb::service selectively, only on single-instance dbs [puppet] - 10https://gerrit.wikimedia.org/r/316566 [16:56:55] (03CR) 10Jcrespo: [C: 032] Add mariadb::service selectively, only on single-instance dbs [puppet] - 10https://gerrit.wikimedia.org/r/316566 (owner: 10Jcrespo) [16:59:45] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) (owner: 10Giuseppe Lavagetto) [16:59:53] (03PS14) 10Giuseppe Lavagetto: role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) [17:00:05] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161018T1700). Please do the needful. [17:00:17] better not, releng won't be happy [17:00:18] 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2724062 (10Nuria) @pmiazga : no, you do not need access to 1003, things might have changed since that guide was written. Do not worry, you are set for data access. Please d... [17:00:50] (03CR) 10Giuseppe Lavagetto: [V: 032] role::mediawiki::webserver: restart hhvm routinely [puppet] - 10https://gerrit.wikimedia.org/r/315938 (https://phabricator.wikimedia.org/T147773) (owner: 10Giuseppe Lavagetto) [17:01:02] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:01:20] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Testing on Production, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2725964 (10AndyRussG) @aaron, thanks for improving and merging the `checkKeys` patch!!!! I have som... [17:01:20] (03PS2) 10Madhuvishy: labstore: Mount maps share simultaneously from labstore1003 and 1001 [puppet] - 10https://gerrit.wikimedia.org/r/316482 (https://phabricator.wikimedia.org/T147657) [17:01:41] !log upgrading nginx on cache_maps - T144523 [17:01:42] T144523: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523 [17:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:02:07] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:03:09] (03PS2) 10Dzahn: Update doc comment on mediawiki::packages::php5 [puppet] - 10https://gerrit.wikimedia.org/r/308904 (owner: 10Legoktm) [17:03:18] RECOVERY - cxserver endpoints health on scb2004 is OK: All endpoints are healthy [17:03:27] RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:03:38] (03CR) 10Madhuvishy: labstore: Mount maps share simultaneously from labstore1003 and 1001 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/316482 (https://phabricator.wikimedia.org/T147657) (owner: 10Madhuvishy) [17:03:42] RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:04:02] (03CR) 10Dzahn: [C: 032] "only comments" [puppet] - 10https://gerrit.wikimedia.org/r/308904 (owner: 10Legoktm) [17:04:19] RECOVERY - mathoid endpoints health on scb2004 is OK: All endpoints are healthy [17:04:39] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [17:04:57] RECOVERY - ores on scb2004 is OK: HTTP OK: HTTP/1.0 200 OK - 2822 bytes in 0.086 second response time [17:04:57] RECOVERY - changeprop endpoints health on scb2004 is OK: All endpoints are healthy [17:05:27] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [17:05:52] (03PS1) 10Jcrespo: Add mariadb::services to labs::db [puppet] - 10https://gerrit.wikimedia.org/r/316590 [17:06:31] (03PS2) 10Dzahn: Add redirect for toolserver sulinfo tool [puppet] - 10https://gerrit.wikimedia.org/r/316076 (owner: 10Dereckson) [17:06:57] RECOVERY - puppet last run on elastic1029 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:08:22] (03CR) 10Dzahn: [C: 032] Add redirect for toolserver sulinfo tool [puppet] - 10https://gerrit.wikimedia.org/r/316076 (owner: 10Dereckson) [17:09:55] (03CR) 10Dzahn: [C: 04-1] "Eh, the message says "install" but the code says "ensure absent"." [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [17:12:05] (03CR) 10Dzahn: [C: 04-1] "> ferm will actually fail if AAAA are missing" [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [17:14:18] 06Operations, 10netops, 05Goal, 13Patch-For-Review: Decomission palladium - https://phabricator.wikimedia.org/T147320#2726109 (10Dzahn) >>! In T147320#2688855, @akosiaris wrote: > Setting to stalled for say.. 2 weeks ? The 2 weeks are over today, exactly 14 days afer that comment. [17:14:25] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2726111 (10Dzahn) [17:15:16] 06Operations, 10netops, 05Goal, 13Patch-For-Review: Decomission palladium - https://phabricator.wikimedia.org/T147320#2688855 (10Dzahn) @joe ok with you if i go ahead and now merge https://gerrit.wikimedia.org/r/#/c/315891/ and actually shutdown palladium? [17:15:16] (03PS1) 10Jcrespo: Use mariadb::service; prevent puppet from managing mysql symlinks [puppet] - 10https://gerrit.wikimedia.org/r/316595 [17:17:17] (03CR) 10Dzahn: [C: 04-1] "missing change in site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/315879 (owner: 10Dzahn) [17:17:30] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 707 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3108533 keys - replication_delay is 707 [17:20:15] (03CR) 10Ottomata: [C: 031] "If you say so! :)" [puppet] - 10https://gerrit.wikimedia.org/r/316595 (owner: 10Jcrespo) [17:24:22] (03PS1) 10Jcrespo: Add mariadb::service and changes for new package [puppet] - 10https://gerrit.wikimedia.org/r/316598 [17:24:43] (03CR) 10Rush: [C: 031] "seems good, thanks for walking me through it" [puppet] - 10https://gerrit.wikimedia.org/r/316482 (https://phabricator.wikimedia.org/T147657) (owner: 10Madhuvishy) [17:27:22] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::webserver: fix path for run-no-puppet [puppet] - 10https://gerrit.wikimedia.org/r/316599 [17:27:46] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::mediawiki::webserver: fix path for run-no-puppet [puppet] - 10https://gerrit.wikimedia.org/r/316599 (owner: 10Giuseppe Lavagetto) [17:34:10] !log stopping mysql, cloning db1064->db1053; upgrading [17:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:35:41] (03PS1) 10BryanDavis: Update .mailmap to de-duplicate authors [puppet] - 10https://gerrit.wikimedia.org/r/316601 [17:36:46] (03CR) 10BryanDavis: Update .mailmap to de-duplicate authors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316601 (owner: 10BryanDavis) [17:38:34] (03PS2) 10Milimetric: Correct and simplify EventLogging monitoring [puppet] - 10https://gerrit.wikimedia.org/r/316567 (https://phabricator.wikimedia.org/T147321) [17:39:00] (03CR) 10Milimetric: Correct and simplify EventLogging monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/316567 (https://phabricator.wikimedia.org/T147321) (owner: 10Milimetric) [17:39:00] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2726213 (10chasemp) [17:40:52] 10Blocked-on-Operations, 06Operations, 10DBA, 06Labs, and 2 others: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2726240 (10chasemp) [17:40:58] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2726242 (10chasemp) [17:41:02] 10Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2400728 (10chasemp) 05Open>03Resolved I'm resolving this as I believe {T148560}, {T147302} and {T59617} are now the relevant work tasks. Big tha... [17:41:18] 06Operations, 06Commons, 10media-storage, 07Regression: Some JPGs are being served as text - https://phabricator.wikimedia.org/T148497#2726243 (10Aklapper) Tentatively adding #operations, feel free to remove again if this happens / happened in a higher layer. [17:41:33] 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2726245 (10dr0ptp4kt) @nuria, the page with Reading Web onboarding is at https://www.mediawiki.org/wiki/Reading/Web/Team/Onboarding. The way it's worded is "Apply for access... [17:41:48] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3080731 keys - replication_delay is 0 [17:42:18] (03PS2) 10Andrew Bogott: Update .mailmap to de-duplicate authors [puppet] - 10https://gerrit.wikimedia.org/r/316601 (owner: 10BryanDavis) [17:43:28] (03PS2) 10BBlack: eqiad recdns IP fix: add new address (.254) [puppet] - 10https://gerrit.wikimedia.org/r/315929 (https://phabricator.wikimedia.org/T143915) [17:44:01] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:46:28] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:46:31] (03PS3) 10Madhuvishy: labstore: Mount maps share simultaneously from labstore1003 and 1001 [puppet] - 10https://gerrit.wikimedia.org/r/316482 (https://phabricator.wikimedia.org/T147657) [17:46:42] (03CR) 10Madhuvishy: [C: 032 V: 032] labstore: Mount maps share simultaneously from labstore1003 and 1001 [puppet] - 10https://gerrit.wikimedia.org/r/316482 (https://phabricator.wikimedia.org/T147657) (owner: 10Madhuvishy) [17:47:37] (03CR) 10Andrew Bogott: [C: 032] Update .mailmap to de-duplicate authors [puppet] - 10https://gerrit.wikimedia.org/r/316601 (owner: 10BryanDavis) [17:47:42] (03PS3) 10Andrew Bogott: Update .mailmap to de-duplicate authors [puppet] - 10https://gerrit.wikimedia.org/r/316601 (owner: 10BryanDavis) [17:52:41] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Testing on Production, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2726303 (10aaron) touchCheckKey() makes a key be seen as invalid if it was created before the curren... [17:52:45] 06Operations, 06Performance-Team, 10Thumbor: Separate 404s into their own log - https://phabricator.wikimedia.org/T145632#2726304 (10Gilles) 05Open>03Resolved [17:52:48] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2726305 (10Gilles) [17:53:09] 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2726307 (10dr0ptp4kt) @AlexMonk-WMF thanks for providing more context. It's for https://wikitech.wikimedia.org/wiki/How_to_deploy_code and to jump to stats servers. AFAIK, full-s... [17:53:21] (03PS4) 10Mdann52: Localisation of Babel categories on nap.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263342 (https://phabricator.wikimedia.org/T123188) [17:54:05] (03CR) 10Mdann52: "New commit to resolve merge conflicts, per recent discussion can this now be finally merged?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263342 (https://phabricator.wikimedia.org/T123188) (owner: 10Mdann52) [17:55:26] (03CR) 10Gilles: Log when HTTP status codes from Mediawiki and Thumbor are different (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) (owner: 10Gilles) [17:55:38] !log warming up elastic@codfw from wasat.codfw.wmnet [17:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:57:01] (03PS1) 10Giuseppe Lavagetto: base::resolving: allow defining additional search domains in labs [puppet] - 10https://gerrit.wikimedia.org/r/316602 [17:57:03] (03PS1) 10Giuseppe Lavagetto: compiler: add hiera lookups for the project [puppet] - 10https://gerrit.wikimedia.org/r/316603 [17:57:59] <_joe_> yuvipanda: care to take a look? ^ [17:58:15] <_joe_> or andrewbogott / madhuvishy [17:58:23] <_joe_> I need this to keep the puppet compiler working [17:58:29] <_joe_> both these changes, actually [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161018T1800). Please do the needful. [18:00:04] James_F, dcausse, brion, and Dereckson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:05] I can SWAT. [18:00:35] _joe_: I'm looking [18:00:46] Dereckson: space for yesterday's patch? [18:00:59] woot [18:01:02] o/ [18:01:04] kart_: yes, there is [18:01:30] you can add it to the deployment table [18:02:25] Dereckson: added. Thanks. [18:02:43] dcausse: any order of preference? [18:02:46] Dereckson: could you do one of mine at the very end? ([config] 315298 [cirrus] Activate BM25 on top 10 wikis: Step 2) [18:03:03] ok [18:03:06] the two first are just for lab and can be deployed [18:03:28] (03PS2) 10Dereckson: Elastic@deployment-prep: force the number of replicas to 1 max [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315104 (https://phabricator.wikimedia.org/T147777) (owner: 10DCausse) [18:03:43] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315104 (https://phabricator.wikimedia.org/T147777) (owner: 10DCausse) [18:03:52] jouncebot: now [18:03:53] For the next 0 hour(s) and 56 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161018T1800) [18:04:17] jouncebot: tomorrow [18:04:19] (03Merged) 10jenkins-bot: Elastic@deployment-prep: force the number of replicas to 1 max [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315104 (https://phabricator.wikimedia.org/T147777) (owner: 10DCausse) [18:04:21] oh well [18:04:34] (03PS3) 10Dereckson: Elastic@deployment-prep: Remove deployment-elastic08 from the cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315249 (https://phabricator.wikimedia.org/T147777) (owner: 10DCausse) [18:04:45] (03PS2) 10Dzahn: pmacct: move firewall, standard include to role [puppet] - 10https://gerrit.wikimedia.org/r/315879 [18:04:47] jouncebot: next [18:04:47] In 4 hour(s) and 55 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161018T2300) [18:05:00] (03CR) 10Andrew Bogott: [C: 032] compiler: add hiera lookups for the project [puppet] - 10https://gerrit.wikimedia.org/r/316603 (owner: 10Giuseppe Lavagetto) [18:05:02] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315249 (https://phabricator.wikimedia.org/T147777) (owner: 10DCausse) [18:05:39] (03Merged) 10jenkins-bot: Elastic@deployment-prep: Remove deployment-elastic08 from the cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315249 (https://phabricator.wikimedia.org/T147777) (owner: 10DCausse) [18:05:42] (03CR) 10Andrew Bogott: [C: 032] base::resolving: allow defining additional search domains in labs [puppet] - 10https://gerrit.wikimedia.org/r/316602 (owner: 10Giuseppe Lavagetto) [18:06:01] _joe_: they look fine to me — I'm going to leave the submit and merge to you because my gerrit is super slow for some reason [18:06:05] (maybe yours is too :/ ) [18:06:43] <_joe_> andrewbogott: heh indeed [18:07:13] andrewbogott: do you mean jenkins/zuul or actual gerrit [18:07:44] actual gerrit [18:08:02] like when I try to change views there's a 30-second or so pause [18:08:08] interesting, i cant confirm that [18:08:12] yet [18:08:23] seems fast to me [18:09:22] it had this slow down yesterday where i had to restart it, but now it seems normal [18:09:58] could just be my browser acting up [18:10:16] is it chrome? [18:10:23] i am on firefox [18:10:30] Hi raynor. Welcome to Wikimedia. [18:11:17] mutante: ff [18:11:30] Well... scap isn't willing to sync-master from mira. [18:11:32] odd..olk [18:11:33] ok [18:12:24] Hello Dereckson :) [18:12:47] andrewbogott i doint think it is your browser [18:12:50] l10nupdate job succeeded from tin this night. [18:13:08] I did expereienced a sudden sloweness erlyer but only for 1-3 secs [18:13:12] then normal [18:13:25] then it loaded, and is working now on a 80mbps connection [18:13:25] tin: Improperly owned -0:0- files in /srv/mediawiki-staging [18:14:42] yep, but my main issue is a scap sync-file in mira is stucked at sync-masters stage, with rsync as high cpu level [18:15:10] something else could be currently pulling code [18:15:26] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2726378 (10Gilles) I've just realized that this can't work for another reason: the pids are namespaced under firejail, so there's no way to ad... [18:16:28] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: pay-lvs1003/pay-lvs1004 hardware swap for pay-lvs1001/pay-lvs1002 - https://phabricator.wikimedia.org/T147932#2726394 (10Cmjohnson) @Jgreen any reason to keep this task open? [18:18:42] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2726398 (10Paladox) Gerrit become slow again on 18/10/2016 around 6:50pm to 7:10pm bst time. [18:19:14] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2726403 (10Paladox) [18:19:19] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2726405 (10Gilles) Well, the above was blocked by the sys/fs thing, but since pid is shown as "3" it obviously doesn't work any better with th... [18:19:48] now i see what Andrew means.. hmmm [18:19:58] (03PS1) 10Urbanecm: Create a new namespace "Príloha" for skwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316608 (https://phabricator.wikimedia.org/T148563) [18:20:00] LOL, gerrits gone down again [18:20:06] all of a sudden i got that delay when viewing a diff [18:20:20] paladox: no, it did not [18:20:42] it's that weird delay but then it's back after a couple seconds [18:20:48] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2726414 (10Gilles) I think that the namespaced pid is the real dealbreaker here. I don't think it's possible to assign the subprocesses to a d... [18:20:53] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2724169 (10Paladox) [18:21:02] mutante oh gerrit was slow [18:21:03] for me [18:21:10] 06Operations: Can't scap from mira - https://phabricator.wikimedia.org/T148566#2726416 (10Dereckson) [18:21:12] More slower [18:21:19] mutante i have the white screen [18:21:22] loading page [18:21:26] on gerrit [18:22:34] 06Operations: Can't scap from mira - https://phabricator.wikimedia.org/T148566#2726429 (10Dereckson) [18:25:15] Well I'm considering to try it from tin, and I'll maintain the same /srv/mediawiki-stating state on tin and mira to avoid issues for future deployment. [18:25:32] paladox: i'm trying to look at the logs, but i'm on a bus [18:25:35] hold on [18:25:39] mutante thanks [18:26:27] com.google.gerrit.sshd.GerritServerSession : Exception caught [18:26:27] 2940 java.io.IOException: Connection reset by peer [18:26:39] Oh [18:26:48] paladox: but.. gerrit ui works for me [18:26:53] doesnt it for you? [18:26:53] That will at least allow to know if the issue is from mira or general. [18:26:54] Yep [18:26:56] works again [18:27:02] Just a few minutes ago it was slow [18:27:04] ok, so i did _not_ restart it this time [18:27:05] then started working [18:27:07] yes, ack [18:27:53] mutante is this https://groups.google.com/forum/#!topic/repo-discuss/vCIxr79W8to the error your getting? [18:28:27] Or https://bugs.chromium.org/p/gerrit/issues/detail?id=2113 [18:28:59] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2726435 (10Dzahn) ack, it became slow but this time i did not restart the service and it was just working again after a little while. in the logs we have now, i se... [18:29:54] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Restrict outgoing network connections from Electron render service - https://phabricator.wikimedia.org/T148567#2726439 (10GWicke) [18:30:16] !log dereckson@mira Synchronized wmf-config/CirrusSearch-labs.php: Elastic@deployment-prep: force the number of replicas to 1 max (no-op in prod, labs only) (duration: 01m 18s) [18:30:23] 06Operations: Can't scap from mira - https://phabricator.wikimedia.org/T148566#2726453 (10Dereckson) 05Open>03stalled Seems to work now. I'll reopen it or invalid it according evolution. [18:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:03] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2726455 (10Dzahn) >>! In T148478#2726398, @Paladox wrote: > Gerrit become slow again on 18/10/2016 around 6:50pm to 7:10pm bst time. Let's talk in UTC time. Like t... [18:31:49] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2726457 (10Paladox) The problem started on 17:06pm utc which was reported by @Andrew last report was 18:21 pm utc. [18:32:03] !log dereckson@mira Synchronized wmf-config/LabsServices.php: Elastic@deployment-prep: Remove deployment-elastic08 from the cluster (no-op in prod, labs only) (duration: 00m 47s) [18:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:32:15] paladox: i dont know, that error is from gerrit's sshd,, not the web part [18:32:24] well, scap sync. now works, a little slower than usual, but works [18:32:38] James_F: ping? [18:33:01] paladox: yes, more like the first one [18:33:17] from mina [18:33:19] 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2726477 (10AlexMonk-WMF) >>! In T148477#2726307, @dr0ptp4kt wrote: > For future reference, what would be the best way to word the request to reflect that? deployment access [18:33:21] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2726478 (10Paladox) >>! In T148478#2726435, @Dzahn wrote: > ack, it became slow but this time i did not restart the service and it was just working again after a li... [18:33:27] mutante yep, and this https://bugs.chromium.org/p/gerrit/issues/detail?id=3685 one [18:33:45] Okay config done (I'll do James_F if/when there, and dcausse at the end) for now. Let's do the code backports. [18:33:46] "Apache MINA says version 2.0.14 should resolve this issue--anyone want to test and verify this?" [18:33:54] from upstream bug report [18:34:24] 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2726485 (10AlexMonk-WMF) also note deployment access involves sudo privileges, and so requires more than just the three day waiting period as ops have to discuss it at their week... [18:34:58] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on tin is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [18:37:23] Reedy: ping? [18:37:30] Hi [18:37:45] Hi. Could you login to mira and check /srv/mediawiki-staging/php-1.28.0-wmf.22 ? [18:38:07] Is my umask wrong because we don't have moving home dirs? [18:38:37] when I do git operations (git fetch, git status) in this directory, it uses the Git repository of /srv/mediawiki-staging [18:39:19] but there is a .git folder [18:40:26] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2726213 (10AlexMonk-WMF) (a script run would also handle wikimania2017wiki_p and tcywiki_p which are currently missing) [18:40:32] reedy@mira:/srv/mediawiki-staging/php-1.28.0-wmf.22$ git remote -v [18:40:32] origin https://gerrit.wikimedia.org/r/p/operations/mediawiki-config.git (fetch) [18:40:32] origin https://gerrit.wikimedia.org/r/p/operations/mediawiki-config.git (push) [18:40:58] Dereckson: So do the other branches [18:41:06] :S [18:41:24] It didn't do this whenever I last tried [18:42:04] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2726213 (10chasemp) >>! In T148560#2726515, @AlexMonk-WMF wrote: > (a script run would also handle wikimania2017wiki_p and tcywiki_p which are currently m... [18:42:51] ori was the last one to deploy [18:42:52] same in an old branch on tin: [18:42:55] dereckson@tin:/srv/mediawiki-staging/php-1.28.0-wmf.14$ git remote -v [18:42:58] origin https://gerrit.wikimedia.org/r/p/operations/mediawiki-config.git (fetch) [18:43:26] do we have a task for the outage yesterday? i need to reference it in a response to a user [18:43:29] Somebodies changed something [18:43:33] MatmaRex: Not yet AFAIK [18:43:40] grumble [18:43:42] ori: Was mira alright last time you deployed [18:43:49] anything i could link and say "go here for more information:? [18:43:53] just use git remote to update the origins, and wait to see if it was a one-off fluke? [18:43:55] i guess there's greg's email [18:44:04] ori deployed from tin [18:44:06] 06Operations, 10Traffic: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2726535 (10BBlack) I've found one counterpoint recently, making a mathematically-backed-up claim that we don't have to worry about AES-128 batch attacks so much in the specific case of G... [18:44:33] Why did ori deploy from tin with the big "do not use this server" banner? [18:44:35] MatmaRex: last said incident report was still writing [18:44:50] 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2726539 (10dr0ptp4kt) Thanks much - appreciate it! [18:44:56] Of course, everything is wrong there too [18:46:36] tin was reimagined today [18:46:39] 08:56 moritzm: reimaging tin to jessie [18:46:42] reimaged [18:46:46] The MediaWiki script file "/srv/mediawiki-staging/php-1.28.0-wmf.22/maintenance/eval.php" does not exist. [18:46:53] reedy@mira:/srv/mediawiki-staging/php-1.28.0-wmf.22$ git rev-parse --git-dir [18:46:53] /srv/mediawiki-staging/.git [18:47:26] ugh do we have to rebuild the staging repositories on mira/tin? [18:47:28] Krenair: there are only directories in maintenance/, not files [18:48:01] rebuilding shouldn't be necessary, just change the origin with `git remote set-url origin ...` and checkout the appropriate thing [18:48:29] ebernhardson: and the security patches? [18:48:33] guys where has the private repository gone? [18:49:14] Dereckson: Yeah, do not deploy anything [18:49:43] jouncebot: freeze [18:49:46] maybe someone should touch the scap lock [18:49:52] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2726547 (10Eevans) p:05Unbreak!>03High [18:50:47] Dereckson: I think the tin reimage is the cause [18:51:47] Dereckson: I lost some chat log. SWAT is on hold? [18:51:54] the first sync-masters step after a reimage will be really really slow [18:51:59] I updated the deployment table to sort patches between synced (the two labs config ones for dcausse), merged but not synced, not merged [18:52:15] kart_: yes, we've an issue [18:52:23] hey there are rsync processed on mira [18:52:34] could be still getting the code? [18:52:44] Dereckson: okay. [18:52:54] you mean this Dereckson? [18:52:55] root 1227 0.0 0.0 12584 2244 ? Ss Sep23 0:00 /usr/bin/rsync --daemon --no-detach [18:53:04] no [18:53:22] there were rsync processes at high level cpu sooner [18:53:53] 18:14:42 < Dereckson> yep, but my main issue is a scap sync-file in mira is stucked at sync-masters stage, with rsync as high cpu level [18:53:59] 18:15:10 < Dereckson> something else could be currently pulling code [18:54:17] (03PS1) 10Madhuvishy: labstore misc: Add maps project ips as role param [puppet] - 10https://gerrit.wikimedia.org/r/316609 [18:54:41] that would be expected for the first sync after tin was reimaged. I think the last time we did this dance the initial sync took something like 20 minutes. [18:55:38] bd808: that's coherent with observed behavior this time too, :11 → :30 [18:56:23] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726582 (10Paladox) [18:56:29] (03CR) 10Madhuvishy: [C: 032] labstore misc: Add maps project ips as role param [puppet] - 10https://gerrit.wikimedia.org/r/316609 (owner: 10Madhuvishy) [18:57:32] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2724169 (10Paladox) [18:57:34] 06Operations: Can't scap from mira - https://phabricator.wikimedia.org/T148566#2726585 (10Dereckson) 05stalled>03Invalid A 20 minutes delay is expected after a deployment server reimaging. 18:54:41 < bd808> that would be expected for the first sync after tin was reimaged. I think the last time we did this d... [18:57:38] 06Operations, 06Labs, 13Patch-For-Review: Move maps share to labstore1003 - https://phabricator.wikimedia.org/T147657#2726587 (10madhuvishy) Maps migration details Announcement: The /data/project/maps NFS share is undergoing maintenance, starting 9 AM PST (16:00 GMT) and will be unavailable for a short win... [18:57:43] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:57:44] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2724169 (10Paladox) [18:58:12] Dereckson: delay maybe expected. But both staging are broked [18:59:50] 06Operations: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2726593 (10Dereckson) [19:00:56] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2726610 (10Gilles) For completion I also tried cgexec, which doesn't fare any better, probably because the noblacklist option doesn't quite wo... [19:02:07] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2726611 (10GWicke) [19:02:33] Krenair: what's the procedure, a root touch or a puppet change? <-- 18:49 < Krenair> maybe someone should touch the scap lock [19:03:00] It's a good idea to avoid automated processes like l10nupdate to try to sync. [19:03:11] It's not gonna for a few hours [19:03:34] Dereckson, I can't get anyone to answer the damn question [19:03:50] just touch it manually [19:03:56] not hard to do [19:04:17] The question is whether it *should* be done [19:04:27] I'd say "yes" [19:04:29] yes, do it [19:05:15] -rw-r--r-- 1 krenair wikidev 0 Oct 18 19:04 /var/lock/scap [19:05:19] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: pay-lvs1003/pay-lvs1004 hardware swap for pay-lvs1001/pay-lvs1002 - https://phabricator.wikimedia.org/T147932#2726630 (10Jgreen) Nope! [19:05:26] yuvipanda! [19:06:47] (that was mira, ostriches has tin's) [19:10:15] (03PS2) 10Giuseppe Lavagetto: base::resolving: allow defining additional search domains in labs [puppet] - 10https://gerrit.wikimedia.org/r/316602 [19:11:09] 06Operations, 10Traffic: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2726639 (10BBlack) Circling back to the broader plans and ideas here: === FF/NSS ChaPoly/AES pref hacks === Firefox/NSS don't seem to be making any quick progress on ChaPoly prioritiz... [19:12:22] (03PS1) 10BryanDavis: Remove ~ori/.hushlogin [puppet] - 10https://gerrit.wikimedia.org/r/316613 [19:14:00] Reedy: bd808 has the answer so ^ [19:14:21] (03PS1) 10Muehlenhoff: Add patch for CVE-2016-7042 [debs/linux44] - 10https://gerrit.wikimedia.org/r/316614 [19:15:14] (03PS2) 10Giuseppe Lavagetto: compiler: add hiera lookups for the project [puppet] - 10https://gerrit.wikimedia.org/r/316603 [19:15:17] (03CR) 10Giuseppe Lavagetto: [V: 032] compiler: add hiera lookups for the project [puppet] - 10https://gerrit.wikimedia.org/r/316603 (owner: 10Giuseppe Lavagetto) [19:15:31] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2726654 (10chasemp) a:05chasemp>03None [19:15:47] (03PS1) 10Reedy: Let's disable l10nupdate completely until we have /srv/mediawiki-staging back [puppet] - 10https://gerrit.wikimedia.org/r/316615 [19:16:05] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2726213 (10chasemp) I'm not sure what the right steps are to do this through Puppet so I'm hoping to connect with one of the #DBA folks to knock it out so... [19:16:05] ^ _joe_ wonder if we should do that until we're well in the clear [19:17:37] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:18:36] Reedy, _joe_: Yes, yes we should [19:19:26] (03CR) 10BryanDavis: [C: 031] Let's disable l10nupdate completely until we have /srv/mediawiki-staging back [puppet] - 10https://gerrit.wikimedia.org/r/316615 (owner: 10Reedy) [19:20:03] (03CR) 10Chad: [C: 031] Let's disable l10nupdate completely until we have /srv/mediawiki-staging back [puppet] - 10https://gerrit.wikimedia.org/r/316615 (owner: 10Reedy) [19:20:06] (03CR) 10Chad: [C: 031] Remove ~ori/.hushlogin [puppet] - 10https://gerrit.wikimedia.org/r/316613 (owner: 10BryanDavis) [19:20:11] (03CR) 10Giuseppe Lavagetto: [C: 032] Let's disable l10nupdate completely until we have /srv/mediawiki-staging back [puppet] - 10https://gerrit.wikimedia.org/r/316615 (owner: 10Reedy) [19:20:18] (03PS2) 10Giuseppe Lavagetto: Let's disable l10nupdate completely until we have /srv/mediawiki-staging back [puppet] - 10https://gerrit.wikimedia.org/r/316615 (owner: 10Reedy) [19:20:22] (03CR) 10Giuseppe Lavagetto: [V: 032] Let's disable l10nupdate completely until we have /srv/mediawiki-staging back [puppet] - 10https://gerrit.wikimedia.org/r/316615 (owner: 10Reedy) [19:21:03] (03CR) 10Muehlenhoff: [C: 032] Add patch for CVE-2016-7042 [debs/linux44] - 10https://gerrit.wikimedia.org/r/316614 (owner: 10Muehlenhoff) [19:21:40] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:30:07] 06Operations, 10Traffic: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2726702 (10BBlack) Also relevant to the above is Mozilla's current recommendations at https://wiki.mozilla.org/Security/Server_Side_TLS . In a nutshell: * Their Modern-only config's hi... [19:30:31] 06Operations: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2726703 (10Dereckson) [19:31:45] 06Operations: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2726713 (10demon) p:05Triage>03Unbreak! a:03demon [19:31:58] 06Operations: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2726716 (10Dereckson) `/var/lock/scap` has been touched on both deployment servers, `l10nupdate` disabled. [19:32:10] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2726717 (10demon) [19:32:28] Dereckson: Thanks for filing that [19:35:19] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726725 (10Paladox) The error described by @dzahn is fixed here https://gerrit-review.googlesource.com/#/c/87435/ and will probably be released in gerr... [19:36:25] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726726 (10demon) That's also not what caused a slowdown. We've been suffering from that ever since we upgraded past 2.8.x [19:36:34] (03PS2) 10Ori.livneh: Remove ~ori/.hushlogin [puppet] - 10https://gerrit.wikimedia.org/r/316613 (owner: 10BryanDavis) [19:36:39] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2726593 (10Dzahn) Restoring /srv/mediawiki-staging from Bacula backup, source mira, destination mira /srv/mediawiki-restore, job is current... [19:36:57] (03CR) 10Ori.livneh: [C: 032 V: 032] "Yeah, makes sense. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/316613 (owner: 10BryanDavis) [19:40:02] bd808: did I make a mistake somewhere that you were able to attribute to not seeing the MOTD, or did you just notice that by coincidence? (Thank you either way.) [19:40:31] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:40:40] ori: You deployed from tin [19:40:51] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726747 (10Paladox) @demon we could disable gc? And return to cron? [19:41:03] (03CR) 10Ottomata: [C: 032] Correct and simplify EventLogging monitoring [puppet] - 10https://gerrit.wikimedia.org/r/316567 (https://phabricator.wikimedia.org/T147321) (owner: 10Milimetric) [19:41:08] (03PS3) 10Ottomata: Correct and simplify EventLogging monitoring [puppet] - 10https://gerrit.wikimedia.org/r/316567 (https://phabricator.wikimedia.org/T147321) (owner: 10Milimetric) [19:41:09] https://p.defau.lt/?q0b9QAr9qENqCWPtQ_mCnQ [19:41:10] (03CR) 10Ottomata: [V: 032] Correct and simplify EventLogging monitoring [puppet] - 10https://gerrit.wikimedia.org/r/316567 (https://phabricator.wikimedia.org/T147321) (owner: 10Milimetric) [19:41:12] ^ missed that [19:41:24] blergh. [19:41:25] thanks [19:41:29] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:41:43] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726750 (10demon) That doesn't even make sense. I'm talking about the JVM garbage collection. [19:42:14] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726753 (10Paladox) Oh, carn't we disable that? [19:42:31] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726754 (10demon) No. [19:44:13] ori: could have happened to anyone. we need stronger safety nets. But the hushlogin didn't help [19:46:38] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726768 (10Paladox) Can we setup heap to catch the crash so that we get some type of log that it's jvm gc? http://stackoverflow.com/questions/35262175... [19:47:34] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:48:30] Should we setup heap for java on cobalt?, so that we can see if it is jvm gc causing the problem? [19:48:34] http://stackoverflow.com/questions/35262175/jvm-garbage-collector-suddenly-consumes-100-cpu-after-running-for-several-hours [19:51:02] Not now. [19:51:06] This isn't the time. [19:51:31] PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:51:38] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2726789 (10GWicke) [19:54:47] (03PS1) 10Paladox: Enable jvm heap log to debug gerrit slowing down [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) [19:55:01] (03PS2) 10Paladox: Enable jvm heap log to debug gerrit slowing down [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) [19:55:03] (03PS3) 10Paladox: Enable jvm heap log to debug gerrit slowing down [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) [19:55:13] Ok [19:55:21] (03CR) 10Paladox: [C: 04-1] "For another time." [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [19:56:46] (03CR) 10Chad: [C: 04-1] "The creation of a file in the manifest (and the erb) are not needed. We should also dump this in /var/lib/gerrit2/review_site/logs/ instea" [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [19:56:58] (03CR) 10Dereckson: "JVM should be written in uppercase (commit message, log, config)" [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [19:57:10] Sorry that i did ^^ when you wrote not now [19:57:12] I didnt see [19:57:13] it [19:57:18] Until i uploaded it [19:57:51] but you made 3 PS? [19:58:10] any updates on what is the verdict on SWAT? [19:58:18] Reedy i accidentially clicked rebase twice [19:58:28] Since it was loading the rebase toggly twice [19:58:28] Nikerabbit: No swat. [19:58:32] Nikerabbit: we resumed normal deployment today, but we've currently a little issue [19:58:38] toggly = toggle [19:58:42] Dereckson: Little, hah. [19:58:50] How do you rebase a patch twice? [19:58:51] Moar like huge issue. All deploys halted [19:59:00] Reedy click the button twice [19:59:04] When it's loading [19:59:08] ostriches: little in resolution time [19:59:13] Not quite ;-) [19:59:24] It's gonna take me a bit to ensure everything's hunky dory [19:59:28] ok [19:59:38] Well when it shows the yellow box above saying loading and your on the rebase screen [19:59:42] I accidently did it. [20:00:30] Nikerabbit: so yesterday issue is solved, but a new incident occured and we lost /srv/mediawiki-staging [20:01:06] (well solved, root cause hasn't identified precisely) [20:01:25] hmm okay [20:01:34] (03PS4) 10Paladox: Enable jvm heap log to debug gerrit slowing down [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) [20:01:47] it's an ops-have-to-restore-many-GBs-worth-of-backups size issue [20:01:49] (03PS5) 10Paladox: Enable JVM heap log to debug gerrit slowing down [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) [20:01:51] (03CR) 10Paladox: [C: 04-1] Enable JVM heap log to debug gerrit slowing down [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [20:01:59] (03CR) 10Paladox: "> The creation of a file in the manifest (and the erb) are not" [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [20:02:06] (03CR) 10Paladox: "> JVM should be written in uppercase (commit message, log, config)" [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [20:03:52] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:09:05] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2726872 (10Dereckson) Current status: @demon will need some time to ensure the staging folder and everything are sane again. Meanwhile, de... [20:12:06] Dereckson: Root cause we knows. [20:12:17] This happened before :( [20:12:43] For yesterday 503s or /srv/mediawiki-staging? Nikerabbit question was about the first I think. [20:13:29] Oh for today's [20:13:31] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:13:51] ah, the 503s from yesterday, which is hhvm processes suddenly growing in memory, root cause is not known, but remedial measures taken and seem to be holding steady-ish [20:14:09] so swat would be ok but... now there's the deployment server issue [20:15:16] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [20:19:33] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:20:00] 06Operations, 06Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#2726926 (10Matanya) a:05Matanya>03None [20:20:18] I got to test the new survery yay, i wonder who picked me LOL [20:21:33] I was hand picked but not sure by who [20:22:10] paladox xD [20:22:25] Im testing it for the public [20:22:30] 1hr it will take [20:22:31] to answer [20:22:33] survey says, wmf is doing more surveys [20:22:50] god damnit i knew someone was gonna make a reference to family fued [20:22:54] feud* [20:23:59] LOL [20:24:07] "Have you heard of the Wikimedia Foundation?", has anyone herd of wmf [20:24:10] ? [20:24:19] paladox: what's that [20:24:30] a question on the survey [20:25:15] What in gods name is WMF? xD [20:26:03] Can you move the offtopic chatter elsewhere. Kthx [20:26:38] Sorry [20:27:35] Reedy sorry just having a bit of fun before stuff starts exploding again [20:28:05] See the topic [20:28:23] I'm aware [20:28:47] but you're not [20:28:55] You keep doing it [20:30:02] serious stuff. [20:30:21] It keeps hidding the topic [20:30:23] in my client [20:30:28] Doesn't mean it changes [20:30:30] It scrolls to the middle [20:30:42] paladox they are right it does say serious stuff [20:30:49] Yep [20:31:08] For IRC help, join #freenode. [20:31:21] Reedy: y so srs? [20:31:50] But i was talking about wmf, i wasent talking about anything else. [20:32:08] If you look at my question, it was a joke about wmf, i wasent trying to get off topic [20:32:14] just thought it would be funny [20:32:16] sorry for that [20:32:22] ^ [20:32:39] This is also not the channel for that. [20:32:44] There is a Wikimedia social/offtopic channel recently created I remember [20:33:01] We are aware we are now in the wrong and we are deeply sorry... [20:33:04] Dereckson #wikimedia-offtopic [20:33:33] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:33:54] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Attempt to assign to a reserved variable name: 'trusted' on node gallium.wikimedia.org [20:33:57] this is killing me [20:34:03] it's a damn heisenbug [20:34:03] I honestly forgot this was -operations, i thought this was the offtopic chan i forgot i didnt join that chan. [20:34:11] akosiaris: We could rename it ;-) [20:34:26] ostriches: gallium ? [20:34:30] don't we already ? [20:34:35] :P [20:34:42] well migrating it but anyway [20:35:09] Are they always reserved var name errors or are they a variety? [20:35:14] I meant rename the variable to trusted_for_real or something ;-) [20:35:26] If trusted is reserved! [20:35:51] the reserved variable name for 'trusted' is some weird race bug that only happens when for some reason the yaml cache is not looked up (presumably due to expiration?) but the puppetdb cache is used instead [20:35:58] gah [20:36:07] there's a task for that in puppetlabs jira, lemme find it [20:36:11] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:37:05] https://tickets.puppetlabs.com/browse/PDB-949 [20:37:24] closed WONTFIX [20:37:38] Heh, puppet recovered! [20:37:45] yes, I ran puppet manually [20:37:45] Mission accomplished ;-) [20:37:50] but I did not have to [20:37:57] it would have recovered anyway in the next run [20:38:37] maybe upgrading puppetdb to 3.2 would help [20:38:40] not sure though [20:38:53] that bug is so inconclusive [20:41:43] yes the next run clears it up every tie. so it's just icinga clutter as far as we are concerned >_< [20:42:00] that bug rpeort and the comments (and also the linked bugs) are sure discouraging [20:44:05] uh just hit the issue of missing thumbs on uploads on a local testwiki , this to anyone who thought it to be some cache service [20:47:15] arseny92: locally it should redirect to /w/thumb.php [20:57:02] (03PS1) 10Milimetric: Remove unnecessary ops restriction from piwik ldap [puppet] - 10https://gerrit.wikimedia.org/r/316702 [21:01:00] 06Operations, 10MediaWiki-Email: Email system - https://phabricator.wikimedia.org/T148588#2727044 (10Jalexander) Adding ops for now because it seems like it could be something stuck on the server side but may not be. Some clarifications from James when I chatted to him: He received notifications through the e... [21:01:45] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2727046 (10Eevans) [21:03:14] 06Operations, 10MediaWiki-Email: Email system - https://phabricator.wikimedia.org/T148588#2727029 (10Legoktm) This is most likely a dupe of {T134886}. [21:08:52] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:19:23] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2727117 (10Eevans) [21:19:44] on-topic should i enter my email when i finished the survey? [21:19:45] We may have some additional questions about this survey and how we might improve it. If you like to be contacted about your responses to this test survey, please share your email address below: [21:22:03] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2727137 (10Eevans) [21:23:25] 06Operations, 06Commons, 10media-storage, 07Regression: Some JPGs are being served as text - https://phabricator.wikimedia.org/T148497#2727139 (10Storkk) All reporters on COM:AN also now report it working. [21:26:22] (03PS1) 10Muehlenhoff: Add kernel-wedge to package list [puppet] - 10https://gerrit.wikimedia.org/r/316707 [21:31:17] (03CR) 10Alexandros Kosiaris: [C: 031] Add kernel-wedge to package list [puppet] - 10https://gerrit.wikimedia.org/r/316707 (owner: 10Muehlenhoff) [21:31:44] (03CR) 10Muehlenhoff: [C: 032] Add kernel-wedge to package list [puppet] - 10https://gerrit.wikimedia.org/r/316707 (owner: 10Muehlenhoff) [21:34:44] (03CR) 10Alexandros Kosiaris: [C: 04-2] "that's a logical OR, not a logical AND. So it's the sum of people that are either in the wmf group, ops group or nda group, not the inters" [puppet] - 10https://gerrit.wikimedia.org/r/316702 (owner: 10Milimetric) [21:35:13] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:37:03] !log added Dpatrick to WMF LDAP group [21:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:41:05] (03CR) 10Milimetric: "That's a little weird... Wouldn't everyone in the ops group also be in the wmf group? Same with nda. Also, this is confusing because it " [puppet] - 10https://gerrit.wikimedia.org/r/316702 (owner: 10Milimetric) [21:41:29] (03CR) 10Alexandros Kosiaris: [C: 031] decom palladium from puppet, install_server, network constants [puppet] - 10https://gerrit.wikimedia.org/r/315891 (https://phabricator.wikimedia.org/T147320) (owner: 10Dzahn) [21:42:16] 06Operations, 10netops, 05Goal, 13Patch-For-Review: Decomission palladium - https://phabricator.wikimedia.org/T147320#2727215 (10akosiaris) FWIW, I am fine with that. But probably send one last heads up to ops@. [21:42:59] akosiaris: also, don't we want it to actually be wmf and nda? Because people aren't automatically added to the nda group and that's important for access to things like piwik and pivot [21:43:05] Dereckson i mean the 404 error for thumbs [21:43:17] (both of which, if what you say is true, use "nda or wmf") [21:44:46] * bawolff is in the nda group but not the wmf group, I think ;) [21:45:01] I don't think the groups are uniformly applied [21:45:11] at all [21:45:30] bawolff: right, that's ok, still we should have a way to say nda & wmf in ldap config [21:45:43] (03CR) 10Dzahn: "No, there can be volunteer ops who are not in WMF, and the "nda" group purpose is "people who signed NDAs but are not WMF employees"" [puppet] - 10https://gerrit.wikimedia.org/r/316702 (owner: 10Milimetric) [21:45:49] (03CR) 10Alexandros Kosiaris: "no, ops would not be in NDA group. neither would people in the wmf group be in the NDA group. ops would be in both nda and wmf groups. But" [puppet] - 10https://gerrit.wikimedia.org/r/316702 (owner: 10Milimetric) [21:47:00] bawolff: that would be correct for somebody who is a volunteer [21:47:07] bawolff: Lol, uniformity. You must be new here! [21:47:14] bawolff: then you got hired , right? [21:47:23] after you already had an NDA [21:47:24] yep [21:47:31] so that is how it makes sense [21:47:43] do you have a wikitech user with a @wikimedia.org email? [21:48:11] My wikitech user is still using my gmail account, I think [21:48:24] so technically that is your "volunteer account" [21:48:30] * bawolff was not complaining so much as pointing out that the groups are all over the place [21:48:31] now you can say i also have a WMF account [21:48:32] or not [21:49:15] the "nda" group is correct mostly, the problem is that onboarding workflows never have "add to wmf once hired" [21:49:26] there are tickets for onboarding stuff [21:49:35] but it's a cross-team issue [21:49:41] (03Abandoned) 10Milimetric: Remove unnecessary ops restriction from piwik ldap [puppet] - 10https://gerrit.wikimedia.org/r/316702 (owner: 10Milimetric) [21:50:13] milimetric: what issue is that you are witnessing and trying to solve ? maybe we can help [21:50:51] akosiaris: people keep reporting that they can't login to piwik, and also sort of relatedly, I figured it'd be good to restrict access to only people within nda [21:52:34] I could be in the nda group, despite also being in the wmf group [21:52:36] milimetric: piwik has 2 different username/password forms. One is from apache and it's practically Basic Authentication. That one create a popup window. The other one is an HTML form. [21:52:38] I have a volunteer NDA [21:52:39] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2727223 (10Eevans) [21:52:48] which one are people complaining about ? [21:52:52] or are they about both ? [21:53:08] akosiaris: yeah, no worries about the second one, I got that one, people had trouble with ldap [21:53:16] NDA thats the info confientendality right? [21:53:29] yes [21:53:37] i accidently signed both english ones xD [21:53:45] so im set :P [21:54:07] milimetric: then probably onboarding process fail [21:54:08] Zppix: you can't anymore, now they are handled on an external server managed by legal [21:54:08] I think you're talking about something else Zppix [21:54:17] akosiaris: yep, will follow up individually with people [21:54:23] Probably... because what i signed was on phab [21:54:31] Zppix: and on Phabricator you couldn't, you must be in a specific group to sign the NDA on /legalpad [21:54:34] akosiaris: thanks for explaining how that works [21:54:41] 06Operations, 10ChangeProp, 06Services (doing), 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2727233 (10Pchelolo) Since we are updating to `4.6.0` maybe it worths going for `4.6.1` right away? It's a new security release that came out today: https://github.... [21:54:48] milimetric: yw [21:54:58] Zppix: old Phabricator NDA was https://phabricator.wikimedia.org/L2 [21:54:59] Dereckson i don't know then im probably am thinking of something completely different [21:55:14] Dereckson, 'old'? [21:55:16] Dereckson anyone can sign it on phab [21:55:17] either way i ident'd to WMF wether its the NDA or not [21:55:30] Krenair: it's supersed by individual agreements now [21:55:42] do i have to resign then Dereckson? [21:57:27] You realise signing it doesn't do anything? [21:57:45] Then we must be talking about 2 different things xD [21:57:51] Zppix: you only have to sign what you're invited to sign actually, when you need to access to some resources [21:57:52] im gonna shut up before i confuse myself even more [21:58:26] Zppix did you sign https://phabricator.wikimedia.org/L3 ? [21:58:27] l3 l4 and l32 is all that i signed by the way [21:58:41] Dereckson like http://i.imgur.com/8E4CGCl.png [21:59:14] Zppix you wont need 3, since you currently doint have access to prod server [21:59:17] just guessing [21:59:26] Or don't bother signing any? [21:59:30] As they won't get you anything [22:00:11] But they could prevent you from re signing in the future [22:00:19] if you do ever access any prod machines [22:00:29] or any of the other ones you signed [22:00:49] Krenair: legal now uses a SaaS solution at an external provider to manage contracts and signatures, that replaces L2 [22:01:14] Dereckson, Cobblestone? [22:01:16] yes [22:01:20] Reedy enwiki requires it for some rights i eventually when the time comes want to request. [22:01:29] Sure [22:01:37] Dereckson, I signed my NDA before L2, I've never been required to sign again [22:01:37] But doesn't mean there's any point siging it now [22:01:42] Reedy so its better to have it done and over with then to forget about it [22:01:47] if conversation includes term "NDA" without further explanation, assume total confusion each time [22:01:47] Not really [22:01:50] Nor, that you'll get them [22:01:52] Possibly because I'm a current contractor, but if so then someone messed up [22:02:12] Reedy, eh i like doing things while i think about it rather then forget its a personal pref of mine [22:02:20] Why? [22:02:22] It doesn't help you [22:02:36] If you requested thoserights, and hadn't signed, it'd be asked of you at the time [22:03:01] Reedy, it doesnt hurt me either [22:03:23] what are you trying to fix? [22:03:36] mutante nothing if your meaning what im discussing [22:04:44] ok, then dont even worry about it, it's a pandora's box [22:05:35] or open ticket and put legal on it [22:05:39] there isnt just one "NDA" [22:06:29] (03PS1) 10Yuvipanda: labs: Attempt to debug maps nfs client issue [puppet] - 10https://gerrit.wikimedia.org/r/316708 [22:07:22] Zppix: Have you arranged your funeral too? So it's over with and you can forget about it? [22:07:39] (03CR) 10jenkins-bot: [V: 04-1] labs: Attempt to debug maps nfs client issue [puppet] - 10https://gerrit.wikimedia.org/r/316708 (owner: 10Yuvipanda) [22:07:44] 06Operations, 10ChangeProp, 06Services (doing), 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2727240 (10MoritzMuehlenhoff) @Pchelolo : This doesn't affect us, we're linking dynamically against Debian's copy of c-ares and the fix for that is already deployed... [22:08:02] that's for the surviving relatives so that they have something to keep them busy instead of staring at the mental hole he left behind [22:08:12] (03PS2) 10Yuvipanda: labs: Attempt to debug maps nfs client issue [puppet] - 10https://gerrit.wikimedia.org/r/316708 [22:08:15] (actually Reedy, if you care about that, plan the funerals is important to do before to die, because afterwards it's too late) [22:08:25] apergos: good poin [22:08:25] t [22:08:48] (so more important than sign random documents beforehand) [22:08:58] apergos agreed [22:09:23] (03CR) 10jenkins-bot: [V: 04-1] labs: Attempt to debug maps nfs client issue [puppet] - 10https://gerrit.wikimedia.org/r/316708 (owner: 10Yuvipanda) [22:09:39] And the financial burden? [22:10:17] (03PS3) 10Yuvipanda: labs: Attempt to debug maps nfs client issue [puppet] - 10https://gerrit.wikimedia.org/r/316708 [22:11:40] funerals are a rip-off [22:12:13] it all just gets buried, or goes up in flames [22:12:54] (03CR) 10Yuvipanda: [C: 032] labs: Attempt to debug maps nfs client issue [puppet] - 10https://gerrit.wikimedia.org/r/316708 (owner: 10Yuvipanda) [22:15:53] (03PS1) 10Yuvipanda: Revert "labs: Attempt to debug maps nfs client issue" [puppet] - 10https://gerrit.wikimedia.org/r/316710 [22:16:00] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "labs: Attempt to debug maps nfs client issue" [puppet] - 10https://gerrit.wikimedia.org/r/316710 (owner: 10Yuvipanda) [22:17:07] (03PS1) 10Madhuvishy: maps: Add maps mount to nfsmounts yaml config [puppet] - 10https://gerrit.wikimedia.org/r/316711 [22:18:15] Hi when i got to https://commons.wikimedia.org/wiki/File:FEZ_trial_gameplay_HD.webm [22:18:21] Im getting this error in the console [22:18:30] SEC7133: The security certificate protecting this url is insecure: data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAABkEAIAAACY3hF0AAAAZklEQVR4AWP6//9TxIJcJgYwGEHUH4bHIOrN/2IQ9ZqhGMH7/4IhHMR78t8KQlmDqMcQ3oP/chBKHkwxgHn3oYJg6j9U7s5/JjR9CMNeMIQh8V5DHAGx9v+H//0g3heG1Xj9MEoBABg9NcOxKunPAAAAAElFTkSuQmCC [22:18:47] is commons using internal certs or did we go back to using the normal certs? [22:19:01] (03CR) 10Madhuvishy: [C: 032] maps: Add maps mount to nfsmounts yaml config [puppet] - 10https://gerrit.wikimedia.org/r/316711 (owner: 10Madhuvishy) [22:19:36] https://phabricator.wikimedia.org/F4623717 [22:20:42] It's not a URL [22:20:50] Try using a decent browser? [22:21:53] Reedy but ie is supported by wikimedia [22:22:00] And? [22:22:01] it works in Edge without those errors in the log [22:22:05] Doesn't mean you should use it [22:22:14] Reedy well dosen't mean you shoulden fix it either [22:22:20] paladox: file a bug against Edge in Edge's bug tracker? :) [22:22:23] I don't think ther's anything to fix [22:22:29] It's not a URL [22:22:35] andre__ hi, this is ie not edge [22:22:36] It's an image base64 encoded in CSS [22:22:41] It's not a WMF bug [22:22:56] yes, but it worked before. Not sure the date it started [22:23:00] though [22:23:13] paladox: file a bug against IE in IE's bug tracker? :) [22:23:24] Ok [22:24:24] Im also going to report it against https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/ [22:24:28] also used for IE [22:24:45] I'm also afraid it's offtopic for #operations. [22:28:30] 06Operations, 06Commons, 10media-storage, 07Regression: Some JPGs are being served as text - https://phabricator.wikimedia.org/T148497#2727272 (10Nick) I can no longer replicate the issues I encountered earlier either. [22:28:46] !log demon@mira Synchronized README: Bringing co-masters back in sync (duration: 13m 10s) [22:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:30:53] (03PS8) 10Alexandros Kosiaris: icinga: Remove event_profiling_enabled [puppet] - 10https://gerrit.wikimedia.org/r/315085 [22:30:55] (03PS8) 10Alexandros Kosiaris: icinga: retry_check_interval => retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/315087 [22:30:57] (03PS4) 10Alexandros Kosiaris: ntp: Update neon specific ACLs to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/315255 [22:30:59] (03PS8) 10Alexandros Kosiaris: icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 [22:31:01] (03PS4) 10Alexandros Kosiaris: Replace neon with einsteinium where applicable [puppet] - 10https://gerrit.wikimedia.org/r/315257 [22:31:03] (03PS5) 10Alexandros Kosiaris: icinga: Kill /etc/icinga/puppet_hostextinfo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315242 [22:31:05] (03PS5) 10Alexandros Kosiaris: naggen2: Kill hostextinfo support [puppet] - 10https://gerrit.wikimedia.org/r/315243 [22:31:07] (03PS5) 10Alexandros Kosiaris: Remove absented /etc/icinga/puppet_hostextinfo.cfg entry [puppet] - 10https://gerrit.wikimedia.org/r/315244 [22:31:09] (03PS5) 10Alexandros Kosiaris: icinga: Remove the last vestiges of hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315245 [22:31:11] (03PS1) 10Alexandros Kosiaris: nrpe: Update nrpe allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/316712 [22:38:13] !log demon@mira Started scap: bringing full cluster back into sync [22:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:40:34] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on tin is OK: Files ownership is ok. [22:41:24] 06Operations, 10Traffic, 07Browser-Support-Internet-Explorer, 07HTTPS, 07Upstream: Visting https://commons.wikimedia.org/wiki/File:FEZ_trial_gameplay_HD.webm on Internet Explorer shows errors - https://phabricator.wikimedia.org/T148595#2727296 (10Paladox) [22:41:34] (03PS2) 10Alexandros Kosiaris: nrpe: Update nrpe allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/316712 [22:41:38] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] nrpe: Update nrpe allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/316712 (owner: 10Alexandros Kosiaris) [22:42:41] 06Operations, 10Traffic, 07Browser-Support-Internet-Explorer, 07HTTPS, 07Upstream: Visting https://commons.wikimedia.org/wiki/File:FEZ_trial_gameplay_HD.webm on Internet Explorer shows errors - https://phabricator.wikimedia.org/T148595#2727296 (10Paladox) [22:43:57] It's likely not a Traffic, Operations or HTTPS issue [22:44:09] (03PS1) 10Madhuvishy: maps: Enable root squash for maps on misc server [puppet] - 10https://gerrit.wikimedia.org/r/316713 [22:45:52] (03CR) 10Madhuvishy: [C: 032] maps: Enable root squash for maps on misc server [puppet] - 10https://gerrit.wikimedia.org/r/316713 (owner: 10Madhuvishy) [22:46:31] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2727314 (10demon) Ok, tin/mira are back in use and look sane. Doing a final scap to get everything back in sync, but we should be good to g... [22:47:34] ostriches: hey, your scap pull will push in production 3 untested merged changes to wmf22 [22:48:24] can't abort now [22:48:35] people shouldn't merge if they can't sync and test [22:48:38] it'll be rete [22:48:57] one is JS only [22:49:03] one is low risk [22:50:13] ostriches: it's me, I've seen the issue during a SWAT windows, where I cr+2 the backports to wmf22, it's when I was going to git fetch on /srv/mediawiki-staging/php-1.28.0-wmf.22 I've seen the state was corrupted [22:50:35] should've reverted then:) [22:53:03] Once scap is done, I'll ping the relevant developers to test that. And I'm watching fatalmonitor. [22:53:18] paladox , Reedy , andre__ , i don't have those ssl issues [22:53:41] and yes trying that exact same filepage [22:53:53] I didn't see it in Chrome either [22:54:01] arseny92: It's not an ssl issue, and it's not a url [22:54:46] 40% on sync-apaches. No files changes, but mtimes all need updating ;-) [22:55:40] kart_: brion: ping? [22:55:47] Dereckson: yo [22:56:03] brion: https://gerrit.wikimedia.org/r/#/c/316069 is live on mw1099 [22:56:08] sweet! lemme test [22:56:38] Dereckson: confirmed fixed [22:56:43] thanks [22:57:20] * brion does the dance of joy [22:57:35] you're welcome [22:58:40] Nikerabbit: ping? [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161018T2300). [23:00:04] kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:10] Lies, no swat. [23:00:18] 96% [23:00:34] :( [23:01:00] are there any evening SWATs this week? [23:01:05] yes [23:01:12] Can go tomorrow or thurs. [23:01:22] OK [23:01:24] Just not now, I'm almost done fixing tin/mira but I'm too paranoid to unblock [23:01:33] kaldari: it's only we have an ongoing issue, https://phabricator.wikimedia.org/T148571 [23:01:45] that's fine. I was just asking since yesterdays evening SWAT was also cancelled [23:01:54] Unrelated issues :( [23:01:58] #notourweek [23:03:27] !log demon@mira Finished scap: bringing full cluster back into sync (duration: 25m 13s) [23:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:03:36] That went faster than I'd hoped. [23:04:29] Ok, I'm gonna dip out now, I'm burnt out. Please ping me (irc or e-mail) if things fall over again. [23:04:57] !log This full scap pulled three changes of the EU SWAT: [[gerrit:316069] TimedMediaHandler, [[gerrit:316585]] MobileFrontend, [[gerrit:315901]] ULS [23:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:31] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.351 second response time [23:07:22] brion change has been tested, Krinkle change for MF we already see a decrease of Undefined property: MobilePage::$revisionTimestamp in fatalmonitor (371 before scap, 220 when I started this line, 199 now). kart change still to test. [23:07:42] so that looks good [23:08:04] arseny92 Internet Explorer 11? [23:08:33] paladox yes just like yours [23:09:26] Ok [23:09:41] Maybe a new bug introduced in a windows insider build [23:12:24] 06Operations, 10Traffic, 07Browser-Support-Internet-Explorer, 07HTTPS, 07Upstream: Visting https://commons.wikimedia.org/wiki/File:FEZ_trial_gameplay_HD.webm on Internet Explorer shows errors - https://phabricator.wikimedia.org/T148595#2727371 (10Aklapper) It only shows errors in the developer console. [23:13:04] 06Operations, 10Traffic, 07Browser-Support-Internet-Explorer, 07HTTPS, 07Upstream: Visting https://commons.wikimedia.org/wiki/File:FEZ_trial_gameplay_HD.webm on Internet Explorer shows errors - https://phabricator.wikimedia.org/T148595#2727373 (10Paladox) yes. [23:13:36] 06Operations, 10Traffic, 07Browser-Support-Internet-Explorer, 07HTTPS, 07Upstream: Visting https://commons.wikimedia.org/wiki/File:FEZ_trial_gameplay_HD.webm on Internet Explorer shows errors - https://phabricator.wikimedia.org/T148595#2727374 (10Paladox) [23:14:17] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 07Browser-Support-Internet-Explorer, 07Upstream: Visting [[c:File:FEZ_trial_gameplay_HD.webm]] in IE11 shows errors in developer console about insecure data:image/png;base64 "URL" - https://phabricator.wikimedia.org/T148595#2727375 (10Aklapper) p:... [23:15:25] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:18:26] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.109 second response time [23:18:50] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 07Browser-Support-Internet-Explorer, 07Upstream: Visting [[c:File:FEZ_trial_gameplay_HD.webm]] in IE11 shows errors in developer console about insecure data:image/png;base64 "URL" - https://phabricator.wikimedia.org/T148595#2727381 (10Paladox) [23:56:52] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2727424 (10Eevans)