[00:07:03] (03PS1) 10Odder: Create logo for the Kabiye Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344562 [00:09:11] (03CR) 10Odder: "All three files have been optimized with optipng -o7 as usual." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344562 (owner: 10Odder) [00:11:41] (03PS2) 10Odder: Create logo for the Kabiye Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344562 (https://phabricator.wikimedia.org/T160868) [00:23:54] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3127287 (10GoranSMilovanovic) @RobH Groups: (1) reseachers, (2) analytics-wmde, (3) analytics-users; personal e-mail: goran.s.milovanovic@gmail... [00:27:32] 06Operations, 10LDAP-Access-Requests, 06WMDE-Analytics-Engineering, 10Wikidata, 15User-Addshore: Add goransm to ldap/wmde group - https://phabricator.wikimedia.org/T160924#3127292 (10GoranSMilovanovic) @MoritzMuehlenhoff 24 March, 2017: NDA signed. Of course I understand you will need a confirmation from... [01:15:07] !log catrope@tin Started scap: Wikidata cherry-picks (with i18n) [01:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:35] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 82414.892438 Seconds [01:29:35] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 82414.901972 Seconds [01:29:55] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 82436.061309 Seconds [01:30:05] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 82481.035156 Seconds [01:30:25] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 82500.20455 Seconds [01:30:25] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 82500.206969 Seconds [01:37:15] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [01:40:10] !log catrope@tin Finished scap: Wikidata cherry-picks (with i18n) (duration: 25m 03s) [01:40:15] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [01:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:56] !log bacula - on helium, attempt to start bacula-director process, attempt to fix permissions on key files as codified in director.pp [01:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:25] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [01:54:25] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 3.282055 Seconds [01:54:25] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 3.292665 Seconds [01:55:05] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 44.309131 Seconds [01:55:35] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 24.783112 Seconds [01:55:37] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 24.805346 Seconds [01:55:55] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 45.344678 Seconds [02:01:03] Uhmm, dewiki seems down for me [02:01:03] Request from 94.134.245.179 via cp3033 cp3033, Varnish XID 735457820 [02:01:03] Error: 503, Service Unavailable at Fri, 24 Mar 2017 02:00:09 GMT [02:05:25] PROBLEM - HHVM rendering on mw1256 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.848 second response time [02:05:35] PROBLEM - HHVM rendering on mw1237 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.895 second response time [02:05:35] PROBLEM - HHVM rendering on mw1189 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.998 second response time [02:05:35] PROBLEM - HHVM rendering on mw1245 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.123 second response time [02:05:45] PROBLEM - HHVM rendering on mw1253 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.100 second response time [02:05:45] PROBLEM - HHVM rendering on mw1203 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.229 second response time [02:05:45] PROBLEM - HHVM rendering on mw1286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.344 second response time [02:05:55] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:05:56] PROBLEM - HHVM rendering on mw1287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.866 second response time [02:06:05] PROBLEM - HHVM rendering on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.765 second response time [02:06:05] PROBLEM - HHVM rendering on mw1185 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.748 second response time [02:06:05] PROBLEM - HHVM rendering on mw1211 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.668 second response time [02:06:05] PROBLEM - HHVM rendering on mw1236 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.562 second response time [02:06:25] PROBLEM - HHVM rendering on mw1246 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.259 second response time [02:06:25] RECOVERY - HHVM rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.075 second response time [02:06:25] RECOVERY - HHVM rendering on mw1237 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.071 second response time [02:06:35] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 78733 bytes in 0.121 second response time [02:06:35] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.075 second response time [02:06:45] PROBLEM - HHVM rendering on mw1178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.794 second response time [02:07:05] RECOVERY - HHVM rendering on mw1236 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.071 second response time [02:07:15] PROBLEM - HHVM rendering on mw1266 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.940 second response time [02:07:15] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [02:07:15] eddiegp: ^ that's probably why [02:07:25] PROBLEM - HHVM rendering on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.034 second response time [02:07:25] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.098 second response time [02:07:35] RECOVERY - HHVM rendering on mw1253 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.070 second response time [02:07:45] PROBLEM - HHVM rendering on mw1171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.580 second response time [02:07:54] Jep, I just saw this. [02:07:55] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [02:07:55] RECOVERY - HHVM rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.084 second response time [02:07:55] RECOVERY - HHVM rendering on mw1185 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.077 second response time [02:08:05] PROBLEM - HHVM rendering on mw1270 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.886 second response time [02:08:05] RECOVERY - HHVM rendering on mw1266 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.087 second response time [02:08:15] PROBLEM - HHVM rendering on mw1254 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.456 second response time [02:08:25] RECOVERY - HHVM rendering on mw1245 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.073 second response time [02:08:35] PROBLEM - HHVM rendering on mw1179 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.255 second response time [02:08:35] RECOVERY - HHVM rendering on mw1178 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.077 second response time [02:08:35] PROBLEM - HHVM rendering on mw1295 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50597 bytes in 7.899 second response time [02:08:55] RECOVERY - HHVM rendering on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.070 second response time [02:08:55] RECOVERY - HHVM rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.079 second response time [02:08:55] RECOVERY - HHVM rendering on mw1211 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.074 second response time [02:08:56] PROBLEM - HHVM rendering on mw1249 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.864 second response time [02:08:56] PROBLEM - HHVM rendering on mw1272 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.004 second response time [02:09:05] PROBLEM - HHVM rendering on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.021 second response time [02:09:05] RECOVERY - HHVM rendering on mw1254 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.072 second response time [02:09:05] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50624 bytes in 8.431 second response time [02:09:15] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 500 (expecting: 200) [02:09:15] PROBLEM - HHVM rendering on mw1201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.112 second response time [02:09:15] RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.071 second response time [02:09:15] RECOVERY - HHVM rendering on mw1246 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.073 second response time [02:09:25] PROBLEM - HHVM rendering on mw1242 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.695 second response time [02:09:27] RECOVERY - HHVM rendering on mw1179 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.094 second response time [02:09:27] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 78733 bytes in 0.119 second response time [02:09:35] PROBLEM - HHVM rendering on mw1226 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.025 second response time [02:09:35] RECOVERY - HHVM rendering on mw1171 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.073 second response time [02:09:55] RECOVERY - HHVM rendering on mw1249 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.097 second response time [02:09:56] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 78742 bytes in 0.322 second response time [02:09:56] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:09:56] PROBLEM - HHVM rendering on mw1258 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.746 second response time [02:10:04] hi [02:10:05] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [02:10:05] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 78732 bytes in 0.110 second response time [02:10:15] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [02:10:15] RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.088 second response time [02:10:17] any ideas on cause? [02:10:22] saw this: [02:10:23] Mar 24 02:08:40 mw1185 hhvm[18031]: Fatal error: Class undefined: Wikibase\Client\Hooks\NoLangLinkHandler in /srv/mediawiki/php-1.29.0-wmf....n line 99 [02:10:26] PROBLEM - HHVM rendering on mw1256 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.001 second response time [02:10:35] PROBLEM - HHVM rendering on mw1282 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.876 second response time [02:10:40] not sure yet if it's actually over [02:10:45] PROBLEM - HHVM rendering on mw1253 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.961 second response time [02:10:45] PROBLEM - HHVM rendering on mw1273 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.845 second response time [02:10:55] RECOVERY - HHVM rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.073 second response time [02:10:55] RECOVERY - HHVM rendering on mw1272 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.099 second response time [02:10:55] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.074 second response time [02:10:56] the maps replag a bit before is weird too, I wonder if there was some codfw<->eqiad net slowdown that caused tertiary fallout? [02:10:56] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [02:11:15] PROBLEM - HHVM rendering on mw1229 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.931 second response time [02:11:25] PROBLEM - HHVM rendering on mw1176 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.123 second response time [02:11:25] RECOVERY - HHVM rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.077 second response time [02:11:25] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.070 second response time [02:11:25] PROBLEM - HHVM rendering on mw1224 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.984 second response time [02:11:25] PROBLEM - HHVM rendering on mw1210 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.036 second response time [02:11:25] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 500 (expecting: 200) [02:11:33] on a random one, mw1185, status of HHVM is "active (running)" [02:11:33] the first and most-persistent of the spam is on rendering [02:11:35] RECOVERY - HHVM rendering on mw1273 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.093 second response time [02:11:40] but it still shows those fatals too [02:11:45] PROBLEM - HHVM rendering on mw1294 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50597 bytes in 8.740 second response time [02:11:45] PROBLEM - HHVM rendering on mw1222 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.488 second response time [02:11:57] !log Removing xhgui.results entries from before 1 December 2016 in MongoDB on tungsten (T161196) [02:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:04] T161196: tungsten is out of space on /srv - https://phabricator.wikimedia.org/T161196 [02:12:05] PROBLEM - HHVM rendering on mw1211 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.980 second response time [02:12:15] RECOVERY - HHVM rendering on mw1176 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.083 second response time [02:12:15] PROBLEM - HHVM rendering on mw1247 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 9.273 second response time [02:12:15] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.069 second response time [02:12:15] RECOVERY - HHVM rendering on mw1210 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.077 second response time [02:12:35] Aye, what's going on? [02:12:35] RECOVERY - HHVM rendering on mw1253 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.086 second response time [02:12:35] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 78733 bytes in 0.118 second response time [02:12:35] RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.083 second response time [02:12:36] "wikidata cherrypicks" was last deploy update not too long before, but seems unlikely to be related? [02:12:56] PROBLEM - HHVM rendering on mw1173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.710 second response time [02:12:59] yes, it does, because i think that error message above ends in "wikidata" something [02:13:05] PROBLEM - HHVM rendering on mw1267 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.855 second response time [02:13:05] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.073 second response time [02:13:05] RECOVERY - HHVM rendering on mw1247 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.073 second response time [02:13:09] /php-1.29.0-wmf.17/extensions/Wikid.. [02:13:14] hmmm [02:13:15] PROBLEM - HHVM rendering on mw1170 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.962 second response time [02:13:15] PROBLEM - HHVM rendering on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.112 second response time [02:13:15] PROBLEM - HHVM rendering on mw1254 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.367 second response time [02:13:17] and "Wikibase" is wikidata [02:13:18] RoanKattouw: ^ What did you deploy? :) [02:13:25] PROBLEM - HHVM rendering on mw1242 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.772 second response time [02:13:25] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.088 second response time [02:13:28] I was referring to: [02:13:30] 01:40 < logmsgbot> !log catrope@tin Finished scap: Wikidata cherry-picks (with i18n) (duration: 25m 03s) [02:13:36] but there must have been more info earlier on [02:13:55] RECOVERY - HHVM rendering on mw1267 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.085 second response time [02:14:05] PROBLEM - HHVM rendering on mw1244 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.164 second response time [02:14:05] PROBLEM - HHVM rendering on mw1221 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.982 second response time [02:14:05] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.082 second response time [02:14:05] RECOVERY - HHVM rendering on mw1170 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.092 second response time [02:14:05] RECOVERY - HHVM rendering on mw1254 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.071 second response time [02:14:06] PROBLEM - HHVM rendering on mw1236 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.230 second response time [02:14:15] RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.065 second response time [02:14:21] Checking tin bblack [02:14:23] Fatal error: Class undefined: Wikibase\Client\Hooks\NoLangLinkHandler in /srv/mediawiki/php-1.29.0-wmf.17/extensions/Wikidata/extensions/Wikibase/client/includes/Hooks/MagicWordHookHandlers.php on line 99 [02:14:24] php/ git log [02:14:25] mit f95d9135e48b8329328efc2af46b8d8100c79089 [02:14:25] Author: Roan Kattouw [02:14:25] Date: Thu Mar 23 17:46:46 2017 -0700 [02:14:25] Update git submodules [02:14:25] [02:14:25] * Update extensions/Wikidata from branch 'wmf/1.29.0-wmf.17' [02:14:25] - Fix .gitreview [02:14:26] [02:14:26] Change-Id: Ia8d25041eee648caa22c8fd34eaacd26035ac0d8 [02:14:27] commit 1f32a896a6207bf78d9e97a9fdb49818256e0f3a [02:14:27] Author: Roan Kattouw [02:14:28] Date: Thu Mar 23 17:44:42 2017 -0700 [02:14:28] Update git submodules [02:14:31] That is the most prevelant failure [02:14:35] PROBLEM - HHVM rendering on mw1297 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50597 bytes in 7.775 second response time [02:14:45] PROBLEM - HHVM rendering on mw1199 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.990 second response time [02:14:45] PROBLEM - HHVM rendering on mw1193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.031 second response time [02:14:45] PROBLEM - HHVM rendering on mw1298 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50597 bytes in 8.145 second response time [02:14:45] RECOVERY - HHVM rendering on mw1173 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.098 second response time [02:14:59] https://gerrit.wikimedia.org/r/#/c/342294/ [02:15:01] the timing is also right when it started, yes [02:15:03] Ouch [02:15:04] "Port Wikidata filter to new RCFilters system" [02:15:05] RECOVERY - HHVM rendering on mw1236 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.071 second response time [02:15:15] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [02:15:25] PROBLEM - HHVM rendering on mw1250 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.877 second response time [02:15:28] RoanKattouw: hey:) [02:15:30] Hmm did I screw up the autoloader file? [02:15:35] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 78733 bytes in 0.136 second response time [02:15:38] https://github.com/wikimedia/mediawiki-extensions-Wikidata/commit/ed58ad75ebc33ebbf675ce676108e81c22081329 [02:15:45] PROBLEM - HHVM rendering on mw1253 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.875 second response time [02:16:04] logstash is broken? [02:16:05] PROBLEM - HHVM rendering on mw1208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.048 second response time [02:16:05] PROBLEM - HHVM rendering on mw1181 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 4.190 second response time [02:16:15] PROBLEM - HHVM rendering on mw1188 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 4.895 second response time [02:16:15] RECOVERY - HHVM rendering on mw1250 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.093 second response time [02:16:15] PROBLEM - HHVM rendering on mw1180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 6.514 second response time [02:16:16] PROBLEM - HHVM rendering on mw1246 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 6.512 second response time [02:16:25] PROBLEM - HHVM rendering on mw1293 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50597 bytes in 7.782 second response time [02:16:25] PROBLEM - HHVM rendering on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 6.811 second response time [02:16:35] PROBLEM - HHVM rendering on mw1202 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.553 second response time [02:16:35] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 78732 bytes in 0.104 second response time [02:16:35] RECOVERY - HHVM rendering on mw1253 is OK: HTTP OK: HTTP/1.1 200 OK - 78732 bytes in 0.108 second response time [02:16:35] PROBLEM - HHVM rendering on mw1295 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50597 bytes in 7.839 second response time [02:16:41] Is anyone trying to revert? [02:16:43] If not, I'm doing it now [02:16:54] Please [02:16:55] please do, +1 :) [02:16:55] RECOVERY - HHVM rendering on mw1244 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.070 second response time [02:16:55] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 78733 bytes in 0.119 second response time [02:16:55] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.071 second response time [02:16:57] i was going to ask if we can, please [02:17:05] RECOVERY - HHVM rendering on mw1188 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.074 second response time [02:17:05] RECOVERY - HHVM rendering on mw1180 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.082 second response time [02:17:12] I'm on mobile so go right ahead [02:17:15] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 500 (expecting: 200) [02:17:15] PROBLEM - HHVM rendering on mw1247 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.718 second response time [02:17:15] RECOVERY - HHVM rendering on mw1246 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.074 second response time [02:17:15] RECOVERY - HHVM rendering on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.093 second response time [02:17:25] RECOVERY - HHVM rendering on mw1263 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.087 second response time [02:17:25] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 78732 bytes in 0.104 second response time [02:17:35] PROBLEM - HHVM rendering on mw1198 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.407 second response time [02:17:35] PROBLEM - HHVM rendering on mw1215 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.151 second response time [02:17:35] PROBLEM - HHVM rendering on mw1268 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.200 second response time [02:17:36] RoanKattouw: 'Wikibase\\NoLangLinkHandler' => $baseDir . '/extensions/Wikibase/client/includes/Hooks/NoLangLinkHandler.php', [02:17:45] PROBLEM - HHVM rendering on mw1257 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.537 second response time [02:17:54] RoanKattouw: Looks like wrong NS [02:17:55] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:55] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:56] RECOVERY - HHVM rendering on mw1211 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.081 second response time [02:17:57] 06Operations: bacula-director not running on helium - https://phabricator.wikimedia.org/T161281#3127354 (10Dzahn) [02:18:05] PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.958 second response time [02:18:05] PROBLEM - HHVM rendering on mw1184 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.960 second response time [02:18:05] PROBLEM - HHVM rendering on mw1185 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.878 second response time [02:18:05] RECOVERY - HHVM rendering on mw1247 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.070 second response time [02:18:15] PROBLEM - HHVM rendering on mw1229 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.166 second response time [02:18:15] PROBLEM - HHVM rendering on mw1248 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.166 second response time [02:18:15] PROBLEM - HHVM rendering on mw1266 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.134 second response time [02:18:15] PROBLEM - HHVM rendering on mw1281 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.870 second response time [02:18:16] PROBLEM - HHVM rendering on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.793 second response time [02:18:25] PROBLEM - HHVM rendering on mw1252 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.782 second response time [02:18:25] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.094 second response time [02:18:25] RECOVERY - HHVM rendering on mw1268 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.073 second response time [02:18:25] RECOVERY - HHVM rendering on mw1215 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.089 second response time [02:18:25] RECOVERY - HHVM rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 78733 bytes in 0.122 second response time [02:18:26] FWIW, the net impact to user traffic seems much smaller than all this log spam would indicate :) [02:18:26] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 78733 bytes in 0.125 second response time [02:18:26] PROBLEM - HHVM rendering on mw1256 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.796 second response time [02:18:35] PROBLEM - HHVM rendering on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.741 second response time [02:18:35] RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.077 second response time [02:18:45] PROBLEM - HHVM rendering on mw1286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.377 second response time [02:18:45] PROBLEM - HHVM rendering on mw1228 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.297 second response time [02:18:45] PROBLEM - HHVM rendering on mw1294 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50597 bytes in 8.380 second response time [02:18:51] weeeee [02:18:51] interesting part is that one user came and reported that de.wp was down [02:18:55] RECOVERY - HHVM rendering on mw1184 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.102 second response time [02:18:55] RECOVERY - HHVM rendering on mw1185 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.080 second response time [02:18:55] PROBLEM - HHVM rendering on mw1173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.722 second response time [02:18:56] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:57] BEFORE the icinga started talking about anything [02:19:03] and then shortly after it started [02:19:05] RECOVERY - HHVM rendering on mw1181 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.083 second response time [02:19:05] RECOVERY - HHVM rendering on mw1248 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.076 second response time [02:19:13] enwiki main page is down [02:19:15] RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.073 second response time [02:19:25] RECOVERY - HHVM rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.069 second response time [02:19:25] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.073 second response time [02:19:35] RECOVERY - HHVM rendering on mw1257 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.068 second response time [02:19:36] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.082 second response time [02:19:36] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.079 second response time [02:19:36] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.089 second response time [02:19:36] Reedy: But my commit didn't touch that, did it? [02:19:36] also, no pages [02:19:45] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [02:19:45] RECOVERY - HHVM rendering on mw1173 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.083 second response time [02:19:45] PROBLEM - HHVM rendering on mw1177 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.586 second response time [02:19:55] https://github.com/wikimedia/mediawiki-extensions-Wikidata/commit/ed58ad75ebc33ebbf675ce676108e81c22081329#diff-8779bcd8178ea875ab61c9c360551636R357 [02:19:55] RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.093 second response time [02:19:55] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [02:20:01] The main page for some but all Wikipedias is down for me. However, it seems to work on other browsers/logged out? [02:20:03] some pages render, sometimes, it's kinda random [02:20:05] RECOVERY - HHVM rendering on mw1266 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.091 second response time [02:20:05] revert staged on tin [02:20:06] Seemingly not [02:20:09] testing on mwdebug now [02:20:13] and yes, anonymous get some cache benefits.. [02:20:14] jay8g: Logged out will likely hit the cache [02:20:15] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 78732 bytes in 0.105 second response time [02:20:17] (vs logged-in) [02:20:18] Krinkle: Just deploy it [02:20:37] cached/anonymous main_page on enwiki is still ok for me, but may vary [02:20:45] PROBLEM - HHVM rendering on mw1240 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.938 second response time [02:20:45] RECOVERY - HHVM rendering on mw1177 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.074 second response time [02:20:45] PROBLEM - HHVM rendering on mw1214 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.945 second response time [02:20:45] PROBLEM - HHVM rendering on mw1255 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.853 second response time [02:20:45] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [02:20:56] PROBLEM - HHVM rendering on mw1227 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.739 second response time [02:20:56] RoanKattouw: https://github.com/wikimedia/mediawiki-extensions-Wikidata/commit/ed58ad75ebc33ebbf675ce676108e81c22081329#diff-b2bf7ff075a09dba6330a56d29ec97fbR99 [02:21:01] I guess that means that line is the wrong NS [02:21:05] PROBLEM - HHVM rendering on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.731 second response time [02:21:11] fucking l10n update has the lock [02:21:14] (and it's been down for a while now) [02:21:15] PROBLEM - HHVM rendering on mw1180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.761 second response time [02:21:22] Yeah, looks to be that [02:21:25] PROBLEM - HHVM rendering on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.789 second response time [02:21:30] Krinkle: And a great example why we should disable l10nupdate [02:21:35] RECOVERY - HHVM rendering on mw1240 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.070 second response time [02:21:35] PROBLEM - HHVM rendering on mw1271 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.695 second response time [02:21:36] RECOVERY - HHVM rendering on mw1255 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.076 second response time [02:21:45] PROBLEM - HHVM rendering on mw1253 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.705 second response time [02:21:45] PROBLEM - HHVM rendering on mw1178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.683 second response time [02:21:45] PROBLEM - HHVM rendering on mw1238 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.764 second response time [02:21:45] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:21:53] Don't bother testing on mwdebug, I did test it there and didn't catch this [02:21:54] is there a sane thing to do at that point other than wait on l10nupdate? [02:22:05] PROBLEM - HHVM rendering on mw1249 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.956 second response time [02:22:05] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.082 second response time [02:22:05] RECOVERY - HHVM rendering on mw1180 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.088 second response time [02:22:13] Kill l10nupdate? [02:22:15] RECOVERY - HHVM rendering on mw1252 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.076 second response time [02:22:15] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.072 second response time [02:22:25] !log Hard-killed all l10nupdate processes and rm'ed scap lock [02:22:25] RECOVERY - HHVM rendering on mw1271 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.076 second response time [02:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:32] syncing now [02:22:35] RECOVERY - HHVM rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.077 second response time [02:22:35] RECOVERY - HHVM rendering on mw1178 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.091 second response time [02:22:45] PROBLEM - HHVM rendering on mw1257 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.743 second response time [02:22:45] PROBLEM - HHVM rendering on mw1286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.726 second response time [02:22:45] PROBLEM - HHVM rendering on mw1209 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.733 second response time [02:22:45] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [02:22:55] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.083 second response time [02:22:55] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.079 second response time [02:22:55] RECOVERY - HHVM rendering on mw1249 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.102 second response time [02:22:55] PROBLEM - HHVM rendering on mw1258 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.737 second response time [02:23:05] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [02:23:23] scap is not doing anything [02:23:25] PROBLEM - HHVM rendering on mw1246 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.699 second response time [02:23:25] PROBLEM - HHVM rendering on mw1195 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.816 second response time [02:23:31] bd808: Reedy [02:23:35] RECOVERY - HHVM rendering on mw1253 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.073 second response time [02:23:35] RECOVERY - HHVM rendering on mw1257 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.090 second response time [02:23:35] PROBLEM - HHVM rendering on mw1245 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.873 second response time [02:23:35] RECOVERY - HHVM rendering on mw1209 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.086 second response time [02:23:35] RECOVERY - HHVM rendering on mw1214 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.078 second response time [02:23:41] Reedy: sorry the anchor in that link isn't working for me, can you quote the line you're referring to? [02:23:43] I see the pony pig but stalled output otherwise [02:23:44] killed [02:23:45] PROBLEM - HHVM rendering on mw1191 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.770 second response time [02:23:45] PROBLEM - HHVM rendering on mw1213 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.778 second response time [02:23:45] PROBLEM - HHVM rendering on mw1177 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.709 second response time [02:23:50] RoanKattouw: I left a comment on gerrit [02:23:55] RECOVERY - HHVM rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.087 second response time [02:23:57] Thanks [02:24:05] PROBLEM - HHVM rendering on mw1221 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.707 second response time [02:24:05] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50624 bytes in 7.749 second response time [02:24:05] okay, rsync-masters got it now [02:24:15] RECOVERY - HHVM rendering on mw1246 is OK: HTTP OK: HTTP/1.1 200 OK - 78731 bytes in 0.068 second response time [02:24:15] PROBLEM - HHVM rendering on mw1266 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 8.133 second response time [02:24:15] PROBLEM - HHVM rendering on mw1188 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.979 second response time [02:24:16] 06Operations: bacula-director not running on helium - https://phabricator.wikimedia.org/T161281#3127367 (10Dzahn) [02:24:25] PROBLEM - HHVM rendering on mw1230 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.729 second response time [02:24:25] PROBLEM - HHVM rendering on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50620 bytes in 7.622 second response time [02:24:25] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 78711 bytes in 3.214 second response time [02:24:25] RECOVERY - HHVM rendering on mw1245 is OK: HTTP OK: HTTP/1.1 200 OK - 78707 bytes in 0.069 second response time [02:24:32] !log Killed l10nupdate on tin, was blocking emergency pushes [02:24:35] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 78707 bytes in 0.079 second response time [02:24:35] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 78709 bytes in 0.128 second response time [02:24:35] RECOVERY - HHVM rendering on mw1213 is OK: HTTP OK: HTTP/1.1 200 OK - 78707 bytes in 0.090 second response time [02:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:45] RECOVERY - HHVM rendering on mw1177 is OK: HTTP OK: HTTP/1.1 200 OK - 78707 bytes in 0.089 second response time [02:24:48] !log krinkle@tin Synchronized php-1.29.0-wmf.17/extensions/Wikidata: revert (duration: 02m 34s) [02:24:51] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/344564/1/extensions/Wikibase/client/includes/Hooks/MagicWordHookHandlers.php [02:24:55] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 78708 bytes in 0.118 second response time [02:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:55] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 78720 bytes in 1.859 second response time [02:24:59] Reedy: well spotted [02:25:00] You moved the file it was in, so that changed the NS [02:25:04] !log All apaches are back up [02:25:05] RECOVERY - HHVM rendering on mw1266 is OK: HTTP OK: HTTP/1.1 200 OK - 78707 bytes in 0.090 second response time [02:25:05] RECOVERY - HHVM rendering on mw1188 is OK: HTTP OK: HTTP/1.1 200 OK - 78707 bytes in 0.074 second response time [02:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:12] I still wonder why I got white pages and 503 error pages about 5 minutes before this flood of errors. [02:25:15] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 78707 bytes in 0.099 second response time [02:25:15] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 78707 bytes in 0.072 second response time [02:25:25] So it was a bug in the code, not the infrastructure [02:25:29] Will fix when I get home [02:25:30] Yup [02:25:30] eddiegp: it can take a while for the automated monitoring to start alerting [02:25:37] https://en.wikipedia.org/wiki/Main_Page appears to have just recovered [02:25:40] Plus, people hitting hte right pages [02:25:49] works for me again [02:25:59] thanks for revert [02:26:14] Thanks Krinkle [02:26:37] And thanks Reedy for finding the bug, it's also broken in master right now so I'll fix it after dinner [02:26:49] !log Reminder to incident doc writer: Logstash was (and is) not responsive serving Kibana-rendered errors about logstash Service unavailable [02:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:50] Just commented on T158360 about l10nupdate prolonging the outage [02:27:50] T158360: RFC: Disabling LocalisationUpdate on WMF wikis - https://phabricator.wikimedia.org/T158360 [02:28:13] !log Reminder to incident doc writer: Was difficult figuring out what the last "real" patch was, scap message for SAL is manually written (not says which commit in which repo), and git log contains noise from security patches. We need simple revert options from the flat git tree at /srv/mediawiki [02:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:48] Krinkle: Don't we have a magic "undo last deploy" [02:28:55] next question: why fatals resulted in blank pages instead of wikimedia errors? [02:29:08] Maybe, but I am not yet familiar with it. and didnt have the time to start learning it now [02:29:18] MaxSem: indeed, good point. [02:29:18] was it the actual page URLs that were returning the 500s? [02:29:25] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [02:29:35] bblack: Yes, all urls were blanking for me. It also took a surprisingly long time to render (1-5 seconds) [02:29:35] bblack, yes. like enwiki main page [02:29:36] bblack: Ask eddiegp as he was seeing it [02:29:43] but possibly due to increased load from cache misses? [02:29:51] although that shoudln't be the case [02:30:02] I meant as opposed to a sub-resource on the page blocking rendering [02:30:18] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.17) (duration: 08m 35s) [02:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:35] For me it was https://de.wikipedia.org/wiki/Wikipedia:Hauptseite and https://en.wikipedia.org/wiki/Main_Page [02:30:44] it took 1-5s to get a blank page from PHP. Which may be fine if the underlying erorr was e.g. related to DB code, but it seems like it should've been an immediate failure from mw init. [02:30:45] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [02:31:13] !log Reverted patch - https://gerrit.wikimedia.org/r/#/c/344569/ [02:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:35] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3729850 keys, up 11 hours 32 minutes - replication_delay is 0 [02:31:53] Only got white pages at en.wp, but I only tried to load that page once or twice after error flood came in. Before the error flood I loaded de.wp a few times, sometimes getting 503 (see 03:01) and sometimes white pages. [02:32:21] some of the 500 are on the page itself, yes: "http_status":"500","response_size":50115,"http_method":"GET","uri_host":"en.wikipedia.org","uri_path":"/wiki/Main_Page" [02:32:40] the blank pages I saw in browser, they never really finished loading [02:33:52] why was almost all the spam just for rendering? [02:33:59] I was talking about what the browser shows, yeah. I didn't look in HTTP response header, with 503 I meant a HTML WMF error page telling me a 503 [02:34:28] maybe the icinga checks use different paths for rendering vs appservers/api [02:34:55] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 615 MB (0% inode=97%) [02:35:55] RECOVERY - Disk space on stat1002 is OK: DISK OK [02:36:02] bblack: I see in logstash now (spotty responsive again, kind of) that there were also about as much fatals as there were DB-lagged warnings [02:36:20] scratch that, about 10x more db lag warnings [02:36:29] 06Operations: bacula-director not running on helium - https://phabricator.wikimedia.org/T161281#3127380 (10Dzahn) [02:36:30] oh, I'm just confused by the icinga output [02:36:41] 500K db lag warnings in 1 hour, 32K fatals about wikidata class missing [02:37:02] Fatal error: unknown exception [02:37:13] "HHVM rendering on mw1280" does not mean the rendering cluster (as in thumbnail rendering) [02:37:25] Yeah, it means html rendering [02:37:26] it means "hhvm on the main appservers cluster is failing at rendering the main pge" [02:37:29] confusing! [02:37:46] I think what you're looking for is what we typically call scaling? [02:37:54] the "scalers" [02:38:03] yeah but the actual cluster named is "rendering" [02:38:13] e.g. in LVS config: [02:38:14] rendering: [02:38:14] description: "MediaWiki thumbnail rendering cluster, rendering.svc.%{::site}.wmnet" [02:38:18] class: low-traffic [02:38:35] Hm.. did you pick that name? :P [02:38:40] I doubt it :) [02:38:53] we like reusing things in different context [02:39:00] or similar ones [02:39:02] keeps people on their toes [02:39:09] "ES is down!" [02:39:15] Yeah, should be consistent. This is hte first I hear of rendering, but then again,I don't see much of LVS [02:39:18] "ElasticSearch or External Storage?" [02:39:21] ecma script? [02:39:26] Just say labs labs labs [02:39:40] I prefer nocnocnoc [02:39:47] https://github.com/wikimedia/puppet/blob/f29a5df66173396536bc96bc762359278b1b511c/conftool-data/node/eqiad.yaml [02:39:51] imagescaler and videoscaler [02:39:58] uh, also Use of ChangesListSpecialPageFilters hook (used in \Wikibase\Client\Hooks\ChangesListSpecialPageHookHandlers::onChangesListSpecialPageFilters) was deprecated in MediaWiki 1.29. [Called from ChangesListSpecialPage::getCustomFilters in /srv/mediawiki/php-1.29.0-wmf.17/includes/specialpage/ChangesListSpecialPage.php at line 723] [02:40:16] lol [02:40:20] deprecated is fiiine :P [02:40:39] bblack: wait, one LVS for 'rendering', that is imagescalers, right? not including videoscalers? [02:41:03] https://github.com/wikimedia/puppet/blob/6eb9f0383ce436b145a22bae7498145b1942ffc3/hieradata/role/common/mediawiki/imagescaler.yaml#L1-L4 [02:41:21] Reedy, not fine as long as it spams MY LOGS [02:41:31] yeah if I filter out the inbetween lines and skip to what's relevant, in that LVS config paste I had: [02:41:34] rendering: [02:41:36] description: "MediaWiki thumbnail rendering cluster, rendering.svc.%{::site}.wmnet" [02:41:39] class: low-traffic [02:41:39] hi [02:41:42] conftool: [02:41:44] cluster: imagescaler [02:41:47] service: apache2 [02:41:47] what's going on? [02:41:54] so yeah "rendering.svc" == "imagescaler" cluster [02:42:25] aude: https://gerrit.wikimedia.org/r/#/c/344569/ outage - Roan deployed a patch for RCFilters with a broken class autoloader :/ - fixed now. [02:42:35] outage indirectly via wikidata [02:42:56] ugh [02:43:08] still unsure whether the outage was triggered by the scap deploy from him with 10min delay, or whether it was the start of l10ndeploy making it active. I don't think it was triggered by the magic word this time, so probably just a coincident that l10update jsut started to run [02:43:23] bug bug o [02:43:24] maybe [02:43:25] o/ [02:43:29] magic words are scary :o [02:44:11] Not autoloader, just a wrong class reference [02:44:16] aude: https://gerrit.wikimedia.org/r/#/c/344564/1/extensions/Wikibase/client/includes/Hooks/MagicWordHookHandlers.php [02:44:28] fuck yeah, php namespaces [02:44:29] The file moved namespaces and we didn't update the call [02:44:36] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 1821.116551 Seconds [02:44:37] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 1821.116266 Seconds [02:44:49] if https://gerrit.wikimedia.org/r/#/c/344564/ went out in swat ? [02:45:05] It did [02:45:06] magic words require localisation cache update stuff [02:45:11] Not sure what code path that's in yet and if there's a connection with l10nupdate, will look when I'm home [02:45:25] it's best these kinds of changes go out with the train [02:45:30] anyways, in the public 5xx graphs the error rate from this never went over ~0.02% of text-cluster requests impacted. but that's because it's drowned in working anon hits. the impact was certainly higher for logged-in fetches. [02:45:35] in my experience, and still are scary * [02:45:55] * aude doesn't understand l10nupdate entirey [02:45:59] entirely* [02:46:00] aude: Well, it would've still been broken anyway [02:46:04] yeah :/ [02:46:35] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [02:46:36] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [02:46:39] is there a ticket for this/ [02:46:41] ? [02:46:52] Nope [02:46:56] ok [02:46:59] Roan was going to fix master when he gets to a computer [02:47:26] ok [02:47:33] someone should write an incident report on wikitech (not it!) [02:47:45] I do keep wondering how many problems proper static analysis would find :) [02:47:59] find/prevent [02:51:30] 06Operations: bacula-director not running on helium - https://phabricator.wikimedia.org/T161281#3127397 (10Dzahn) A puppet run changes /var/lib/puppet/ssl/private_keys/helium.eqiad.wmnet.pem back to be owned by puppet:puppet. Scratch the part about "Unable to connect to MySQL server", that was from March 17th... [02:56:49] bblack: yeah, we should have a separate apache http response tracking [02:56:55] varnish is completely separate [02:57:48] Also, why did this not break locally during development? Even if it's not tested, I assume it worked locally for the person writing it? [02:58:39] Since this was a change to Wikibase, I suspected perhaps a difference between how we test Wikibase locally vs. how it works once plugged into Wikidata extension with composer etc. [02:58:55] We normally don't locally verify the result if a cherry-pick of Wikidata import. [02:59:38] Reedy: Also, why did scap not prevent this deployment if it broke on every page view? [02:59:47] Did it though? :) [02:59:57] this is the second time scap did not catch a blatently obvious error on every request [03:00:05] that a simple local http request would've caught [03:00:12] I've certainly seen scap say no to deploys on beta [03:00:19] I'd probably suggest filing a scap/releng task [03:00:28] Reedy: Yeah, but it's too picky probably about which error levels and channels [03:00:41] Right, and if that's the case... That's just stupid :) [03:00:44] I thought it was fixed that it now considers all hhvm/apache/mediawiki logs [03:00:52] But it still ignores a lot [03:01:26] I'd rather have a simple prepromote check that verifies enwiki/MainPage HTTP 200 than catch a 20% increase in PHP notices. [03:01:29] I really don't know [03:01:39] We should probably have both [03:01:41] it seems scap warnings are heavily invested into catching the latter, not the former wider net. [03:01:51] Yeah, T121597 [03:01:51] T121597: Implement MediaWiki pre-promote checks - https://phabricator.wikimedia.org/T121597 [03:01:54] A bit slower to deploy, to catch these sorts of errors seems a win win [03:02:23] wouldn't even be slow [03:02:36] just a dozen or so local http fetches against a canary, or even tin itself. [03:02:49] i'm wondering why this needed to go in swat? [03:02:57] be back in an hour [03:02:58] You'd have to ask Roan :) [03:02:58] o/ [03:03:10] we deployed a new version of wikibase this week with the train [03:03:19] I can only presume it's blocking him on something [03:03:21] so should be in sync with whatever core has [03:03:23] :/ [03:06:57] ok, i'm off for tonight [03:07:14] please make a phabricator ticket if there is anything that we can look at or help with [03:13:18] here, beginning of incident report. maybe you can add stuff https://wikitech.wikimedia.org/wiki/Incident_documentation/20170323-wikibase [03:13:32] once it's fixed in master [03:34:41] Krinkle: This was not related to the cherry-pick and composer, and it worked in local testing and on mwdebug1002, it just seems to be in a code path not directly related to the change (but that got moved as part of the change) [03:34:53] Apparently canaries didn't catch it [03:35:41] aude: Our change missed the cut and it seemed simple, so we SWATted it. Didn't realize it contained a PHP fatal of course :/ [03:36:49] I also should have checked fatalmonitor after scapping, that would likely have caught it [03:37:15] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [03:37:15] Ooh, I see, the error only happened when (re)parsing a page with the {{noexternallanglinks}} magic word in it [03:38:29] Krinkle: So to answer your question it's very possible that the error didn't even trigger during the canary window [03:40:15] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [03:41:44] aude: Also Reedy is right, we have a beta feature release on Tuesday that'll now be somewhat broken (or at least will behave suboptimally) until this patch gets there on the train on Thursday [03:43:09] Krinkle: So because of the way it was triggered, it wasn't a "blatantly obvious error on every request" in dev/beta/mwdebug, because it only errored on pages that have that magic word, which probably includes the main pages of most Wikipedias [03:43:25] Given that I thought I was changing Special:Recentchanges, that's the page I tested in mwdebug [03:43:32] That said I should have monitored fatalmonitor post-deploy [03:48:05] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [03:49:05] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3725008 keys, up 11 hours 33 minutes - replication_delay is 0 [03:49:18] Oh, also! I just read backscroll again and noticed there were no user reports of breakage until 20 mins after the scap finished, and no automated errors until 25 mins after. So I think what happened is my deploy broke parsing the Main Page, but it was still in parser cache, and ~20 mins after my deploy is when the pcache TTL expired (main pages usually use [03:49:18] things like {{CURRENTDAY}} so they have TTL of 1h), and that's when 1) parsing Main Page started fataling but also 2) pcache for Main Page remained empty, so every Main Page request would try to parse it, which is why it took several seconds to fail and caused a lot of DB load [03:49:53] So I don't think canary/scap could have caught this [03:50:52] It's possible that I could have caught it by looking at fatalmonitor if there were non-Main Page failures, but I don't think there are that many pages that use {{noexternallanglinks}} [03:55:45] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [03:56:35] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3724171 keys, up 12 hours 57 minutes - replication_delay is 0 [05:00:05] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:29:05] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [05:41:05] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:55:00] RoanKattouw: Interesting, yeah, so only with that magic word [05:55:26] RoanKattouw: But if it's a magic word, that's a fairly visible feature of an extension code-wise, is there not a single phpunit test that covers that in some way? [06:03:09] No :( [06:03:27] Our change factored out its processing though so it's much easier to add one now [06:03:48] Also it's not one that produces output, it's a behavioral one, so I guess it's harder to test [06:04:14] I think we did add tests for the other magic word ({{WBREPONAME}}) [06:07:45] Right, so less of a reason to actually test it [06:08:02] RoanKattouw: and I guess it's also unfortunate that the class load only triggers when the word is used, not when it is registered [06:08:12] in that case a simple parser test would've fixed it [06:08:19] too lazily loaded/resolved [06:08:24] but for good reason I suppose [06:08:36] anyway, good reason to increase code coverage for the sake of code coverage [06:09:05] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:09:10] equals('

', {{noexternallanglinks]]) [06:09:45] PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:10:04] !log Removing xhgui.results entries before 1-Dec-2016 finished. Running xhgui->command(compact=>results) now. T161196 [06:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:13] T161196: tungsten is out of space on /srv - https://phabricator.wikimedia.org/T161196 [06:31:15] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:38:45] RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:39:05] <_joe_> Krinkle: can you document that process on wikitech so that next time we know what to do to safely free some space? [06:42:02] (03PS2) 10Giuseppe Lavagetto: role::memcached: convert to use profile::multidc::redis [puppet] - 10https://gerrit.wikimedia.org/r/344480 [06:55:05] PROBLEM - Check systemd state on tungsten is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:58:55] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:59:15] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:02:57] (03PS1) 10Marostegui: db-eqiad.php: Restore original weight for db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344577 (https://phabricator.wikimedia.org/T137191) [07:04:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore original weight for db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344577 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [07:06:00] (03Merged) 10jenkins-bot: db-eqiad.php: Restore original weight for db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344577 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [07:07:02] (03CR) 10jenkins-bot: db-eqiad.php: Restore original weight for db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344577 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [07:07:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original weight for db1070, db1071 and db1082 - T137191 (duration: 00m 43s) [07:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:35] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [07:12:53] _joe_: Yeah, although something went wrong I think. My secondary connection to mongodb stopped responding as well. [07:13:25] It steadily kept on compacting, reducing disk space and about 10-20% CPU, and then it just stopped [07:13:34] I don't have access to the server so can't restart it or check what happened [07:15:28] actually, 5-7% not even 10-20, misread the graph [07:15:37] https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=tungsten&var-network=eth0&from=1490333780114&to=now [07:15:45] started around 6:00 UTC [07:15:54] and then sometime in the last 20min it died I guess.. [07:18:25] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 18247.746923 Seconds [07:18:25] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 18247.771794 Seconds [07:19:05] RECOVERY - Check systemd state on tungsten is OK: OK - running: The system is fully operational [07:19:05] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 18289.490994 Seconds [07:21:07] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [07:21:25] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [07:21:25] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [07:24:03] Okay, it's back up [07:26:29] !log upgrading twisted to 16.2.0 on lvs4003 and lvs4004 (ulsfo secondaries) T160433 [07:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:35] T160433: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433 [07:27:55] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:30:42] <_joe_> Krinkle: you don't have access? [07:30:48] <_joe_> I thought it was perf-roots [07:35:00] !log cirrus: refresh comp suggest indices in elastic@codfw [07:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:41] (03PS1) 10Marostegui: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344579 (https://phabricator.wikimedia.org/T73563) [07:36:09] (03CR) 10DCausse: [C: 031] "I think we are ready to reenable this one" [puppet] - 10https://gerrit.wikimedia.org/r/344171 (owner: 10DCausse) [07:37:52] _joe_: Yeah, not yet unfortunately [07:38:29] _joe_: volans confirmed with tcpdump a few hours ago that sporadically app servers are still sending data to tungsten despite me disabling that code path 7 hours ago [07:38:45] it used to send 1:10,000 sampled with mt_rand, now it's a fixed if for X-Wikimedia-Debug:profile [07:38:47] https://phabricator.wikimedia.org/T161286#3127521 ( [07:38:57] I gotta go now, will be back in 10 hours or so [07:39:06] Maybe you can take a peek at what might be causing this? [07:42:17] !log installing git updates on trusty [07:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:23] (03PS2) 10Marostegui: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344579 (https://phabricator.wikimedia.org/T73563) [07:45:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344579 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [07:46:35] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344579 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [07:46:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344579 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [07:48:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1056, repool db1059 T160415 - T73563 (duration: 00m 43s) [07:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:29] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [07:48:29] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [07:49:05] !log Deploy schema change s4 on db1069 and db1056 - T160415 - T73563 [07:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:33] (03CR) 10Giuseppe Lavagetto: [C: 032] role::memcached: convert to use profile::multidc::redis [puppet] - 10https://gerrit.wikimedia.org/r/344480 (owner: 10Giuseppe Lavagetto) [08:01:10] !log upgrading twisted to 16.2.0 on lvs4001 and lvs4002 (ulsfo primaries) T160433 [08:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:17] T160433: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433 [08:05:05] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.467 second response time [08:08:35] PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/redis/replica/] [08:10:05] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.308 second response time [08:10:55] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [08:15:55] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [08:17:39] (03CR) 10DCausse: [C: 031] Update mwgrep for elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) (owner: 10EBernhardson) [08:23:37] !log Deploy schema change s4 db2019 (codfw master) - T160415 [08:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:44] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [08:23:51] (03PS1) 10Giuseppe Lavagetto: compiler-update-facts: remove 'trusted:' facts lines [puppet] - 10https://gerrit.wikimedia.org/r/344582 [08:23:53] (03PS1) 10Giuseppe Lavagetto: role::puppet_compiler: fix broken inclusion of a removed class [puppet] - 10https://gerrit.wikimedia.org/r/344583 [08:25:03] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] "Long overdue..." [puppet] - 10https://gerrit.wikimedia.org/r/344582 (owner: 10Giuseppe Lavagetto) [08:25:55] (03CR) 10Giuseppe Lavagetto: [C: 032] role::puppet_compiler: fix broken inclusion of a removed class [puppet] - 10https://gerrit.wikimedia.org/r/344583 (owner: 10Giuseppe Lavagetto) [08:40:02] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:50:10] (03PS3) 10Gehel: Re-enable updateSuggesterIndex cron [puppet] - 10https://gerrit.wikimedia.org/r/344171 (owner: 10DCausse) [08:52:01] (03CR) 10Gehel: [C: 032] Re-enable updateSuggesterIndex cron [puppet] - 10https://gerrit.wikimedia.org/r/344171 (owner: 10DCausse) [08:52:53] _joe_: I see 2 commits from you waiting to be merged on puppetmaster... should I merge them with mine? [08:53:03] <_joe_> gehel: sorry, yes [08:53:15] _joe_: no problem, merging... [08:53:15] <_joe_> those are both for labs and I just forgot to puppet-merge, sorry [08:54:00] _joe_: btw, any chance you'd have time to have a look at https://gerrit.wikimedia.org/r/#/c/342248/ ? [08:54:23] <_joe_> gehel: I was about to say [08:54:40] <_joe_> if I didn't work 7AM-9.30PM yesterday, I would've [08:55:00] <_joe_> I hope I can finish the Big Redis Refactor today [08:55:04] _joe_: again, nothing urgent there, but I'd like to get it out of the way at some point... [08:55:13] <_joe_> yeah I totally undestand [08:55:21] (03CR) 10Alexandros Kosiaris: "Actually this was done in Ic959aea0ae3693ca66651cfd3e428aaa3348fc74" [puppet] - 10https://gerrit.wikimedia.org/r/344582 (owner: 10Giuseppe Lavagetto) [08:56:57] _joe_: I'm sure you do! Thanks! [08:57:11] <_joe_> akosiaris: so we have two copies of the same thing [08:57:20] <_joe_> and instructions on wikitech are wrong [08:57:21] <_joe_> LOL [08:58:43] (03CR) 10Muehlenhoff: Visualdiff: Install mediawiki::packages::fonts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [09:00:28] RECOVERY - bacula director process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir [09:01:53] (03PS2) 10Muehlenhoff: Remove obsolete Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/344386 [09:03:45] (03PS2) 10Gehel: postgresql - drop support for postgis 1.5 [puppet] - 10https://gerrit.wikimedia.org/r/344176 [09:04:25] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3127690 (10Marostegui) [09:05:56] (03CR) 10Gehel: postgresql - drop support for postgis 1.5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344176 (owner: 10Gehel) [09:06:41] 06Operations, 10Traffic, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#3127709 (10fgiunchedi) I took another quick look at this, part of the churn is the periodic varnis... [09:07:04] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/344386 (owner: 10Muehlenhoff) [09:07:08] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [09:07:18] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [09:08:51] 06Operations: bacula-director not running on helium - https://phabricator.wikimedia.org/T161281#3127711 (10akosiaris) Ah it's puppet agent managing the permissions of it's directory more aggresively than previous versions (3.8 vs 3.4). The issue was triggered when the 2 changes above were merged. On a regular st... [09:08:58] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [09:09:38] (03CR) 10Gehel: "Puppet compiler indicates that the change is reasonable: https://puppet-compiler.wmflabs.org/5895/" [puppet] - 10https://gerrit.wikimedia.org/r/344176 (owner: 10Gehel) [09:10:18] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [09:11:55] (03PS1) 10Filippo Giunchedi: conftool: add prometheus100[34] [puppet] - 10https://gerrit.wikimedia.org/r/344585 (https://phabricator.wikimedia.org/T148408) [09:14:37] (03PS1) 10Marostegui: sanitarium.my.cnf: Set slave_type_conversions [puppet] - 10https://gerrit.wikimedia.org/r/344586 (https://phabricator.wikimedia.org/T73563) [09:15:08] (03CR) 10Alexandros Kosiaris: postgresql - drop support for postgis 1.5 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344176 (owner: 10Gehel) [09:15:21] (03CR) 10Gehel: Update mwgrep for elasticsearch 5.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) (owner: 10EBernhardson) [09:16:12] (03CR) 10Gehel: postgresql - drop support for postgis 1.5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344176 (owner: 10Gehel) [09:16:52] (03PS3) 10Gehel: postgresql - drop support for postgis 1.5 [puppet] - 10https://gerrit.wikimedia.org/r/344176 [09:17:20] (03CR) 10Gehel: "So much nicer now!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344176 (owner: 10Gehel) [09:17:21] (03CR) 10Alexandros Kosiaris: [C: 031] postgresql - drop support for postgis 1.5 [puppet] - 10https://gerrit.wikimedia.org/r/344176 (owner: 10Gehel) [09:19:01] (03CR) 10Marostegui: [C: 032] "Looks good: https://puppet-compiler.wmflabs.org/5897/" [puppet] - 10https://gerrit.wikimedia.org/r/344586 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [09:19:15] 06Operations, 06Performance-Team, 10Thumbor: Thumbor inexplicably 504s intermittently on files that render fine later - https://phabricator.wikimedia.org/T150746#3127728 (10Gilles) [09:19:27] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3127729 (10Gehel) We managed to crash that server again, with the same test (stress + bonnie). @Papaul is running a full H/W diagnostic. Server... [09:21:31] 06Operations, 10ops-eqiad, 13Patch-For-Review: db1057 does not react to powercycle/powerdown/powerup commands - https://phabricator.wikimedia.org/T160435#3127733 (10Marostegui) @Cmjohnson I might be asking something silly, but I thought I would ask just in case. Is it doable to take some spare pieces from so... [09:21:57] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: db1057 does not react to powercycle/powerdown/powerup commands - https://phabricator.wikimedia.org/T160435#3127734 (10Marostegui) [09:22:04] (03PS3) 10Gehel: Update mwgrep for elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) (owner: 10EBernhardson) [09:22:22] (03CR) 10Filippo Giunchedi: [C: 032] conftool: add prometheus100[34] [puppet] - 10https://gerrit.wikimedia.org/r/344585 (https://phabricator.wikimedia.org/T148408) (owner: 10Filippo Giunchedi) [09:22:26] (03PS2) 10Filippo Giunchedi: conftool: add prometheus100[34] [puppet] - 10https://gerrit.wikimedia.org/r/344585 (https://phabricator.wikimedia.org/T148408) [09:22:48] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:23:26] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] conftool: add prometheus100[34] [puppet] - 10https://gerrit.wikimedia.org/r/344585 (https://phabricator.wikimedia.org/T148408) (owner: 10Filippo Giunchedi) [09:27:19] (03CR) 10Gehel: [C: 031] "Puppet compiler looks happy." [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) (owner: 10EBernhardson) [09:27:34] (03CR) 10Gehel: Update mwgrep for elasticsearch 5.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) (owner: 10EBernhardson) [09:31:05] (03CR) 10Gehel: Allow search clusters to reindex from eachother (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [09:37:10] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3127748 (10fgiunchedi) [09:38:52] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3127764 (10Gilles) > Accidental UI popups are major point of frustration for some. We're not proposing no delay, because if yo... [09:39:05] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [09:39:30] !log pool prometheus100[34] - T148408 [09:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:37] T148408: Put prometheus baremetal servers in service - https://phabricator.wikimedia.org/T148408 [09:40:15] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3702806 keys, up 17 hours 24 minutes - replication_delay is 0 [09:40:56] 06Operations, 10LDAP-Access-Requests, 06WMDE-Analytics-Engineering, 10Wikidata, 15User-Addshore: Add goransm to ldap/wmde group - https://phabricator.wikimedia.org/T160924#3127766 (10Addshore) >>! In T160924#3125565, @MoritzMuehlenhoff wrote: > (Generally when onboarding someone new, feel free to simply... [09:42:13] (03CR) 10Gilles: "Can you tell if the http traffic is PUTs and/or GETs?" [puppet] - 10https://gerrit.wikimedia.org/r/343263 (https://phabricator.wikimedia.org/T160670) (owner: 10Gilles) [09:50:45] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [10:07:15] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [10:10:15] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [10:10:20] (03CR) 10Alexandros Kosiaris: [C: 031] Performance Grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/342431 (https://phabricator.wikimedia.org/T156245) (owner: 10Gilles) [10:16:05] RECOVERY - swift codfw-prod object availability on graphite1001 is OK: OK: Less than 1.00% under the threshold [95.0] [10:23:24] (03PS1) 10Muehlenhoff: Bump ABI to 4 [debs/linux44] - 10https://gerrit.wikimedia.org/r/344591 [10:25:03] 06Operations, 05Prometheus-metrics-monitoring: Upgrade mysqld_exporter to 0.9.0 - https://phabricator.wikimedia.org/T147476#3127864 (10fgiunchedi) 05Open>03declined Superseded by {T161296} [10:25:05] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: MySQL monitoring with prometheus - https://phabricator.wikimedia.org/T143896#3127868 (10fgiunchedi) [10:40:50] (03PS2) 10Alexandros Kosiaris: Assign the kubernetes pod IPs in DNS [dns] - 10https://gerrit.wikimedia.org/r/341794 [10:44:58] (03CR) 10Alexandros Kosiaris: [C: 031] "IPv6 added, seems fine to me" [dns] - 10https://gerrit.wikimedia.org/r/341794 (owner: 10Alexandros Kosiaris) [10:53:09] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3127910 (10jcrespo) So s/both/master/ ? [11:02:06] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Put prometheus baremetal servers in service - https://phabricator.wikimedia.org/T148408#3127933 (10fgiunchedi) codfw/eqiad are now serving prometheus queries with baremetal hw only, next week I'll decom the ganeti VMs in... [11:12:25] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:15:30] (03PS1) 10Filippo Giunchedi: aptrepo: add post-2015 HP MCP public key [puppet] - 10https://gerrit.wikimedia.org/r/344595 [11:18:30] !log upgrade grafana to 4.2.0 on labmon1001 - T161193 [11:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:38] T161193: Upgrade to Grafana 4.2.0 - https://phabricator.wikimedia.org/T161193 [11:24:20] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3127963 (10jcrespo) Actually, I am not sure that is needed- things flopped because ALL s... [11:40:25] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [11:42:49] (03CR) 10Muehlenhoff: [C: 032] Bump ABI to 4 [debs/linux44] - 10https://gerrit.wikimedia.org/r/344591 (owner: 10Muehlenhoff) [11:54:26] (03PS1) 10Alexandros Kosiaris: base::expose_puppet_certs: Also provide a keypair [puppet] - 10https://gerrit.wikimedia.org/r/344603 (https://phabricator.wikimedia.org/T161281) [12:13:46] !log Start first run of pt-table-checksum on s5 (dewiki) - T161294 [12:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:53] T161294: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294 [12:22:07] (03PS1) 10Marostegui: db-eqiad.php: Repool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344605 (https://phabricator.wikimedia.org/T73563) [12:23:00] (03PS1) 10Alexandros Kosiaris: Update bacula::client to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344606 (https://phabricator.wikimedia.org/T161281) [12:25:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344605 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [12:29:09] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344605 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [12:29:22] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344605 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [12:30:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1056 T160415 - T73563 (duration: 00m 44s) [12:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:26] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [12:30:26] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [12:49:55] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:50:55] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 500 (expecting: 200) [12:50:56] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 500 (expecting: 200) [12:51:05] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:51:25] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [12:51:45] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 500 (expecting: 200) [12:51:45] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 500 (expecting: 200) [12:51:45] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 500 (expecting: 200) [12:52:15] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:52:35] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:52:45] PROBLEM - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:52:45] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:52:55] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [12:53:05] PROBLEM - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.55 and port 9042: Connection refused [12:53:05] PROBLEM - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [13:03:05] RECOVERY - cassandra-b service on restbase2009 is OK: OK - cassandra-b is active [13:03:45] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [13:04:35] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [13:04:45] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [13:04:45] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [13:04:45] RECOVERY - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-b valid until 2017-09-12 15:36:09 +0000 (expires in 172 days) [13:04:55] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [13:04:55] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [13:04:55] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [13:04:55] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [13:05:05] RECOVERY - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.55 port 9042 [13:05:35] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [13:05:45] PROBLEM - Host db1057 is DOWN: PING CRITICAL - Packet loss = 100% [13:05:55] RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active [13:07:06] db1057 came back from downtime [13:09:35] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:09:55] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [13:35:32] (03CR) 10Gehel: [C: 04-1] base::expose_puppet_certs: Also provide a keypair (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344603 (https://phabricator.wikimedia.org/T161281) (owner: 10Alexandros Kosiaris) [13:36:35] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [13:37:15] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [13:39:35] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:40:15] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [13:43:34] 06Operations, 10hardware-requests: Rename labtestmetal2001 - https://phabricator.wikimedia.org/T161265#3126920 (10MoritzMuehlenhoff) The old hostname is still shown in servermon: https://servermon.wikimedia.org/hosts/ [13:44:25] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 641850 [13:47:42] !log installing freetype security updates on trusty [13:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:10] (03CR) 10Alexandros Kosiaris: base::expose_puppet_certs: Also provide a keypair (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344603 (https://phabricator.wikimedia.org/T161281) (owner: 10Alexandros Kosiaris) [13:58:29] (03PS2) 10Alexandros Kosiaris: base::expose_puppet_certs: Also provide a keypair [puppet] - 10https://gerrit.wikimedia.org/r/344603 (https://phabricator.wikimedia.org/T161281) [13:58:31] (03PS2) 10Alexandros Kosiaris: Update bacula::client to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344606 (https://phabricator.wikimedia.org/T161281) [14:01:57] (03PS1) 10Muehlenhoff: Adapt debdeploy grain to rename of nova::api role [puppet] - 10https://gerrit.wikimedia.org/r/344611 [14:03:03] (03PS3) 10Alexandros Kosiaris: base::expose_puppet_certs: Also provide a keypair [puppet] - 10https://gerrit.wikimedia.org/r/344603 (https://phabricator.wikimedia.org/T161281) [14:03:05] (03PS3) 10Alexandros Kosiaris: Update bacula::client to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344606 (https://phabricator.wikimedia.org/T161281) [14:03:28] (03PS1) 10Andrew Bogott: Labtestvirt2002: Try a different partman config [puppet] - 10https://gerrit.wikimedia.org/r/344612 [14:03:44] (03PS1) 10Hashar: contint: install PhantomJS from backport [puppet] - 10https://gerrit.wikimedia.org/r/344613 (https://phabricator.wikimedia.org/T137112) [14:04:17] (03CR) 10jerkins-bot: [V: 04-1] base::expose_puppet_certs: Also provide a keypair [puppet] - 10https://gerrit.wikimedia.org/r/344603 (https://phabricator.wikimedia.org/T161281) (owner: 10Alexandros Kosiaris) [14:05:00] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/344603 (https://phabricator.wikimedia.org/T161281) (owner: 10Alexandros Kosiaris) [14:05:24] (03CR) 10Andrew Bogott: [C: 032] Labtestvirt2002: Try a different partman config [puppet] - 10https://gerrit.wikimedia.org/r/344612 (owner: 10Andrew Bogott) [14:06:35] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [14:06:55] RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active [14:07:15] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [14:08:08] (03PS1) 10Muehlenhoff: Adapt debdeploy grain to rename of nova::controller role [puppet] - 10https://gerrit.wikimedia.org/r/344614 [14:08:25] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 590522 [14:08:43] ^ andrewbogott nova-fullstack finally dropped 7 instances I think [14:09:03] ok, I'll have a look at the dead instances [14:09:35] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:55] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [14:10:07] (03PS1) 10Muehlenhoff: Adapt debdeploy grain to rename of nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/344615 [14:10:15] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [14:10:34] (03PS2) 10Hashar: contint: install PhantomJS from backport [puppet] - 10https://gerrit.wikimedia.org/r/344613 (https://phabricator.wikimedia.org/T137112) [14:13:34] (03CR) 10Muehlenhoff: [C: 031] contint: install PhantomJS from backport [puppet] - 10https://gerrit.wikimedia.org/r/344613 (https://phabricator.wikimedia.org/T137112) (owner: 10Hashar) [14:15:05] PROBLEM - check_puppetrun on alnilam is CRITICAL: CRITICAL: Puppet has 17 failures [14:15:17] (03CR) 10Hashar: [V: 031] "I have forgotten to actually add the package. Cherry picked it on the CI puppet master:" [puppet] - 10https://gerrit.wikimedia.org/r/344613 (https://phabricator.wikimedia.org/T137112) (owner: 10Hashar) [14:16:01] (03PS4) 10Gehel: postgresql - drop support for postgis 1.5 [puppet] - 10https://gerrit.wikimedia.org/r/344176 [14:20:05] PROBLEM - check_puppetrun on alnilam is CRITICAL: CRITICAL: Puppet has 17 failures [14:20:40] looking ^^^ [14:21:26] (03CR) 10Gehel: [C: 032] postgresql - drop support for postgis 1.5 [puppet] - 10https://gerrit.wikimedia.org/r/344176 (owner: 10Gehel) [14:25:15] RECOVERY - check_puppetrun on alnilam is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [14:26:26] (03PS1) 10Andrew Bogott: Nova compute: Update kernel advice for new installs [puppet] - 10https://gerrit.wikimedia.org/r/344619 [14:28:04] (03CR) 10Muehlenhoff: [C: 031] Nova compute: Update kernel advice for new installs [puppet] - 10https://gerrit.wikimedia.org/r/344619 (owner: 10Andrew Bogott) [14:29:24] (03CR) 10Andrew Bogott: [C: 032] Nova compute: Update kernel advice for new installs [puppet] - 10https://gerrit.wikimedia.org/r/344619 (owner: 10Andrew Bogott) [14:31:31] (03PS1) 10Giuseppe Lavagetto: role::memcached: add hack (temporarily) for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/344620 [14:31:33] (03PS1) 10Giuseppe Lavagetto: profile::redis: add local master and slave profiles [puppet] - 10https://gerrit.wikimedia.org/r/344621 [14:31:35] (03PS1) 10Giuseppe Lavagetto: role::jobqueue_redis: convert slaves to role::jobqueue_redis::slave [puppet] - 10https://gerrit.wikimedia.org/r/344622 [14:31:37] (03PS1) 10Giuseppe Lavagetto: redis: cleanup unused modules/files [puppet] - 10https://gerrit.wikimedia.org/r/344623 [14:31:39] (03PS1) 10Giuseppe Lavagetto: redis::monitoring: convert ores to nrpe, cleanup [puppet] - 10https://gerrit.wikimedia.org/r/344624 [14:33:04] (03PS2) 10Giuseppe Lavagetto: role::memcached: add hack (temporarily) for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/344620 [14:33:44] godog: , yt? [14:34:07] (03CR) 10Giuseppe Lavagetto: [C: 032] role::memcached: add hack (temporarily) for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/344620 (owner: 10Giuseppe Lavagetto) [14:35:45] ottomata: yup [14:36:17] was curious about trying out https://github.com/Quantiply/grafana-plugins/tree/master/features/druid [14:36:40] godog: hola [14:36:51] do we already use any 3rd party grafana plugins [14:36:52] ? [14:37:04] urandom: hola! [14:37:15] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [14:37:26] ottomata: yeah we use simple-json-datasource IIRC [14:38:02] urandom: I was updating T160759 as restbase2001-a is in trouble :( [14:38:02] T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759 [14:38:09] oh cool godog that looks pretty simple [14:38:16] godog: saw that; yeah [14:38:45] PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[cassandra-a] [14:39:07] (03PS8) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) [14:40:10] ottomata: yeah I'm not sure what's the best path to actually distribute it, either scap o deb package I suppose [14:40:15] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [14:40:40] urandom: not sure if that error is recoverable or we'd be better off starting from scratch [14:40:50] godog: the json one is just a git clone, which might be easier in this case [14:40:54] i can fork to gerrit [14:41:35] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [14:41:37] godog: it's a commitlog segment, so removing it is enough (should be enough) [14:41:55] RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active [14:42:20] godog: it would represent possibly uncommitted data at the time of the OOM, though it is 0 bytes in size, so maybe there is none [14:42:31] urandom: nice, I see it is coming up again [14:42:37] godog: either way, it would be replicated [14:42:51] ottomata: yeah looks simple enough [14:43:15] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 172 days) [14:43:17] it's a head scratcher for sure, but nothing very serious [14:43:35] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042 [14:43:40] <_joe_> urandom: so just removing the file was enough? [14:43:52] _joe_: yes [14:44:06] it was totally empty anyway [14:44:28] which is wrong, that is the "corruption" it was complaining about, but certainly nothing to be recovered from it [14:46:28] _joe_: and fyi, this is what i was referring to in the meeting yesterday [14:47:19] when these do happen, they're happening in codfw because that is where the updates happen, not where client requests are answered from [14:47:52] so it would definitely be worth moving updates to eqiad during the switch [14:48:18] 06Operations: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433#3128265 (10MoritzMuehlenhoff) This has been fixed in mapreduce 2.9.0 last month, but it will take some time until we can upgrade to a CDH release which is based on Hadoop 2.9 [14:53:59] (03PS1) 10Muehlenhoff: Direct output of accountcheck to standard root mails [puppet] - 10https://gerrit.wikimedia.org/r/344625 [14:54:38] (03PS2) 10Ema: varnish: remove varnish::monitoring::ganglia [puppet] - 10https://gerrit.wikimedia.org/r/337002 [14:55:07] (03PS1) 10Alexandros Kosiaris: Update bacula::storage to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344626 (https://phabricator.wikimedia.org/T161281) [14:55:09] (03PS1) 10Alexandros Kosiaris: Update bacula::director to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344627 (https://phabricator.wikimedia.org/T161281) [14:55:18] (03CR) 10jerkins-bot: [V: 04-1] Direct output of accountcheck to standard root mails [puppet] - 10https://gerrit.wikimedia.org/r/344625 (owner: 10Muehlenhoff) [14:57:24] (03PS2) 10Muehlenhoff: Direct output of accountcheck to standard root mails [puppet] - 10https://gerrit.wikimedia.org/r/344625 [15:00:58] (03PS5) 10Filippo Giunchedi: Performance Grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/342431 (https://phabricator.wikimedia.org/T156245) (owner: 10Gilles) [15:03:33] (03PS2) 10Alexandros Kosiaris: Update bacula::director to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344627 (https://phabricator.wikimedia.org/T161281) [15:04:28] (03CR) 10Muehlenhoff: [C: 032] Direct output of accountcheck to standard root mails [puppet] - 10https://gerrit.wikimedia.org/r/344625 (owner: 10Muehlenhoff) [15:04:30] (03CR) 10Filippo Giunchedi: [C: 032] Performance Grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/342431 (https://phabricator.wikimedia.org/T156245) (owner: 10Gilles) [15:06:39] (03PS1) 10Andrew Bogott: Labtestvirt2002: Try yet another partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/344628 [15:06:44] RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:07:06] (03PS1) 10Giuseppe Lavagetto: profile::redis::multidc: re-introduce the ipsec guard on trusty [puppet] - 10https://gerrit.wikimedia.org/r/344629 [15:08:43] (03PS2) 10Giuseppe Lavagetto: profile::redis::multidc: re-introduce the ipsec guard on trusty [puppet] - 10https://gerrit.wikimedia.org/r/344629 [15:08:51] (03CR) 10Andrew Bogott: [C: 032] Labtestvirt2002: Try yet another partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/344628 (owner: 10Andrew Bogott) [15:08:56] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::redis::multidc: re-introduce the ipsec guard on trusty [puppet] - 10https://gerrit.wikimedia.org/r/344629 (owner: 10Giuseppe Lavagetto) [15:09:55] (03PS3) 10Giuseppe Lavagetto: profile::redis::multidc: re-introduce the ipsec guard on trusty [puppet] - 10https://gerrit.wikimedia.org/r/344629 [15:10:00] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::redis::multidc: re-introduce the ipsec guard on trusty [puppet] - 10https://gerrit.wikimedia.org/r/344629 (owner: 10Giuseppe Lavagetto) [15:10:15] <_joe_> jesus, gerrit [15:11:04] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_grafana_alert] [15:11:25] that's me ^ checking [15:13:56] (03PS3) 10Subramanya Sastry: Visualdiff: Install mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/344551 [15:14:14] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [15:15:10] (03CR) 10jerkins-bot: [V: 04-1] Visualdiff: Install mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [15:15:14] (03PS3) 10Muehlenhoff: Direct output of accountcheck to standard root mails [puppet] - 10https://gerrit.wikimedia.org/r/344625 [15:16:48] (03CR) 10Subramanya Sastry: Visualdiff: Install mediawiki::packages::fonts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [15:17:20] (03PS4) 10Muehlenhoff: Direct output of accountcheck to standard root mails [puppet] - 10https://gerrit.wikimedia.org/r/344625 [15:17:40] (03PS2) 10Filippo Giunchedi: aptrepo: add post-2015 HP MCP public key [puppet] - 10https://gerrit.wikimedia.org/r/344595 [15:17:42] (03PS1) 10Filippo Giunchedi: nagios_common: omit check_grafana_alert extension [puppet] - 10https://gerrit.wikimedia.org/r/344631 [15:18:11] (03CR) 10Muehlenhoff: [V: 032 C: 032] Direct output of accountcheck to standard root mails [puppet] - 10https://gerrit.wikimedia.org/r/344625 (owner: 10Muehlenhoff) [15:18:35] (03PS2) 10Filippo Giunchedi: nagios_common: omit check_grafana_alert extension [puppet] - 10https://gerrit.wikimedia.org/r/344631 [15:20:51] (03PS2) 10Giuseppe Lavagetto: profile::redis: add local master and slave profiles [puppet] - 10https://gerrit.wikimedia.org/r/344621 [15:21:45] (03CR) 10Filippo Giunchedi: [C: 032] nagios_common: omit check_grafana_alert extension [puppet] - 10https://gerrit.wikimedia.org/r/344631 (owner: 10Filippo Giunchedi) [15:23:02] (03CR) 10Muehlenhoff: [C: 031] "Looks fine. PCC also fine: http://puppet-compiler.wmflabs.org/5905/" [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [15:23:54] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:25:54] (03PS2) 10Alexandros Kosiaris: Update bacula::storage to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344626 (https://phabricator.wikimedia.org/T161281) [15:25:56] (03PS3) 10Alexandros Kosiaris: Update bacula::director to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344627 (https://phabricator.wikimedia.org/T161281) [15:27:24] PROBLEM - Host labtestvirt2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:27:25] !log running unscheduled ALTER TABLE on arbcom_cswiki.archive T104756 [15:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:31] T104756: core - archive database table - database schema is inconsistent with update patch - https://phabricator.wikimedia.org/T104756 [15:27:54] RECOVERY - Host labtestvirt2002 is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms [15:28:23] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3128323 (10madhuvishy) @Cmjohnson Bumping this, we'd like to move forward with this soon if possible :) Thanks! [15:29:54] PROBLEM - salt-minion processes on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [15:30:04] PROBLEM - dhclient process on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [15:30:04] PROBLEM - DPKG on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [15:30:14] PROBLEM - configured eth on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [15:30:14] PROBLEM - Disk space on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [15:30:14] PROBLEM - kvm ssl cert on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [15:30:44] PROBLEM - nova-compute process on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [15:32:04] RECOVERY - dhclient process on labtestvirt2002 is OK: PROCS OK: 0 processes with command name dhclient [15:32:04] RECOVERY - DPKG on labtestvirt2002 is OK: All packages OK [15:32:14] RECOVERY - configured eth on labtestvirt2002 is OK: OK - interfaces up [15:32:14] RECOVERY - Disk space on labtestvirt2002 is OK: DISK OK [15:32:14] RECOVERY - kvm ssl cert on labtestvirt2002 is OK: Cert /etc/ssl/localcerts/labvirt-star.codfw.wmnet.crt will not expire for at least 90 days [15:34:34] (03CR) 10Hashar: [V: 031] "We would merge that with Moritz on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/344613 (https://phabricator.wikimedia.org/T137112) (owner: 10Hashar) [15:34:53] (03CR) 10Alexandros Kosiaris: [C: 031] aptrepo: add post-2015 HP MCP public key [puppet] - 10https://gerrit.wikimedia.org/r/344595 (owner: 10Filippo Giunchedi) [15:35:29] (03PS1) 10Gehel: archiva - also generate sha1sum for .zip file [puppet] - 10https://gerrit.wikimedia.org/r/344634 [15:36:19] (03CR) 10EBernhardson: [C: 031] archiva - also generate sha1sum for .zip file [puppet] - 10https://gerrit.wikimedia.org/r/344634 (owner: 10Gehel) [15:38:30] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3128339 (10RobH) Thanks! Please note we still have to have legal sign off on this task. (We unfortunately cannot accept simple user confirmati... [15:38:40] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3128340 (10RobH) p:05Triage>03Normal [15:39:17] (03PS4) 10Alexandros Kosiaris: base::expose_puppet_certs: Also provide a keypair [puppet] - 10https://gerrit.wikimedia.org/r/344603 (https://phabricator.wikimedia.org/T161281) [15:39:18] (03PS4) 10Alexandros Kosiaris: Update bacula::client to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344606 (https://phabricator.wikimedia.org/T161281) [15:39:20] (03PS3) 10Alexandros Kosiaris: Update bacula::storage to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344626 (https://phabricator.wikimedia.org/T161281) [15:39:22] (03PS4) 10Alexandros Kosiaris: Update bacula::director to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344627 (https://phabricator.wikimedia.org/T161281) [15:39:45] (03CR) 10Ottomata: [C: 031] archiva - also generate sha1sum for .zip file [puppet] - 10https://gerrit.wikimedia.org/r/344634 (owner: 10Gehel) [15:40:05] (03CR) 10Gehel: [C: 032] archiva - also generate sha1sum for .zip file [puppet] - 10https://gerrit.wikimedia.org/r/344634 (owner: 10Gehel) [15:41:22] (03CR) 10Alexandros Kosiaris: [C: 031] "https://puppet-compiler.wmflabs.org/5906/ seems pretty content with this patch set series." [puppet] - 10https://gerrit.wikimedia.org/r/344627 (https://phabricator.wikimedia.org/T161281) (owner: 10Alexandros Kosiaris) [15:41:23] 06Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 06Labs, 10hardware-requests: Eqiad: Hardware request for labstore1006/7, dataset1002/3 - https://phabricator.wikimedia.org/T161311#3128341 (10ArielGlenn) [15:41:34] PROBLEM - Host labtestvirt2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:18] 06Operations, 10OCG-General, 10Reading-Community-Engagement, 06Reading-Web-Backlog, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2799650 (10Aklapper) [15:42:39] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3127143 (10RobH) We need a few things for this to be granted: [] - please determine (perhaps with @gehel) exactly what groups you should be added to for access [] - review and sign th... [15:42:43] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3128372 (10ArielGlenn) See {T161311}. [15:43:47] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3128374 (10RobH) p:05Triage>03Normal [15:43:54] RECOVERY - Host labtestvirt2002 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [15:44:27] (03PS5) 10Alexandros Kosiaris: base::expose_puppet_certs: Also provide a keypair [puppet] - 10https://gerrit.wikimedia.org/r/344603 (https://phabricator.wikimedia.org/T161281) [15:44:30] (03PS5) 10Alexandros Kosiaris: Update bacula::client to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344606 (https://phabricator.wikimedia.org/T161281) [15:44:32] (03PS4) 10Alexandros Kosiaris: Update bacula::storage to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344626 (https://phabricator.wikimedia.org/T161281) [15:44:33] (03PS5) 10Alexandros Kosiaris: Update bacula::director to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344627 (https://phabricator.wikimedia.org/T161281) [15:44:48] let's merge and see the world burn [15:45:28] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3127143 (10RobH) [15:45:55] (03PS2) 10Eevans: Mandatory Cassandra client encryption [puppet] - 10https://gerrit.wikimedia.org/r/342904 (https://phabricator.wikimedia.org/T111113) [15:46:04] PROBLEM - dhclient process on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [15:46:04] PROBLEM - DPKG on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [15:46:08] (03PS3) 10Filippo Giunchedi: aptrepo: add post-2015 HP MCP public key [puppet] - 10https://gerrit.wikimedia.org/r/344595 [15:46:14] PROBLEM - configured eth on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [15:46:14] PROBLEM - Disk space on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [15:46:14] PROBLEM - kvm ssl cert on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [15:46:40] (03CR) 10Alexandros Kosiaris: [C: 032] base::expose_puppet_certs: Also provide a keypair [puppet] - 10https://gerrit.wikimedia.org/r/344603 (https://phabricator.wikimedia.org/T161281) (owner: 10Alexandros Kosiaris) [15:46:57] (03CR) 10Alexandros Kosiaris: [C: 032] Update bacula::client to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344606 (https://phabricator.wikimedia.org/T161281) (owner: 10Alexandros Kosiaris) [15:47:18] (03CR) 10Alexandros Kosiaris: [C: 032] Update bacula::storage to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344626 (https://phabricator.wikimedia.org/T161281) (owner: 10Alexandros Kosiaris) [15:47:31] (03CR) 10Alexandros Kosiaris: [C: 032] Update bacula::director to use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/344627 (https://phabricator.wikimedia.org/T161281) (owner: 10Alexandros Kosiaris) [15:47:54] (03CR) 10Filippo Giunchedi: [C: 032] aptrepo: add post-2015 HP MCP public key [puppet] - 10https://gerrit.wikimedia.org/r/344595 (owner: 10Filippo Giunchedi) [15:48:07] (03PS1) 10EBernhardson: [WIP] Update logstash plugins for 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/344637 [15:48:09] (03PS4) 10Filippo Giunchedi: aptrepo: add post-2015 HP MCP public key [puppet] - 10https://gerrit.wikimedia.org/r/344595 [15:48:35] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] aptrepo: add post-2015 HP MCP public key [puppet] - 10https://gerrit.wikimedia.org/r/344595 (owner: 10Filippo Giunchedi) [15:49:04] damn [15:49:09] and of course it backfired [15:49:14] (03CR) 10Eevans: [C: 031] "Ready to be merged; Scheduled for Puppet SWAT on 2017-03-28" [puppet] - 10https://gerrit.wikimedia.org/r/342904 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [15:49:31] (03PS3) 10Giuseppe Lavagetto: profile::redis: add local master and slave profiles [puppet] - 10https://gerrit.wikimedia.org/r/344621 [15:49:59] andrewbogott: ^silencing labtestvirt2002 spam above [15:50:11] I guess this took me more than two hours :( [15:50:15] Did you downtime it again? [15:50:50] andrewbogott: yeah [15:51:14] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:51:14] PROBLEM - puppet last run on conf2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:51:24] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:51:34] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:51:54] PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:24] PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:25] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:34] PROBLEM - puppet last run on conf2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:16] ^ that is "Invalid relationship: File[/etc/bacula/bacula-fd.conf] { require => Exec[concat-bacula-keypair] }, because Exec[concat-bacula-keypair] doesn't seem to be in the catalog" [15:53:19] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3128413 (10Gehel) * group for access: [[ https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml#L334 | maps-admin ]], [[ https://github.com/wikimedia/puppet/... [15:53:24] PROBLEM - puppet last run on uranium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:32] yes [15:53:34] it's me [15:53:37] fixing now [15:53:39] ok :) [15:53:48] (03PS1) 10Alexandros Kosiaris: bacula::client: remove Exec['concat-bacula-keypair'] dependency [puppet] - 10https://gerrit.wikimedia.org/r/344639 [15:53:56] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3128415 (10Gehel) [15:53:57] damn relationship validation happens on the agent, not the master [15:54:04] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:54:05] so the compiler could not catch that [15:54:19] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] bacula::client: remove Exec['concat-bacula-keypair'] dependency [puppet] - 10https://gerrit.wikimedia.org/r/344639 (owner: 10Alexandros Kosiaris) [15:54:22] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3128417 (10RobH) @Gehel, I'll drop legal an email for @RStallman-legalteam to comment on this task, they handle the NDA confirmations. [15:55:03] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3128421 (10Gehel) I am very much sponsoring Paul for his SSH access to the maps cluster. [15:55:34] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:44] PROBLEM - puppet last run on heze is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula/sd-keypair] [15:55:54] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:34] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:34] PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:34] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:56] runs puppet on terbium and sees it's related to the bacula change [15:57:34] (03PS5) 10Andrew Bogott: Bootstrapvz: Simplify and update [puppet] - 10https://gerrit.wikimedia.org/r/343208 [15:57:44] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:58:35] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::redis: add local master and slave profiles [puppet] - 10https://gerrit.wikimedia.org/r/344621 (owner: 10Giuseppe Lavagetto) [15:58:43] (03PS4) 10Giuseppe Lavagetto: profile::redis: add local master and slave profiles [puppet] - 10https://gerrit.wikimedia.org/r/344621 [15:58:54] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:59:03] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::redis: add local master and slave profiles [puppet] - 10https://gerrit.wikimedia.org/r/344621 (owner: 10Giuseppe Lavagetto) [15:59:15] (03CR) 10Andrew Bogott: [C: 032] Bootstrapvz: Simplify and update [puppet] - 10https://gerrit.wikimedia.org/r/343208 (owner: 10Andrew Bogott) [15:59:24] PROBLEM - Check systemd state on conf1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:59:34] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair],Etcd_user[root],Etcd_role[guest] [15:59:54] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair],Service[bacula-fd] [15:59:55] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3128432 (10RobH) [16:00:05] 06Operations, 06Labs: Instance creation fails before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3128433 (10chasemp) Took about a day and half to leak 7 instances https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1486132582.09&target=cactiStyle(log(server... [16:00:14] (03PS5) 10Giuseppe Lavagetto: profile::redis: add local master and slave profiles [puppet] - 10https://gerrit.wikimedia.org/r/344621 [16:00:24] PROBLEM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:00:34] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:00:34] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:00:45] PROBLEM - Check systemd state on bast4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:00:54] PROBLEM - Check systemd state on etcd1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:01:04] RECOVERY - dhclient process on labtestvirt2002 is OK: PROCS OK: 0 processes with command name dhclient [16:01:04] RECOVERY - DPKG on labtestvirt2002 is OK: All packages OK [16:01:04] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:01:14] RECOVERY - configured eth on labtestvirt2002 is OK: OK - interfaces up [16:01:14] RECOVERY - Disk space on labtestvirt2002 is OK: DISK OK [16:01:14] RECOVERY - kvm ssl cert on labtestvirt2002 is OK: Cert /etc/ssl/localcerts/labvirt-star.codfw.wmnet.crt will not expire for at least 90 days [16:01:24] PROBLEM - puppet last run on etcd1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair],Etcd_user[root] [16:01:24] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:01:34] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:02:25] PROBLEM - puppet last run on meitnerium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:02:34] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:02:54] PROBLEM - Check systemd state on meitnerium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:03:03] (03PS1) 10Alexandros Kosiaris: bacula: Specify the correct filename for the cert file [puppet] - 10https://gerrit.wikimedia.org/r/344641 [16:03:14] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:03:23] (03PS2) 10Alexandros Kosiaris: bacula: Specify the correct filename for the cert file [puppet] - 10https://gerrit.wikimedia.org/r/344641 [16:03:24] PROBLEM - puppet last run on auth2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:03:33] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] bacula: Specify the correct filename for the cert file [puppet] - 10https://gerrit.wikimedia.org/r/344641 (owner: 10Alexandros Kosiaris) [16:03:34] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:03:34] PROBLEM - Check systemd state on auth2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:03:44] PROBLEM - Check systemd state on rutherfordium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:04:08] <_joe_> akosiaris: merging your change too [16:04:44] RECOVERY - nova-compute process on labtestvirt2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [16:04:44] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:04:45] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational [16:04:54] _joe_: thanks [16:05:08] ok that fixed the issue, recoveries should start arriving now [16:05:14] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:05:34] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:05:44] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:05:58] (03PS2) 10Giuseppe Lavagetto: role::jobqueue_redis: convert slaves to role::jobqueue_redis::slave [puppet] - 10https://gerrit.wikimedia.org/r/344622 [16:06:28] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3128450 (10GWicke) There is a performance dashboard at https://grafana.wikimedia.org/dashboard/db/reading-web-page-previews?fro... [16:06:44] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:06:54] PROBLEM - Check systemd state on iron is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:06:54] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:07:14] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [16:07:24] PROBLEM - puppet last run on rdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:07:37] (03CR) 10Giuseppe Lavagetto: [C: 032] role::jobqueue_redis: convert slaves to role::jobqueue_redis::slave [puppet] - 10https://gerrit.wikimedia.org/r/344622 (owner: 10Giuseppe Lavagetto) [16:08:04] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:08:14] PROBLEM - puppet last run on etcd1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:08:24] PROBLEM - Check systemd state on etcd1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:08:34] PROBLEM - Check systemd state on bast2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:08:34] PROBLEM - Check systemd state on conf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:08:34] PROBLEM - Check systemd state on tin is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:08:34] PROBLEM - puppet last run on mc1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:08:34] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational [16:08:44] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:08:44] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:08:49] <_joe_> uhm apparently I screwed up too [16:08:54] PROBLEM - Check systemd state on pollux is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:08:54] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:09:14] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:09:14] PROBLEM - puppet last run on mc2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:09:20] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3128457 (10Cmjohnson) @madhuvishy . Sorry I was away on vacation. When do you want to do this next week. I would prefer if we coul... [16:09:24] PROBLEM - Check systemd state on lithium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:09:24] PROBLEM - bacula director process on helium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-dir [16:09:34] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:10:04] PROBLEM - Check systemd state on install2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:10:14] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:10:14] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [16:10:24] PROBLEM - Check systemd state on helium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:10:24] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:10:24] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:10:25] PROBLEM - Check systemd state on conf1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:10:25] PROBLEM - puppet last run on mc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:10:25] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3128458 (10RobH) a:05RobH>03Ottomata Oh, I failed to see the entire raid5 comment in the initial request until now. Raid 5 is horrible for write, and I'm pretty... [16:10:34] RECOVERY - Check systemd state on tin is OK: OK - running: The system is fully operational [16:10:44] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:10:44] RECOVERY - Check systemd state on rutherfordium is OK: OK - running: The system is fully operational [16:10:54] RECOVERY - Check systemd state on pollux is OK: OK - running: The system is fully operational [16:10:56] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:11:04] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:11:04] PROBLEM - Check systemd state on install1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:11:24] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:11:24] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [16:11:28] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3128462 (10RobH) a:03Ottomata @Ottomata: So we've discussed the raid level, but I realize now I never got the overall capacity requirement? Not the disk layout, bu... [16:11:34] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:11:34] PROBLEM - puppet last run on mc2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:11:34] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:11:49] ottomata1: Sorry about the confusion on the stat replacements, but I realized I never got minimum capacity, just raid discussion [16:12:05] ?[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Profile::Redis::Instance[6379] is already declared in file /etc/puppet/modules/profile/manifests/redis/multidc_instance.pp:25; cannot redeclare at /etc/puppet/modules/profile/manifests/redis/instance.pp:27 on node mc1009.eqiad.wmnet?[0m [16:12:07] _joe_: ^ [16:12:12] I think that's you [16:12:12] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3128465 (10madhuvishy) @Cmjohnson Yup no problem! :) Just to confirm, can we move both servers to row B then? [16:12:13] but as soon as I have that from you (im sure its not much of your time since you likely had to figure it out already ;) then i'll get them quoted! [16:12:14] PROBLEM - puppet last run on mc2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:12:24] PROBLEM - puppet last run on mc2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:12:30] robh: k will comment shortly [16:12:44] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:12:44] PROBLEM - Check systemd state on bast3002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:12:54] PROBLEM - Check systemd state on dbstore1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:13:04] PROBLEM - puppet last run on mc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:13] 06Operations, 10Cassandra, 06Services (doing): Upload cassandra-tools-wmf 1.0.1-1 Debian package to apt.w.o - https://phabricator.wikimedia.org/T161239#3128468 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This is done! ``` root@install1002:~# reprepro -C backports --ignore=wrongdistribution include... [16:13:21] urandom: ^ \o/ [16:13:34] PROBLEM - Check systemd state on etcd1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:13:44] PROBLEM - puppet last run on mc1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:44] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:13:45] PROBLEM - puppet last run on mc1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:45] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:13:54] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:54] PROBLEM - puppet last run on mc1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:54] PROBLEM - puppet last run on mc1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:54] PROBLEM - puppet last run on mc1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:54] PROBLEM - puppet last run on mc1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:04] PROBLEM - puppet last run on etcd1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:14:04] PROBLEM - puppet last run on mc1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:04] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [16:14:04] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:04] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:05] PROBLEM - puppet last run on mc1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:14] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:14] godog: thank you; \o/ [16:14:14] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:14:25] PROBLEM - puppet last run on mc1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:26] RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:14:26] PROBLEM - puppet last run on mc1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:44] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:15:14] PROBLEM - puppet last run on mc2033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:15:34] RECOVERY - Check systemd state on auth2001 is OK: OK - running: The system is fully operational [16:15:44] PROBLEM - puppet last run on mc2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:15:44] RECOVERY - puppet last run on heze is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:15:44] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:15:54] RECOVERY - Check systemd state on dbstore1001 is OK: OK - running: The system is fully operational [16:15:54] RECOVERY - Check systemd state on etcd1002 is OK: OK - running: The system is fully operational [16:15:54] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:15:54] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[bacula-director] [16:15:55] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [16:15:55] RECOVERY - puppet last run on etcd1005 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:16:04] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:16:04] RECOVERY - Check systemd state on install1002 is OK: OK - running: The system is fully operational [16:16:04] RECOVERY - Check systemd state on install2002 is OK: OK - running: The system is fully operational [16:16:05] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:16:14] RECOVERY - puppet last run on etcd1006 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:16:24] RECOVERY - puppet last run on auth2001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:16:27] RECOVERY - puppet last run on etcd1002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:16:27] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [16:16:27] RECOVERY - Check systemd state on conf1001 is OK: OK - running: The system is fully operational [16:16:27] RECOVERY - Check systemd state on etcd1006 is OK: OK - running: The system is fully operational [16:16:27] PROBLEM - puppet last run on mc2034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:16:27] RECOVERY - Check systemd state on conf1003 is OK: OK - running: The system is fully operational [16:16:27] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:16:34] RECOVERY - Check systemd state on etcd1005 is OK: OK - running: The system is fully operational [16:16:34] RECOVERY - Check systemd state on bast2001 is OK: OK - running: The system is fully operational [16:16:34] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:16:34] RECOVERY - Check systemd state on conf1002 is OK: OK - running: The system is fully operational [16:16:35] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:16:35] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational [16:16:38] (03PS1) 10Giuseppe Lavagetto: profile::redis::multidc_instance: fully qualify puppet nonsense [puppet] - 10https://gerrit.wikimedia.org/r/344644 [16:16:44] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:16:44] RECOVERY - Check systemd state on bast3002 is OK: OK - running: The system is fully operational [16:16:44] PROBLEM - puppet last run on rdb2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:16:44] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:16:44] RECOVERY - Check systemd state on bast4001 is OK: OK - running: The system is fully operational [16:17:11] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3128472 (10Gehel) [16:17:29] 06Operations, 06Commons, 06Multimedia, 10media-storage: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704#2509075 (10Thibaut120094) Same problem here: ``` Error undeleting file: The file "mwstore://local-multiwrite/local-public/9/9f/Chocolate(bgFFF... [16:17:34] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:17:44] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:17:44] PROBLEM - Check systemd state on conf2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:44] PROBLEM - puppet last run on wezen is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:17:44] PROBLEM - Check systemd state on etcd1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:54] PROBLEM - Check systemd state on etcd1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:04] PROBLEM - Check systemd state on etcd1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:14] PROBLEM - Check systemd state on wezen is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:14] PROBLEM - puppet last run on mc2031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:18:24] PROBLEM - puppet last run on mc2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:18:24] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 13216 [16:18:34] PROBLEM - Check systemd state on dubnium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:34] PROBLEM - Check systemd state on conf2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:39] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::redis::multidc_instance: fully qualify puppet nonsense [puppet] - 10https://gerrit.wikimedia.org/r/344644 (owner: 10Giuseppe Lavagetto) [16:18:44] PROBLEM - Check systemd state on fermium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:19:04] RECOVERY - Check systemd state on etcd1004 is OK: OK - running: The system is fully operational [16:19:14] RECOVERY - puppet last run on conf2002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:19:24] RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:19:34] RECOVERY - puppet last run on conf2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:19:34] RECOVERY - puppet last run on dubnium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:19:34] RECOVERY - Check systemd state on dubnium is OK: OK - running: The system is fully operational [16:19:34] RECOVERY - puppet last run on etcd1003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:19:34] RECOVERY - Check systemd state on conf2001 is OK: OK - running: The system is fully operational [16:19:35] RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:19:44] RECOVERY - Check systemd state on fermium is OK: OK - running: The system is fully operational [16:19:44] RECOVERY - Check systemd state on conf2003 is OK: OK - running: The system is fully operational [16:19:44] RECOVERY - Check systemd state on etcd1003 is OK: OK - running: The system is fully operational [16:19:54] RECOVERY - Check systemd state on etcd1001 is OK: OK - running: The system is fully operational [16:19:54] PROBLEM - Check systemd state on bromine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:19:54] RECOVERY - puppet last run on etcd1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:20:05] PROBLEM - Check systemd state on serpens is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:20:24] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:20:24] RECOVERY - puppet last run on mc1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:20:24] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:20:34] PROBLEM - Check systemd state on wasat is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:20:44] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:20:44] RECOVERY - puppet last run on mc1011 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:21:04] PROBLEM - Check systemd state on mira is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:21:04] PROBLEM - Check systemd state on seaborgium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:21:44] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:22:24] RECOVERY - puppet last run on mc1016 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:22:54] RECOVERY - puppet last run on mc1018 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:23:04] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:23:14] RECOVERY - Check systemd state on wezen is OK: OK - running: The system is fully operational [16:23:24] RECOVERY - puppet last run on rdb2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:23:24] RECOVERY - Check systemd state on lithium is OK: OK - running: The system is fully operational [16:23:24] RECOVERY - puppet last run on meitnerium is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:23:25] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [16:23:25] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:23:25] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:23:34] RECOVERY - Check systemd state on wasat is OK: OK - running: The system is fully operational [16:23:34] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:23:44] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:23:44] RECOVERY - puppet last run on rdb2005 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:23:44] RECOVERY - puppet last run on wezen is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [16:23:44] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:23:44] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:23:54] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:23:54] RECOVERY - Check systemd state on iron is OK: OK - running: The system is fully operational [16:23:54] RECOVERY - Check systemd state on bromine is OK: OK - running: The system is fully operational [16:23:54] RECOVERY - Check systemd state on meitnerium is OK: OK - running: The system is fully operational [16:23:54] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:23:55] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:23:58] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3128477 (10Cmjohnson) @madhuvishy You want both moved to row B but separate racks? If they need to be connected to each other that... [16:24:04] RECOVERY - Check systemd state on mira is OK: OK - running: The system is fully operational [16:24:04] RECOVERY - Check systemd state on seaborgium is OK: OK - running: The system is fully operational [16:24:04] RECOVERY - Check systemd state on serpens is OK: OK - running: The system is fully operational [16:24:14] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:24:14] RECOVERY - puppet last run on mc2021 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:24:24] RECOVERY - puppet last run on mc2022 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:24:24] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:24:24] RECOVERY - puppet last run on mc1005 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:24:44] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:24:54] RECOVERY - puppet last run on mc1014 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:24:54] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:25:04] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:25:34] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational [16:26:05] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:26:14] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:26:24] RECOVERY - puppet last run on mc2023 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:26:34] RECOVERY - puppet last run on mc2019 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:26:54] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:27:04] RECOVERY - puppet last run on mc1012 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:27:05] RECOVERY - puppet last run on mc1004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:27:05] RECOVERY - puppet last run on mc1003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:27:14] RECOVERY - puppet last run on mc2031 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:27:14] RECOVERY - puppet last run on mc2033 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:27:24] RECOVERY - puppet last run on mc2034 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:27:44] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:27:44] RECOVERY - puppet last run on mc2024 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:27:44] RECOVERY - puppet last run on mc1009 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:27:54] RECOVERY - puppet last run on mc1013 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:27:54] RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:28:14] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:28:14] RECOVERY - puppet last run on mc2028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:28:34] RECOVERY - puppet last run on mc1015 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:29:15] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3128493 (10madhuvishy) @Cmjohnson Yes I think separate racks for reliability. We currently have labstore1004 and labstore1005 set u... [16:30:38] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3128506 (10chasemp) It's similar to what we are doing with labstore1004/1005, or same model. I believe those are in neighboring ra... [16:31:24] PROBLEM - Redis replication status tcp_6379 on rdb2002 is CRITICAL: Return code of 255 is out of bounds [16:32:14] PROBLEM - Redis replication status tcp_6380 on rdb2002 is CRITICAL: Return code of 255 is out of bounds [16:32:14] PROBLEM - Redis replication status tcp_6381 on rdb2002 is CRITICAL: Return code of 255 is out of bounds [16:32:22] (03PS1) 10Giuseppe Lavagetto: profile::redis::slave: fix parameter extraction [puppet] - 10https://gerrit.wikimedia.org/r/344647 [16:32:23] <_joe_> that's expected ^^ [16:32:24] PROBLEM - Redis replication status tcp_6378 on rdb2002 is CRITICAL: Return code of 255 is out of bounds [16:32:25] (03PS1) 10Cmjohnson: T160640 Adding dhcpd entries and netboot.cfg for new swift servers ms-be1028-39 [puppet] - 10https://gerrit.wikimedia.org/r/344646 [16:32:26] 06Operations, 06DC-Ops: Change cdentinger's icinga sms gateway to Sprint - https://phabricator.wikimedia.org/T161112#3128507 (10RobH) I just noticed this today, if your phone changed, did the phone number stay the same? (I'm guessing yes since you didn't state otherwise.) [16:32:33] <_joe_> and this ^^ is the solution [16:32:58] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::redis::slave: fix parameter extraction [puppet] - 10https://gerrit.wikimedia.org/r/344647 (owner: 10Giuseppe Lavagetto) [16:33:47] (03PS1) 10Alexandros Kosiaris: bacula: Fix typos in client.erb [puppet] - 10https://gerrit.wikimedia.org/r/344648 [16:34:15] (03CR) 10Alexandros Kosiaris: [C: 032] bacula: Fix typos in client.erb [puppet] - 10https://gerrit.wikimedia.org/r/344648 (owner: 10Alexandros Kosiaris) [16:34:20] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] bacula: Fix typos in client.erb [puppet] - 10https://gerrit.wikimedia.org/r/344648 (owner: 10Alexandros Kosiaris) [16:35:45] (03PS2) 10Cmjohnson: T160640 Adding dhcpd entries and netboot.cfg for new swift servers ms-be1028-39 [puppet] - 10https://gerrit.wikimedia.org/r/344646 [16:36:18] shit [16:36:29] i broke icinga, or it was broken and i didnt check the config in advance, fixing [16:36:35] (its not down, just failed its config check) [16:37:32] something shifted in performance team [16:37:40] that wasnt me... [16:37:44] (03PS3) 10Cmjohnson: T160640 Adding dhcpd entries and netboot.cfg for new swift servers ms-be1028-39 [puppet] - 10https://gerrit.wikimedia.org/r/344646 [16:38:18] its been failed over an hour [16:38:22] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3128520 (10Ottomata) [16:39:09] (03CR) 10Cmjohnson: [C: 032] T160640 Adding dhcpd entries and netboot.cfg for new swift servers ms-be1028-39 [puppet] - 10https://gerrit.wikimedia.org/r/344646 (owner: 10Cmjohnson) [16:39:10] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3080357 (10Ottomata) stat1003 is using about 3.2T space right now, and I don't expect it to grow much. If we can get something with at least 4T storage capacity, 6T... [16:39:21] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3128523 (10Ottomata) a:05Ottomata>03RobH [16:39:31] (03PS4) 10Cmjohnson: T160640 Adding dhcpd entries and netboot.cfg for new swift servers ms-be1028-39 [puppet] - 10https://gerrit.wikimedia.org/r/344646 [16:39:33] (03PS1) 10Giuseppe Lavagetto: profile::redis::slave: fix parameter extraction (again) [puppet] - 10https://gerrit.wikimedia.org/r/344650 [16:39:51] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::redis::slave: fix parameter extraction (again) [puppet] - 10https://gerrit.wikimedia.org/r/344650 (owner: 10Giuseppe Lavagetto) [16:40:54] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:41:00] (03PS1) 10Papaul: DNS/Decom Remove mgmt DNS entries for ms-fe200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/344651 [16:43:10] (03PS1) 10Giuseppe Lavagetto: profile::redis::slave: use auth settings in the instances [puppet] - 10https://gerrit.wikimedia.org/r/344652 [16:43:33] (03PS1) 10Alexandros Kosiaris: Fix references to the performance-team [puppet] - 10https://gerrit.wikimedia.org/r/344653 [16:43:34] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::redis::slave: use auth settings in the instances [puppet] - 10https://gerrit.wikimedia.org/r/344652 (owner: 10Giuseppe Lavagetto) [16:44:00] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3128531 (10Ottomata) @robh, see the section under **Disks** in the task description. The storage requirements aren't so strict, but as usual, more is better. stat10... [16:44:35] (03CR) 10Cmjohnson: [V: 032 C: 032] T160640 Adding dhcpd entries and netboot.cfg for new swift servers ms-be1028-39 [puppet] - 10https://gerrit.wikimedia.org/r/344646 (owner: 10Cmjohnson) [16:44:48] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3128538 (10RobH) Ok, we can do 4 * 4TB to hit 8TB in raid10. That comes out to more like 7.4TB usable. I'll also ask for quotes for 4 * 6TB to see what the price di... [16:44:55] ottomata: thanks! I'll get the quote request in for stat1003 replacement now [16:44:59] (03PS2) 10Alexandros Kosiaris: Fix references to the performance-team [puppet] - 10https://gerrit.wikimedia.org/r/344653 [16:45:09] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix references to the performance-team [puppet] - 10https://gerrit.wikimedia.org/r/344653 (owner: 10Alexandros Kosiaris) [16:45:24] RECOVERY - Redis replication status tcp_6378 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6378 has 1 databases (db0) with 15 keys, up 38 seconds - replication_delay is 0 [16:45:40] (03PS5) 10Cmjohnson: T160640 Adding dhcpd entries and netboot.cfg for new swift servers ms-be1028-39 [puppet] - 10https://gerrit.wikimedia.org/r/344646 [16:46:02] (03CR) 10Cmjohnson: [V: 032 C: 032] T160640 Adding dhcpd entries and netboot.cfg for new swift servers ms-be1028-39 [puppet] - 10https://gerrit.wikimedia.org/r/344646 (owner: 10Cmjohnson) [16:46:14] RECOVERY - Redis replication status tcp_6380 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6380 has 1 databases (db0) with 3688544 keys, up 1 minutes 34 seconds - replication_delay is 0 [16:46:14] RECOVERY - Redis replication status tcp_6381 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6381 has 1 databases (db0) with 3682455 keys, up 1 minutes 35 seconds - replication_delay is 0 [16:47:24] RECOVERY - Redis replication status tcp_6379 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 8385151 keys, up 2 minutes 40 seconds - replication_delay is 0 [16:49:17] (03PS1) 10Alexandros Kosiaris: bacula: Fix typo with TLS key [puppet] - 10https://gerrit.wikimedia.org/r/344655 [16:49:33] (03PS2) 10Alexandros Kosiaris: bacula: Fix typo with TLS key [puppet] - 10https://gerrit.wikimedia.org/r/344655 [16:49:39] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] bacula: Fix typo with TLS key [puppet] - 10https://gerrit.wikimedia.org/r/344655 (owner: 10Alexandros Kosiaris) [16:51:34] (03PS1) 10Andrew Bogott: bootstrap_vz: Handle initial hostname a bit better [puppet] - 10https://gerrit.wikimedia.org/r/344656 (https://phabricator.wikimedia.org/T160908) [16:51:39] chasemp: ^ [16:52:14] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:53:58] andrewbogott: I think it's cool, going to think about it a minute over lunch as I'm on my way out if you don't mind. thanks for doing that [16:54:06] ok! [16:57:21] 06Operations, 06DC-Ops: Change cdentinger's icinga sms gateway to Sprint - https://phabricator.wikimedia.org/T161112#3128566 (10RobH) 05Open>03Resolved Done and live! [17:00:53] (03PS1) 10Andrew Bogott: Further attempt to get partman working for labtestvirt2002 [puppet] - 10https://gerrit.wikimedia.org/r/344657 [17:02:24] RECOVERY - bacula director process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir [17:02:24] RECOVERY - Check systemd state on helium is OK: OK - running: The system is fully operational [17:02:50] (03CR) 10Andrew Bogott: [C: 032] Further attempt to get partman working for labtestvirt2002 [puppet] - 10https://gerrit.wikimedia.org/r/344657 (owner: 10Andrew Bogott) [17:02:54] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:04:24] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [17:06:14] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:07:54] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:11:31] (03PS1) 10Filippo Giunchedi: nagios: s/contactgroup_members/members/ [puppet] - 10https://gerrit.wikimedia.org/r/344658 [17:11:49] akosiaris: ^ [17:13:04] (03CR) 10Alexandros Kosiaris: [C: 031] nagios: s/contactgroup_members/members/ [puppet] - 10https://gerrit.wikimedia.org/r/344658 (owner: 10Filippo Giunchedi) [17:13:17] (03PS2) 10Filippo Giunchedi: nagios: s/contactgroup_members/members/ [puppet] - 10https://gerrit.wikimedia.org/r/344658 [17:13:25] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] nagios: s/contactgroup_members/members/ [puppet] - 10https://gerrit.wikimedia.org/r/344658 (owner: 10Filippo Giunchedi) [17:13:43] merging right away, it isn't like we're doing checks on icinga config [17:15:08] 06Operations, 13Patch-For-Review: bacula-director not running on helium - https://phabricator.wikimedia.org/T161281#3128597 (10akosiaris) 05Open>03Resolved A long tail of commits and fixes later, the puppet modules are freak of that hideous keypair generation in the module and rather rely on base::expose_p... [17:15:26] (03PS1) 10Ema: bgp: log with util.log instead of printing to stdout [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/344659 [17:16:14] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [17:17:11] oh oh [17:17:57] yeah that's me [17:21:37] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3128626 (10RStallman-legalteam) Confirming that Paul Norman has a current contract with WMF which includes NDA language. This will cover the NDA requirement until the contract expires. [17:23:57] (03PS1) 10Filippo Giunchedi: nagios: use members for wikitech-static contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/344662 [17:24:57] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] nagios: use members for wikitech-static contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/344662 (owner: 10Filippo Giunchedi) [17:25:16] (03PS2) 10Filippo Giunchedi: nagios: use members for wikitech-static contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/344662 [17:25:23] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] nagios: use members for wikitech-static contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/344662 (owner: 10Filippo Giunchedi) [17:26:15] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [17:27:44] (03PS1) 10Cmjohnson: T160640 Adding dns entries for production new swift servers ms-be1028-1036 [dns] - 10https://gerrit.wikimedia.org/r/344663 [17:28:00] (03CR) 10jerkins-bot: [V: 04-1] T160640 Adding dns entries for production new swift servers ms-be1028-1036 [dns] - 10https://gerrit.wikimedia.org/r/344663 (owner: 10Cmjohnson) [17:28:02] (03PS1) 10Andrew Bogott: Add hiera settings for labtestvirt2002 [puppet] - 10https://gerrit.wikimedia.org/r/344664 [17:29:57] 06Operations, 10DNS, 06Discovery, 06Labs, and 3 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3128645 (10MaxSem) These subdomains should just be removed. They made sense in the times of HTTP when browsers had a hard li... [17:30:37] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3128646 (10RobH) [17:30:41] (03CR) 10Andrew Bogott: [C: 032] Add hiera settings for labtestvirt2002 [puppet] - 10https://gerrit.wikimedia.org/r/344664 (owner: 10Andrew Bogott) [17:31:15] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3127143 (10RobH) [17:32:26] (03PS2) 10Cmjohnson: T160640 Adding dns entries for production new swift servers ms-be1028-1036 [dns] - 10https://gerrit.wikimedia.org/r/344663 [17:32:55] RECOVERY - salt-minion processes on labtestvirt2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:34:25] (03CR) 10Cmjohnson: [C: 032] T160640 Adding dns entries for production new swift servers ms-be1028-1036 [dns] - 10https://gerrit.wikimedia.org/r/344663 (owner: 10Cmjohnson) [17:38:26] there is a performance grafana alert about enwiki.js [17:38:57] enwiki-mobile.js [17:39:31] not sure how to respond to that, I do not see any deployment happening lately [17:45:12] (03PS1) 10Andrew Bogott: One more tweak to labvirt2002 partman [puppet] - 10https://gerrit.wikimedia.org/r/344666 [17:47:29] (03CR) 10Andrew Bogott: [C: 032] One more tweak to labvirt2002 partman [puppet] - 10https://gerrit.wikimedia.org/r/344666 (owner: 10Andrew Bogott) [17:50:07] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3128745 (10jcrespo) I will set this up on s6-codfw as a test, but next week. [17:50:09] !log restart elasticsearch on relforge100[12] to test reindex api over https [17:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:25] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 474 [18:08:20] (03PS3) 10EBernhardson: Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 [18:08:29] (03CR) 10EBernhardson: Allow search clusters to reindex from eachother (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [18:09:17] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3128793 (10Ottomata) ​Ok! [18:23:05] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_refinery_source] [18:23:50] (03PS1) 10Andrew Bogott: Format labtestvirt drives with xfs [puppet] - 10https://gerrit.wikimedia.org/r/344672 [18:26:10] (03CR) 10Andrew Bogott: [C: 032] Format labtestvirt drives with xfs [puppet] - 10https://gerrit.wikimedia.org/r/344672 (owner: 10Andrew Bogott) [18:42:06] 06Operations, 13Patch-For-Review: bacula-director not running on helium - https://phabricator.wikimedia.org/T161281#3128884 (10Dzahn) thank you very much :) {meme, src="seal-of-approval", below="PROCS OK: 1 process with UID = 110 (bacula), command name 'bacula-dir'", above="helium bacula director process" } [18:46:18] (03PS2) 10Rush: bootstrap_vz: Handle initial hostname a bit better [puppet] - 10https://gerrit.wikimedia.org/r/344656 (https://phabricator.wikimedia.org/T160908) (owner: 10Andrew Bogott) [18:46:40] (03CR) 10Rush: [C: 031] "as I understand it this works and without making all things ordered we should give some headroom for this operation :)" [puppet] - 10https://gerrit.wikimedia.org/r/344656 (https://phabricator.wikimedia.org/T160908) (owner: 10Andrew Bogott) [18:47:41] (03PS3) 10Andrew Bogott: bootstrap_vz: Handle initial hostname a bit better [puppet] - 10https://gerrit.wikimedia.org/r/344656 (https://phabricator.wikimedia.org/T160908) [18:48:09] (03CR) 10Andrew Bogott: [C: 032] bootstrap_vz: Handle initial hostname a bit better [puppet] - 10https://gerrit.wikimedia.org/r/344656 (https://phabricator.wikimedia.org/T160908) (owner: 10Andrew Bogott) [18:51:15] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:54:18] (03PS4) 10Dzahn: Visualdiff: Install mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [18:55:05] (03CR) 10Dzahn: "PS4: jenkins-bot did not verify because of tabs ( ERROR tab character found (hard_tabs)), replaced with spaces on line 6/7" [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [18:55:31] (03CR) 10jerkins-bot: [V: 04-1] Visualdiff: Install mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [18:56:15] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:57:23] (03PS5) 10Dzahn: Visualdiff: Install mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [19:00:48] (03PS1) 10Andrew Bogott: bootstrap_vs: Don't install cloud.cfg [puppet] - 10https://gerrit.wikimedia.org/r/344676 [19:04:02] (03CR) 10Dzahn: [C: 032] "verified now. also checked it's not affecting prod parsoid. http://puppet-compiler.wmflabs.org/5908/" [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [19:04:11] (03PS6) 10Dzahn: Visualdiff: Install mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [19:07:15] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [19:08:05] RECOVERY - puppet last run on labtestvirt2002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [19:10:15] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [19:10:49] (03CR) 10Andrew Bogott: [C: 032] bootstrap_vs: Don't install cloud.cfg [puppet] - 10https://gerrit.wikimedia.org/r/344676 (owner: 10Andrew Bogott) [19:10:56] (03PS2) 10Andrew Bogott: bootstrap_vs: Don't install cloud.cfg [puppet] - 10https://gerrit.wikimedia.org/r/344676 [19:11:00] (03PS1) 10Andrew Bogott: Labtest: Add labtestvirt2002 to scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/344678 [19:12:05] PROBLEM - DPKG on ruthenium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:12:15] ^ because it's installing stuff [19:12:55] (03PS2) 10Andrew Bogott: Labtest: Add labtestvirt2002 to scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/344678 [19:13:57] (03CR) 10Andrew Bogott: [C: 032] Labtest: Add labtestvirt2002 to scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/344678 (owner: 10Andrew Bogott) [19:16:05] RECOVERY - DPKG on ruthenium is OK: All packages OK [19:18:10] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/P5128" [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [19:18:21] subbu: all your fonts are belong to ruthenium - https://phabricator.wikimedia.org/P5128 [19:23:25] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 23 probes of 426 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [19:28:25] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 426 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [19:33:43] !log krinkle@tin Synchronized wmf-config/StartProfiler.php: touch - T161286 - hhvm cache maybe? (duration: 00m 43s) [19:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:50] T161286: App servers still sending handful of profiles XHGui - despite code being disabled - https://phabricator.wikimedia.org/T161286 [19:43:07] (03PS1) 10Krinkle: StartProfiler: Add hostname in xhgui record [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344683 (https://phabricator.wikimedia.org/T161286) [19:46:06] (03CR) 10Legoktm: [C: 031] StartProfiler: Add hostname in xhgui record [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344683 (https://phabricator.wikimedia.org/T161286) (owner: 10Krinkle) [19:48:04] I don't trust having app servers send data to tungsten over the weekend. This was supposed to be fixed yesterday and it's still happening. [19:48:11] (03CR) 10Krinkle: [C: 032] StartProfiler: Add hostname in xhgui record [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344683 (https://phabricator.wikimedia.org/T161286) (owner: 10Krinkle) [19:49:38] (03Merged) 10jenkins-bot: StartProfiler: Add hostname in xhgui record [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344683 (https://phabricator.wikimedia.org/T161286) (owner: 10Krinkle) [19:49:45] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:49:47] (03CR) 10jenkins-bot: StartProfiler: Add hostname in xhgui record [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344683 (https://phabricator.wikimedia.org/T161286) (owner: 10Krinkle) [19:50:59] legoktm: staging on mwdebug1001 now [19:52:04] (03CR) 10Smalyshev: [C: 031] Allow search clusters to reindex from eachother (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [19:52:07] legoktm: I can see it being called, but not being included. odd - https://performance.wikimedia.org/xhgui/run/view?id=58d578ca3f3dfa19490cbe34 [19:53:28] Ah, it's stored but not inthe view, that's fine [19:53:29] uhh [19:53:33] ok [19:55:06] !log krinkle@tin Synchronized wmf-config/StartProfiler.php: T161286 - include hostname (duration: 00m 49s) [19:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:16] legoktm: trimmed the db again to start fresh [19:55:17] T161286: App servers still sending handful of profiles XHGui - despite code being disabled - https://phabricator.wikimedia.org/T161286 [19:57:14] now we wait? [19:57:31] legoktm: Just got one [19:57:36] legoktm: I don't get it though.. [19:57:47] legoktm: This one again doesns't include hostname [19:58:20] so the code changes aren't taking effect somehow [20:01:49] legoktm: Ah, I think I got it. A hunch anyway. Symlinks. HHVM doesn't resolve symlinks to determine whether the files changed [20:01:57] we symlink StartProfiler from config to $IP, right? [20:02:00] * Krinkle tries another touch there [20:02:05] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:03:49] !log krinkle@tin Synchronized php-1.29.0-wmf.17/StartProfiler.php: touch - T161286 - (symlink) (duration: 00m 42s) [20:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:56] T161286: App servers still sending handful of profiles XHGui - despite code being disabled - https://phabricator.wikimedia.org/T161286 [20:07:15] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [20:11:12] (03PS2) 10Hashar: jenkins: tweak log permissions [puppet] - 10https://gerrit.wikimedia.org/r/342210 [20:18:45] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:20:45] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:20:45] RECOVERY - nova-network process on labtestnet2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-network [20:20:55] (03CR) 10Hashar: "The group_only permission is to secure the log files a bit but that means admins can no more read them unless they use sudo -u jenkins.. " [puppet] - 10https://gerrit.wikimedia.org/r/342210 (owner: 10Hashar) [20:29:05] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:43:09] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3129177 (10Nirzar) >We're not proposing no delay, because if you're mousing across things that would create needless requests i... [20:44:12] 06Operations, 10hardware-requests: Rename labtestmetal2001 - https://phabricator.wikimedia.org/T161265#3129178 (10Andrew) I must've done the cleanup out of order... I removed things again and they seem to be actually gone now. [20:47:45] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:48:15] RECOVERY - Disk space on tungsten is OK: DISK OK [20:48:45] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [20:49:46] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review, 07Upstream: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#3129183 (10Paladox) This https://github.com/eclipse/jgit/commit/4ddd4a3d1 looks like a fix for this. Fixed in gerrit 2.13.7+. [20:53:01] 06Operations, 06Performance-Team: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3129188 (10Krinkle) [21:02:38] (03PS1) 10Andrew Bogott: Nova.conf compute_monitors=virt_driver [puppet] - 10https://gerrit.wikimedia.org/r/344689 (https://phabricator.wikimedia.org/T161006) [21:02:44] PROBLEM - puppet last run on wtp1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:57] (03PS2) 10Andrew Bogott: Nova.conf compute_monitors=virt_driver [puppet] - 10https://gerrit.wikimedia.org/r/344689 (https://phabricator.wikimedia.org/T161006) [21:08:46] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3129247 (10Andrew) The current state of this is: I rebooted labvirt1001 and it got better. I've migrated a handful of tools exec nodes back to labvirt1001 and I'm going t... [21:12:45] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3129255 (10dr0ptp4kt) [21:13:28] Hi there [21:13:42] we've encountered some weird behavior on ro.wiki [21:13:58] an article just shows a blank page: no error, no nothing [21:14:19] it was because of the infobox - i've been able to work around it somehow [21:14:31] but I was hoping I could get some relevant information from the logs [21:14:41] the page is https://ro.wikipedia.org/wiki/Thomas_Mann [21:15:01] hmm, "Eroare Lua: not enough memory. [21:15:16] yeah, after I removed all params from the infobox [21:15:31] it is showing in https://ro.wikipedia.org/w/index.php?title=Thomas_Mann&oldid=10758157 [21:15:33] but checkout https://ro.wikipedia.org/w/index.php?title=Thomas_Mann&oldid=10758157 [21:15:38] yeah :) [21:15:44] RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [21:15:52] :) [21:21:14] (03PS1) 10Andrew Bogott: Nova: Remove unused rsync server [puppet] - 10https://gerrit.wikimedia.org/r/344691 [21:24:34] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:29:04] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:30:44] RECOVERY - puppet last run on wtp1022 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:32:24] 06Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 06Labs, 10hardware-requests: Eqiad: Hardware request for labstore1006/7, dataset1002/3 - https://phabricator.wikimedia.org/T161311#3129356 (10ArielGlenn) [21:33:34] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:34:44] PROBLEM - puppet last run on wtp1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:48:20] (03PS1) 10EBernhardson: Update logstash plugins for 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/344704 [21:48:42] (03Abandoned) 10EBernhardson: [WIP] Update logstash plugins for 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/344637 (owner: 10EBernhardson) [21:52:03] (03PS2) 10EBernhardson: Update logstash plugins for 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/344704 [21:57:04] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [21:59:47] so, any chance to find out what the logs are saying about [[:ro:Thomas Mann]]? [22:01:02] strainu: :( "PHP Fatal Error: unknown exception" [22:01:44] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [22:01:50] very explicit :) [22:01:53] thanks anyway [22:02:59] hhvm error.log on one of the machines says same thing, no context [22:08:54] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:09:21] strainu: https://phabricator.wikimedia.org/P5130 [22:09:47] oh actually thats just me doing it wrong, sec [22:12:29] getRedirectedFrom() is on Article, not WikiPage… [22:17:58] yeah, buth $this->page can be an Article [22:18:06] perhaps it's not seen in the correct namespace? [22:20:17] ebernhardson says he did something wrong, so perhaps he used the wrong factory [22:20:48] yea passing an Article there works find, the code just didn't make it clear that a different implementation of Page object was expected. Still trying but havn't triggered the exception yet (but can from the web) [22:22:12] are you loading the old revision? [22:22:19] trying to convince it to [22:23:04] I can revert to that version for testing if you want me to [22:24:44] https://ro.wikipedia.org/wiki/Special:ExpandTemplates fails too with this text [22:25:16] which one? the current one or the one with issues? [22:25:36] the Infocaseta Scriitor [22:26:01] yeah, the issue is with the DoB [22:26:02] https://ro.wikipedia.org/wiki/Special:ExpandTemplates: Title: "Thomas Mann", content "{{Infocaseta Scriitor|nume=foo}}" [22:26:02] if there is a specific bit of wikitext, just put it in a user namespace sandbox or something [22:26:27] there are Lua errors of out of memory [22:26:32] so probably related to that [22:26:45] that sounds pretty plausible for ending up at an unknown exception [22:27:21] well, the thing is I removed the params one by one [22:27:29] and with a single param, I hit the issue [22:27:41] with no params, I get the Lua memory error [22:30:48] I don't think this can be reproduced in a userpage, it needs a valid Q-item [22:31:34] it doesn't reproduce in https://ro.wikipedia.org/wiki/Mihai_Eminescu [22:31:50] despite a similar quantity of information [22:36:54] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [22:38:55] ok got the exception, but its still same exception :P will take a minute to find the right spot ... [22:40:04] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [22:43:14] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [22:44:14] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3677939 keys, up 1 days 6 hours - replication_delay is 0 [22:45:04] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.407 second response time [22:45:04] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 298 seconds ago with 0 failures [22:49:14] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [22:50:04] PROBLEM - Disk space on ms-be2005 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdk1 is not accessible: Input/output error [22:50:04] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.657 second response time [22:50:14] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3677641 keys, up 1 days 6 hours - replication_delay is 0 [23:01:04] PROBLEM - MegaRAID on ms-be2005 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [23:01:05] ACKNOWLEDGEMENT - MegaRAID on ms-be2005 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T161358 [23:01:08] 06Operations, 10ops-codfw: Degraded RAID on ms-be2005 - https://phabricator.wikimedia.org/T161358#3129630 (10ops-monitoring-bot) [23:02:04] RECOVERY - Disk space on ms-be2005 is OK: DISK OK [23:03:28] sorry, I have to step out before I fall asleep with the computer in my arms [23:03:45] ebernhardson: if you find anything, just leave it here, I'll check the logs tomorrow [23:03:55] thanks for your help to you and Platonides ! [23:04:10] you are welcome, strainu :) [23:04:23] we should probably create a phab task [23:04:28] to keep track of this [23:10:04] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [23:12:14] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdk1] [23:15:04] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 296 seconds ago with 0 failures [23:16:48] nothing useful for you ... it happens inside Scribunto_LuaSandboxInterpreter::callFunction() when attempting to call something from Wikibase, but i don't think Wikibase itself has anything to do with it. [23:27:05] (03PS1) 10Jforrester: Enable wgCiteResponsiveReferences on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344722 (https://phabricator.wikimedia.org/T161307) [23:34:34] (03PS2) 10Dzahn: mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777 [23:36:59] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn) [23:42:21] (03PS1) 10Jforrester: Set wgOOUIEditPage false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344724 [23:45:50] (03CR) 10jerkins-bot: [V: 04-1] Set wgOOUIEditPage false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344724 (owner: 10Jforrester) [23:47:54] PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:48:51] 06Operations: 404 loading images from Virgin Media - https://phabricator.wikimedia.org/T161360#3129706 (10Platonides) [23:49:31] 06Operations, 10Traffic: 404 loading images from Virgin Media - https://phabricator.wikimedia.org/T161360#3129683 (10Platonides) [23:58:15] (03PS2) 10Jforrester: Set wgOOUIEditPage false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344724