[00:02:06] greg-g: Can I deploy an additional VE fix despite the SWAT having ended? [00:02:12] * greg-g nods [00:02:19] rmoen: sorry, I was out witha migraine all day :( [00:02:28] boo ;( [00:02:36] reorg postpartum [00:02:46] ori: you better believe it [00:03:04] RoanKattouw: are you going for the RB switch too? [00:03:24] Yes [00:03:39] cool, I'll stand by then [00:04:56] (03CR) 10Catrope: [C: 032] Use /api/rest_v1/ entry point for VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206316 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [00:05:02] (03Merged) 10jenkins-bot: Use /api/rest_v1/ entry point for VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206316 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [00:05:07] (03CR) 10Dzahn: [C: 032] monitoring: selector outside a resource [puppet] - 10https://gerrit.wikimedia.org/r/195523 (owner: 10Matanya) [00:05:16] PROBLEM - salt-minion processes on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:06:55] RECOVERY - salt-minion processes on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:06:59] !log catrope Synchronized wmf-config/InitialiseSettings.php: Temp disable direct RESTbase on enwiki (duration: 00m 17s) [00:07:03] Logged the message, Master [00:07:21] (03CR) 10Dzahn: "Catrope: how's this?" [puppet] - 10https://gerrit.wikimedia.org/r/167413 (owner: 10Ori.livneh) [00:07:33] !log catrope Synchronized wmf-config/CommonSettings.php: Use same-domain entry point for RESTbase (duration: 00m 13s) [00:07:37] Logged the message, Master [00:08:06] (03CR) 10Dzahn: "has path conflict" [puppet] - 10https://gerrit.wikimedia.org/r/167413 (owner: 10Ori.livneh) [00:08:19] (03CR) 10Catrope: "Why are you asking me? I don't remember ever having had anything to do with mathoid... unless I puppetized it together with citoid?" [puppet] - 10https://gerrit.wikimedia.org/r/167413 (owner: 10Ori.livneh) [00:08:37] gwicke: OK that first change is out now [00:08:44] gwicke: Let me know when you want me to turn direct RB back on on enwiki [00:09:18] RoanKattouw: looking good on mediawiki.org [00:09:40] so +1 for proceeding from me [00:09:52] (03CR) 10Dzahn: [C: 04-1] adding pending deployment ganglia group and setting it to default [puppet] - 10https://gerrit.wikimedia.org/r/159167 (owner: 10RobH) [00:10:57] (03CR) 10Dzahn: "has a path conflict. (and let's keep ServerAdmin option)" [puppet] - 10https://gerrit.wikimedia.org/r/165924 (owner: 10Ori.livneh) [00:11:06] (03CR) 10GWicke: [C: 031] Load HTML directly from RESTBase for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206319 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [00:11:14] mutante: ServerAdmin is set by the apache module for all vhosts [00:11:40] (03CR) 10Dzahn: "path conflict. @ Reedy does this still make sense?" [puppet] - 10https://gerrit.wikimedia.org/r/169944 (https://bugzilla.wikimedia.org/39482) (owner: 10Reedy) [00:12:59] (03CR) 10GWicke: "Added Alex, who IIRC puppetized this initially, and Marko and Giuseppe, who are currently generalizing the service puppet setup." [puppet] - 10https://gerrit.wikimedia.org/r/167413 (owner: 10Ori.livneh) [00:13:12] (03CR) 10BBlack: "PS13 is just whitespace cleanup + rebase. Testing this in beta now..." [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T88813) (owner: 10Nuria) [00:13:41] ori: ok, fair. a bit confused why Apache config is not in the Apache module [00:15:18] RoanKattouw: I can do the next one if you +2 it [00:17:58] (03PS6) 10Dzahn: Allow OCG machines in Beta to be jenkins slaves. [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10Cscott) [00:17:59] Hold on I gotta sync something else first [00:18:39] (03CR) 10Dzahn: "cscott: is this still desired? rebased it. is this role used in labs?" [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10Cscott) [00:20:27] mutante: thanks for resurrecting that btw [00:21:39] !log catrope Synchronized php-1.26wmf2/extensions/VisualEditor: Fix RESTbase revid bug (duration: 00m 17s) [00:21:46] Logged the message, Master [00:21:57] !log catrope Synchronized php-1.26wmf3/extensions/VisualEditor: Fix RESTbase revid bug (duration: 00m 18s) [00:22:00] Logged the message, Master [00:23:03] (03CR) 10Dzahn: "ori, _joe_: here's more to resurrect :)" [puppet] - 10https://gerrit.wikimedia.org/r/179027 (owner: 10Ori.livneh) [00:24:42] (03CR) 10Dzahn: [C: 032] apache: Mute warnings about right-to-left relationships [puppet] - 10https://gerrit.wikimedia.org/r/201884 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [00:25:54] (03CR) 10Dzahn: [C: 031] Use require_package for python-redis [puppet] - 10https://gerrit.wikimedia.org/r/202093 (owner: 10Gergő Tisza) [00:27:56] (03CR) 10Dzahn: [C: 031] Tools: Fix bigbrother's patterns for web service types [puppet] - 10https://gerrit.wikimedia.org/r/201996 (https://phabricator.wikimedia.org/T94496) (owner: 10Tim Landscheidt) [00:30:04] (03CR) 10Dzahn: [C: 04-1] "deployment-prep already has it:" [puppet] - 10https://gerrit.wikimedia.org/r/201942 (https://phabricator.wikimedia.org/T95107) (owner: 10Gergő Tisza) [00:33:08] (03CR) 10Dzahn: "realistically that's rejected then" [puppet] - 10https://gerrit.wikimedia.org/r/181949 (owner: 10Hoo man) [00:36:46] gwicke: So James_F|Away says mw.org is broken [00:37:07] (03CR) 10Dzahn: [C: 031] "what could go wrong? it's just staging?" [puppet] - 10https://gerrit.wikimedia.org/r/200795 (owner: 1020after4) [00:37:17] gwicke: https://www.mediawiki.org/api/rest_v1/page/html/VisualEditor%2FDesign is a 500 [00:37:22] gwicke: Because it's a subpage [00:37:31] gwicke: Probably something going on with the Apache redirect decoding %2F? [00:37:36] * RoanKattouw reverts [00:37:56] hmm, it's not touching Apache [00:38:02] that would have to be Varnish [00:38:34] bblack: is Varnish decoding percent encoding by default? [00:38:36] (03PS1) 10Catrope: Revert "Use /api/rest_v1/ entry point for VE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206340 [00:38:53] {"type":"https://restbase.org/errors/internal_error","method":"get","detail":"Error: Invalid revision: changelog","uri":"/www.mediawiki.org/v1/page/html/VisualEditor/changelog"} [00:38:59] (03PS2) 10Yuvipanda: Classify staging-cache-.* machines. [puppet] - 10https://gerrit.wikimedia.org/r/200795 (owner: 1020after4) [00:39:07] (03CR) 10Yuvipanda: [C: 032 V: 032] Classify staging-cache-.* machines. [puppet] - 10https://gerrit.wikimedia.org/r/200795 (owner: 1020after4) [00:39:26] http://rest.wikimedia.org/www.mediawiki.org/v1/page/html/VisualEditor%2FDesign clearly works [00:39:50] this must be specific to text varnishes [00:39:57] (03PS2) 10Catrope: Revert "Use /api/rest_v1/ entry point for VE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206340 [00:40:06] (03CR) 10Catrope: [C: 032] Revert "Use /api/rest_v1/ entry point for VE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206340 (owner: 10Catrope) [00:40:11] (03Merged) 10jenkins-bot: Revert "Use /api/rest_v1/ entry point for VE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206340 (owner: 10Catrope) [00:40:12] Yeah we didn't have this bug before [00:40:29] But redirects often have this bug of decoding %2F on the way, I've seen that happen in other cases [00:41:15] the weird thing is that it's passing through Varnish in both cases [00:47:24] !log catrope Synchronized wmf-config/CommonSettings.php: Revert RESTbase URL change (duration: 00m 13s) [00:47:28] Logged the message, Master [00:49:37] (03CR) 10Dzahn: [C: 04-1] "read the ticket one more time i think it should default to Apache2 license. also because puppet itself switched to using it https://puppet" [puppet] - 10https://gerrit.wikimedia.org/r/183862 (owner: 10Rush) [01:02:56] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:06:06] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [01:20:37] (03PS1) 10GWicke: Don't normalize the path for requests to restbase [puppet] - 10https://gerrit.wikimedia.org/r/206345 [01:24:15] (03PS2) 10GWicke: Don't normalize the path for requests to restbase [puppet] - 10https://gerrit.wikimedia.org/r/206345 [01:25:43] (03PS3) 10GWicke: Don't normalize the path for requests to restbase [puppet] - 10https://gerrit.wikimedia.org/r/206345 [01:26:26] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1233075 (10ori) MediaWiki's default expectation is that static assets and text are served from the same host. It actually takes quite a bit of hackery to convince it to use bits. So (happily... [01:35:30] (03CR) 10BBlack: [C: 032] Don't normalize the path for requests to restbase [puppet] - 10https://gerrit.wikimedia.org/r/206345 (owner: 10GWicke) [01:36:02] (03PS1) 10Ori.livneh: varnish: implement 'do_gzip' cluster option for mobile/text frontend, too [puppet] - 10https://gerrit.wikimedia.org/r/206348 [01:36:34] ^ bblack. (patch only introduces the cluster option and leaves it set to false) [01:51:45] PROBLEM - salt-minion processes on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:51:46] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:15] RECOVERY - salt-minion processes on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:53:16] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [01:55:45] (03PS1) 10Ori.livneh: Handle 204s consistently across Varnish roles [puppet] - 10https://gerrit.wikimedia.org/r/206351 [01:57:18] (03CR) 10Ori.livneh: Handle 204s consistently across Varnish roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/206351 (owner: 10Ori.livneh) [02:05:06] PROBLEM - dhclient process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:06:36] RECOVERY - dhclient process on labvirt1006 is OK: PROCS OK: 0 processes with command name dhclient [02:25:33] !log l10nupdate Synchronized php-1.26wmf2/cache/l10n: (no message) (duration: 06m 35s) [02:25:46] Logged the message, Master [02:30:01] !log LocalisationUpdate completed (1.26wmf2) at 2015-04-24 02:28:58+00:00 [02:30:05] Logged the message, Master [02:37:15] PROBLEM - salt-minion processes on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:37:15] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:16] RECOVERY - salt-minion processes on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:42:16] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [02:44:30] !log l10nupdate Synchronized php-1.26wmf3/cache/l10n: (no message) (duration: 06m 00s) [02:44:37] Logged the message, Master [02:48:15] !log LocalisationUpdate completed (1.26wmf3) at 2015-04-24 02:47:12+00:00 [02:48:19] Logged the message, Master [03:14:18] (03CR) 10Springle: "Just a few questions:" [puppet] - 10https://gerrit.wikimedia.org/r/206145 (owner: 10Aaron Schulz) [03:21:06] PROBLEM - salt-minion processes on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:21:06] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:55] RECOVERY - salt-minion processes on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:25:55] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [03:39:56] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [03:41:06] PROBLEM - HHVM busy threads on mw1148 is CRITICAL 40.00% of data above the critical threshold [86.4] [03:42:55] PROBLEM - HHVM busy threads on mw1206 is CRITICAL 40.00% of data above the critical threshold [115.2] [03:44:15] PROBLEM - HHVM busy threads on mw1202 is CRITICAL 40.00% of data above the critical threshold [115.2] [03:44:25] PROBLEM - HHVM rendering on mw1129 is CRITICAL - Socket timeout after 10 seconds [03:44:25] PROBLEM - HHVM rendering on mw1144 is CRITICAL - Socket timeout after 10 seconds [03:44:35] PROBLEM - Apache HTTP on mw1119 is CRITICAL - Socket timeout after 10 seconds [03:44:36] PROBLEM - Apache HTTP on mw1208 is CRITICAL - Socket timeout after 10 seconds [03:44:36] PROBLEM - Apache HTTP on mw1136 is CRITICAL - Socket timeout after 10 seconds [03:44:36] PROBLEM - HHVM rendering on mw1133 is CRITICAL - Socket timeout after 10 seconds [03:44:45] PROBLEM - HHVM rendering on mw1121 is CRITICAL - Socket timeout after 10 seconds [03:44:46] PROBLEM - HHVM rendering on mw1204 is CRITICAL - Socket timeout after 10 seconds [03:44:55] PROBLEM - Apache HTTP on mw1191 is CRITICAL - Socket timeout after 10 seconds [03:44:55] PROBLEM - Apache HTTP on mw1145 is CRITICAL - Socket timeout after 10 seconds [03:44:55] PROBLEM - HHVM busy threads on mw1140 is CRITICAL 60.00% of data above the critical threshold [86.4] [03:44:55] PROBLEM - HHVM rendering on mw1116 is CRITICAL - Socket timeout after 10 seconds [03:45:06] PROBLEM - HHVM rendering on mw1114 is CRITICAL - Socket timeout after 10 seconds [03:45:06] PROBLEM - HHVM rendering on mw1145 is CRITICAL - Socket timeout after 10 seconds [03:45:07] PROBLEM - Apache HTTP on mw1133 is CRITICAL - Socket timeout after 10 seconds [03:45:15] PROBLEM - HHVM rendering on mw1143 is CRITICAL - Socket timeout after 10 seconds [03:45:15] PROBLEM - HHVM rendering on mw1128 is CRITICAL - Socket timeout after 10 seconds [03:45:15] PROBLEM - HHVM rendering on mw1125 is CRITICAL - Socket timeout after 10 seconds [03:45:15] PROBLEM - Apache HTTP on mw1206 is CRITICAL - Socket timeout after 10 seconds [03:45:16] PROBLEM - HHVM rendering on mw1136 is CRITICAL - Socket timeout after 10 seconds [03:45:25] PROBLEM - Apache HTTP on mw1125 is CRITICAL - Socket timeout after 10 seconds [03:45:25] PROBLEM - Apache HTTP on mw1120 is CRITICAL - Socket timeout after 10 seconds [03:45:25] PROBLEM - HHVM rendering on mw1203 is CRITICAL - Socket timeout after 10 seconds [03:45:25] PROBLEM - Apache HTTP on mw1193 is CRITICAL - Socket timeout after 10 seconds [03:45:26] PROBLEM - Apache HTTP on mw1203 is CRITICAL - Socket timeout after 10 seconds [03:45:26] PROBLEM - Apache HTTP on mw1127 is CRITICAL - Socket timeout after 10 seconds [03:45:35] PROBLEM - Apache HTTP on mw1130 is CRITICAL - Socket timeout after 10 seconds [03:45:36] PROBLEM - Apache HTTP on mw1128 is CRITICAL - Socket timeout after 10 seconds [03:45:45] PROBLEM - Apache HTTP on mw1144 is CRITICAL - Socket timeout after 10 seconds [03:45:45] PROBLEM - Apache HTTP on mw1201 is CRITICAL - Socket timeout after 10 seconds [03:45:45] PROBLEM - Apache HTTP on mw1205 is CRITICAL - Socket timeout after 10 seconds [03:45:45] PROBLEM - Apache HTTP on mw1143 is CRITICAL - Socket timeout after 10 seconds [03:45:46] PROBLEM - HHVM rendering on mw1205 is CRITICAL - Socket timeout after 10 seconds [03:45:46] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL - Socket timeout after 10 seconds [03:45:49] PROBLEM - HHVM rendering on mw1139 is CRITICAL - Socket timeout after 10 seconds [03:45:50] PROBLEM - HHVM rendering on mw1190 is CRITICAL - Socket timeout after 10 seconds [03:45:50] PROBLEM - HHVM rendering on mw1146 is CRITICAL - Socket timeout after 10 seconds [03:45:50] PROBLEM - Apache HTTP on mw1132 is CRITICAL - Socket timeout after 10 seconds [03:45:50] PROBLEM - HHVM rendering on mw1202 is CRITICAL - Socket timeout after 10 seconds [03:45:50] PROBLEM - Apache HTTP on mw1229 is CRITICAL - Socket timeout after 10 seconds [03:45:50] PROBLEM - Apache HTTP on mw1142 is CRITICAL - Socket timeout after 10 seconds [03:45:55] PROBLEM - Apache HTTP on mw1123 is CRITICAL - Socket timeout after 10 seconds [03:45:55] PROBLEM - Apache HTTP on mw1146 is CRITICAL - Socket timeout after 10 seconds [03:45:55] PROBLEM - Apache HTTP on mw1139 is CRITICAL - Socket timeout after 10 seconds [03:45:56] PROBLEM - HHVM rendering on mw1130 is CRITICAL - Socket timeout after 10 seconds [03:45:56] PROBLEM - HHVM rendering on mw1119 is CRITICAL - Socket timeout after 10 seconds [03:45:56] PROBLEM - HHVM rendering on mw1126 is CRITICAL - Socket timeout after 10 seconds [03:46:15] PROBLEM - Apache HTTP on mw1230 is CRITICAL - Socket timeout after 10 seconds [03:46:15] PROBLEM - Apache HTTP on mw1137 is CRITICAL - Socket timeout after 10 seconds [03:46:15] PROBLEM - Apache HTTP on mw1190 is CRITICAL - Socket timeout after 10 seconds [03:46:15] PROBLEM - Apache HTTP on mw1198 is CRITICAL - Socket timeout after 10 seconds [03:46:16] PROBLEM - Apache HTTP on mw1131 is CRITICAL - Socket timeout after 10 seconds [03:46:16] PROBLEM - Apache HTTP on mw1116 is CRITICAL - Socket timeout after 10 seconds [03:46:16] PROBLEM - Apache HTTP on mw1114 is CRITICAL - Socket timeout after 10 seconds [03:46:25] PROBLEM - HHVM rendering on mw1230 is CRITICAL - Socket timeout after 10 seconds [03:46:25] PROBLEM - Apache HTTP on mw1204 is CRITICAL - Socket timeout after 10 seconds [03:46:25] PROBLEM - HHVM rendering on mw1201 is CRITICAL - Socket timeout after 10 seconds [03:46:25] PROBLEM - Apache HTTP on mw1126 is CRITICAL - Socket timeout after 10 seconds [03:46:25] PROBLEM - HHVM queue size on mw1128 is CRITICAL 40.00% of data above the critical threshold [80.0] [03:46:26] PROBLEM - HHVM rendering on mw1200 is CRITICAL - Socket timeout after 10 seconds [03:46:26] PROBLEM - Apache HTTP on mw1202 is CRITICAL - Socket timeout after 10 seconds [03:46:27] PROBLEM - HHVM rendering on mw1137 is CRITICAL - Socket timeout after 10 seconds [03:46:27] PROBLEM - Apache HTTP on mw1129 is CRITICAL - Socket timeout after 10 seconds [03:46:28] PROBLEM - Apache HTTP on mw1195 is CRITICAL - Socket timeout after 10 seconds [03:46:28] PROBLEM - HHVM rendering on mw1142 is CRITICAL - Socket timeout after 10 seconds [03:46:29] PROBLEM - HHVM rendering on mw1147 is CRITICAL - Socket timeout after 10 seconds [03:46:29] PROBLEM - HHVM rendering on mw1198 is CRITICAL - Socket timeout after 10 seconds [03:46:30] PROBLEM - HHVM rendering on mw1196 is CRITICAL - Socket timeout after 10 seconds [03:46:46] PROBLEM - Apache HTTP on mw1207 is CRITICAL - Socket timeout after 10 seconds [03:46:46] PROBLEM - HHVM rendering on mw1229 is CRITICAL - Socket timeout after 10 seconds [03:46:46] PROBLEM - HHVM rendering on mw1193 is CRITICAL - Socket timeout after 10 seconds [03:46:47] PROBLEM - HHVM rendering on mw1123 is CRITICAL - Socket timeout after 10 seconds [03:46:47] PROBLEM - HHVM rendering on mw1195 is CRITICAL - Socket timeout after 10 seconds [03:46:47] PROBLEM - HHVM rendering on mw1131 is CRITICAL - Socket timeout after 10 seconds [03:46:47] PROBLEM - HHVM rendering on mw1120 is CRITICAL - Socket timeout after 10 seconds [03:46:47] PROBLEM - HHVM rendering on mw1124 is CRITICAL - Socket timeout after 10 seconds [03:46:55] PROBLEM - HHVM rendering on mw1138 is CRITICAL - Socket timeout after 10 seconds [03:46:55] PROBLEM - HHVM rendering on mw1140 is CRITICAL - Socket timeout after 10 seconds [03:46:55] PROBLEM - HHVM rendering on mw1127 is CRITICAL - Socket timeout after 10 seconds [03:46:56] PROBLEM - Apache HTTP on mw1124 is CRITICAL - Socket timeout after 10 seconds [03:47:07] PROBLEM - Apache HTTP on mw1147 is CRITICAL - Socket timeout after 10 seconds [03:47:15] PROBLEM - Apache HTTP on mw1140 is CRITICAL - Socket timeout after 10 seconds [03:47:26] PROBLEM - HHVM queue size on mw1137 is CRITICAL 80.00% of data above the critical threshold [80.0] [03:47:26] PROBLEM - HHVM rendering on mw1148 is CRITICAL - Socket timeout after 10 seconds [03:47:27] PROBLEM - HHVM rendering on mw1134 is CRITICAL - Socket timeout after 10 seconds [03:47:36] PROBLEM - Apache HTTP on mw1148 is CRITICAL - Socket timeout after 10 seconds [03:47:36] PROBLEM - Apache HTTP on mw1134 is CRITICAL - Socket timeout after 10 seconds [03:47:45] PROBLEM - HHVM rendering on mw1207 is CRITICAL - Socket timeout after 10 seconds [03:47:45] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.107 second response time [03:47:45] PROBLEM - HHVM rendering on mw1192 is CRITICAL - Socket timeout after 10 seconds [03:47:46] PROBLEM - Apache HTTP on mw1115 is CRITICAL - Socket timeout after 10 seconds [03:47:55] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 1.648 second response time [03:47:55] PROBLEM - Apache HTTP on mw1192 is CRITICAL - Socket timeout after 10 seconds [03:47:56] PROBLEM - Apache HTTP on mw1135 is CRITICAL - Socket timeout after 10 seconds [03:48:05] PROBLEM - HHVM rendering on mw1117 is CRITICAL - Socket timeout after 10 seconds [03:48:05] PROBLEM - HHVM rendering on mw1191 is CRITICAL - Socket timeout after 10 seconds [03:48:06] PROBLEM - Apache HTTP on mw1117 is CRITICAL - Socket timeout after 10 seconds [03:48:07] PROBLEM - HHVM queue size on mw1117 is CRITICAL 60.00% of data above the critical threshold [80.0] [03:48:16] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.137 second response time [03:48:16] PROBLEM - HHVM busy threads on mw1189 is CRITICAL 60.00% of data above the critical threshold [115.2] [03:48:35] PROBLEM - HHVM rendering on mw1115 is CRITICAL - Socket timeout after 10 seconds [03:48:35] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 3 below the confidence bounds [03:48:46] PROBLEM - Apache HTTP on mw1199 is CRITICAL - Socket timeout after 10 seconds [03:48:55] PROBLEM - Apache HTTP on mw1197 is CRITICAL - Socket timeout after 10 seconds [03:48:56] RECOVERY - LVS HTTP IPv4 on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 16531 bytes in 0.118 second response time [03:49:05] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.046 second response time [03:49:06] PROBLEM - HHVM rendering on mw1199 is CRITICAL - Socket timeout after 10 seconds [03:49:41] meh? [03:50:26] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.155 second response time [03:50:45] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.183 second response time [03:50:46] PROBLEM - HHVM queue size on mw1137 is CRITICAL 100.00% of data above the critical threshold [80.0] [03:50:55] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.151 second response time [03:50:55] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 1.479 second response time [03:51:07] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.077 second response time [03:51:15] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.946 second response time [03:51:15] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.098 second response time [03:51:15] PROBLEM - HHVM busy threads on mw1231 is CRITICAL 40.00% of data above the critical threshold [115.2] [03:51:16] RECOVERY - HHVM rendering on mw1121 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.220 second response time [03:51:16] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.460 second response time [03:51:16] RECOVERY - HHVM rendering on mw1133 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 2.224 second response time [03:51:16] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 4.112 second response time [03:51:17] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 69381 bytes in 2.100 second response time [03:51:17] RECOVERY - HHVM rendering on mw1196 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 2.356 second response time [03:51:18] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.819 second response time [03:51:18] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.577 second response time [03:51:25] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 7.448 second response time [03:51:26] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.048 second response time [03:51:26] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.375 second response time [03:51:26] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.371 second response time [03:51:26] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.405 second response time [03:51:35] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.158 second response time [03:51:36] PROBLEM - HHVM busy threads on mw1189 is CRITICAL 60.00% of data above the critical threshold [115.2] [03:51:45] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 69381 bytes in 5.380 second response time [03:51:45] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.505 second response time [03:51:45] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.055 second response time [03:51:45] RECOVERY - HHVM rendering on mw1127 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 1.024 second response time [03:51:46] RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 9.028 second response time [03:51:46] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 3.401 second response time [03:51:46] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 9.781 second response time [03:51:47] RECOVERY - HHVM rendering on mw1124 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 9.768 second response time [03:51:47] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.235 second response time [03:51:48] RECOVERY - HHVM rendering on mw1128 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 6.389 second response time [03:51:48] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.931 second response time [03:51:55] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.373 second response time [03:51:55] RECOVERY - HHVM rendering on mw1138 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 8.107 second response time [03:51:55] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 8.372 second response time [03:51:55] RECOVERY - HHVM rendering on mw1125 is OK: HTTP OK: HTTP/1.1 200 OK - 69381 bytes in 8.751 second response time [03:51:56] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.058 second response time [03:51:56] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.182 second response time [03:51:56] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.060 second response time [03:51:57] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.337 second response time [03:51:57] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 2.375 second response time [03:51:58] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.152 second response time [03:52:05] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.390 second response time [03:52:05] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [03:52:05] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.819 second response time [03:52:05] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.565 second response time [03:52:15] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.073 second response time [03:52:15] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.099 second response time [03:52:16] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.179 second response time [03:52:16] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.285 second response time [03:52:16] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.069 second response time [03:52:16] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.734 second response time [03:52:16] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.161 second response time [03:52:17] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.211 second response time [03:52:17] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 69380 bytes in 0.238 second response time [03:52:18] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.223 second response time [03:52:18] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.206 second response time [03:52:19] RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.200 second response time [03:52:19] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 69381 bytes in 0.231 second response time [03:52:25] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.074 second response time [03:52:25] PROBLEM - HHVM queue size on mw1139 is CRITICAL 100.00% of data above the critical threshold [80.0] [03:52:26] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.063 second response time [03:52:26] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.071 second response time [03:52:26] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.071 second response time [03:52:26] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.077 second response time [03:52:26] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.074 second response time [03:52:35] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.170 second response time [03:52:35] RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 69380 bytes in 0.218 second response time [03:52:35] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.226 second response time [03:52:35] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.241 second response time [03:52:36] RECOVERY - HHVM rendering on mw1126 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.603 second response time [03:52:36] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.074 second response time [03:52:45] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.056 second response time [03:52:45] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [03:52:45] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.049 second response time [03:52:46] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.071 second response time [03:52:46] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.084 second response time [03:52:46] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.076 second response time [03:52:46] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.036 second response time [03:52:47] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [03:52:47] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [03:52:55] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.066 second response time [03:52:55] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.084 second response time [03:52:55] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.185 second response time [03:52:55] RECOVERY - HHVM rendering on mw1137 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.210 second response time [03:52:55] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.233 second response time [03:52:56] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.180 second response time [03:52:56] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.169 second response time [03:52:57] RECOVERY - HHVM rendering on mw1147 is OK: HTTP OK: HTTP/1.1 200 OK - 69380 bytes in 0.226 second response time [03:52:57] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.146 second response time [03:52:58] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.246 second response time [03:52:58] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.270 second response time [03:52:59] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.108 second response time [03:52:59] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.193 second response time [03:53:05] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.088 second response time [03:53:06] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.068 second response time [03:53:06] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.074 second response time [03:53:16] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.146 second response time [03:53:16] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.187 second response time [03:53:16] RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.196 second response time [03:53:16] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 69381 bytes in 0.205 second response time [03:53:16] PROBLEM - HHVM queue size on mw1147 is CRITICAL 80.00% of data above the critical threshold [80.0] [03:54:08] what was that mess? [03:55:26] PROBLEM - HHVM queue size on mw1140 is CRITICAL 60.00% of data above the critical threshold [80.0] [03:55:26] PROBLEM - HHVM queue size on mw1230 is CRITICAL 33.33% of data above the critical threshold [80.0] [03:55:56] PROBLEM - HHVM busy threads on mw1195 is CRITICAL 40.00% of data above the critical threshold [115.2] [03:56:06] PROBLEM - HHVM busy threads on mw1121 is CRITICAL 40.00% of data above the critical threshold [86.4] [03:56:19] ^d: logstash fatalmonitor seems to be full of "Search backend error during full_text search for ..." [03:56:35] PROBLEM - HHVM busy threads on mw1140 is CRITICAL 80.00% of data above the critical threshold [86.4] [03:56:35] PROBLEM - HHVM busy threads on mw1145 is CRITICAL 40.00% of data above the critical threshold [86.4] [03:56:53] All the failing searches seem to be for 'Sju svarta be-hå (1954 film)' [03:57:06] RECOVERY - HHVM queue size on mw1230 is OK Less than 30.00% above the threshold [10.0] [03:57:15] PROBLEM - HHVM busy threads on mw1122 is CRITICAL 60.00% of data above the critical threshold [86.4] [03:57:36] PROBLEM - HHVM queue size on mw1199 is CRITICAL 40.00% of data above the critical threshold [80.0] [03:57:36] RECOVERY - HHVM busy threads on mw1195 is OK Less than 30.00% above the threshold [76.8] [03:57:46] RECOVERY - HHVM busy threads on mw1231 is OK Less than 30.00% above the threshold [76.8] [03:58:05] PROBLEM - HHVM busy threads on mw1137 is CRITICAL 40.00% of data above the critical threshold [86.4] [03:58:06] PROBLEM - HHVM busy threads on mw1221 is CRITICAL 40.00% of data above the critical threshold [115.2] [03:58:15] RECOVERY - HHVM busy threads on mw1145 is OK Less than 30.00% above the threshold [57.6] [03:58:15] RECOVERY - HHVM queue size on mw1147 is OK Less than 30.00% above the threshold [10.0] [03:58:35] PROBLEM - HHVM queue size on mw1206 is CRITICAL 40.00% of data above the critical threshold [80.0] [03:58:36] PROBLEM - HHVM busy threads on mw1134 is CRITICAL 40.00% of data above the critical threshold [86.4] [03:58:43] connection errors to 10.64.48.19 and 10.64.48.25 too [03:58:45] PROBLEM - HHVM busy threads on mw1208 is CRITICAL 33.33% of data above the critical threshold [115.2] [03:59:05] RECOVERY - HHVM queue size on mw1137 is OK Less than 30.00% above the threshold [10.0] [03:59:16] RECOVERY - HHVM queue size on mw1199 is OK Less than 30.00% above the threshold [10.0] [03:59:26] PROBLEM - HHVM busy threads on mw1201 is CRITICAL 40.00% of data above the critical threshold [115.2] [03:59:26] RECOVERY - HHVM busy threads on mw1121 is OK Less than 30.00% above the threshold [57.6] [03:59:46] RECOVERY - HHVM busy threads on mw1137 is OK Less than 30.00% above the threshold [57.6] [03:59:46] RECOVERY - HHVM busy threads on mw1221 is OK Less than 30.00% above the threshold [76.8] [03:59:55] PROBLEM - HHVM busy threads on mw1140 is CRITICAL 40.00% of data above the critical threshold [86.4] [03:59:56] RECOVERY - HHVM busy threads on mw1189 is OK Less than 30.00% above the threshold [76.8] [04:00:16] RECOVERY - HHVM queue size on mw1206 is OK Less than 30.00% above the threshold [10.0] [04:00:26] RECOVERY - HHVM busy threads on mw1134 is OK Less than 30.00% above the threshold [57.6] [04:00:26] RECOVERY - HHVM queue size on mw1140 is OK Less than 30.00% above the threshold [10.0] [04:00:26] RECOVERY - HHVM busy threads on mw1208 is OK Less than 30.00% above the threshold [76.8] [04:00:37] RECOVERY - HHVM busy threads on mw1122 is OK Less than 30.00% above the threshold [57.6] [04:01:15] RECOVERY - HHVM busy threads on mw1148 is OK Less than 30.00% above the threshold [57.6] [04:01:16] RECOVERY - HHVM busy threads on mw1201 is OK Less than 30.00% above the threshold [76.8] [04:01:16] RECOVERY - HHVM busy threads on mw1206 is OK Less than 30.00% above the threshold [76.8] [04:01:25] RECOVERY - HHVM queue size on mw1128 is OK Less than 30.00% above the threshold [10.0] [04:01:36] RECOVERY - HHVM queue size on mw1117 is OK Less than 30.00% above the threshold [10.0] [04:02:26] RECOVERY - HHVM queue size on mw1139 is OK Less than 30.00% above the threshold [10.0] [04:02:36] RECOVERY - HHVM busy threads on mw1202 is OK Less than 30.00% above the threshold [76.8] [04:03:16] RECOVERY - HHVM busy threads on mw1140 is OK Less than 30.00% above the threshold [57.6] [04:08:16] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [04:16:56] PROBLEM - HHVM busy threads on mw1134 is CRITICAL 40.00% of data above the critical threshold [86.4] [04:18:56] PROBLEM - HHVM busy threads on mw1122 is CRITICAL 33.33% of data above the critical threshold [86.4] [04:18:56] PROBLEM - HHVM busy threads on mw1142 is CRITICAL 60.00% of data above the critical threshold [86.4] [04:19:16] PROBLEM - HHVM rendering on mw1192 is CRITICAL - Socket timeout after 10 seconds [04:19:36] PROBLEM - HHVM rendering on mw1137 is CRITICAL - Socket timeout after 10 seconds [04:19:36] PROBLEM - Apache HTTP on mw1126 is CRITICAL - Socket timeout after 10 seconds [04:19:56] PROBLEM - HHVM queue size on mw1121 is CRITICAL 40.00% of data above the critical threshold [80.0] [04:20:26] PROBLEM - HHVM rendering on mw1203 is CRITICAL - Socket timeout after 10 seconds [04:20:26] PROBLEM - Apache HTTP on mw1203 is CRITICAL - Socket timeout after 10 seconds [04:20:46] PROBLEM - HHVM rendering on mw1148 is CRITICAL - Socket timeout after 10 seconds [04:20:46] PROBLEM - HHVM rendering on mw1139 is CRITICAL - Socket timeout after 10 seconds [04:20:56] PROBLEM - Apache HTTP on mw1148 is CRITICAL - Socket timeout after 10 seconds [04:20:56] PROBLEM - Apache HTTP on mw1123 is CRITICAL - Socket timeout after 10 seconds [04:20:56] PROBLEM - Apache HTTP on mw1139 is CRITICAL - Socket timeout after 10 seconds [04:20:56] PROBLEM - Apache HTTP on mw1134 is CRITICAL - Socket timeout after 10 seconds [04:21:06] PROBLEM - Apache HTTP on mw1115 is CRITICAL - Socket timeout after 10 seconds [04:21:06] PROBLEM - HHVM rendering on mw1119 is CRITICAL - Socket timeout after 10 seconds [04:21:06] PROBLEM - HHVM rendering on mw1126 is CRITICAL - Socket timeout after 10 seconds [04:21:15] PROBLEM - Apache HTTP on mw1192 is CRITICAL - Socket timeout after 10 seconds [04:21:15] PROBLEM - Apache HTTP on mw1137 is CRITICAL - Socket timeout after 10 seconds [04:21:15] PROBLEM - Apache HTTP on mw1198 is CRITICAL - Socket timeout after 10 seconds [04:21:16] PROBLEM - Apache HTTP on mw1119 is CRITICAL - Socket timeout after 10 seconds [04:21:17] PROBLEM - HHVM rendering on mw1228 is CRITICAL - Socket timeout after 10 seconds [04:21:25] PROBLEM - Apache HTTP on mw1195 is CRITICAL: Connection timed out [04:21:25] PROBLEM - Apache HTTP on mw1116 is CRITICAL - Socket timeout after 10 seconds [04:21:25] PROBLEM - Apache HTTP on mw1129 is CRITICAL - Socket timeout after 10 seconds [04:21:25] PROBLEM - HHVM rendering on mw1117 is CRITICAL - Socket timeout after 10 seconds [04:21:26] PROBLEM - HHVM rendering on mw1142 is CRITICAL - Socket timeout after 10 seconds [04:21:26] PROBLEM - HHVM rendering on mw1198 is CRITICAL - Socket timeout after 10 seconds [04:21:26] PROBLEM - HHVM rendering on mw1147 is CRITICAL - Socket timeout after 10 seconds [04:21:27] PROBLEM - HHVM rendering on mw1196 is CRITICAL - Socket timeout after 10 seconds [04:21:35] PROBLEM - Apache HTTP on mw1117 is CRITICAL - Socket timeout after 10 seconds [04:21:36] PROBLEM - HHVM rendering on mw1116 is CRITICAL - Socket timeout after 10 seconds [04:21:45] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 33.33% of data above the critical threshold [500.0] [04:21:46] PROBLEM - HHVM rendering on mw1131 is CRITICAL - Socket timeout after 10 seconds [04:21:47] PROBLEM - HHVM rendering on mw1145 is CRITICAL - Socket timeout after 10 seconds [04:21:47] PROBLEM - HHVM rendering on mw1195 is CRITICAL: Connection timed out [04:21:55] PROBLEM - Apache HTTP on mw1206 is CRITICAL - Socket timeout after 10 seconds [04:21:56] PROBLEM - HHVM rendering on mw1140 is CRITICAL - Socket timeout after 10 seconds [04:21:56] PROBLEM - HHVM rendering on mw1127 is CRITICAL - Socket timeout after 10 seconds [04:21:56] PROBLEM - HHVM rendering on mw1115 is CRITICAL - Socket timeout after 10 seconds [04:22:06] PROBLEM - Apache HTTP on mw1120 is CRITICAL: Connection timed out [04:22:06] PROBLEM - HHVM busy threads on mw1138 is CRITICAL 40.00% of data above the critical threshold [86.4] [04:22:15] PROBLEM - Apache HTTP on mw1147 is CRITICAL - Socket timeout after 10 seconds [04:22:15] PROBLEM - Apache HTTP on mw1127 is CRITICAL - Socket timeout after 10 seconds [04:22:15] PROBLEM - Apache HTTP on mw1140 is CRITICAL - Socket timeout after 10 seconds [04:22:26] PROBLEM - Apache HTTP on mw1144 is CRITICAL - Socket timeout after 10 seconds [04:22:35] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.719 second response time [04:22:35] PROBLEM - HHVM rendering on mw1134 is CRITICAL - Socket timeout after 10 seconds [04:22:35] PROBLEM - HHVM rendering on mw1202 is CRITICAL - Socket timeout after 10 seconds [04:22:35] PROBLEM - HHVM rendering on mw1146 is CRITICAL - Socket timeout after 10 seconds [04:22:35] PROBLEM - Apache HTTP on mw1142 is CRITICAL - Socket timeout after 10 seconds [04:22:36] PROBLEM - Apache HTTP on mw1146 is CRITICAL - Socket timeout after 10 seconds [04:22:45] PROBLEM - HHVM queue size on mw1148 is CRITICAL 60.00% of data above the critical threshold [80.0] [04:22:46] PROBLEM - HHVM rendering on mw1144 is CRITICAL - Socket timeout after 10 seconds [04:22:46] PROBLEM - HHVM rendering on mw1129 is CRITICAL - Socket timeout after 10 seconds [04:22:55] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.531 second response time [04:22:56] PROBLEM - HHVM queue size on mw1123 is CRITICAL 60.00% of data above the critical threshold [80.0] [04:22:57] PROBLEM - Apache HTTP on mw1131 is CRITICAL - Socket timeout after 10 seconds [04:22:57] RECOVERY - HHVM rendering on mw1147 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.258 second response time [04:23:05] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 4.019 second response time [04:23:05] PROBLEM - Apache HTTP on mw1202 is CRITICAL - Socket timeout after 10 seconds [04:23:15] PROBLEM - HHVM rendering on mw1206 is CRITICAL - Socket timeout after 10 seconds [04:23:17] PROBLEM - HHVM rendering on mw1135 is CRITICAL - Socket timeout after 10 seconds [04:23:17] PROBLEM - Apache HTTP on mw1145 is CRITICAL - Socket timeout after 10 seconds [04:23:27] PROBLEM - HHVM rendering on mw1120 is CRITICAL - Socket timeout after 10 seconds [04:23:35] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 4.072 second response time [04:23:35] PROBLEM - HHVM rendering on mw1114 is CRITICAL - Socket timeout after 10 seconds [04:23:36] PROBLEM - HHVM queue size on mw1231 is CRITICAL 33.33% of data above the critical threshold [80.0] [04:23:42] uhhh [04:23:45] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.073 second response time [04:23:46] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.533 second response time [04:23:46] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 5.671 second response time [04:23:55] PROBLEM - Apache HTTP on mw1130 is CRITICAL - Socket timeout after 10 seconds [04:24:05] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.072 second response time [04:24:06] PROBLEM - HHVM busy threads on mw1228 is CRITICAL 80.00% of data above the critical threshold [115.2] [04:24:15] PROBLEM - Apache HTTP on mw1132 is CRITICAL - Socket timeout after 10 seconds [04:24:15] RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 9.446 second response time [04:24:16] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.064 second response time [04:24:16] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.033 second response time [04:24:26] PROBLEM - HHVM rendering on mw1130 is CRITICAL - Socket timeout after 10 seconds [04:24:32] we're aware [04:24:35] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.875 second response time [04:24:37] PROBLEM - Apache HTTP on mw1135 is CRITICAL - Socket timeout after 10 seconds [04:24:45] RECOVERY - HHVM rendering on mw1137 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 3.001 second response time [04:24:45] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.910 second response time [04:24:45] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.448 second response time [04:24:46] PROBLEM - Apache HTTP on mw1114 is CRITICAL - Socket timeout after 10 seconds [04:24:46] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 69380 bytes in 3.113 second response time [04:24:46] PROBLEM - HHVM rendering on mw1133 is CRITICAL - Socket timeout after 10 seconds [04:24:47] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.508 second response time [04:24:55] PROBLEM - HHVM rendering on mw1132 is CRITICAL - Socket timeout after 10 seconds [04:24:56] PROBLEM - Apache HTTP on mw1191 is CRITICAL - Socket timeout after 10 seconds [04:24:56] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 3.571 second response time [04:24:56] PROBLEM - Apache HTTP on mw1138 is CRITICAL - Socket timeout after 10 seconds [04:24:56] PROBLEM - HHVM queue size on mw1117 is CRITICAL 80.00% of data above the critical threshold [80.0] [04:24:56] PROBLEM - Apache HTTP on mw1196 is CRITICAL - Socket timeout after 10 seconds [04:25:05] RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 3.064 second response time [04:25:06] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.775 second response time [04:25:06] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.726 second response time [04:25:16] PROBLEM - Apache HTTP on mw1133 is CRITICAL - Socket timeout after 10 seconds [04:25:17] PROBLEM - HHVM rendering on mw1128 is CRITICAL - Socket timeout after 10 seconds [04:25:17] PROBLEM - HHVM rendering on mw1138 is CRITICAL - Socket timeout after 10 seconds [04:25:17] PROBLEM - HHVM rendering on mw1143 is CRITICAL - Socket timeout after 10 seconds [04:25:25] PROBLEM - HHVM rendering on mw1136 is CRITICAL - Socket timeout after 10 seconds [04:25:26] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.629 second response time [04:25:26] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.560 second response time [04:25:26] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.086 second response time [04:25:26] PROBLEM - Apache HTTP on mw1125 is CRITICAL - Socket timeout after 10 seconds [04:25:35] PROBLEM - Apache HTTP on mw1197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 3.062 second response time [04:25:45] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.605 second response time [04:25:46] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.076 second response time [04:25:46] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.205 second response time [04:25:46] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.237 second response time [04:25:46] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 69381 bytes in 2.753 second response time [04:25:47] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.076 second response time [04:25:47] !log Did a cluster-wide 'service hhvm restart'. [04:25:51] Logged the message, Master [04:25:55] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.498 second response time [04:25:56] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.913 second response time [04:25:56] RECOVERY - HHVM rendering on mw1126 is OK: HTTP OK: HTTP/1.1 200 OK - 69380 bytes in 0.346 second response time [04:25:56] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 69381 bytes in 0.326 second response time [04:25:57] RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 69380 bytes in 0.258 second response time [04:25:57] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.325 second response time [04:25:57] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.509 second response time [04:25:57] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.547 second response time [04:26:06] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [04:26:06] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.079 second response time [04:26:06] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.080 second response time [04:26:15] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.807 second response time [04:26:16] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.644 second response time [04:26:16] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.106 second response time [04:26:16] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.127 second response time [04:26:16] RECOVERY - HHVM rendering on mw1133 is OK: HTTP OK: HTTP/1.1 200 OK - 69381 bytes in 0.587 second response time [04:26:17] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.111 second response time [04:26:25] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.213 second response time [04:26:25] RECOVERY - HHVM rendering on mw1196 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.223 second response time [04:26:26] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.523 second response time [04:26:26] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.292 second response time [04:26:26] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.048 second response time [04:26:26] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.448 second response time [04:26:26] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.046 second response time [04:26:35] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 69374 bytes in 6.836 second response time [04:26:36] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.691 second response time [04:26:45] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.504 second response time [04:26:45] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 69380 bytes in 0.252 second response time [04:26:46] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.070 second response time [04:26:46] RECOVERY - HHVM rendering on mw1128 is OK: HTTP OK: HTTP/1.1 200 OK - 69380 bytes in 0.580 second response time [04:26:46] RECOVERY - HHVM rendering on mw1127 is OK: HTTP OK: HTTP/1.1 200 OK - 69381 bytes in 0.422 second response time [04:26:46] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.437 second response time [04:26:46] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 69381 bytes in 0.423 second response time [04:26:47] RECOVERY - HHVM rendering on mw1138 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.541 second response time [04:26:55] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 69381 bytes in 0.570 second response time [04:26:56] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.087 second response time [04:27:06] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.300 second response time [04:27:06] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.173 second response time [04:27:26] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 69373 bytes in 0.366 second response time [04:27:46] PROBLEM - HHVM queue size on mw1199 is CRITICAL 40.00% of data above the critical threshold [80.0] [04:27:56] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.042 second response time [04:27:56] PROBLEM - HHVM busy threads on mw1231 is CRITICAL 40.00% of data above the critical threshold [115.2] [04:27:57] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 69381 bytes in 0.678 second response time [04:28:26] PROBLEM - HHVM busy threads on mw1135 is CRITICAL 40.00% of data above the critical threshold [86.4] [04:28:26] PROBLEM - HHVM busy threads on mw1207 is CRITICAL 40.00% of data above the critical threshold [115.2] [04:28:36] PROBLEM - HHVM busy threads on mw1119 is CRITICAL 60.00% of data above the critical threshold [86.4] [04:28:45] RECOVERY - HHVM queue size on mw1231 is OK Less than 30.00% above the threshold [10.0] [04:28:46] PROBLEM - HHVM busy threads on mw1205 is CRITICAL 60.00% of data above the critical threshold [115.2] [04:28:46] PROBLEM - HHVM busy threads on mw1128 is CRITICAL 40.00% of data above the critical threshold [86.4] [04:28:56] PROBLEM - HHVM queue size on mw1196 is CRITICAL 60.00% of data above the critical threshold [80.0] [04:29:15] PROBLEM - HHVM queue size on mw1139 is CRITICAL 40.00% of data above the critical threshold [80.0] [04:29:26] PROBLEM - HHVM busy threads on mw1117 is CRITICAL 40.00% of data above the critical threshold [86.4] [04:29:26] PROBLEM - HHVM busy threads on mw1195 is CRITICAL 40.00% of data above the critical threshold [115.2] [04:29:27] RECOVERY - HHVM queue size on mw1199 is OK Less than 30.00% above the threshold [10.0] [04:29:37] PROBLEM - HHVM busy threads on mw1229 is CRITICAL 40.00% of data above the critical threshold [115.2] [04:29:56] PROBLEM - HHVM queue size on mw1117 is CRITICAL 60.00% of data above the critical threshold [80.0] [04:29:56] PROBLEM - HHVM busy threads on mw1198 is CRITICAL 60.00% of data above the critical threshold [115.2] [04:30:06] PROBLEM - HHVM busy threads on mw1145 is CRITICAL 40.00% of data above the critical threshold [86.4] [04:30:06] PROBLEM - HHVM busy threads on mw1189 is CRITICAL 40.00% of data above the critical threshold [115.2] [04:30:07] PROBLEM - HHVM busy threads on mw1203 is CRITICAL 40.00% of data above the critical threshold [115.2] [04:30:07] RECOVERY - HHVM busy threads on mw1207 is OK Less than 30.00% above the threshold [76.8] [04:30:26] PROBLEM - HHVM busy threads on mw1138 is CRITICAL 40.00% of data above the critical threshold [86.4] [04:30:26] RECOVERY - HHVM busy threads on mw1134 is OK Less than 30.00% above the threshold [57.6] [04:30:46] PROBLEM - HHVM queue size on mw1122 is CRITICAL 40.00% of data above the critical threshold [80.0] [04:30:55] PROBLEM - HHVM busy threads on mw1143 is CRITICAL 33.33% of data above the critical threshold [86.4] [04:30:55] RECOVERY - HHVM queue size on mw1139 is OK Less than 30.00% above the threshold [10.0] [04:30:56] PROBLEM - HHVM busy threads on mw1126 is CRITICAL 40.00% of data above the critical threshold [86.4] [04:31:03] feel free to ping me in case we should send out a tweet per https://wikitech.wikimedia.org/wiki/Incident_response#Communicating_with_the_public [04:31:06] RECOVERY - HHVM busy threads on mw1117 is OK Less than 30.00% above the threshold [57.6] [04:31:06] RECOVERY - HHVM busy threads on mw1195 is OK Less than 30.00% above the threshold [76.8] [04:31:16] RECOVERY - HHVM queue size on mw1123 is OK Less than 30.00% above the threshold [10.0] [04:31:17] RECOVERY - HHVM busy threads on mw1229 is OK Less than 30.00% above the threshold [76.8] [04:31:36] PROBLEM - HHVM queue size on mw1126 is CRITICAL 40.00% of data above the critical threshold [80.0] [04:31:45] (came here because i noticed API errors disrupting citoid etc. while editing, 20-30 min ago) [04:31:45] PROBLEM - HHVM busy threads on mw1140 is CRITICAL 40.00% of data above the critical threshold [86.4] [04:31:46] RECOVERY - HHVM busy threads on mw1189 is OK Less than 30.00% above the threshold [76.8] [04:31:46] RECOVERY - HHVM busy threads on mw1135 is OK Less than 30.00% above the threshold [57.6] [04:31:46] RECOVERY - HHVM busy threads on mw1203 is OK Less than 30.00% above the threshold [76.8] [04:32:04] !log ori Synchronized php-1.26wmf1/includes/filerepo/file/LocalFile.php: Short-circuit LocalFile::loadExtraFromDB in attempt to mitigate outage (duration: 00m 14s) [04:32:06] RECOVERY - HHVM busy threads on mw1205 is OK Less than 30.00% above the threshold [76.8] [04:32:06] RECOVERY - HHVM busy threads on mw1138 is OK Less than 30.00% above the threshold [57.6] [04:32:06] RECOVERY - HHVM busy threads on mw1128 is OK Less than 30.00% above the threshold [57.6] [04:32:09] Logged the message, Master [04:32:16] RECOVERY - HHVM queue size on mw1196 is OK Less than 30.00% above the threshold [10.0] [04:32:26] RECOVERY - HHVM busy threads on mw1122 is OK Less than 30.00% above the threshold [57.6] [04:32:26] RECOVERY - HHVM queue size on mw1122 is OK Less than 30.00% above the threshold [10.0] [04:32:26] RECOVERY - HHVM busy threads on mw1142 is OK Less than 30.00% above the threshold [57.6] [04:32:35] RECOVERY - HHVM busy threads on mw1228 is OK Less than 30.00% above the threshold [76.8] [04:32:35] RECOVERY - HHVM busy threads on mw1143 is OK Less than 30.00% above the threshold [57.6] [04:32:46] RECOVERY - HHVM queue size on mw1148 is OK Less than 30.00% above the threshold [10.0] [04:32:56] RECOVERY - HHVM busy threads on mw1231 is OK Less than 30.00% above the threshold [76.8] [04:33:16] RECOVERY - HHVM queue size on mw1126 is OK Less than 30.00% above the threshold [10.0] [04:33:16] RECOVERY - HHVM queue size on mw1117 is OK Less than 30.00% above the threshold [10.0] [04:33:16] RECOVERY - HHVM busy threads on mw1198 is OK Less than 30.00% above the threshold [76.8] [04:33:25] RECOVERY - HHVM busy threads on mw1140 is OK Less than 30.00% above the threshold [57.6] [04:33:26] RECOVERY - HHVM busy threads on mw1145 is OK Less than 30.00% above the threshold [57.6] [04:33:26] RECOVERY - HHVM queue size on mw1121 is OK Less than 30.00% above the threshold [10.0] [04:33:35] RECOVERY - HHVM busy threads on mw1119 is OK Less than 30.00% above the threshold [57.6] [04:33:56] PROBLEM - HHVM queue size on mw1136 is CRITICAL 40.00% of data above the critical threshold [80.0] [04:34:16] RECOVERY - HHVM busy threads on mw1126 is OK Less than 30.00% above the threshold [57.6] [04:35:36] RECOVERY - HHVM queue size on mw1136 is OK Less than 30.00% above the threshold [10.0] [04:42:49] !log killing LocalFile::loadExtraFromDB wholesale on s4 [04:42:55] Logged the message, Master [04:43:36] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [04:46:47] PROBLEM - HHVM busy threads on mw1130 is CRITICAL 50.00% of data above the critical threshold [86.4] [04:47:54] !log ori Synchronized php-1.26wmf2/includes/filerepo/file/LocalFile.php: Short-circuit LocalFile::loadExtraFromDB in attempt to mitigate outage (duration: 00m 12s) [04:47:58] Logged the message, Master [04:48:21] ^^ springle [04:48:41] ok [04:50:47] PROBLEM - HHVM busy threads on mw1131 is CRITICAL 80.00% of data above the critical threshold [86.4] [04:52:37] PROBLEM - HHVM busy threads on mw1192 is CRITICAL 40.00% of data above the critical threshold [115.2] [04:52:37] PROBLEM - HHVM busy threads on mw1114 is CRITICAL 80.00% of data above the critical threshold [86.4] [04:52:46] PROBLEM - Apache HTTP on mw1143 is CRITICAL - Socket timeout after 10 seconds [04:52:56] PROBLEM - Apache HTTP on mw1148 is CRITICAL - Socket timeout after 10 seconds [04:52:56] PROBLEM - Apache HTTP on mw1139 is CRITICAL - Socket timeout after 10 seconds [04:52:56] PROBLEM - HHVM busy threads on mw1126 is CRITICAL 100.00% of data above the critical threshold [86.4] [04:53:05] PROBLEM - HHVM rendering on mw1119 is CRITICAL - Socket timeout after 10 seconds [04:53:25] PROBLEM - HHVM rendering on mw1191 is CRITICAL - Socket timeout after 10 seconds [04:53:26] PROBLEM - HHVM rendering on mw1121 is CRITICAL - Socket timeout after 10 seconds [04:53:26] PROBLEM - Apache HTTP on mw1126 is CRITICAL - Socket timeout after 10 seconds [04:53:26] PROBLEM - HHVM rendering on mw1137 is CRITICAL - Socket timeout after 10 seconds [04:53:26] PROBLEM - HHVM rendering on mw1198 is CRITICAL - Socket timeout after 10 seconds [04:53:26] PROBLEM - Apache HTTP on mw1191 is CRITICAL - Socket timeout after 10 seconds [04:53:35] PROBLEM - HHVM busy threads on mw1137 is CRITICAL 100.00% of data above the critical threshold [86.4] [04:53:35] PROBLEM - Apache HTTP on mw1145 is CRITICAL - Socket timeout after 10 seconds [04:53:36] PROBLEM - Apache HTTP on mw1194 is CRITICAL - Socket timeout after 10 seconds [04:53:36] PROBLEM - Apache HTTP on mw1207 is CRITICAL - Socket timeout after 10 seconds [04:53:46] PROBLEM - HHVM rendering on mw1120 is CRITICAL - Socket timeout after 10 seconds [04:53:46] PROBLEM - Apache HTTP on mw1206 is CRITICAL - Socket timeout after 10 seconds [04:53:46] PROBLEM - HHVM queue size on mw1121 is CRITICAL 60.00% of data above the critical threshold [80.0] [04:53:55] PROBLEM - HHVM rendering on mw1145 is CRITICAL - Socket timeout after 10 seconds [04:53:55] PROBLEM - HHVM rendering on mw1123 is CRITICAL - Socket timeout after 10 seconds [04:53:56] PROBLEM - HHVM rendering on mw1128 is CRITICAL - Socket timeout after 10 seconds [04:53:56] PROBLEM - HHVM rendering on mw1127 is CRITICAL - Socket timeout after 10 seconds [04:54:05] PROBLEM - Apache HTTP on mw1125 is CRITICAL - Socket timeout after 10 seconds [04:54:06] PROBLEM - Apache HTTP on mw1197 is CRITICAL - Socket timeout after 10 seconds [04:54:06] PROBLEM - Apache HTTP on mw1203 is CRITICAL - Socket timeout after 10 seconds [04:54:06] PROBLEM - Apache HTTP on mw1130 is CRITICAL - Socket timeout after 10 seconds [04:54:06] PROBLEM - HHVM rendering on mw1203 is CRITICAL - Socket timeout after 10 seconds [04:54:06] PROBLEM - Apache HTTP on mw1127 is CRITICAL - Socket timeout after 10 seconds [04:54:16] PROBLEM - HHVM queue size on mw1201 is CRITICAL 40.00% of data above the critical threshold [80.0] [04:54:16] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.122 second response time [04:54:25] PROBLEM - Apache HTTP on mw1128 is CRITICAL - Socket timeout after 10 seconds [04:54:26] PROBLEM - HHVM rendering on mw1189 is CRITICAL - Socket timeout after 10 seconds [04:54:35] PROBLEM - HHVM rendering on mw1202 is CRITICAL - Socket timeout after 10 seconds [04:54:35] PROBLEM - HHVM rendering on mw1190 is CRITICAL - Socket timeout after 10 seconds [04:54:35] PROBLEM - HHVM rendering on mw1139 is CRITICAL - Socket timeout after 10 seconds [04:54:35] PROBLEM - HHVM rendering on mw1148 is CRITICAL - Socket timeout after 10 seconds [04:54:36] PROBLEM - HHVM rendering on mw1194 is CRITICAL - Socket timeout after 10 seconds [04:54:37] PROBLEM - HHVM rendering on mw1207 is CRITICAL - Socket timeout after 10 seconds [04:54:37] PROBLEM - Apache HTTP on mw1123 is CRITICAL - Socket timeout after 10 seconds [04:54:45] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 7.418 second response time [04:54:45] PROBLEM - HHVM rendering on mw1130 is CRITICAL - Socket timeout after 10 seconds [04:54:46] PROBLEM - HHVM rendering on mw1192 is CRITICAL - Socket timeout after 10 seconds [04:54:46] PROBLEM - HHVM rendering on mw1126 is CRITICAL - Socket timeout after 10 seconds [04:54:47] PROBLEM - Apache HTTP on mw1192 is CRITICAL - Socket timeout after 10 seconds [04:54:47] PROBLEM - HHVM queue size on mw1190 is CRITICAL 40.00% of data above the critical threshold [80.0] [04:54:47] PROBLEM - Apache HTTP on mw1198 is CRITICAL - Socket timeout after 10 seconds [04:54:47] PROBLEM - Apache HTTP on mw1190 is CRITICAL - Socket timeout after 10 seconds [04:54:56] RECOVERY - HHVM rendering on mw1137 is OK: HTTP OK: HTTP/1.1 200 OK - 69345 bytes in 1.777 second response time [04:55:05] PROBLEM - Apache HTTP on mw1208 is CRITICAL - Socket timeout after 10 seconds [04:55:06] PROBLEM - HHVM busy threads on mw1147 is CRITICAL 100.00% of data above the critical threshold [86.4] [04:55:07] PROBLEM - HHVM rendering on mw1197 is CRITICAL - Socket timeout after 10 seconds [04:55:15] PROBLEM - HHVM rendering on mw1208 is CRITICAL - Socket timeout after 10 seconds [04:55:15] PROBLEM - HHVM rendering on mw1206 is CRITICAL - Socket timeout after 10 seconds [04:55:16] PROBLEM - Apache HTTP on mw1121 is CRITICAL - Socket timeout after 10 seconds [04:55:16] PROBLEM - Apache HTTP on mw1138 is CRITICAL - Socket timeout after 10 seconds [04:55:25] PROBLEM - HHVM busy threads on mw1132 is CRITICAL 80.00% of data above the critical threshold [86.4] [04:55:25] RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.264 second response time [04:55:26] RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 7.951 second response time [04:55:26] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 33.33% of data above the critical threshold [500.0] [04:55:26] RECOVERY - HHVM rendering on mw1127 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 1.551 second response time [04:55:35] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.239 second response time [04:55:36] RECOVERY - HHVM rendering on mw1128 is OK: HTTP OK: HTTP/1.1 200 OK - 69345 bytes in 9.094 second response time [04:55:36] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [04:55:36] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.990 second response time [04:55:45] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.266 second response time [04:55:45] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.422 second response time [04:55:55] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.070 second response time [04:55:56] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.164 second response time [04:56:05] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.230 second response time [04:56:05] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.299 second response time [04:56:05] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 69345 bytes in 0.349 second response time [04:56:05] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.533 second response time [04:56:06] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.096 second response time [04:56:06] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.088 second response time [04:56:06] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.093 second response time [04:56:07] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.231 second response time [04:56:07] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.280 second response time [04:56:16] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.218 second response time [04:56:16] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.330 second response time [04:56:16] RECOVERY - HHVM rendering on mw1126 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.564 second response time [04:56:17] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.073 second response time [04:56:17] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.059 second response time [04:56:17] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.072 second response time [04:56:25] PROBLEM - HHVM queue size on mw1148 is CRITICAL 66.67% of data above the critical threshold [80.0] [04:56:35] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [04:56:36] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.195 second response time [04:56:36] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.081 second response time [04:56:36] RECOVERY - HHVM rendering on mw1121 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.325 second response time [04:56:36] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.437 second response time [04:56:45] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.210 second response time [04:56:45] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.242 second response time [04:56:46] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.074 second response time [04:56:46] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.222 second response time [04:56:46] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.076 second response time [04:56:46] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.117 second response time [04:56:46] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.100 second response time [04:56:47] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.067 second response time [04:56:47] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.075 second response time [04:56:56] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.096 second response time [04:57:05] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 69337 bytes in 0.419 second response time [04:57:25] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.071 second response time [05:01:17] PROBLEM - HHVM busy threads on mw1139 is CRITICAL 40.00% of data above the critical threshold [86.4] [05:01:46] PROBLEM - HHVM busy threads on mw1201 is CRITICAL 60.00% of data above the critical threshold [115.2] [05:01:46] PROBLEM - HHVM busy threads on mw1133 is CRITICAL 60.00% of data above the critical threshold [86.4] [05:02:06] PROBLEM - HHVM busy threads on mw1204 is CRITICAL 60.00% of data above the critical threshold [115.2] [05:02:36] PROBLEM - HHVM busy threads on mw1138 is CRITICAL 60.00% of data above the critical threshold [86.4] [05:02:46] PROBLEM - HHVM busy threads on mw1124 is CRITICAL 40.00% of data above the critical threshold [86.4] [05:03:26] RECOVERY - HHVM busy threads on mw1201 is OK Less than 30.00% above the threshold [76.8] [05:03:26] RECOVERY - HHVM busy threads on mw1133 is OK Less than 30.00% above the threshold [57.6] [05:03:36] RECOVERY - HHVM busy threads on mw1147 is OK Less than 30.00% above the threshold [57.6] [05:03:46] RECOVERY - HHVM busy threads on mw1137 is OK Less than 30.00% above the threshold [57.6] [05:03:56] PROBLEM - HHVM busy threads on mw1136 is CRITICAL 33.33% of data above the critical threshold [86.4] [05:04:26] RECOVERY - HHVM busy threads on mw1124 is OK Less than 30.00% above the threshold [57.6] [05:04:26] RECOVERY - HHVM queue size on mw1201 is OK Less than 30.00% above the threshold [10.0] [05:04:45] RECOVERY - HHVM busy threads on mw1139 is OK Less than 30.00% above the threshold [57.6] [05:04:56] RECOVERY - HHVM queue size on mw1148 is OK Less than 30.00% above the threshold [10.0] [05:05:05] RECOVERY - HHVM queue size on mw1190 is OK Less than 30.00% above the threshold [10.0] [05:05:26] RECOVERY - HHVM busy threads on mw1130 is OK Less than 30.00% above the threshold [57.6] [05:05:26] RECOVERY - HHVM busy threads on mw1204 is OK Less than 30.00% above the threshold [76.8] [05:05:36] RECOVERY - HHVM busy threads on mw1132 is OK Less than 30.00% above the threshold [57.6] [05:05:45] RECOVERY - HHVM busy threads on mw1136 is OK Less than 30.00% above the threshold [57.6] [05:05:46] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Apr 24 05:04:42 UTC 2015 (duration 4m 41s) [05:05:52] Logged the message, Master [05:05:55] thank you, logmsgbot [05:06:05] RECOVERY - HHVM busy threads on mw1138 is OK Less than 30.00% above the threshold [57.6] [05:06:06] RECOVERY - HHVM busy threads on mw1131 is OK Less than 30.00% above the threshold [57.6] [05:06:16] RECOVERY - HHVM busy threads on mw1114 is OK Less than 30.00% above the threshold [57.6] [05:06:16] RECOVERY - HHVM busy threads on mw1192 is OK Less than 30.00% above the threshold [76.8] [05:06:36] RECOVERY - HHVM busy threads on mw1126 is OK Less than 30.00% above the threshold [57.6] [05:07:25] RECOVERY - HHVM queue size on mw1121 is OK Less than 30.00% above the threshold [10.0] [05:10:23] (03PS1) 10BBlack: temporarily block Special:Export [puppet] - 10https://gerrit.wikimedia.org/r/206354 [05:12:25] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [05:18:52] !log ori Synchronized wmf-config/InitialiseSettings.php: $wgExportAllowHistory default false, $wgExportMaxHistory default 1000 -> 10 (duration: 00m 16s) [05:19:00] Logged the message, Master [05:27:56] PROBLEM - HHVM busy threads on mw1222 is CRITICAL 60.00% of data above the critical threshold [115.2] [05:28:04] !log nuked https://commons.wikimedia.org/wiki/User:Niteshift/MVneu/2015_April_21-20 [05:28:10] Logged the message, Master [05:32:56] PROBLEM - HHVM busy threads on mw1222 is CRITICAL 50.00% of data above the critical threshold [115.2] [05:33:10] it's mw1222 again [05:34:45] RECOVERY - HHVM busy threads on mw1222 is OK Less than 30.00% above the threshold [76.8] [05:35:39] !log restart hhvm on mw1222; locked up in pthread_cond_wait, backtrace: https://phabricator.wikimedia.org/P552 [05:35:42] Logged the message, Master [05:41:23] (03Abandoned) 10BBlack: temporarily block Special:Export [puppet] - 10https://gerrit.wikimedia.org/r/206354 (owner: 10BBlack) [05:44:18] !log ori Synchronized php-1.26wmf1/includes/filerepo/file/LocalFile.php: Undo local hack on version that is inactive (1.26wmf1). No-op. (duration: 00m 17s) [05:44:24] Logged the message, Master [06:29:45] PROBLEM - puppet last run on elastic1022 is CRITICAL puppet fail [06:30:27] PROBLEM - puppet last run on db1059 is CRITICAL Puppet has 4 failures [06:31:06] PROBLEM - HHVM busy threads on mw1223 is CRITICAL 33.33% of data above the critical threshold [115.2] [06:31:15] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on db2064 is CRITICAL Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on holmium is CRITICAL Puppet has 1 failures [06:34:56] PROBLEM - puppet last run on mw1065 is CRITICAL Puppet has 1 failures [06:35:05] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:35:15] PROBLEM - puppet last run on mw2013 is CRITICAL Puppet has 1 failures [06:35:36] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:35:56] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 1 failures [06:35:56] PROBLEM - puppet last run on mw2093 is CRITICAL Puppet has 1 failures [06:36:06] PROBLEM - HHVM busy threads on mw1235 is CRITICAL 40.00% of data above the critical threshold [115.2] [06:36:06] PROBLEM - puppet last run on mw1025 is CRITICAL Puppet has 1 failures [06:36:45] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [06:36:55] PROBLEM - puppet last run on mw2003 is CRITICAL Puppet has 1 failures [06:40:05] !log nuked http://commons.wikimedia.org/wiki/User:Niteshift/MVneu/2015_April_21-30 [06:40:10] Logged the message, Master [06:45:16] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:55] RECOVERY - puppet last run on db1059 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on mw1025 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:27] RECOVERY - HHVM busy threads on mw1223 is OK Less than 30.00% above the threshold [76.8] [06:46:36] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:45] RECOVERY - puppet last run on elastic1022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:55] RECOVERY - puppet last run on db2064 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:56] RECOVERY - puppet last run on mw1065 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:05] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:47:05] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:47:15] RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:15] RECOVERY - puppet last run on mw2003 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:47:36] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:47:56] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:55] RECOVERY - HHVM busy threads on mw1235 is OK Less than 30.00% above the threshold [76.8] [07:10:35] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 64 data above and 8 below the confidence bounds [07:16:53] (03CR) 10Aaron Schulz: "Yeah I don't want to lower it too much at once. I don't think it can be set at session level. Do you mean changing it globaly via admin qu" [puppet] - 10https://gerrit.wikimedia.org/r/206145 (owner: 10Aaron Schulz) [07:17:55] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:55] PROBLEM - salt-minion processes on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:25] RECOVERY - salt-minion processes on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:19:25] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [07:29:16] PROBLEM - dhclient process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:32:26] RECOVERY - dhclient process on labvirt1006 is OK: PROCS OK: 0 processes with command name dhclient [07:33:48] <_joe_> sigh labs [07:35:41] (03CR) 10Springle: "That's changed relatively recently; innodb_lock_wait_timeout is a session variable since MySQL (and MariaDB) 5.5:" [puppet] - 10https://gerrit.wikimedia.org/r/206145 (owner: 10Aaron Schulz) [07:59:28] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1233245 (10Nemo_bis) [08:05:38] 6operations, 10MediaWiki-Debug-Logging, 6Release-Engineering, 6Security-Team, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1233253 (10Joe) @Andrew, @fgiunchedi is no one working on this? It is an old ticket, marked as high priority and it's unassigned. [08:09:34] !log tstarling Synchronized php-1.26wmf2/includes/filerepo/file/LocalFile.php: reverting live hack (duration: 00m 16s) [08:09:37] Logged the message, Master [08:20:49] (03PS1) 10KartikMistry: CX: Add ceb, it and war in source languages [puppet] - 10https://gerrit.wikimedia.org/r/206360 [08:29:24] (03PS2) 10KartikMistry: CX: Add ceb, it and war in source languages [puppet] - 10https://gerrit.wikimedia.org/r/206360 (https://phabricator.wikimedia.org/T97114) [08:32:48] (03PS1) 10Filippo Giunchedi: install-server: shrink cassandra raid0 minimum partition size [puppet] - 10https://gerrit.wikimedia.org/r/206361 (https://phabricator.wikimedia.org/T90955) [08:33:20] (03PS2) 10Filippo Giunchedi: install-server: shrink cassandra raid0 minimum partition size [puppet] - 10https://gerrit.wikimedia.org/r/206361 (https://phabricator.wikimedia.org/T90955) [08:33:26] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install-server: shrink cassandra raid0 minimum partition size [puppet] - 10https://gerrit.wikimedia.org/r/206361 (https://phabricator.wikimedia.org/T90955) (owner: 10Filippo Giunchedi) [08:34:57] (03PS1) 10Alexandros Kosiaris: WIP: service: Use a package require instead of a class [puppet] - 10https://gerrit.wikimedia.org/r/206363 [08:45:04] (03CR) 10Alexandros Kosiaris: [C: 032] CX: Add ceb, it and war in source languages [puppet] - 10https://gerrit.wikimedia.org/r/206360 (https://phabricator.wikimedia.org/T97114) (owner: 10KartikMistry) [08:47:05] !log deployed parsoid/deploy 8b5de6aba / I4d55f6d50: Bump src to d2135c6b69 for deploy [08:47:12] Logged the message, Master [08:48:13] (03PS2) 10Alexandros Kosiaris: service: Use a package require instead of a class [puppet] - 10https://gerrit.wikimedia.org/r/206363 [08:52:39] (03CR) 10Alexandros Kosiaris: [C: 032] "Catalogcompiler says OK" [puppet] - 10https://gerrit.wikimedia.org/r/206363 (owner: 10Alexandros Kosiaris) [08:54:07] PROBLEM - Parsoid on wtp1001 is CRITICAL: HTTP CRITICAL - No data received from host [08:54:20] (03CR) 10Alexandros Kosiaris: "Yeah, I am taking over this. Plan is to move mathoid to service::node anyway, will probably modify heavily this change or if it turns out " [puppet] - 10https://gerrit.wikimedia.org/r/167413 (owner: 10Ori.livneh) [08:55:06] akosiaris: you can abandon it if it ends up getting in your way [08:55:41] ori: I did say that in the comment, didn't I ? [08:55:46] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.004 second response time [08:56:05] ori: you 're making me second guess myself man, not cool [08:56:06] :P [08:56:14] akosiaris: durr. i just read the grrit-wm bit [08:56:21] which abbreviated the crucial piece [08:56:33] ok, no worries [08:56:33] you know me, cutting corners [08:56:43] :) thanks [08:59:29] (03PS5) 10Merlijn van Deen: Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/206118 [08:59:53] <_joe_> !log restarting parsoid cluster-wide [09:00:00] Logged the message, Master [09:04:28] * _joe_ send SIGSLEEP to ori [09:04:43] <_joe_> toghether with big thanks for all the work you did today [09:07:49] 6operations: Upgrade salt to 2014.7 (investigating) - https://phabricator.wikimedia.org/T88971#1233324 (10ArielGlenn) So as people will know there was a salt bug which affected trebuchet, stalling this process; new packages were released a couple days ago in the ppas so I'll be testing with those and hopefully w... [09:09:51] <_joe_> !log parsoid restart done [09:09:57] Logged the message, Master [10:03:04] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1233386 (10Visdaviva) Hi, the community members are working on the logo. It should be up on Commons soon. While migrating the content in the incubator,... [10:15:39] 6operations, 7Documentation: Create documentation on the requesting/allocation of virtual machines in the misc cluster - https://phabricator.wikimedia.org/T97072#1233395 (10Aklapper) [10:22:31] (03PS3) 10Filippo Giunchedi: install-server: partman for dm-cache [puppet] - 10https://gerrit.wikimedia.org/r/200134 (https://phabricator.wikimedia.org/T88994) [10:22:41] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install-server: partman for dm-cache [puppet] - 10https://gerrit.wikimedia.org/r/200134 (https://phabricator.wikimedia.org/T88994) (owner: 10Filippo Giunchedi) [10:25:53] PROBLEM - Cassanda CQL query interface on cerium is CRITICAL: Connection refused [10:26:11] PROBLEM - Cassandra database on cerium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [10:28:26] that's me [10:31:57] !log nova migrated a couple of etcd's project VMs [10:32:01] Logged the message, Master [10:35:26] 6operations, 10MediaWiki-Debug-Logging, 6Release-Engineering, 6Security-Team, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1233411 (10fgiunchedi) no I don't think anyone is working on this, I mostly worked on it when on clinic duty, my plate is full alrea... [10:50:14] (03PS1) 10Filippo Giunchedi: gdash: add metrics queued to graphite dashboard [puppet] - 10https://gerrit.wikimedia.org/r/206370 [10:50:26] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: add metrics queued to graphite dashboard [puppet] - 10https://gerrit.wikimedia.org/r/206370 (owner: 10Filippo Giunchedi) [10:57:22] (03CR) 10ArielGlenn: "Here are some first comments. Note that I am not at alll reviewing the dependencies." [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/204964 (https://phabricator.wikimedia.org/T94457) (owner: 10GWicke) [10:58:55] (03CR) 10ArielGlenn: "CSteipp, I added you to get eyeballs from the security side of things; at least at the first deployment it would be good to have the code " [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/204964 (https://phabricator.wikimedia.org/T94457) (owner: 10GWicke) [11:04:35] PROBLEM - Cassanda CQL query interface on praseodymium is CRITICAL: Connection refused [11:04:55] PROBLEM - Cassandra database on praseodymium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [11:30:30] (03CR) 10Glaisher: [C: 04-1] "see phab" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [11:30:48] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1233450 (10Glaisher) @Mjbmr Did the community approve to using File:Wikipedia-logo-v2-hi.svg as the logo? (and we usually do all the configuration in on... [11:35:51] 6operations, 10Datasets-General-or-Unknown: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503#1233459 (10ArielGlenn) The current approach is mean to be temporary (serving off of francium). [11:40:10] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1233461 (10Mjbmr) @Glaisher No, but translation on the logo with Hindi Wikipedia's logo is same. I know "we usually do all the configuration in one patc... [11:42:20] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1233464 (10Glaisher) @Mjbmr My point is, did the community approve to use the text on Wikipedia-logo-v2-hi.svg for https://commons.wikimedia.org/wiki/Fi... [11:43:55] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1233465 (10Mjbmr) @Glaisher No, but I'm trying to help to create the project as soon as possible. btw did I submit that patch? [11:47:34] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1233467 (10aude) hmmm, we'll have to populate the sites table in order to get the wikibase stuff working, and update it also on wikidata. https://wikit... [11:49:31] 6operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1233470 (10aude) [11:49:32] (03CR) 10Mjbmr: "It can be always uploaded locally in case community wanted to change the logo without having to submit a ticket." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [11:52:32] (03CR) 10Steinsplitter: [C: 031] "i don't see a issue here. standard procedure." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [11:52:55] (03CR) 10Glaisher: "It would be simpler if we use this change for doing all the configuration. As for the logo, we should ask the community to override the fi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [11:53:21] (03CR) 10Steinsplitter: "ignore my comment, misread...." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [11:55:59] 6operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1233473 (10Glaisher) @Visdaviva Since we already have a patch with a link to the Commons file, to make things simple, if you want to uploa... [11:58:19] 6operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1233475 (10Mjbmr) btw, patch https://gerrit.wikimedia.org/r/206083 must be merged first and pushed to wmf/1.26wmf2 branch of mediawiki. [11:58:42] (03PS1) 10Phuedx: Enable Browse experiment on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206375 (https://phabricator.wikimedia.org/T94739) [12:16:39] (03PS4) 10Glaisher: Create Wikipedia Konkani [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [12:16:45] (03CR) 10jenkins-bot: [V: 04-1] Create Wikipedia Konkani [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [12:18:25] (03PS5) 10Glaisher: Create Wikipedia Konkani [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [12:18:30] (03CR) 10jenkins-bot: [V: 04-1] Create Wikipedia Konkani [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [12:20:55] (03PS6) 10Glaisher: Create Wikipedia Konkani [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [12:21:27] (03CR) 10Glaisher: [C: 04-1] "Still needs to be added to MediaWiki core languages list and wikiversions.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [12:21:51] 6operations, 10Graphoid, 6Services, 10service-template-node, and 2 others: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1233501 (10mobrovac) [12:28:27] 6operations, 10hardware-requests: order new array for dataset1001 - https://phabricator.wikimedia.org/T93118#1233503 (10ArielGlenn) I did some back of the napkin-style calculations. With a new array of 12 2TB disks we get 18T say from raid, regular dumps plus pgecounts plus misc grow about 4T a year, we have a... [12:29:05] 6operations, 10hardware-requests: order new array for dataset1001 - https://phabricator.wikimedia.org/T93118#1233504 (10ArielGlenn) 5stalled>3Open a:5ArielGlenn>3RobH [12:29:16] (03CR) 10Mjbmr: Create Wikipedia Konkani (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [12:29:17] 6operations, 10Traffic: increase misc-web-lb cp pool from 2 to 3 systems? - https://phabricator.wikimedia.org/T86718#1233506 (10faidon) a:5mark>3None [12:29:50] 7Blocked-on-Operations, 6operations, 5Patch-For-Review: Install nodejs, nginx and other dependencies on francium - https://phabricator.wikimedia.org/T94457#1233509 (10ArielGlenn) I've made some comments on the gerrit changeset. [12:34:45] (03CR) 10Glaisher: "Sure. I don't see a problem with continuing on this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/131914 (https://bugzilla.wikimedia.org/48618) (owner: 10Withoutaname) [12:37:29] (03PS4) 10Gergő Tisza: Set has_ganglia=false for labs [puppet] - 10https://gerrit.wikimedia.org/r/201942 (https://phabricator.wikimedia.org/T95107) [12:38:30] 6operations, 10Mathoid-General-or-Unknown, 6Services: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1233514 (10mobrovac) 3NEW [12:39:12] (03CR) 10Gergő Tisza: "> deployment-prep already has it:" [puppet] - 10https://gerrit.wikimedia.org/r/201942 (https://phabricator.wikimedia.org/T95107) (owner: 10Gergő Tisza) [12:40:03] 6operations, 10Mathoid-General-or-Unknown, 6Services: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1233521 (10mobrovac) @Physikerwelt Am I correctly understanding that https://github.com/physikerwelt/mathoid-server is simply an updated version of https://gerrit.wikimedia.org/r/... [12:40:21] (03PS7) 10Glaisher: Create Wikipedia Konkani [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [12:40:31] (03CR) 10Glaisher: "Fixed. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [13:14:41] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1233568 (10BBlack) Note also ori's couple of related varnish patches here for a potential path towards various solutions: https://gerrit.wikimedia.org/r/#/c/206351/ https://gerrit.wikimedia.... [13:17:07] 6operations, 10Traffic: increase misc-web-lb cp pool from 2 to 3 systems? - https://phabricator.wikimedia.org/T86718#1233573 (10BBlack) The plan being bandied about at this point is to block this on the dissolution of bits-cluster ( T95448 ), and reuse the 4x bits machines at eqiad, ulsfo, and esams as a globa... [13:21:56] 6operations, 10MediaWiki-Debug-Logging, 6Release-Engineering, 6Security-Team, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1233579 (10Anomie) Is there anything that actually //needs// doing besides just removing the 'sample' from the 'api' entry in wmgMon... [13:37:30] 7Puppet, 6Labs, 5Patch-For-Review: Labs: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/gmond_memcached.py] - https://phabricator.wikimedia.org/T95107#1233591 (10scfc) The whole `has_ganglia` business feels … complicated to me. In `manifests/site.pp`... [13:39:34] 6operations, 10Traffic: Investigate Vary:Accept-Encoding issues on cache clusters - https://phabricator.wikimedia.org/T97128#1233592 (10BBlack) 3NEW [13:40:24] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, 7Availability: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1233599 (10Aklapper) > these can be attributed to img_metadata queries about Djvu files, all coming from s... [13:53:17] (03CR) 10Cscott: "I put this on the back burner for a while; maybe I should resurrect it. We should check with hashar first that the auto-deploy-to-beta st" [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10Cscott) [13:58:49] 6operations, 10Mathoid-General-or-Unknown, 6Services: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1233630 (10mobrovac) PR for bringing mathoid-server up to date with the #service-template-node is available [here](https://github.com/physikerwelt/mathoid-server/pull/5). [13:59:51] (03PS1) 10BBlack: sanitize Accept-Encoding for cache efficiency T97128 [puppet] - 10https://gerrit.wikimedia.org/r/206387 [14:00:38] 6operations, 10Traffic, 5Patch-For-Review: Investigate Vary:Accept-Encoding issues on cache clusters - https://phabricator.wikimedia.org/T97128#1233636 (10BBlack) I want to read and investigate a little further first, and also test this in isolated places before cluster-wide, but the patch above is the gener... [14:06:41] <^d> bd808: All of those search errors have tasks filed already [14:07:11] <^d> https://phabricator.wikimedia.org/T94814 tracks 3 subtasks [14:10:11] <^d> bd808: Ah, the one you spotted is a dupe, resolved as such [14:11:44] (03PS1) 10Alexandros Kosiaris: Enabled the ganglia diskstat plugin for labvirt100X [puppet] - 10https://gerrit.wikimedia.org/r/206388 [14:12:56] (03CR) 10Mobrovac: "Now tracked at https://phabricator.wikimedia.org/T97124" [puppet] - 10https://gerrit.wikimedia.org/r/167413 (owner: 10Ori.livneh) [14:15:06] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:17] (03CR) 10Ottomata: [C: 031] "I guess so?" [puppet] - 10https://gerrit.wikimedia.org/r/206036 (owner: 10Dzahn) [14:16:45] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:17:16] (03CR) 10Alexandros Kosiaris: [C: 032] Enabled the ganglia diskstat plugin for labvirt100X [puppet] - 10https://gerrit.wikimedia.org/r/206388 (owner: 10Alexandros Kosiaris) [14:18:22] <^d> _joe_: I killed an ugly single-use template for you :) https://gerrit.wikimedia.org/r/#/c/206132/ [14:20:30] <_joe_> ^d: eheh [14:24:27] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Allow rsync traffic between analytics VLAN and eventlog1001 - https://phabricator.wikimedia.org/T96934#1233675 (10Ottomata) a:3akosiaris [14:24:29] !log dist-upgrade (including kernel upgrade to 3.13.0-49-generic) on labvirt1006, rebooting [14:24:33] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Allow rsync traffic between analytics VLAN and eventlog1001 - https://phabricator.wikimedia.org/T96934#1229708 (10Ottomata) Alex, can you help? [14:24:34] Logged the message, Master [14:25:16] ottomata: yeah, OK, gimme a few mins, debugging another issue [14:25:59] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Add icinga-wm bot to #wikimedia-analytics - https://phabricator.wikimedia.org/T96928#1233680 (10Ottomata) Coooool! Thanks! [14:27:07] PROBLEM - Host labvirt1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:28:16] akosiaris: thanks [14:29:46] RECOVERY - Host labvirt1006 is UPING OK - Packet loss = 0%, RTA = 1.25 ms [14:33:12] (03CR) 10GWicke: "> Things like maxConcurrency=50, concurrency:3 should be in a config file, along with paths for the dumps, temp directories, urls etc. It" [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/204964 (https://phabricator.wikimedia.org/T94457) (owner: 10GWicke) [14:36:44] 6operations, 5Patch-For-Review: Upgrade xenon, cerium and praseodymium to jessie - https://phabricator.wikimedia.org/T90955#1233723 (10fgiunchedi) 5Open>3Resolved machines reinstalled as jessie [14:38:04] 6operations, 7Graphite, 5Patch-For-Review: revisit what percentiles are calculated by statsite - https://phabricator.wikimedia.org/T88662#1233735 (10fgiunchedi) pending building a new statsite package from latest upstream git [14:49:58] (03CR) 10Filippo Giunchedi: Graphoid: service deployment on SCA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/206105 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [14:50:05] PROBLEM - puppet last run on db2065 is CRITICAL puppet fail [14:53:24] (03CR) 10Filippo Giunchedi: [C: 031] Graphoid: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/206106 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [14:55:06] thnx godog for the reviews [14:55:27] (and for updating to jessie the test boxes) [14:55:27] (03CR) 10Hashar: "That seems to be the cause of exceptions on the beta cluster such as :" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [14:57:34] mobrovac: no problem! I didn't bring up cassandra yet though, if you want to give bootstrapping a go now that a node isn't self-including in seeds [14:58:09] urandom: ^^ [14:58:10] :P [15:06:46] RECOVERY - puppet last run on db2065 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:08:07] !log dist-upgrade (including kernel upgrade to 3.13.0-49-generic) on labvirt1005, rebooting [15:08:12] Logged the message, Master [15:09:56] (03CR) 10Mobrovac: Graphoid: service deployment on SCA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/206105 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [15:12:32] (03PS1) 10Alexandros Kosiaris: Move the ganglia diskstat plugin to role::nova::compute [puppet] - 10https://gerrit.wikimedia.org/r/206395 [15:13:06] PROBLEM - Host labvirt1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:52] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1233844 (10BBlack) [15:13:53] 6operations, 10procurement, 7Varnish: Purchase SSDs for legacy bits cache machines for re-use in other clusters - https://phabricator.wikimedia.org/T95449#1233845 (10BBlack) [15:15:33] 6operations, 10procurement, 7Varnish: Purchase SSDs for legacy bits cache machines for re-use in other clusters - https://phabricator.wikimedia.org/T95449#1233849 (10BBlack) 5Open>3Invalid a:3BBlack Change of plans, see: https://phabricator.wikimedia.org/T86718#1233573 [15:16:27] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1233855 (10BBlack) [15:16:28] 6operations, 10Traffic: increase misc-web-lb cp pool from 2 to 3 systems? - https://phabricator.wikimedia.org/T86718#1233854 (10BBlack) [15:16:36] RECOVERY - Host labvirt1005 is UPING OK - Packet loss = 0%, RTA = 2.21 ms [15:23:04] (03CR) 10Alexandros Kosiaris: [C: 032] Move the ganglia diskstat plugin to role::nova::compute [puppet] - 10https://gerrit.wikimedia.org/r/206395 (owner: 10Alexandros Kosiaris) [15:29:10] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [15:29:23] (03PS1) 10Ottomata: Set dfs_datanode_hdfs_blocks_metadata_enabled in both labs and production for Impala use [puppet] - 10https://gerrit.wikimedia.org/r/206398 [15:29:45] (03CR) 10Ottomata: [C: 032 V: 032] Set dfs_datanode_hdfs_blocks_metadata_enabled in both labs and production for Impala use [puppet] - 10https://gerrit.wikimedia.org/r/206398 (owner: 10Ottomata) [15:29:48] (03PS1) 10BryanDavis: monolog: configure wgDebugLogFile for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206399 (https://phabricator.wikimedia.org/T97138) [15:30:08] akosiaris: ok if i merged diskstat thing? [15:30:27] ottomata: yeah, sorry [15:30:28] thanks! [15:41:07] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Allow rsync traffic between analytics VLAN and eventlog1001 - https://phabricator.wikimedia.org/T96934#1233894 (10akosiaris) 5Open>3Resolved Indeed the analytics ACLs needed a hole for rsync traffic. Done and tested the TCP connection, resolving [15:41:59] !log dist-upgrade (including kernel upgrade to 3.13.0-49-generic) on labvirt1003, rebooting [15:43:01] Logged the message, Master [15:44:40] (03CR) 10Tim Landscheidt: "Also the references to Class['::labsdebrepo'] need to be changed to Class['::labs_debrepo']." [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [15:46:10] PROBLEM - Host labvirt1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:33] 6operations, 10MediaWiki-Debug-Logging, 6Release-Engineering, 6Security-Team, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1233906 (10bd808) My patch in {rOMWC2680380cba022787f19c783a4535d8794ffda8d8} restores unsampled xff logs to fluorine. I left api sa... [15:49:20] RECOVERY - Host labvirt1003 is UPING OK - Packet loss = 0%, RTA = 1.24 ms [15:52:42] 6operations, 10Traffic: increase misc-web-lb cp pool from 2 to 3 systems? - https://phabricator.wikimedia.org/T86718#1233910 (10BBlack) p:5High>3Normal [15:53:27] (03CR) 10Chad: [C: 032] monolog: configure wgDebugLogFile for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206399 (https://phabricator.wikimedia.org/T97138) (owner: 10BryanDavis) [15:53:31] (03Merged) 10jenkins-bot: monolog: configure wgDebugLogFile for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206399 (https://phabricator.wikimedia.org/T97138) (owner: 10BryanDavis) [15:54:10] <^d> Someone has uncommitted changes on tin, who be it? [15:54:27] ^d: ori from the outage battle last night [15:54:39] <^d> I figured from that yeah [15:54:40] <^d> Worth committing or tossing? [15:55:05] toss [15:55:06] it was a hot patch to try and mellow db crushing [15:55:06] (hi) [15:55:11] o/ ori [15:55:12] <^d> okie dokie [15:55:18] thanks [15:55:26] \o [15:55:48] ori: for the next time... what host can I run varnishtop from? [15:55:57] or is that roots only? [15:56:14] !log demon Synchronized wmf-config/: logging cleanup, mostly for labs (duration: 00m 21s) [15:56:18] Logged the message, Master [15:56:20] <^d> bd808: ^^ [15:56:28] sweet [15:57:24] bd808: yes (roots) [15:57:35] k. [15:58:00] it doesn't aggregate; you have to run it on a specific varnish host you're curious about [15:58:09] doesn't aggregate data from multiple instances, that is [15:58:13] right. [15:58:15] (03PS4) 10Chad: logstash: Convert Elasticsearch on logstash100[1-3] to client [puppet] - 10https://gerrit.wikimedia.org/r/205971 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [15:58:52] it's just peeking at the inprocess ring buffer that varnish tosses logs into [16:00:02] (03CR) 10BryanDavis: [C: 04-1] "Hold until T96692 is done and the indices are moved." [puppet] - 10https://gerrit.wikimedia.org/r/205971 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [16:03:48] bd808: not in-process, it uses shm (an ipc mechanism) [16:04:02] *nod* [16:04:02] but yeah, it's local [16:06:23] !log dist-upgrade (including kernel upgrade to 3.13.0-49-generic) on labvirt1004, rebooting [16:06:27] Logged the message, Master [16:06:53] oh, no jouncebot :( [16:07:27] greg-g: it got sick yesterday and I haven't made it better yet [16:08:05] bd808: from the labs craziness or other sickness? [16:08:29] I can kick it if need be [16:08:37] * greg-g thinks he remembers how [16:10:02] * greg-g does the new host key id dance [16:14:27] 7Puppet, 6Labs: Fix Puppet timestamp updater for wikitech - https://phabricator.wikimedia.org/T97082#1233956 (10hashar) My bad I thought notify would still emit a message to the client but it is only on the puppet master :( [16:15:11] bd808: https://phabricator.wikimedia.org/P555 [16:15:30] PROBLEM - Host labvirt1004 is DOWN: PING CRITICAL - Packet loss = 100% [16:18:51] RECOVERY - Host labvirt1004 is UPING OK - Packet loss = 0%, RTA = 2.13 ms [16:21:07] greg-g: it's running, but not reading the wiki or somehow otherwise braindead [16:21:14] :( [16:25:02] (03CR) 10Glaisher: Enable Extension:Shorturl on sa wiki projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201216 (https://phabricator.wikimedia.org/T94660) (owner: 10Shanmugamp7) [16:26:41] 7Puppet, 6Labs: Fix Puppet timestamp updater for wikitech - https://phabricator.wikimedia.org/T97082#1233968 (10scfc) No, no, the blame lies with me :-). You proposed to keep the `notify` resources, //I// opted for `notice` functions. I'll probably fix this by server side functions. But testing is a bitch,... [16:40:34] (03PS1) 10Ottomata: Make sure icedtea-7-jre-jamvm is absent on analytics nodes [puppet] - 10https://gerrit.wikimedia.org/r/206405 [16:44:32] (03CR) 10Ottomata: [C: 032] Make sure icedtea-7-jre-jamvm is absent on analytics nodes [puppet] - 10https://gerrit.wikimedia.org/r/206405 (owner: 10Ottomata) [16:56:41] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [16:59:50] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:00:51] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:01:03] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Moritz Muehlenhoff in ops - https://phabricator.wikimedia.org/T94717#1234068 (10Dzahn) [17:02:58] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Moritz Muehlenhoff in ops - https://phabricator.wikimedia.org/T94717#1170954 (10Dzahn) Moritz reported he is still lacking permissions in Phabricator. For example he can't see T90968 and it tells him he is not in the Ops group. added that t... [17:08:31] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Moritz Muehlenhoff in ops - https://phabricator.wikimedia.org/T94717#1234097 (10Dzahn) 5Open>3Resolved done by @chasemp [17:08:56] bd808: always creating work for me re. Mailing lists :) [17:09:44] Query me with the relevant emails + a distinction between close or delete with the lists please :) [17:12:42] (03CR) 10Milimetric: [C: 031] "Nice. Thanks very much Brandon." [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T88813) (owner: 10Nuria) [17:19:34] Something seems to be weird with mw1125 (or else all the weird is preferentially hitting mw1125), see T97145. [17:19:54] (03Abandoned) 10Dzahn: contacts: remove role from zirconium and delete it [puppet] - 10https://gerrit.wikimedia.org/r/205457 (https://phabricator.wikimedia.org/T90679) (owner: 10Dzahn) [17:22:22] (03CR) 10Bmansurov: [C: 031] Enable Browse experiment on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206375 (https://phabricator.wikimedia.org/T94739) (owner: 10Phuedx) [17:22:25] (03CR) 10Dzahn: [C: 031] Set has_ganglia=false for labs [puppet] - 10https://gerrit.wikimedia.org/r/201942 (https://phabricator.wikimedia.org/T95107) (owner: 10Gergő Tisza) [17:22:50] YuviPanda: do you agree we should just disable Ganglia globally on labs? ^ [17:23:01] see the linked bug for details how it causes issues [17:28:03] mutante: it’s already dead [17:28:17] (03PS5) 10Yuvipanda: Set has_ganglia=false for labs [puppet] - 10https://gerrit.wikimedia.org/r/201942 (https://phabricator.wikimedia.org/T95107) (owner: 10Gergő Tisza) [17:28:18] let me merge it [17:28:26] (03CR) 10Yuvipanda: [C: 032 V: 032] Set has_ganglia=false for labs [puppet] - 10https://gerrit.wikimedia.org/r/201942 (https://phabricator.wikimedia.org/T95107) (owner: 10Gergő Tisza) [17:29:20] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:29:31] mutante: thanks for poking at it :) [17:30:00] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [17:30:08] YuviPanda: thanks for merging it [17:31:17] mutante: yw :) [17:36:14] 7Puppet, 6Labs, 5Patch-For-Review: Labs: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/gmond_memcached.py] - https://phabricator.wikimedia.org/T95107#1234154 (10Dzahn) should be disabled now. is the error gone? [17:36:55] 6operations, 5Patch-For-Review: contacts.wikimedia.org drupal unpuppetized / retire contacts - https://phabricator.wikimedia.org/T90679#1234163 (10Dzahn) manually added bacula job to archive pool to keep it for longer than 60 days [17:37:39] 6operations, 10Wikimedia-Mailing-lists: Close mwapi-team@lists.wikimedia.org list - https://phabricator.wikimedia.org/T97148#1234165 (10RobH) [17:37:46] 6operations, 10Wikimedia-Mailing-lists: Close mwapi-team@lists.wikimedia.org list - https://phabricator.wikimedia.org/T97148#1234168 (10RobH) 5Open>3Resolved a:3RobH [17:40:31] (03PS5) 10Tim Landscheidt: move misc/labsdebrepo out of misc to module [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [17:43:10] (03PS2) 10Yuvipanda: ircecho: Restart if any of the files involved change [puppet] - 10https://gerrit.wikimedia.org/r/201924 [17:43:22] (03CR) 10Yuvipanda: [C: 032 V: 032] ircecho: Restart if any of the files involved change [puppet] - 10https://gerrit.wikimedia.org/r/201924 (owner: 10Yuvipanda) [17:46:05] (03CR) 10Tim Landscheidt: "I tested this successfully on Toolsbeta and tried to test it for *quarry, and it seemed to work (*quarry couldn't install some packages, b" [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [17:48:13] 6operations, 5Patch-For-Review: contacts.wikimedia.org drupal unpuppetized / retire contacts - https://phabricator.wikimedia.org/T90679#1234205 (10Dzahn) >>! In T90679#1102456, @AKoval_WMF wrote: > Yes, for our purposes now, Asana has replaced Civi. > However, I'm sure we'd appreciate a data dump just in case... [17:51:53] Coren, YuviPanda, outage report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150422-LabsOutage please edit as you see fit [17:52:06] I’m not sure what the action items are other than “don’t use broken kernels" [17:52:12] * Coren chuckles. [17:52:27] Well, there is one obvious one: "upgrade all affected hosts" which is already complete. [17:53:30] I wonder if quicker escalation of the shinken flaps would’ve caught it before the tools migration started [17:53:37] for some definition of ‘escalation’ and ‘quicker' [17:54:30] YuviPanda: yes, the fact that the deployment flaps were happening in a channel that I wasn’t in... [17:54:39] and no one mentioned them to me until the next morning… [17:54:48] yeah [17:54:51] not sure how to fix that [17:54:58] not putting those in this channel, no :P [17:55:05] too many elipses mean I’m going to go eat lunch [17:55:15] :D [17:55:20] A factor that hindered debugging is also that the symptoms were - initially - consistent with network overloading which was easy to dismiss as a side effect of migrations. [17:57:18] Coren: how’s the plan to switch NFS to labstore1002 going? idmapd off everywhere? [17:57:37] YuviPanda: That was completely put on hold since yesterday. :-( [17:57:50] heh [17:58:02] YuviPanda: It should be. I still have a couple of instances that either have puppet disabled or self-hosted I need to fix manually -- that's what I'm doing now. [17:58:13] ~30 left or so. [17:58:50] All of tools and deployment-prep are idmap-free, and about 80% of the others. [18:00:39] andrewbogott: Will do a small fix; the root cause is https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917 with the other being a symptom. [18:01:54] (03PS1) 10Dzahn: delete contacts.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/206415 (https://phabricator.wikimedia.org/T90679) [18:02:06] (03CR) 10Yuvipanda: [C: 04-1] move misc/labsdebrepo out of misc to module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [18:04:20] (03PS1) 10Tim Landscheidt: gridengine: Use proper syntax for template variables [puppet] - 10https://gerrit.wikimedia.org/r/206417 [18:04:38] (03CR) 10Dzahn: "how to "find all the hosts with that role explictly selected in wikitech" in general? is there an easy way for that? i've been wondering b" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [18:05:51] andrewbogott: I wonder if it's worthwhile to puppetize a hard check for kernel version and refuse to deploy nova if it's an affected kernel? [18:06:22] (03CR) 10Dzahn: [C: 032] varnish: delete contacts from misc-web config [puppet] - 10https://gerrit.wikimedia.org/r/206413 (https://phabricator.wikimedia.org/T90679) (owner: 10Dzahn) [18:08:53] (03PS2) 10Yuvipanda: gridengine: Use proper syntax for template variables [puppet] - 10https://gerrit.wikimedia.org/r/206417 (owner: 10Tim Landscheidt) [18:08:58] (03PS1) 10Ottomata: Add ferm rules for impala daemons [puppet] - 10https://gerrit.wikimedia.org/r/206420 [18:09:01] (03CR) 10Yuvipanda: [C: 032 V: 032] gridengine: Use proper syntax for template variables [puppet] - 10https://gerrit.wikimedia.org/r/206417 (owner: 10Tim Landscheidt) [18:09:36] (03PS1) 10Dzahn: contacts: delete Apache template, adjust role [puppet] - 10https://gerrit.wikimedia.org/r/206421 (https://phabricator.wikimedia.org/T90679) [18:09:42] (03CR) 10jenkins-bot: [V: 04-1] Add ferm rules for impala daemons [puppet] - 10https://gerrit.wikimedia.org/r/206420 (owner: 10Ottomata) [18:11:56] (03PS2) 10Ottomata: Add ferm rules for impala daemons [puppet] - 10https://gerrit.wikimedia.org/r/206420 (https://phabricator.wikimedia.org/T96329) [18:12:39] (03PS2) 10Dzahn: contacts: delete Apache template, adjust role [puppet] - 10https://gerrit.wikimedia.org/r/206421 (https://phabricator.wikimedia.org/T90679) [18:13:18] 6operations, 7HHVM: Switch HAT appservers to trusty's ICU - https://phabricator.wikimedia.org/T86096#1234299 (10matmarex) ICU 54 (5.4) has already been released, would be nice if we could go straight to that and skip some updateCollation.php the next time we upgrade. There were no existing Debian/Ubuntu packag... [18:13:32] (03CR) 10Dzahn: [C: 032] contacts: delete Apache template, adjust role [puppet] - 10https://gerrit.wikimedia.org/r/206421 (https://phabricator.wikimedia.org/T90679) (owner: 10Dzahn) [18:15:28] (03CR) 10Thcipriani: [C: 032] beta: expose $wmfUdp2logDest to wmfLabsOverrideSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206414 (https://phabricator.wikimedia.org/T97138) (owner: 10BryanDavis) [18:16:17] (03Merged) 10jenkins-bot: beta: expose $wmfUdp2logDest to wmfLabsOverrideSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206414 (https://phabricator.wikimedia.org/T97138) (owner: 10BryanDavis) [18:20:50] (03PS3) 10Ottomata: Add ferm rules for impala daemons [puppet] - 10https://gerrit.wikimedia.org/r/206420 (https://phabricator.wikimedia.org/T96329) [18:20:59] (03CR) 10Ottomata: [C: 032 V: 032] Add ferm rules for impala daemons [puppet] - 10https://gerrit.wikimedia.org/r/206420 (https://phabricator.wikimedia.org/T96329) (owner: 10Ottomata) [18:31:30] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [18:33:44] (03PS1) 10Ottomata: Open more ports for impala, include impalad on hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/206423 (https://phabricator.wikimedia.org/T96329) [18:34:27] ottomata: re: impala i'm disappointed that no one has made a chevy joke yet [18:35:14] google has, in that it often includes chevy things in my search results [18:35:21] (03PS1) 10BryanDavis: beta: Import $wmfUdp2logDest global in wmfLabsSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206424 (https://phabricator.wikimedia.org/T95789) [18:35:25] beyond that, no one has made an antelope joke yet either [18:35:37] :( [18:35:47] (03CR) 10Ottomata: [C: 032] Open more ports for impala, include impalad on hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/206423 (https://phabricator.wikimedia.org/T96329) (owner: 10Ottomata) [18:36:47] (03CR) 10BryanDavis: beta: Import $wmfUdp2logDest global in wmfLabsSettings() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206424 (https://phabricator.wikimedia.org/T95789) (owner: 10BryanDavis) [18:36:54] thcipriani: ^ [18:37:03] "this time for sure!" [18:37:07] * thcipriani looks [18:37:34] (03CR) 10Tim Landscheidt: [C: 04-1] "*argl* Yeah, we need to keep misc::labsdebrepo for the migration period. You can find a list of all instances with the Semantic MediaWik" [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [18:38:35] * bd808 adds "rewrite multiversion config system" to line 739 of things-to-do-before-i-die.txt [18:39:32] (03PS2) 10BryanDavis: beta: Import $wmfUdp2logDest global in wmfLabsSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206424 (https://phabricator.wikimedia.org/T97138) [18:43:15] (03CR) 10Yuvipanda: "You can actually query it from labs instances, with plain LDAP. Let me see if I can find the query I used." [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [18:46:16] (03CR) 10Thcipriani: [C: 032] beta: Import $wmfUdp2logDest global in wmfLabsSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206424 (https://phabricator.wikimedia.org/T97138) (owner: 10BryanDavis) [18:46:23] (03Merged) 10jenkins-bot: beta: Import $wmfUdp2logDest global in wmfLabsSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206424 (https://phabricator.wikimedia.org/T97138) (owner: 10BryanDavis) [18:54:40] (03PS2) 10Phuedx: Enable Browse experiment on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206375 (https://phabricator.wikimedia.org/T94739) [18:54:52] PROBLEM - puppet last run on ms-be2005 is CRITICAL puppet fail [18:54:52] PROBLEM - puppet last run on ms-be1006 is CRITICAL puppet fail [18:55:01] PROBLEM - puppet last run on mc1012 is CRITICAL puppet fail [18:55:11] PROBLEM - puppet last run on ms-be1009 is CRITICAL puppet fail [18:55:12] PROBLEM - puppet last run on ms-be2008 is CRITICAL puppet fail [18:55:20] PROBLEM - puppetmaster https on palladium is CRITICAL - Socket timeout after 10 seconds [18:55:30] PROBLEM - puppet last run on ms-be1014 is CRITICAL puppet fail [18:55:40] the puppet failures are likely fallout from the puppetmaster https failure on palladium [18:55:41] PROBLEM - puppet last run on ms-be1007 is CRITICAL puppet fail [18:55:42] PROBLEM - puppet last run on ms-be1004 is CRITICAL puppet fail [18:55:42] PROBLEM - puppet last run on sodium is CRITICAL puppet fail [18:55:42] PROBLEM - puppet last run on ms-be1015 is CRITICAL puppet fail [18:55:42] PROBLEM - puppet last run on helium is CRITICAL puppet fail [18:55:42] PROBLEM - puppet last run on ms-be2006 is CRITICAL puppet fail [18:55:51] PROBLEM - puppet last run on virt1007 is CRITICAL puppet fail [18:56:00] PROBLEM - puppet last run on ms-fe1001 is CRITICAL puppet fail [18:56:02] !log restarted apache2 on palladium [18:56:08] Logged the message, Master [18:56:11] PROBLEM - puppet last run on ms-be3004 is CRITICAL puppet fail [18:56:11] PROBLEM - puppet last run on mc1003 is CRITICAL puppet fail [18:56:11] PROBLEM - puppet last run on ms-be1010 is CRITICAL puppet fail [18:56:20] PROBLEM - puppet last run on ms-be1002 is CRITICAL puppet fail [18:56:21] PROBLEM - puppet last run on ms-be1003 is CRITICAL puppet fail [18:56:21] PROBLEM - puppet last run on ms-be1011 is CRITICAL puppet fail [18:56:21] PROBLEM - puppet last run on mc1013 is CRITICAL puppet fail [18:56:21] PROBLEM - puppet last run on gadolinium is CRITICAL puppet fail [18:56:40] PROBLEM - puppet last run on db1048 is CRITICAL puppet fail [18:56:41] PROBLEM - puppet last run on mw1256 is CRITICAL puppet fail [18:56:42] PROBLEM - puppet last run on ms-be1008 is CRITICAL puppet fail [18:56:42] PROBLEM - puppet last run on mc1004 is CRITICAL puppet fail [18:56:50] PROBLEM - puppet last run on wtp1019 is CRITICAL puppet fail [18:56:51] PROBLEM - puppet last run on wtp1017 is CRITICAL puppet fail [18:56:51] PROBLEM - puppet last run on mw2114 is CRITICAL puppet fail [18:56:51] PROBLEM - puppet last run on elastic1028 is CRITICAL puppet fail [18:56:51] PROBLEM - puppet last run on bast1001 is CRITICAL puppet fail [18:56:51] PROBLEM - puppet last run on mw1239 is CRITICAL puppet fail [18:56:51] PROBLEM - puppet last run on elastic1026 is CRITICAL puppet fail [18:56:52] PROBLEM - puppet last run on strontium is CRITICAL puppet fail [18:57:01] PROBLEM - puppet last run on mw1240 is CRITICAL puppet fail [18:57:01] PROBLEM - puppet last run on elastic1020 is CRITICAL puppet fail [18:57:01] PROBLEM - puppet last run on ms-be1005 is CRITICAL puppet fail [18:57:01] PROBLEM - puppet last run on wtp1023 is CRITICAL puppet fail [18:57:02] PROBLEM - puppet last run on mw1200 is CRITICAL puppet fail [18:57:02] PROBLEM - puppet last run on mw2143 is CRITICAL puppet fail [18:57:02] PROBLEM - puppet last run on db2041 is CRITICAL puppet fail [18:57:02] PROBLEM - puppet last run on mw2176 is CRITICAL puppet fail [18:57:03] PROBLEM - puppet last run on db2034 is CRITICAL puppet fail [18:57:03] PROBLEM - puppet last run on labcontrol2001 is CRITICAL puppet fail [18:57:04] PROBLEM - puppet last run on wtp1003 is CRITICAL puppet fail [18:57:04] PROBLEM - puppet last run on mw2205 is CRITICAL puppet fail [18:57:10] PROBLEM - puppet last run on mw2092 is CRITICAL puppet fail [18:57:10] PROBLEM - puppet last run on mw2102 is CRITICAL puppet fail [18:57:10] PROBLEM - puppet last run on californium is CRITICAL puppet fail [18:57:10] PROBLEM - puppet last run on snapshot1003 is CRITICAL puppet fail [18:57:11] PROBLEM - puppet last run on db1042 is CRITICAL puppet fail [18:57:20] PROBLEM - puppet last run on es1005 is CRITICAL puppet fail [18:57:20] PROBLEM - puppet last run on mw1179 is CRITICAL puppet fail [18:57:21] PROBLEM - puppet last run on db1064 is CRITICAL puppet fail [18:57:21] PROBLEM - puppet last run on mw1122 is CRITICAL puppet fail [18:57:21] PROBLEM - puppet last run on mw1170 is CRITICAL puppet fail [18:57:21] PROBLEM - puppet last run on pollux is CRITICAL puppet fail [18:57:22] PROBLEM - puppet last run on ms-fe2004 is CRITICAL puppet fail [18:57:22] PROBLEM - puppet last run on heze is CRITICAL puppet fail [18:57:22] PROBLEM - puppet last run on ytterbium is CRITICAL puppet fail [18:57:23] PROBLEM - puppet last run on mw2136 is CRITICAL puppet fail [18:57:30] PROBLEM - puppet last run on mw2104 is CRITICAL puppet fail [18:57:30] PROBLEM - puppet last run on elastic1019 is CRITICAL puppet fail [18:57:31] PROBLEM - puppet last run on mw2170 is CRITICAL puppet fail [18:57:31] PROBLEM - puppet last run on wtp2005 is CRITICAL puppet fail [18:57:31] PROBLEM - puppet last run on mw2099 is CRITICAL puppet fail [18:57:31] PROBLEM - puppet last run on baham is CRITICAL puppet fail [18:57:32] PROBLEM - puppet last run on db2036 is CRITICAL puppet fail [18:57:32] PROBLEM - puppet last run on mw2132 is CRITICAL puppet fail [18:57:40] PROBLEM - puppet last run on virt1001 is CRITICAL puppet fail [18:57:41] PROBLEM - puppet last run on mw1222 is CRITICAL puppet fail [18:57:41] PROBLEM - puppet last run on mw1233 is CRITICAL puppet fail [18:57:41] PROBLEM - puppet last run on elastic1023 is CRITICAL puppet fail [18:57:41] PROBLEM - puppet last run on mw1055 is CRITICAL puppet fail [18:57:41] PROBLEM - puppet last run on db1023 is CRITICAL puppet fail [18:57:42] PROBLEM - puppet last run on db1031 is CRITICAL puppet fail [18:57:42] PROBLEM - puppet last run on db1073 is CRITICAL puppet fail [18:57:43] PROBLEM - puppet last run on mw1068 is CRITICAL puppet fail [18:57:50] PROBLEM - puppet last run on wtp2012 is CRITICAL puppet fail [18:57:50] PROBLEM - puppet last run on db2042 is CRITICAL puppet fail [18:57:51] PROBLEM - puppet last run on mw2182 is CRITICAL puppet fail [18:57:51] PROBLEM - puppet last run on mw2124 is CRITICAL puppet fail [18:57:51] PROBLEM - puppet last run on analytics1010 is CRITICAL puppet fail [18:57:51] PROBLEM - puppet last run on mw2173 is CRITICAL puppet fail [18:57:51] PROBLEM - puppet last run on mw1230 is CRITICAL puppet fail [18:57:52] PROBLEM - puppet last run on es1009 is CRITICAL puppet fail [18:57:52] PROBLEM - puppet last run on mw1238 is CRITICAL puppet fail [18:58:01] PROBLEM - puppet last run on mw2157 is CRITICAL puppet fail [18:58:01] PROBLEM - puppet last run on mw1246 is CRITICAL puppet fail [18:58:01] PROBLEM - puppet last run on mw2110 is CRITICAL puppet fail [18:58:01] PROBLEM - puppet last run on tmh1001 is CRITICAL puppet fail [18:58:01] PROBLEM - puppet last run on mw1169 is CRITICAL puppet fail [18:58:02] PROBLEM - puppet last run on mw1127 is CRITICAL puppet fail [18:58:02] PROBLEM - puppet last run on db1034 is CRITICAL puppet fail [18:58:03] PROBLEM - puppet last run on mw2203 is CRITICAL puppet fail [18:58:03] PROBLEM - puppet last run on mw2190 is CRITICAL puppet fail [18:58:04] PROBLEM - puppet last run on analytics1002 is CRITICAL puppet fail [18:58:04] PROBLEM - puppet last run on db1060 is CRITICAL puppet fail [18:58:11] PROBLEM - puppet last run on mw2214 is CRITICAL puppet fail [18:58:11] PROBLEM - puppet last run on mw2178 is CRITICAL puppet fail [18:58:11] PROBLEM - puppet last run on antimony is CRITICAL puppet fail [18:58:11] PROBLEM - puppet last run on cp4012 is CRITICAL puppet fail [18:58:12] PROBLEM - puppet last run on mw2121 is CRITICAL puppet fail [18:58:18] !log restarted puppetmaster on palladium as well [18:58:21] PROBLEM - puppet last run on db2037 is CRITICAL puppet fail [18:58:21] Logged the message, Master [18:58:22] PROBLEM - puppet last run on mw1183 is CRITICAL puppet fail [18:58:22] PROBLEM - puppet last run on mw1136 is CRITICAL puppet fail [18:58:30] PROBLEM - puppet last run on rdb1004 is CRITICAL puppet fail [18:58:30] PROBLEM - puppet last run on elastic1025 is CRITICAL puppet fail [18:58:31] PROBLEM - puppet last run on mw1107 is CRITICAL puppet fail [18:58:31] PROBLEM - puppet last run on mw1217 is CRITICAL puppet fail [18:58:31] PROBLEM - puppet last run on mw2040 is CRITICAL puppet fail [18:58:40] PROBLEM - puppet last run on es2004 is CRITICAL puppet fail [18:58:51] PROBLEM - puppet last run on analytics1024 is CRITICAL puppet fail [18:59:01] PROBLEM - puppet last run on mw1141 is CRITICAL puppet fail [18:59:01] PROBLEM - puppet last run on mw1225 is CRITICAL puppet fail [18:59:01] PROBLEM - puppet last run on mw1134 is CRITICAL puppet fail [18:59:10] PROBLEM - puppet last run on analytics1035 is CRITICAL puppet fail [18:59:11] PROBLEM - puppet last run on mw2145 is CRITICAL puppet fail [18:59:21] PROBLEM - puppet last run on mw1126 is CRITICAL puppet fail [18:59:21] PROBLEM - puppet last run on db1005 is CRITICAL puppet fail [18:59:21] PROBLEM - puppet last run on cp1045 is CRITICAL Puppet has 2 failures [18:59:22] PROBLEM - puppet last run on mw1232 is CRITICAL puppet fail [18:59:22] PROBLEM - puppet last run on mw1184 is CRITICAL puppet fail [18:59:22] PROBLEM - puppet last run on mw1028 is CRITICAL puppet fail [18:59:22] PROBLEM - puppet last run on mw1203 is CRITICAL puppet fail [18:59:23] PROBLEM - puppet last run on rdb1003 is CRITICAL puppet fail [18:59:23] PROBLEM - puppet last run on db1028 is CRITICAL puppet fail [18:59:24] PROBLEM - puppet last run on mw1121 is CRITICAL puppet fail [18:59:30] PROBLEM - puppet last run on holmium is CRITICAL puppet fail [18:59:30] PROBLEM - puppet last run on mw1125 is CRITICAL puppet fail [18:59:30] PROBLEM - puppet last run on mw1029 is CRITICAL puppet fail [18:59:30] PROBLEM - puppet last run on mw1253 is CRITICAL puppet fail [18:59:30] PROBLEM - puppet last run on mw1079 is CRITICAL puppet fail [18:59:31] PROBLEM - puppet last run on cp1065 is CRITICAL Puppet has 28 failures [18:59:31] PROBLEM - puppet last run on mw1012 is CRITICAL puppet fail [18:59:32] PROBLEM - puppet last run on mw1115 is CRITICAL puppet fail [18:59:32] PROBLEM - puppet last run on elastic1024 is CRITICAL puppet fail [18:59:33] PROBLEM - puppet last run on mw2097 is CRITICAL puppet fail [18:59:33] PROBLEM - puppet last run on mw2042 is CRITICAL puppet fail [18:59:34] PROBLEM - puppet last run on es2006 is CRITICAL puppet fail [18:59:34] PROBLEM - puppet last run on mw2062 is CRITICAL puppet fail [18:59:40] (03CR) 10Yuvipanda: "> ldapsearch -x -D cn=proxyagent,ou=profile,dc=wikimedia,dc=org -b ou=hosts,dc=wikimedia,dc=org -W puppetClass=role::deployment::salt_mas" [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [18:59:46] PROBLEM - puppet last run on mw1151 is CRITICAL puppet fail [18:59:46] PROBLEM - puppet last run on mw1120 is CRITICAL puppet fail [18:59:47] PROBLEM - puppet last run on mw1035 is CRITICAL puppet fail [18:59:47] PROBLEM - puppet last run on mw1180 is CRITICAL puppet fail [18:59:51] PROBLEM - puppet last run on mw1061 is CRITICAL puppet fail [18:59:51] PROBLEM - puppet last run on es2009 is CRITICAL puppet fail [18:59:51] PROBLEM - puppet last run on mw1047 is CRITICAL puppet fail [18:59:51] PROBLEM - puppet last run on mw1104 is CRITICAL puppet fail [18:59:51] PROBLEM - puppet last run on mw1096 is CRITICAL puppet fail [18:59:51] PROBLEM - puppet last run on analytics1001 is CRITICAL puppet fail [18:59:51] PROBLEM - puppet last run on lvs1004 is CRITICAL puppet fail [18:59:51] PROBLEM - puppet last run on es1010 is CRITICAL puppet fail [18:59:52] PROBLEM - puppet last run on analytics1039 is CRITICAL puppet fail [18:59:52] PROBLEM - puppet last run on cp1072 is CRITICAL Puppet has 27 failures [18:59:53] PROBLEM - puppet last run on virt1011 is CRITICAL puppet fail [18:59:53] PROBLEM - puppet last run on restbase1004 is CRITICAL Puppet has 27 failures [18:59:54] PROBLEM - puppet last run on analytics1015 is CRITICAL puppet fail [18:59:54] PROBLEM - puppet last run on mw2112 is CRITICAL puppet fail [19:00:06] PROBLEM - puppet last run on elastic1013 is CRITICAL puppet fail [19:00:06] PROBLEM - puppet last run on mw1040 is CRITICAL puppet fail [19:00:07] PROBLEM - puppet last run on db1026 is CRITICAL puppet fail [19:00:10] PROBLEM - puppet last run on cp1069 is CRITICAL Puppet has 27 failures [19:00:10] PROBLEM - puppet last run on ganeti1003 is CRITICAL puppet fail [19:00:10] PROBLEM - puppet last run on virt1010 is CRITICAL puppet fail [19:00:11] PROBLEM - puppet last run on cp1059 is CRITICAL Puppet has 30 failures [19:00:11] PROBLEM - puppet last run on lvs2005 is CRITICAL puppet fail [19:00:11] PROBLEM - puppet last run on mw1254 is CRITICAL puppet fail [19:00:11] PROBLEM - puppet last run on mw1117 is CRITICAL puppet fail [19:00:12] PROBLEM - puppet last run on mw1242 is CRITICAL puppet fail [19:00:12] PROBLEM - puppet last run on mw1150 is CRITICAL puppet fail [19:00:13] PROBLEM - puppet last run on wtp2014 is CRITICAL puppet fail [19:00:13] PROBLEM - puppet last run on mw1119 is CRITICAL puppet fail [19:00:14] PROBLEM - puppet last run on mw1158 is CRITICAL puppet fail [19:00:14] PROBLEM - puppet last run on mw1095 is CRITICAL puppet fail [19:00:15] PROBLEM - puppet last run on analytics1040 is CRITICAL puppet fail [19:00:26] PROBLEM - puppet last run on mc2016 is CRITICAL Puppet has 8 failures [19:00:26] PROBLEM - puppet last run on mw2012 is CRITICAL puppet fail [19:00:27] PROBLEM - puppet last run on mw2022 is CRITICAL puppet fail [19:00:27] PROBLEM - puppet last run on mc2012 is CRITICAL Puppet has 1 failures [19:00:28] PROBLEM - puppet last run on ganeti2004 is CRITICAL Puppet has 10 failures [19:00:28] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.684 second response time [19:00:30] PROBLEM - puppet last run on ocg1002 is CRITICAL puppet fail [19:00:30] PROBLEM - puppet last run on cp1051 is CRITICAL Puppet has 2 failures [19:00:30] PROBLEM - puppet last run on mw1208 is CRITICAL puppet fail [19:00:30] PROBLEM - puppet last run on wtp2001 is CRITICAL puppet fail [19:00:31] PROBLEM - puppet last run on cp1071 is CRITICAL Puppet has 27 failures [19:00:31] PROBLEM - puppet last run on mw1013 is CRITICAL puppet fail [19:00:38] bye icinga-wm! [19:00:55] there was a quick "12:00 RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.684 second response time" there [19:01:00] so we're due for a flood of RECOVERY [19:01:10] well, were [19:01:23] yep, already on neon to kill the bot tmp [19:01:32] but it killed itself by flooding :) [19:02:44] PROBLEM - puppet last run on cp3040 is CRITICAL Puppet has 3 failures [19:02:44] RECOVERY - puppet last run on elastic1010 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:02:44] RECOVERY - puppet last run on cp1049 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:02:44] RECOVERY - puppet last run on cp1070 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:02:50] PROBLEM - puppet last run on cp3005 is CRITICAL Puppet has 1 failures [19:02:51] PROBLEM - puppet last run on mw1221 is CRITICAL puppet fail [19:03:16] i'll bring it back when most recoveries are over [19:03:34] sees the puppetmaster working [19:07:35] uhm [19:07:40] I wasn't watching, what's all this? [19:07:51] also, I don't think we restart any puppetmaster service there, just apache2? [19:07:57] but I'm not 100% sure [19:08:07] global puppet fail but it is recovering already [19:08:11] I think there is a puppetmaster service defined, but starting it screws up the apache2 variant [19:08:16] didnt even restart anything but watches puppetmaster log and neon log [19:08:49] I mean the above re: 18:58 < ori> !log restarted puppetmaster on palladium as well [19:08:57] I think in the past, that screwed me [19:09:14] ah:) yes, in the past i have restarted apache only [19:09:20] because it was a fail in mod_passenger [19:09:40] I think "service puppetmaster" is something else we don't use and shouldn't be starting, basically [19:09:43] which is confusing as hell [19:10:24] yes, and yes [19:10:29] ack [19:10:30] root@palladium:~# service puppetmaster status * master is not running [19:10:41] i didn't see the recoveries happening after the apache2 restart [19:10:44] but i think i was just impatient [19:10:57] but tail -f /var/log/syslog shows me it's compiling lots of catalogs [19:12:16] it's more that neon takes a while to go through all the notifications. it's already sending out recoveries before it's done sending out all the fails [19:13:35] (03CR) 10Tim Landscheidt: "Indeed the query and now I have a place to find it if I need it again :-)." [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [19:14:53] 7Puppet, 6Labs, 5Patch-For-Review: Labs: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/gmond_memcached.py] - https://phabricator.wikimedia.org/T95107#1234619 (10Tgr) Can't test right now, but I did test the patch in a self-hosted puppetmaster when I... [19:20:22] RECOVERY - puppet last run on ms-be1016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:20:31] RECOVERY - puppet last run on wtp1021 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:20:40] RECOVERY - puppet last run on zirconium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:20:50] RECOVERY - puppet last run on db2035 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [19:20:51] RECOVERY - puppet last run on analytics1034 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:20:51] RECOVERY - puppet last run on analytics1036 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:20:51] RECOVERY - puppet last run on mw1182 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:20:52] RECOVERY - puppet last run on ms-be2010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:21:01] RECOVERY - puppet last run on cp1051 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:21:01] RECOVERY - puppet last run on ms-be1017 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [19:21:02] RECOVERY - puppet last run on mw2135 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:21:02] RECOVERY - puppet last run on mw2089 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:21:02] RECOVERY - puppet last run on mw2098 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:21:02] RECOVERY - puppet last run on mw2025 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:21:11] RECOVERY - puppet last run on wtp1024 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:21:21] RECOVERY - puppet last run on neon is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:21:30] RECOVERY - puppet last run on labvirt1005 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:21:30] RECOVERY - puppet last run on elastic1031 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:21:31] RECOVERY - puppet last run on mw2181 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:21:31] RECOVERY - puppet last run on logstash1003 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [19:21:40] RECOVERY - puppet last run on mc1009 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [19:21:41] RECOVERY - puppet last run on mw1196 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:21:41] RECOVERY - puppet last run on mw1221 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:21:41] RECOVERY - puppet last run on analytics1019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:23:34] <^d> bd808: I'm going to work up a live hack pending a fix for the Cirrus bug [19:23:42] <^d> So we can bail earlier without hammering Elastic [19:27:01] (03CR) 10Hashar: Initial Debian packaging (031 comment) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/203961 (https://phabricator.wikimedia.org/T89142) (owner: 10Hashar) [19:27:17] (03PS9) 10Hashar: Initial Debian packaging [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/203961 (https://phabricator.wikimedia.org/T89142) [19:27:48] "There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). " [19:28:10] (03CR) 10Hashar: "Fix up whitespaces issue in debian/postinst" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/203961 (https://phabricator.wikimedia.org/T89142) (owner: 10Hashar) [19:29:31] (03PS3) 10Hashar: Support spaces in Gearman functions names [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/205564 [19:29:41] (03PS2) 10Hashar: wmf2: patch to support spaces in Gearman functions [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/205571 [19:33:35] (03CR) 10Hashar: "I have refreshed the files on http://people.wikimedia.org/~hashar/debs/nodepool/ using PS2." [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/205571 (owner: 10Hashar) [19:40:44] !log demon Synchronized php-1.26wmf2/extensions/CirrusSearch/includes/Searcher.php: debugging (duration: 00m 17s) [19:40:50] Logged the message, Master [19:42:28] !log demon Synchronized php-1.26wmf2/extensions/CirrusSearch/includes/Searcher.php: undo debugging (duration: 00m 14s) [19:42:31] Logged the message, Master [19:42:52] heya, anyone familiar with linux cgroups here? [19:42:57] paravoid: ? [19:43:58] ori, i am reverting y'day night's deploy .. corruptions being reported .. i'll cherry pick tim's patch on top of that. [19:44:08] or maybe not do that either right now. [19:44:12] subbu: got it, thanks [19:46:14] (03CR) 10Andrew Bogott: [C: 032 V: 032] Initial Debian packaging [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/203961 (https://phabricator.wikimedia.org/T89142) (owner: 10Hashar) [19:48:11] (03PS1) 10Aaron Schulz: Removed unused "max threads" stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206434 [19:48:25] 6operations, 10Wikimedia-Labs-wikitech-interface: Can not log into wikitech.wikimedia.org - https://phabricator.wikimedia.org/T96240#1234739 (10Andrew) [19:48:29] (03CR) 10Aaron Schulz: [C: 032] Removed unused "max threads" stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206434 (owner: 10Aaron Schulz) [19:48:35] (03Merged) 10jenkins-bot: Removed unused "max threads" stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206434 (owner: 10Aaron Schulz) [19:49:21] (03CR) 10Jdlrobson: [C: 031] "Matt flaschen - correct. I can't +2 however... able to help?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson) [19:50:34] 6operations, 6Labs: OOM on virt1000 - https://phabricator.wikimedia.org/T88256#1234749 (10Andrew) 5Open>3Resolved This was probably https://phabricator.wikimedia.org/T96256 -- it doesn't seem to be happening anymore. [19:52:30] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:52:33] 7Puppet, 6Labs, 5Patch-For-Review: Labs: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/gmond_memcached.py] - https://phabricator.wikimedia.org/T95107#1234762 (10Dzahn) p:5Triage>3Normal [19:52:47] !log aaron Synchronized wmf-config/db-eqiad.php: Removed unused "max threads" stuff (duration: 00m 14s) [19:52:48] !log revert parsoid deploy to 3311936a [19:52:52] Logged the message, Master [19:52:54] Logged the message, Master [19:53:10] !log aaron Synchronized wmf-config/db-codfw.php: Removed unused "max threads" stuff (duration: 00m 15s) [19:53:13] Logged the message, Master [19:53:52] (03CR) 10Hashar: [C: 04-2] "Not meant to be merged, that is applied to the Debian package as a quilt patch." [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/205564 (owner: 10Hashar) [19:56:50] (03CR) 10Andrew Bogott: [C: 032 V: 032] wmf2: patch to support spaces in Gearman functions [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/205571 (owner: 10Hashar) [19:57:34] hashar: did I merge the right one of those? [19:58:43] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Nodepool, and 2 others: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1234813 (10hashar) We now have a preliminary Debian package which is good enough. We will improv... [19:59:43] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Nodepool: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1234818 (10hashar) p:5Normal>3Low [20:00:35] ori, do you want me to cherry-pick tim's patch? but, i don't like the 1 retry part of it. [20:01:03] if it is not critical, i would like to investigate which of our patches is causing the corruption. [20:01:10] and defer those fixes to monday. [20:01:54] subbu: that's fine. We are going to also attack this from another angle, which is to set strict wall-clock timeouts for PHP requests. [20:02:10] k [20:04:28] (03PS1) 10Ori.livneh: HHVM: Limit wall execution time of FCGI reqs to 290s [puppet] - 10https://gerrit.wikimedia.org/r/206440 [20:04:37] ^ bblack gwicke [20:05:02] i'd rather have this in place for the weekend than not [20:05:17] i'll be back in 30, will deploy then if you guys are cool with it [20:05:32] (i'd try it on mw1017 first regardless) [20:06:32] (03CR) 10BBlack: [C: 031] "+1 on the general principle, don't ask me about the tech specifics of the patch :)" [puppet] - 10https://gerrit.wikimedia.org/r/206440 (owner: 10Ori.livneh) [20:08:11] (03CR) 10GWicke: [C: 031] "+1 for starting with 290s. I didn't fully check the details or test the patch." [puppet] - 10https://gerrit.wikimedia.org/r/206440 (owner: 10Ori.livneh) [20:10:42] (03PS1) 10coren: Refuse to install nova-compute on broken kernels [puppet] - 10https://gerrit.wikimedia.org/r/206441 (https://phabricator.wikimedia.org/T97152) [20:11:10] hmm, hello moritzm :) [20:11:17] (03PS2) 10coren: Refuse to install nova-compute on broken kernels [puppet] - 10https://gerrit.wikimedia.org/r/206441 (https://phabricator.wikimedia.org/T97152) [20:11:26] andrewbogott: ^^ [20:11:32] have you any experience with cgroups? [20:11:50] (03PS1) 10Aaron Schulz: [WIP] Set lock_wait_timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206442 [20:13:32] (03PS3) 10coren: Refuse to install nova-compute on broken kernels [puppet] - 10https://gerrit.wikimedia.org/r/206441 (https://phabricator.wikimedia.org/T97152) [20:13:37] (03CR) 10Andrew Bogott: [C: 04-1] "Looks good. One request: virt1012 is running 3.13.0-46-generic, so let's make 46 the minimum required so we don't break puppet on an exi" [puppet] - 10https://gerrit.wikimedia.org/r/206441 (https://phabricator.wikimedia.org/T97152) (owner: 10coren) [20:14:04] hashar or YuviPanda do you have a phab link for me regarding split dns for public IPs? [20:14:08] Ohright. -46 not -49 [20:14:28] andrewbogott: I don't think it is documented any where on wikitech [20:14:40] (03PS4) 10coren: Refuse to install nova-compute on broken kernels [puppet] - 10https://gerrit.wikimedia.org/r/206441 (https://phabricator.wikimedia.org/T97152) [20:14:53] hashar: isn’t there a phab ticket for it? [20:14:56] If not I’ll make one [20:14:59] ah I have misread [20:15:00] looking [20:15:03] but I think there were two already last I looked [20:15:27] https://phabricator.wikimedia.org/T95288 [20:15:32] ""Designate should support split horizon resolution to yield private IP of instances behind a public DNS entry"" [20:15:45] which has some doc about our current dnsmasq aliasing and some history [20:15:47] (03CR) 10Andrew Bogott: [C: 032] Refuse to install nova-compute on broken kernels [puppet] - 10https://gerrit.wikimedia.org/r/206441 (https://phabricator.wikimedia.org/T97152) (owner: 10coren) [20:16:11] I have no idea how your are going to be able to fix it though :/ [20:16:29] andrewbogott: You meant to +2? [20:16:43] yep, I’m testing now [20:16:47] were you ready? :) [20:16:58] Yeah, I just didn't expect it. :-) [20:18:06] The only question is versioncmp() which puppet only documents as "compares the same way package resources do" which isn't particularily documented. But examples seem to point at it understanding x.y.z-w correctly. :-) [20:18:27] I guess I can do it wrong on purpose next week and see if it fails :) [20:18:48] That would, actually, be a good idea. [20:21:26] yep [20:21:43] hashar: This is the top-level tracking bug: https://phabricator.wikimedia.org/T97163 you can add that as a blocker to whatever nodepool bug you have. [20:21:46] And then, check my work? [20:22:00] PROBLEM - puppet last run on virt1009 is CRITICAL puppet fail [20:22:01] It’s quite a long dependency chain but each step is already 80% done :) [20:22:24] andrewbogott: ^^ fails on precise? [20:22:26] well well [20:22:32] Oh, idiot! [20:22:51] I only do the package {} if trust and good kernel, not on 'not trusty' [20:22:51] * andrewbogott looks at that patch again [20:23:00] oh [20:23:05] andrewbogott: excellent! Thank you very much [20:23:09] I guess that could be important [20:23:25] that sounds like an Epic project inside the Epic project :) [20:23:36] andrewbogott: Doing fixy now. [20:23:55] I am off, have a good week-end everyone [20:24:40] PROBLEM - puppet last run on virt1002 is CRITICAL puppet fail [20:26:43] (03PS1) 10coren: Fix test for Trusty version for nova-compute [puppet] - 10https://gerrit.wikimedia.org/r/206445 [20:26:48] andrewbogott: ^^ [20:26:55] bblack: if you have a moment, could you review and comment on https://phabricator.wikimedia.org/T95288 ? thx [20:27:11] Ah, whitespace. [20:27:24] (03CR) 10jenkins-bot: [V: 04-1] Fix test for Trusty version for nova-compute [puppet] - 10https://gerrit.wikimedia.org/r/206445 (owner: 10coren) [20:28:44] andrewbogott: what kind of commentary do you want there? :) [20:29:00] PROBLEM - puppet last run on mw1042 is CRITICAL puppet fail [20:29:09] icinga-wm: not again? [20:29:11] PROBLEM - puppet last run on virt1006 is CRITICAL puppet fail [20:29:22] bblack: I want you to say “Sure, /etc/hosts is a way better solution, why didn’t I think of that?” so I can fix the problem without having to learn anything new. [20:29:31] (03PS2) 10coren: Fix test for Trusty version for nova-compute [puppet] - 10https://gerrit.wikimedia.org/r/206445 [20:29:47] andrewbogott: fwiw, in case it's not apparent, in that puppet snippet with $nova_dnsmasq_aliases, the hostname keys are just arbitrary labels. dnsmasq is translating public_ip to private_ip regardless of which hostname was queried. [20:29:53] bblack: or, alternatively, you could say “That’s trivial to implemnet in pdns, let me write a patch…” [20:30:17] andrewbogott: That one should even have correct puupet syntax. :-) ^^ [20:30:32] PROBLEM - puppet last run on virt1003 is CRITICAL puppet fail [20:30:34] andrewbogott: Also, can do split-horizon if you want. I speak mostly bind but pdns is no stranger. [20:30:37] bblack: ah, good point. So your fix is more resilient than just an hostname alias [20:31:35] well more generic, in any case [20:31:41] PROBLEM - puppet last run on virt1001 is CRITICAL puppet fail [20:31:53] powerdns can do anything in theory, they even let you load Lua code to muck with responses [20:31:53] (03PS2) 10Ori.livneh: HHVM: Limit wall execution time of FCGI reqs to 290s [puppet] - 10https://gerrit.wikimedia.org/r/206440 [20:32:01] PROBLEM - puppet last run on virt1004 is CRITICAL puppet fail [20:32:05] (03PS1) 10Ottomata: Use cgcreate to create a CPU cgroup for Impalad [puppet/cdh] - 10https://gerrit.wikimedia.org/r/206446 [20:32:12] RECOVERY - puppet last run on mw1042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:32:12] (03CR) 10Ori.livneh: [C: 032 V: 032] HHVM: Limit wall execution time of FCGI reqs to 290s [puppet] - 10https://gerrit.wikimedia.org/r/206440 (owner: 10Ori.livneh) [20:32:22] * dancecat always loves to see where lua is used as glue [20:32:31] PROBLEM - puppet last run on virt1010 is CRITICAL puppet fail [20:32:51] andrewbogott: here's what I don't understand: there's a real DNS server that serves wmflabs.org to the world, and then there's dnsmasq for these instances. my hacks only affected dnsmasq, but not the real DNS server... [20:33:11] having no idea what Designate is, does Designate do both jobs, or just the dnsmasq job? [20:33:16] the puppet fails on virt hosts are legit this time [20:33:21] PROBLEM - puppet last run on virt1007 is CRITICAL puppet fail [20:33:27] Could not find dependency Package[nova-compute] for Service[nova-compute] [20:33:29] (03CR) 10Andrew Bogott: [C: 031] Fix test for Trusty version for nova-compute [puppet] - 10https://gerrit.wikimedia.org/r/206445 (owner: 10coren) [20:33:31] sounds related [20:34:11] bblack: Previously there was pdns/ldap for public dns, and dnsmasq for private dns [20:34:15] mutante: It's legit. The fix is about to be merged. [20:34:27] I am moving private dns from dnsmasq to pdns/mysql (an unrelated pdns install) [20:34:34] (03CR) 10coren: [C: 032] Fix test for Trusty version for nova-compute [puppet] - 10https://gerrit.wikimedia.org/r/206445 (owner: 10coren) [20:34:42] Some day soom (tm) that server will also handle public dns, and the pdns/ldap server will die [20:34:45] (03PS2) 10Ottomata: Use cgcreate to create a CPU cgroup for Impalad [puppet/cdh] - 10https://gerrit.wikimedia.org/r/206446 (https://phabricator.wikimedia.org/T96329) [20:34:49] !log Deployed I1fa012ca1: HHVM: Limit wall execution time of FCGI reqs to 290s [20:34:54] Logged the message, Master [20:34:55] dnsmasq will continue to live on a s dhcp server but instances won’t rely on it for dns after initial build. [20:35:01] \o/ [20:35:04] bblack: Make sense? [20:35:09] sorta, yeah [20:35:15] (03CR) 10Ottomata: [C: 032] Use cgcreate to create a CPU cgroup for Impalad [puppet/cdh] - 10https://gerrit.wikimedia.org/r/206446 (https://phabricator.wikimedia.org/T96329) (owner: 10Ottomata) [20:35:25] so, you probably need to implement some real split-horizon stuff in your designate server, yes. [20:35:27] So, we’re in transition — in the long run there will be just one pdns server doing public and private both. [20:35:41] /etc/hosts hacks works fine too, but is pretty ugly, I don't think anyone would want to stick with that, right? [20:35:45] ‘designate’ is the tool that populates the pdns/mysql dasta. [20:36:04] (03PS1) 10Ottomata: Update cdh module with cgroup change for Impala [puppet] - 10https://gerrit.wikimedia.org/r/206448 [20:36:15] bblack: I don’t know. It increases the chances of a volunteer knowing what the heck is happening, since they can look at their own /etc/hosts, are less likely to look at settings in the dns server. [20:36:28] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh module with cgroup change for Impala [puppet] - 10https://gerrit.wikimedia.org/r/206448 (owner: 10Ottomata) [20:36:40] I don't know how it looks in pdns, but in BIND split-horizon is a matter of having two copies of a zonefile and saying in the daemon config "clients matching IP network 1.2.3.0/24 get this alternate version", in basic terms [20:36:44] Coren: if you want to just take on https://phabricator.wikimedia.org/T95288 that would be the best case scenario :) [20:37:06] andrewbogott: I'll do it. pdns isn't friendly to split horizons but it's doable. [20:37:11] PROBLEM - puppet last run on virt1011 is CRITICAL puppet fail [20:37:27] bblack: yeah, I’m sure it’s possible… my only worry is that since everything in that server is managed by designate, designate may dislike [20:37:35] right [20:37:37] it might [20:37:41] (03PS1) 10Aaron Schulz: [WIP] Set $wgJobSerialCommitThreshold [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206451 [20:37:41] PROBLEM - puppet last run on virt1008 is CRITICAL puppet fail [20:38:00] not to mention you need to dynamically generate two sets of data, one for each side of the horizon. desginate may not know how to do that at all. [20:38:11] but maybe something could be scripted to generate one from the other, I donno [20:38:31] RECOVERY - puppet last run on virt1009 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [20:39:06] Coren: think we can still hit labstore* switch by 30th of april with this on your plate too? [20:39:33] YuviPanda: Not an issue; the switching is mostly do-x-wait-for-x-repeat [20:39:39] \o/ cool [20:39:56] YuviPanda: It's long and lots of work but requires no brainpower to speak of. :-) [20:39:59] bblack: see, now you are proposing a hack much uglier than just inserting things into /etc/hosts :) [20:40:23] Coren: :) https://gerrit.wikimedia.org/r/#/c/199267/ needs work too. [20:40:25] so there is a suggestion to add their own DNS server directly into designate (vs using a backend) [20:40:35] andrewbogott: /etc/hosts is evil and ugly and nearly impossible to maintain properly. When I deployed that, it was meant as a temporary workaround on a single project. :-) [20:40:47] they call it MiniDNS. and in that context they mention split horizon as a new feature [20:41:05] " Consider the possibility of having multiple views of a DNS zone (The most common example of this is for split horizon)." [20:41:19] mutante: yikes, I’d rather not add yet another dns implementation to the mix [20:41:21] RECOVERY - puppet last run on virt1002 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [20:41:34] mutante: but that suggests that the api will have to support it [20:42:05] andrewbogott: it argues that including it in designate itself vs using a backend DNS is because: [20:42:09] "Designate's current "Backends" implementation is flawed, leaving many opportunities for for Designate and a Backend to become out of sync," [20:42:16] https://wiki.openstack.org/wiki/Designate/Blueprints/MiniDNS [20:42:18] YuviPanda: btw, https://gerrit.wikimedia.org/r/#/c/205897/ is going to block me in not all that long. [20:42:31] mutante: hah! It’s a little late for them to be realizing that :) [20:43:07] andrewbogott: ah. how long? Can I do it Monday or should I do it today? [20:43:16] YuviPanda: Monday is fine [20:44:09] bblack: an alternative is… magic fix for https://phabricator.wikimedia.org/T96924 [20:44:30] bblack: andrewbogott I really think https://phabricator.wikimedia.org/T96924 is the ‘real’ fix [20:44:43] andrewbogott: have you tried asking the OpenStack folks? [20:44:43] I [20:44:48] I can’t believe it’s not a common enough use case [20:45:10] YuviPanda: no, because no matter what I ask they will tell me to upgrade and install new components and I’ll have another 12 months worth of dependency work. [20:46:00] RECOVERY - puppet last run on virt1006 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [20:46:48] andrewbogott: haha :) [20:46:56] andrewbogott: we should still ask, just in case maaaybeeeeee. [20:47:01] RECOVERY - puppet last run on virt1004 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [20:47:07] YuviPanda: many many openstack users are running nova-network but the openstack devs really hate that. [20:47:15] So there’s a switch-to-neutron knee-jerk reaction. [20:47:21] RECOVERY - puppet last run on virt1003 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:47:44] “Just switch to neutron — it’s easy! All you have to do is " [20:48:21] RECOVERY - puppet last run on virt1001 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:48:50] Also, designate doing its down "miniDNS" is a terrible, terrible idea. Doing a proper DNS server is *hard* and making it scalabler even harder. [20:49:42] I wake up every morning thinking I can do DNS better than any existing implementations. Just the fact that I think that leads me to conclude that it must be very very hard. [20:50:01] RECOVERY - puppet last run on virt1007 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:50:05] maybe designate wants to support gdnsd as a backend? :) [20:50:29] mutante: in the perfect future I’ll add support for that myself. Not tonight though. [20:51:00] RECOVERY - puppet last run on virt1010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:52:32] andrewbogott et al: I don't know, these are all complexy designy questions, and it's Friday, and there are other things going on, and meh [20:52:50] bblack: I have to go in five minutes anyway :) [20:53:09] It would be nice if the routing just worked, though :) [20:54:13] andrewbogott: it is very very hard, mostly because the protocol was misdesigned decades ago and some cabal has been protecting it from being replaced/updated for sanity all the time since, and even actively making it worse. [20:54:30] RECOVERY - puppet last run on virt1008 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:55:40] RECOVERY - puppet last run on virt1011 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:55:44] for a random tiny example of crazy: DNS compresses response data to keep packet size down. but not by, you know, using an actual compression algorithm on the data we could just use a library for, which would actually perform well [20:56:06] they invented their own scheme of "magic pointers from one hostname in the packet to duplicate content in another hostname in the packet" [20:56:21] ... [20:56:34] and that in turn has like 400 little caveats about it [20:57:00] in the net, this is the function I had to write to compress my outbound data: https://github.com/gdnsd/gdnsd/blob/master/src/dnspacket.c#L528 [20:57:19] (03PS2) 10Aaron Schulz: Lowered innodb_lock_wait_timeout from defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206442 [20:57:42] yikes [21:01:13] (03Abandoned) 10Aaron Schulz: Changed production innodb_lock_wait_timeout from 50 => 30 [puppet] - 10https://gerrit.wikimedia.org/r/206145 (owner: 10Aaron Schulz) [21:05:51] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [21:08:19] (03PS1) 10BBlack: temporarily block an abusive search query... [puppet] - 10https://gerrit.wikimedia.org/r/206459 [21:09:04] (03CR) 10BBlack: [C: 032 V: 032] temporarily block an abusive search query... [puppet] - 10https://gerrit.wikimedia.org/r/206459 (owner: 10BBlack) [21:10:48] (03CR) 10Ori.livneh: [C: 04-1] graphite: stop system carbon-c-relay (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/206127 (owner: 10Filippo Giunchedi) [21:11:07] !log started hdfs balancer run [21:11:11] Logged the message, Master [21:11:26] (03PS1) 10Dzahn: integration: Apache turn DirectorySlash Off [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) [21:14:43] (03CR) 10Dzahn: "also see http://stackoverflow.com/questions/17439330/convince-apache-of-the-original-client-protocol" [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [21:15:08] 6operations, 10Wikimedia-SVG-rendering: Install PT (paratype) font on image scalars - https://phabricator.wikimedia.org/T97181#1235146 (10Bawolff) [21:21:06] 6operations, 10Wikimedia-SVG-rendering: Install PT (paratype) font on image scalars - https://phabricator.wikimedia.org/T97181#1235159 (10Bawolff) Google suggests that the font has not been packaged for ubuntu as of yet: https://bugs.launchpad.net/ubuntu/+bug/572061 [21:22:29] (03CR) 10Bmansurov: [C: 031] Enable Browse experiment on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206375 (https://phabricator.wikimedia.org/T94739) (owner: 10Phuedx) [21:22:41] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:22:57] (03PS1) 10Ottomata: Run hdfs balancer weekly [puppet] - 10https://gerrit.wikimedia.org/r/206461 [21:23:36] (03CR) 10Ottomata: "Will merge this next week, as I currently have a manual balancer task running." [puppet] - 10https://gerrit.wikimedia.org/r/206461 (owner: 10Ottomata) [21:23:38] (03CR) 10jenkins-bot: [V: 04-1] Run hdfs balancer weekly [puppet] - 10https://gerrit.wikimedia.org/r/206461 (owner: 10Ottomata) [21:28:20] 6operations, 10Wikimedia-Apache-configuration, 5Patch-For-Review: Apache slash expansion should not redirect from HTTPS to HTTP - https://phabricator.wikimedia.org/T95164#1235184 (10Dzahn) so either we just turn of DirectorySlash or if we care to keep it, we additionally add "manual" rewrite rules to replace... [21:29:51] 6operations, 10Wikimedia-Apache-configuration, 5Patch-For-Review: Apache slash expansion should not redirect from HTTPS to HTTP - https://phabricator.wikimedia.org/T95164#1235192 (10Dzahn) a:3Dzahn [21:41:26] 6operations, 15User-Bd808-Test, 10Wikimedia-Mailing-lists: Close mwapi-team@lists.wikimedia.org list - https://phabricator.wikimedia.org/T97148#1235235 (10bd808) [21:45:17] 6operations, 10Sentry, 10hardware-requests, 3Multimedia-Sprint-2015-03-25: Procure hardware for Sentry - https://phabricator.wikimedia.org/T93138#1235250 (10matmarex) [22:13:09] !log krinkle Synchronized php-1.26wmf3/includes/resourceloader/ResourceLoaderModule.php: Ibedc31659ed (duration: 00m 17s) [22:13:16] Logged the message, Master [22:14:07] !log krinkle Synchronized php-1.26wmf2/includes/resourceloader/ResourceLoaderModule.php: Ibedc31659ed (duration: 00m 14s) [22:14:10] Logged the message, Master [22:40:01] PROBLEM - puppet last run on lvs4001 is CRITICAL Puppet last ran 4 hours ago [22:54:32] 6operations, 10Wikimedia-SVG-rendering: Install PT (paratype) font on image scalars - https://phabricator.wikimedia.org/T97181#1235463 (10Dzahn) I'm wondering if for things like fonts, using alien [1] is acceptable. I tried and i do get a .deb from one of those rpms alien -d paratype-pt-mono-fonts-20141121-1... [22:56:41] jouncebot: next [22:56:41] In 64 hour(s) and 3 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150427T1500) [22:57:08] ohi jouncebot. Now let's see if we can get you running on the grid [22:59:12] 6operations, 10Wikimedia-SVG-rendering: Install PT (paratype) font on image scalars - https://phabricator.wikimedia.org/T97181#1235486 (10Dzahn) http://people.wikimedia.org/~dzahn/fonts/ [23:01:51] jouncebot: next [23:01:51] In 63 hour(s) and 58 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150427T1500) [23:02:01] YuviPanda: ^ it's alive! [23:02:14] bd808: \o/ [23:03:08] bd808: I wonder what was wrong [23:03:11] just the virtualenv? [23:03:20] Not sure honestly. [23:03:26] no the venv problem was new [23:03:28] Krinkle: should I just announce? or do you still want to take a look? [23:03:56] YuviPanda: link [23:04:40] Krinkle: https://etherpad.wikimedia.org/p/toollabs-cdnjs [23:05:33] YuviPanda: Maybe mention "static" toollabs project [23:05:54] https://tools.wmflabs.org/static/ [23:06:04] (03PS1) 10Dzahn: add codfw wtp parsoid servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/206478 (https://phabricator.wikimedia.org/T90271) [23:06:09] Seems nice to mention them in this announcement [23:07:09] Krinkle: ah, sure! [23:07:09] K [23:07:17] Krinkle: not sure if people will still use static after this, however? [23:07:22] Krinkle: what should I write? [23:13:21] (03PS1) 10Dzahn: parsoid: add role::parsoid::prod to codfw nodes [puppet] - 10https://gerrit.wikimedia.org/r/206479 (https://phabricator.wikimedia.org/T90271) [23:13:35] (03PS1) 10Nemo bis: Enable RandomRootPage everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206480 (https://phabricator.wikimedia.org/T18655) [23:14:10] YuviPanda: I put a draft blurp in the etherpad you can use/amend [23:14:18] Krinkle: aaaah, cool :) [23:14:53] YuviPanda: I'll work with Ireas to obsolete "static" [23:15:06] Krinkle: \o/ I guess we can setup redirects to cdnjs [23:15:14] YuviPanda: Be sure to end urls in a slash, otherwise they redirect to http [23:15:19] bah true [23:15:22] I should fix that [23:15:27] You should :P [23:30:52] Krinkle: I wonder if lighttpd fixed the trailing slash behavior in newer versions. [23:31:06] ah [23:31:08] Krinkle: nope it hasn't [23:41:53] 6operations: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1235573 (10Dzahn) @springle do you know already which of these (db2043-db2070) will serve which shard? I assume they will all be role::mariadb::core and we just need the shard numbers for each node, then we could add them... [23:58:56] 6operations, 5Patch-For-Review: adjust CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1235601 (10Dzahn) SOFT states do not send any notifications though. So it would seem ok to me to call it resolved. If we want to avoid even seeing them in logs/history, there is " log_service_retries=1"...