[00:03:50] 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1235613 (10Dzahn) So the admin groups requested are not needed? [00:04:07] 10Ops-Access-Requests, 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1235615 (10Dzahn) [00:07:16] 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown: dbtree loads third party resources - https://phabricator.wikimedia.org/T96499#1235617 (10Dzahn) Do we already have jquery sitting on a wikimedia URL? [00:08:12] 10Ops-Access-Requests, 6operations: Give Google webmaster tools access to jon katz (Read only is fine) - https://phabricator.wikimedia.org/T90980#1235618 (10Dzahn) [00:08:19] 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1235619 (10cscott) It should also be noted that io.js is a "friendly fork" of node.js, with the expectation that they will resync in the future. And they are both chasing v8, which is chasing the ES6 language... [00:11:05] (03PS1) 10Dzahn: site.pp: add labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/206486 (https://phabricator.wikimedia.org/T96048) [00:17:04] (03PS1) 10Yuvipanda: tools: Faux enable https for lighttpd by default [puppet] - 10https://gerrit.wikimedia.org/r/206488 (https://phabricator.wikimedia.org/T66627) [00:17:25] (03PS2) 10Yuvipanda: tools: Faux enable https for lighttpd by default [puppet] - 10https://gerrit.wikimedia.org/r/206488 (https://phabricator.wikimedia.org/T66627) [00:22:08] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Faux enable https for lighttpd by default [puppet] - 10https://gerrit.wikimedia.org/r/206488 (https://phabricator.wikimedia.org/T66627) (owner: 10Yuvipanda) [00:32:54] (03PS1) 10Yuvipanda: Revert "tools: Faux enable https for lighttpd by default" [puppet] - 10https://gerrit.wikimedia.org/r/206491 [00:33:09] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "tools: Faux enable https for lighttpd by default" [puppet] - 10https://gerrit.wikimedia.org/r/206491 (owner: 10Yuvipanda) [00:37:27] (03CR) 10Krinkle: [C: 04-1] "This feature is desired and normalises URLs, it just shouldn't redirect to HTTP. It probably does that because HTTPS terminator is separat" [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [00:39:57] (03CR) 10Krinkle: "How do we do this for app servers? The PHP environment for MediaWiki is aware of HTTPS being used. Maybe we can re-use that here?" [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [00:48:10] (03CR) 10BBlack: "My reading of the docs seems to indicate DirectorySlash doesn't do the /-adding redirect in the general case, only when the path actually " [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [00:51:17] (03CR) 10BBlack: "Well and now that I think about it: in the general case as a frontend proxy without knowledge of the app code or filesystem layout as appr" [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [00:55:25] (03CR) 10Krinkle: "Yeah, nevermind. We can't do it from the proxy servers." [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [01:06:11] PROBLEM - puppet last run on dbstore2001 is CRITICAL puppet fail [01:18:47] 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke) [01:19:12] 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke) [01:19:42] 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke) [01:24:31] RECOVERY - puppet last run on dbstore2001 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [01:28:54] 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235702 (10GWicke) [01:29:27] 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke) [01:29:46] 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke) [01:30:34] 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke) [01:34:02] 6operations: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1235706 (10Springle) If we add role::mariadb::core and shard numbers before cloning data and starting mysqld + replication, we'll have a few hundred icinga alerts to silence or ack for a month. I vote for the normal route... [01:38:58] (03PS2) 10Dzahn: integration: Apache turn DirectorySlash Off [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) [01:40:38] 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235707 (10GWicke) [01:46:41] (03CR) 10Krinkle: [C: 031] integration: Apache turn DirectorySlash Off [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [01:52:26] 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1235710 (10GWicke) [01:52:43] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke) [02:06:14] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Don't retry 503 unless allowed by Retry-After in Varnish - https://phabricator.wikimedia.org/T97206#1235713 (10GWicke) 3NEW [02:12:21] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Don't retry 503 unless allowed by Retry-After in Varnish - https://phabricator.wikimedia.org/T97206#1235720 (10GWicke) [02:20:09] !log l10nupdate Synchronized php-1.26wmf2/cache/l10n: (no message) (duration: 07m 48s) [02:20:27] Logged the message, Master [02:24:36] !log LocalisationUpdate completed (1.26wmf2) at 2015-04-25 02:23:33+00:00 [02:24:46] Logged the message, Master [02:28:31] (03PS1) 10GWicke: Use /api/rest_v1/ entry point for VE, take two. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206502 [02:29:07] (03PS2) 10GWicke: Use /api/rest_v1/ entry point for VE, take two. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206502 [02:29:28] (03PS3) 10GWicke: Use /api/rest_v1/ entry point for VE, take two. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206502 [02:39:15] !log l10nupdate Synchronized php-1.26wmf3/cache/l10n: (no message) (duration: 05m 56s) [02:39:20] Logged the message, Master [02:42:57] !log LocalisationUpdate completed (1.26wmf3) at 2015-04-25 02:41:54+00:00 [02:43:01] Logged the message, Master [03:00:42] gwicke, ori, now that both incidents are under control, should I cherry-pick Tim's patch and deploy that? [03:02:17] subbu: I don't think it's necessary to do it now. Better limits / throttling are in order but I think it's OK to take a few days to think it over thoroughly. [03:03:32] The HHVM max execution time limit we pushed out earlier should keep a problem like this (were it to recur) from blowing up to the magnitude that it did. [03:04:00] That said, if you'd be more at ease with stricter limits over the weekend, go for it [03:10:41] Krinkle: btw, decided to not send it now. friday evening, etc. [03:18:37] (03PS1) 10Yuvipanda: tools: Redirect tools.wmflabs.org/toolname appropriately [puppet] - 10https://gerrit.wikimedia.org/r/206504 (https://phabricator.wikimedia.org/T66627) [03:28:37] (03PS2) 10Yuvipanda: tools: Redirect tools.wmflabs.org/toolname appropriately [puppet] - 10https://gerrit.wikimedia.org/r/206504 (https://phabricator.wikimedia.org/T66627) [03:34:34] (03PS3) 10Yuvipanda: tools: Redirect tools.wmflabs.org/toolname appropriately [puppet] - 10https://gerrit.wikimedia.org/r/206504 (https://phabricator.wikimedia.org/T66627) [03:46:54] (03CR) 10Yuvipanda: [C: 032] tools: Redirect tools.wmflabs.org/toolname appropriately [puppet] - 10https://gerrit.wikimedia.org/r/206504 (https://phabricator.wikimedia.org/T66627) (owner: 10Yuvipanda) [03:50:24] (03PS1) 10Yuvipanda: Revert "tools: Redirect tools.wmflabs.org/toolname appropriately" [puppet] - 10https://gerrit.wikimedia.org/r/206506 [03:50:32] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "tools: Redirect tools.wmflabs.org/toolname appropriately" [puppet] - 10https://gerrit.wikimedia.org/r/206506 (owner: 10Yuvipanda) [03:53:07] Krinkle: i accept defeat on this one for today https://phabricator.wikimedia.org/T66627#1235824 [03:54:49] O/ [03:54:55] YuviPanda: You've tried. [03:55:04] Krinkle: I almost got it to work. [03:55:08] Krinkle: and I think I have the right approach. [03:55:20] Krinkle: however, I’m wondering… if a better solution is to just offer nginx as an option :D [03:55:33] Krinkle: it doesn’t affect anything other than lighttpd [03:55:38] uwsgi handles it just fine and so does nodejs [03:55:51] YuviPanda: Right, the redirect is coming from the indiivudal tools' lighttpd instances in most cases [03:55:59] in all the cases [03:56:05] nginx itself redirects protocol-relatively [03:56:07] and so is good [03:56:08] YuviPanda: What about the top level [03:56:15] Krinkle: that’s what I tried to fix in that patch [03:56:25] I mean, that one isn't running on lighttpd right? [03:56:27] thing is you’ve to make sure you are only matching tools.wmflabs.org/toolname [03:56:39] and not tools.wmflabs.org/toolname/ or tools.wmflabs.org/?status [03:56:49] Krinkle: what do you mean by ‘toplevel’? [03:56:53] Krinkle: tools.wmflabs.org? [03:56:53] YuviPanda: I guess Lighttpd, like Apache, has DirectorySlash behaviour based on full url and it's never told that the user may be on HTTPS by the outside proxy [03:57:09] maybe we can make it use x-forwarded-proto like we do in prod redirect.conf. [03:57:12] Krinkle: well, it’s told it’s on https by outsdie proxy (X-Forwarded-Proto) [03:57:18] Krinkle: it just doesn’t give a shit [03:57:19] {ENV:PROTO}//$1/$2 [03:57:37] and the only way to customize that seems to be to either use embedded lua or to write your own module [03:57:42] Yeah, but X-Forwarded-Proto isn't common enough to expect a program to handle [03:57:47] Afaik Nginx and apache don't handle it either by default [03:57:56] we made it do that in our conf [03:58:01] sure, but it should allow me to handle it in config without having to embed lua or write config for it [03:58:03] and even then, only for prod redirects. [03:58:15] prod regular domains and misc web lb (as for integration.wikimedia.org) has the same bug [03:58:22] from spending about 2h on it I can’t find any way to tell lighttpd to pick up the proto from x-forwarded-prot [03:58:28] Ah, we can't program it on lighttpd? [03:58:30] $HTTP[‘scheme’] can not be set [03:58:34] yeah that’s the crux of the problem [03:58:45] you can’t without using lua or writing a custom C module [03:59:04] Krinkle: longterm, I’d like to replace lighttps with nginx [03:59:12] Krinkle: and language specific severs [03:59:17] (like uwsgi / rack / nodejs / HHVM) [03:59:59] YuviPanda: maybe scheme can't be overwritten but the redirect target may be overwritable [04:00:21] e.g. disable directoryslash and enable a manual rewrite rule for if -d, redirect to proto://domain/path/ [04:00:22] Krinkle: yeah, if we had an explicit redirect. but we aren’t doing an explicit redirect so I’ve no idea how lighttpd is deciding to do the redirect [04:00:37] there’s no way to disable directoryslash as far as I can tell [04:00:41] in fact that behavior isn’t even documented [04:02:21] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [04:02:53] Krinkle: emailed about cdnjs :) [04:03:01] wtf icinga-wm [04:03:04] YuviPanda: Looks like newer versions of lighttpd may have changed this [04:03:06] > No changes to merge. [04:03:20] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [04:04:59] Krinkle: oh, hmm. how newer? [04:05:00] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [04:05:44] YuviPanda: "Lighttpd doesn't have this trailing slash problem in the latest release." -- random person on the internet [04:05:46] .. in 20 09 [04:05:55] nvm [04:06:02] Krinkle: :) [04:07:21] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [04:07:33] fine [04:07:50] Krinkle: lighttpd would’ve not been my first choice, but there exist 228 .ligghttpd.conf files now [04:07:58] Krinkle: I wonder if / when wikimedia goes ssl only toollabs can too [04:08:28] YuviPanda: that might not solve the problem though [04:08:39] in fact it might make it worse :D [04:08:39] unless we use ssl internally as well [04:08:41] redirect loooop! [04:08:43] yeah [04:08:44] yup [04:08:53] it definitely will make it worse :) [04:09:04] we can maybe rewrite redirects coming *out* of the proxy to be http but that’s… a hack [04:09:06] nah, why would it loop? [04:09:15] https foo -> http foo/ -> https foo/ [04:09:19] just one extra hop [04:09:24] oh hmm [04:09:26] but still [04:09:28] extra hop [04:09:52] YuviPanda: we have the same problem in prod already with every non-mediawiki domain we have serving out of apache [04:10:04] silently stripping https? [04:10:08] Yup [04:10:29] I’m going to for toollabs advocate someone writing and packaging a small lighttpd mod [04:10:44] that handles XFS header [04:10:49] in C [04:10:58] it shouldn’t be too hard - there’s already one for XFF [04:11:24] YuviPanda: I assume the XFF one is disabled for tool labs currently? [04:11:28] and only for web proxy? [04:11:41] Krinkle: yeah, and webproxy doesn’t set xff anyway [04:11:47] so it doesn’t matter [04:11:49] oh [04:11:51] hm.. [04:11:58] so why is there one for XFF :D ? [04:12:18] Krinkle: it’s included by defualt :D we don’t enable it [04:13:47] right [04:18:41] PROBLEM - puppet last run on mw2171 is CRITICAL puppet fail [04:19:04] Krinkle: announcement for the cdnjs mirror sent :) [04:19:19] Krinkle: I’m looking for other small things that’ll make web devs’ on toollabs life easier :) let me know if you got any ideas [04:22:48] (03CR) 10Mattflaschen: [C: 032] Enable VectorBeta form refresh on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson) [04:22:50] (03CR) 10jenkins-bot: [V: 04-1] Enable VectorBeta form refresh on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson) [04:26:16] (03PS3) 10Mattflaschen: Enable VectorBeta form refresh on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson) [04:26:33] (03CR) 10Mattflaschen: [C: 032] Enable VectorBeta form refresh on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson) [04:26:38] (03Merged) 10jenkins-bot: Enable VectorBeta form refresh on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson) [04:30:26] !log mattflaschen Synchronized wmf-config/InitialiseSettings-labs.php: Sync Beta Cluster-only change (for MW UI beta feature) (duration: 00m 16s) [04:30:32] Logged the message, Master [04:30:56] !log mattflaschen Synchronized wmf-config/CommonSettings-labs.php: Sync Beta Cluster-only change (for MW UI beta feature) (duration: 00m 16s) [04:30:59] Logged the message, Master [04:32:27] (03CR) 10Mattflaschen: "Deployed (automatically), synced (so it doesn't trigger the puppet error), and tested (works)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson) [04:36:51] RECOVERY - puppet last run on mw2171 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [04:46:02] 10Ops-Access-Requests, 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1235870 (10GWicke) Just re-read this task, and realized that apertium is third-party machine translation software. This means that sending logs to logstash or fluorine might not be tri... [04:54:49] 10Ops-Access-Requests, 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1235872 (10KartikMistry) @gwicke Both is good idea, but right now logs are sufficient. We've not 'seen' log since deployment, so it is bit weird :/ [04:57:50] 10Ops-Access-Requests, 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1235884 (10GWicke) Also adding @bd808, as he might have an idea for shipping plain log files to logstash. [05:18:44] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Apr 25 05:17:41 UTC 2015 (duration 17m 40s) [05:18:52] Logged the message, Master [05:21:31] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 75.00% of data above the critical threshold [24.0] [05:26:30] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [06:04:32] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 66.67% of data above the critical threshold [24.0] [06:09:01] PROBLEM - puppet last run on mw2184 is CRITICAL puppet fail [06:16:01] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [06:29:20] PROBLEM - puppet last run on cp3037 is CRITICAL Puppet has 1 failures [06:30:00] PROBLEM - puppet last run on labvirt1003 is CRITICAL Puppet has 1 failures [06:30:10] PROBLEM - puppet last run on mc2011 is CRITICAL puppet fail [06:30:20] PROBLEM - puppet last run on elastic1027 is CRITICAL Puppet has 1 failures [06:30:41] PROBLEM - puppet last run on mw1099 is CRITICAL Puppet has 1 failures [06:31:00] PROBLEM - puppet last run on cp1056 is CRITICAL Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on mw1100 is CRITICAL Puppet has 1 failures [06:33:50] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:33:51] PROBLEM - puppet last run on mw2104 is CRITICAL Puppet has 1 failures [06:34:51] PROBLEM - puppet last run on mw2083 is CRITICAL Puppet has 1 failures [06:35:40] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:45:20] RECOVERY - puppet last run on elastic1027 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:45:40] RECOVERY - puppet last run on mw1099 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:51] RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on mw1100 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:31] RECOVERY - puppet last run on mw2083 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on labvirt1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:41] RECOVERY - puppet last run on mc2011 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:47:11] RECOVERY - puppet last run on mw2104 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:20] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:47:50] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:24:51] PROBLEM - HHVM busy threads on mw1197 is CRITICAL 40.00% of data above the critical threshold [115.2] [08:26:31] PROBLEM - HHVM busy threads on mw1120 is CRITICAL 100.00% of data above the critical threshold [86.4] [08:27:11] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [08:27:20] PROBLEM - HHVM queue size on mw1130 is CRITICAL 33.33% of data above the critical threshold [80.0] [08:29:01] PROBLEM - HHVM busy threads on mw1136 is CRITICAL 60.00% of data above the critical threshold [86.4] [08:29:01] RECOVERY - HHVM queue size on mw1130 is OK Less than 30.00% above the threshold [10.0] [08:30:55] (03CR) 10Glaisher: "doc.wikimedia.org too." [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [08:31:51] PROBLEM - HHVM busy threads on mw1190 is CRITICAL 80.00% of data above the critical threshold [115.2] [08:38:30] RECOVERY - HHVM busy threads on mw1190 is OK Less than 30.00% above the threshold [76.8] [08:38:51] RECOVERY - HHVM busy threads on mw1136 is OK Less than 30.00% above the threshold [57.6] [08:40:31] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [08:40:58] (03PS1) 10Glaisher: Modify AbuseFilter block configuration on eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206510 (https://phabricator.wikimedia.org/T96669) [08:41:40] RECOVERY - HHVM busy threads on mw1120 is OK Less than 30.00% above the threshold [57.6] [08:49:00] PROBLEM - HHVM busy threads on mw1145 is CRITICAL 33.33% of data above the critical threshold [86.4] [08:51:21] PROBLEM - HHVM busy threads on mw1117 is CRITICAL 60.00% of data above the critical threshold [86.4] [08:52:22] PROBLEM - HHVM queue size on mw1198 is CRITICAL 33.33% of data above the critical threshold [80.0] [08:58:00] PROBLEM - HHVM busy threads on mw1139 is CRITICAL 60.00% of data above the critical threshold [86.4] [08:58:01] PROBLEM - puppet last run on lvs3003 is CRITICAL puppet fail [09:05:25] <_joe_> the API cluster is super-loaded again [09:06:22] RECOVERY - HHVM busy threads on mw1139 is OK Less than 30.00% above the threshold [57.6] [09:07:20] RECOVERY - HHVM busy threads on mw1145 is OK Less than 30.00% above the threshold [57.6] [09:07:22] RECOVERY - HHVM queue size on mw1198 is OK Less than 30.00% above the threshold [10.0] [09:16:20] RECOVERY - puppet last run on lvs3003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:18:30] PROBLEM - HHVM busy threads on mw1190 is CRITICAL 33.33% of data above the critical threshold [115.2] [09:18:51] PROBLEM - HHVM queue size on mw1147 is CRITICAL 60.00% of data above the critical threshold [80.0] [09:21:01] PROBLEM - HHVM busy threads on mw1114 is CRITICAL 60.00% of data above the critical threshold [86.4] [09:24:01] PROBLEM - HHVM queue size on mw1207 is CRITICAL 33.33% of data above the critical threshold [80.0] [09:25:20] PROBLEM - HHVM busy threads on mw1190 is CRITICAL 83.33% of data above the critical threshold [115.2] [09:27:30] PROBLEM - HHVM busy threads on mw1191 is CRITICAL 40.00% of data above the critical threshold [115.2] [09:27:41] PROBLEM - HHVM busy threads on mw1114 is CRITICAL 60.00% of data above the critical threshold [86.4] [09:28:01] PROBLEM - HHVM busy threads on mw1203 is CRITICAL 80.00% of data above the critical threshold [115.2] [09:29:01] RECOVERY - HHVM queue size on mw1207 is OK Less than 30.00% above the threshold [10.0] [09:29:41] RECOVERY - HHVM busy threads on mw1117 is OK Less than 30.00% above the threshold [57.6] [09:30:11] PROBLEM - HHVM busy threads on mw1201 is CRITICAL 33.33% of data above the critical threshold [115.2] [09:31:50] RECOVERY - HHVM busy threads on mw1201 is OK Less than 30.00% above the threshold [76.8] [09:31:51] RECOVERY - HHVM busy threads on mw1190 is OK Less than 30.00% above the threshold [76.8] [09:32:41] RECOVERY - HHVM busy threads on mw1114 is OK Less than 30.00% above the threshold [57.6] [09:33:31] (03CR) 10Giuseppe Lavagetto: "Please remember that this limit applies to the jobrunners as well, as I stated clearly in the reason for -1 on my own change introducing t" [puppet] - 10https://gerrit.wikimedia.org/r/206440 (owner: 10Ori.livneh) [09:44:11] PROBLEM - HHVM busy threads on mw1143 is CRITICAL 40.00% of data above the critical threshold [86.4] [09:47:50] PROBLEM - HHVM busy threads on mw1119 is CRITICAL 60.00% of data above the critical threshold [86.4] [09:50:21] PROBLEM - HHVM busy threads on mw1201 is CRITICAL 40.00% of data above the critical threshold [115.2] [09:52:20] PROBLEM - HHVM busy threads on mw1132 is CRITICAL 100.00% of data above the critical threshold [86.4] [09:55:31] PROBLEM - HHVM queue size on mw1148 is CRITICAL 40.00% of data above the critical threshold [80.0] [09:57:08] <_joe_> !log nuked User:Niteshift/MVneu/2015_April_21-30 on commonswiki [09:57:14] Logged the message, Master [09:57:20] RECOVERY - HHVM queue size on mw1148 is OK Less than 30.00% above the threshold [10.0] [09:57:20] PROBLEM - HHVM busy threads on mw1129 is CRITICAL 33.33% of data above the critical threshold [86.4] [09:59:00] RECOVERY - HHVM busy threads on mw1129 is OK Less than 30.00% above the threshold [57.6] [09:59:31] RECOVERY - HHVM busy threads on mw1119 is OK Less than 30.00% above the threshold [57.6] [10:00:51] RECOVERY - HHVM busy threads on mw1132 is OK Less than 30.00% above the threshold [57.6] [10:01:10] RECOVERY - HHVM queue size on mw1147 is OK Less than 30.00% above the threshold [10.0] [10:01:20] RECOVERY - HHVM busy threads on mw1191 is OK Less than 30.00% above the threshold [76.8] [10:01:51] RECOVERY - HHVM busy threads on mw1203 is OK Less than 30.00% above the threshold [76.8] [10:02:10] RECOVERY - HHVM busy threads on mw1197 is OK Less than 30.00% above the threshold [76.8] [10:04:01] RECOVERY - HHVM busy threads on mw1201 is OK Less than 30.00% above the threshold [76.8] [10:04:30] RECOVERY - HHVM busy threads on mw1143 is OK Less than 30.00% above the threshold [57.6] [11:20:41] PROBLEM - configured eth on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:22:10] RECOVERY - configured eth on tin is OK - interfaces up [11:26:11] PROBLEM - RAID on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:29:21] RECOVERY - RAID on tin is OK optimal, 1 logical, 2 physical [11:41:51] PROBLEM - puppet last run on cp3033 is CRITICAL puppet fail [11:42:41] 6operations: internal_api_error_Exception: [22e05a83] Exception Caught: wfDiff(): popen() failed errors on English Wikipedia - https://phabricator.wikimedia.org/T97145#1235990 (10Anomie) [11:47:05] 6operations: internal_api_error_Exception: [22e05a83] Exception Caught: wfDiff(): popen() failed errors on English Wikipedia - https://phabricator.wikimedia.org/T97145#1235992 (10Anomie) Reviewing the logs again, this particular instance of the problem seems like it may have accidentally been fixed at around 20:... [12:00:01] RECOVERY - puppet last run on cp3033 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:06:11] PROBLEM - RAID on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:07:41] RECOVERY - RAID on tin is OK optimal, 1 logical, 2 physical [12:08:51] PROBLEM - puppet last run on mc2001 is CRITICAL puppet fail [12:10:50] PROBLEM - puppet last run on multatuli is CRITICAL puppet fail [12:25:21] RECOVERY - puppet last run on mc2001 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:27:21] RECOVERY - puppet last run on multatuli is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [12:39:21] (03PS1) 10Aklapper: Phab monthly stats email: Clarify what day values for priority mean [puppet] - 10https://gerrit.wikimedia.org/r/206515 [13:02:32] (03PS1) 10Aklapper: Phab monthly stats email: Show how many projects saw workboard moves [puppet] - 10https://gerrit.wikimedia.org/r/206518 [13:04:10] PROBLEM - puppet last run on mw2135 is CRITICAL Puppet has 1 failures [13:04:14] (03CR) 10Aklapper: [C: 031] Phab monthly stats email: Clarify what day values for priority mean [puppet] - 10https://gerrit.wikimedia.org/r/206515 (owner: 10Aklapper) [13:15:40] PROBLEM - puppet last run on mw2205 is CRITICAL puppet fail [13:20:41] RECOVERY - puppet last run on mw2135 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:33:51] RECOVERY - puppet last run on mw2205 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:09:11] PROBLEM - puppet last run on mw1100 is CRITICAL Puppet has 1 failures [14:25:51] RECOVERY - puppet last run on mw1100 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:10:25] _joe_, hi. [15:10:57] <_joe_> subbu: hey [15:11:01] saw the ops email. [15:11:11] let me cherry pick tim's patch. [15:11:25] <_joe_> subbu: it was released AFAIK [15:11:29] <_joe_> it wasn't? [15:12:01] unless someone else did it after i reverted the deploy y'day .. see my mail on #ops. [15:12:18] because of https://phabricator.wikimedia.org/T97155 [15:12:40] <_joe_> oh ok [15:13:26] <_joe_> so the deploy contained a few other things [15:13:51] <_joe_> ok, I guess this morning we had to thank the time limit on HHVM [15:13:58] yes. [15:19:39] <_joe_> subbu: so tell me if I can help [15:20:10] i am about to sync and will restart after .. i had to resolve a conflict aftter cherry-pick. [15:26:45] !log deployed parsoid version fca17070 (cherry-pick of d2135c6b on parsoid master) [15:26:51] Logged the message, Master [15:27:00] _joe_, there ^ [15:27:17] <_joe_> should I rolling-restart parsoid? [15:27:23] i've restarted too. [15:27:25] <_joe_> subbu: thanks! [15:27:31] so, how bad was it this mornin? [15:27:35] *morning [15:27:45] morning CST/PST time :) [15:42:29] _joe_, https://www.mediawiki.org/wiki/Parsoid/Deployments has all deployment info for parsoid. [15:42:40] <_joe_> subbu: ok thanks\ [16:19:11] (03PS3) 10Shanmugamp7: Enable Extension:Shorturl on sa wiki projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201216 (https://phabricator.wikimedia.org/T94660) [16:21:42] (03PS1) 10Tim Landscheidt: Tools: Fix redirects from https to http [puppet] - 10https://gerrit.wikimedia.org/r/206519 (https://phabricator.wikimedia.org/T66627) [16:22:08] (03CR) 10Tim Landscheidt: "Tested on Toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/206519 (https://phabricator.wikimedia.org/T66627) (owner: 10Tim Landscheidt) [16:48:12] (03PS4) 10Shanmugamp7: Enable Extension:Shorturl on sa wiki projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201216 (https://phabricator.wikimedia.org/T94660) [17:17:56] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1236117 (10ssastry) [18:24:19] (03CR) 10Yuvipanda: [C: 032] Tools: Fix redirects from https to http [puppet] - 10https://gerrit.wikimedia.org/r/206519 (https://phabricator.wikimedia.org/T66627) (owner: 10Tim Landscheidt) [18:30:32] Krinkle|detached: ^ omg the redirects are fixed :D [18:48:36] 6operations, 10Wikimedia-Labs-General, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1236169 (10Aklapper) [18:49:44] (03CR) 10Hoo man: Update dispatchChanges cronjob to use new script location (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/205644 (owner: 10Aude) [18:58:30] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [19:08:31] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:32:43] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, 7Availability: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1236196 (10GWicke) This seems to be caused by this call chain: - [DjVu::doTransform calls LocalFile::getM... [19:48:10] YuviPanda: Woo, nice! [19:48:25] Krinkle: :D scfc is awesome, etc [20:10:40] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, 7Availability: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1236228 (10GWicke) Patch to remove the XML loading at https://gerrit.wikimedia.org/r/#/c/206526/. [20:17:17] 6operations, 7Wikimedia-log-errors: internal_api_error_Exception: [22e05a83] Exception Caught: wfDiff(): popen() failed errors on English Wikipedia - https://phabricator.wikimedia.org/T97145#1236233 (10Krenair) [20:31:47] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, and 2 others: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1236241 (10GWicke) [20:43:11] 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown: dbtree loads third party resources - https://phabricator.wikimedia.org/T96499#1236252 (10Krenair) https://bits.wikimedia.org/meta.wikimedia.org/load.php?modules=jquery&only=scripts ? [21:00:12] PROBLEM - puppet last run on mc1017 is CRITICAL Puppet has 1 failures [21:01:55] 6operations, 10Wikimedia-Labs-wikitech-interface, 7Regression: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1236287 (10Aklapper) @Andrew: Any news? :-/ [21:12:56] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Don't retry 503 unless allowed by Retry-After in Varnish - https://phabricator.wikimedia.org/T97206#1236301 (10GWicke) [21:16:40] RECOVERY - puppet last run on mc1017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:56:21] PROBLEM - puppet last run on ganeti2003 is CRITICAL puppet fail [22:13:01] RECOVERY - puppet last run on ganeti2003 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [22:32:49] (03PS1) 10GWicke: Bump to 6ac383c [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/206612 [22:47:31] PROBLEM - puppet last run on mw2080 is CRITICAL puppet fail [23:02:11] PROBLEM - puppet last run on cp3033 is CRITICAL puppet fail [23:05:50] RECOVERY - puppet last run on mw2080 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [23:18:50] RECOVERY - puppet last run on cp3033 is OK Puppet is currently enabled, last run 1 second ago with 0 failures