[00:03:50] <wikibugs>	 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1235613 (10Dzahn) So the admin groups requested are not needed?
[00:04:07] <wikibugs>	 10Ops-Access-Requests, 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1235615 (10Dzahn)
[00:07:16] <wikibugs>	 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown: dbtree loads third party resources - https://phabricator.wikimedia.org/T96499#1235617 (10Dzahn) Do we already have jquery sitting on a wikimedia URL?
[00:08:12] <wikibugs>	 10Ops-Access-Requests, 6operations: Give Google webmaster tools access to jon katz (Read only is fine) - https://phabricator.wikimedia.org/T90980#1235618 (10Dzahn)
[00:08:19] <wikibugs>	 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1235619 (10cscott) It should also be noted that io.js is a "friendly fork" of node.js, with the expectation that they will resync in the future.  And they are both chasing v8, which is chasing the ES6 language...
[00:11:05] <grrrit-wm>	 (03PS1) 10Dzahn: site.pp: add labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/206486 (https://phabricator.wikimedia.org/T96048) 
[00:17:04] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Faux enable https for lighttpd by default [puppet] - 10https://gerrit.wikimedia.org/r/206488 (https://phabricator.wikimedia.org/T66627) 
[00:17:25] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Faux enable https for lighttpd by default [puppet] - 10https://gerrit.wikimedia.org/r/206488 (https://phabricator.wikimedia.org/T66627) 
[00:22:08] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Faux enable https for lighttpd by default [puppet] - 10https://gerrit.wikimedia.org/r/206488 (https://phabricator.wikimedia.org/T66627) (owner: 10Yuvipanda)
[00:32:54] <grrrit-wm>	 (03PS1) 10Yuvipanda: Revert "tools: Faux enable https for lighttpd by default" [puppet] - 10https://gerrit.wikimedia.org/r/206491 
[00:33:09] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "tools: Faux enable https for lighttpd by default" [puppet] - 10https://gerrit.wikimedia.org/r/206491 (owner: 10Yuvipanda)
[00:37:27] <grrrit-wm>	 (03CR) 10Krinkle: [C: 04-1] "This feature is desired and normalises URLs, it just shouldn't redirect to HTTP. It probably does that because HTTPS terminator is separat" [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn)
[00:39:57] <grrrit-wm>	 (03CR) 10Krinkle: "How do we do this for app servers? The PHP environment for MediaWiki is aware of HTTPS being used. Maybe we can re-use that here?" [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn)
[00:48:10] <grrrit-wm>	 (03CR) 10BBlack: "My reading of the docs seems to indicate DirectorySlash doesn't do the /-adding redirect in the general case, only when the path actually " [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn)
[00:51:17] <grrrit-wm>	 (03CR) 10BBlack: "Well and now that I think about it: in the general case as a frontend proxy without knowledge of the app code or filesystem layout as appr" [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn)
[00:55:25] <grrrit-wm>	 (03CR) 10Krinkle: "Yeah, nevermind. We can't do it from the proxy servers." [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn)
[01:06:11] <icinga-wm>	 PROBLEM - puppet last run on dbstore2001 is CRITICAL puppet fail
[01:18:47] <wikibugs>	 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke)
[01:19:12] <wikibugs>	 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke)
[01:19:42] <wikibugs>	 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke)
[01:24:31] <icinga-wm>	 RECOVERY - puppet last run on dbstore2001 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures
[01:28:54] <wikibugs>	 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235702 (10GWicke)
[01:29:27] <wikibugs>	 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke)
[01:29:46] <wikibugs>	 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke)
[01:30:34] <wikibugs>	 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke)
[01:34:02] <wikibugs>	 6operations: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1235706 (10Springle) If we add role::mariadb::core and shard numbers before cloning data and starting mysqld + replication, we'll have a few hundred icinga alerts to silence or ack for a month.  I vote for the normal route...
[01:38:58] <grrrit-wm>	 (03PS2) 10Dzahn: integration: Apache turn DirectorySlash Off [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) 
[01:40:38] <wikibugs>	 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: Make sure timeouts are staggered, and there are no retries on timeout from a lower level - https://phabricator.wikimedia.org/T97204#1235707 (10GWicke)
[01:46:41] <grrrit-wm>	 (03CR) 10Krinkle: [C: 031] integration: Apache turn DirectorySlash Off [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn)
[01:52:26] <wikibugs>	 6operations, 10Architecture, 10RESTBase, 10incident-20150423-Commons, and 3 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1235710 (10GWicke)
[01:52:43] <wikibugs>	 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1235677 (10GWicke)
[02:06:14] <wikibugs>	 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Don't retry 503 unless allowed by Retry-After in Varnish - https://phabricator.wikimedia.org/T97206#1235713 (10GWicke) 3NEW
[02:12:21] <wikibugs>	 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Don't retry 503 unless allowed by Retry-After in Varnish - https://phabricator.wikimedia.org/T97206#1235720 (10GWicke)
[02:20:09] <logmsgbot>	 !log l10nupdate Synchronized php-1.26wmf2/cache/l10n: (no message) (duration: 07m 48s)
[02:20:27] <morebots>	 Logged the message, Master
[02:24:36] <logmsgbot>	 !log LocalisationUpdate completed (1.26wmf2) at 2015-04-25 02:23:33+00:00
[02:24:46] <morebots>	 Logged the message, Master
[02:28:31] <grrrit-wm>	 (03PS1) 10GWicke: Use /api/rest_v1/ entry point for VE, take two. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206502 
[02:29:07] <grrrit-wm>	 (03PS2) 10GWicke: Use /api/rest_v1/ entry point for VE, take two. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206502 
[02:29:28] <grrrit-wm>	 (03PS3) 10GWicke: Use /api/rest_v1/ entry point for VE, take two. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206502 
[02:39:15] <logmsgbot>	 !log l10nupdate Synchronized php-1.26wmf3/cache/l10n: (no message) (duration: 05m 56s)
[02:39:20] <morebots>	 Logged the message, Master
[02:42:57] <logmsgbot>	 !log LocalisationUpdate completed (1.26wmf3) at 2015-04-25 02:41:54+00:00
[02:43:01] <morebots>	 Logged the message, Master
[03:00:42] <subbu>	 gwicke, ori, now that both incidents are under control, should I cherry-pick Tim's patch and deploy that?
[03:02:17] <ori>	 subbu: I don't think it's necessary to do it now. Better limits / throttling are in order but I think it's OK to take a few days to think it over thoroughly.
[03:03:32] <ori>	 The HHVM max execution time limit we pushed out earlier should keep a problem like this (were it to recur) from blowing up to the magnitude that it did.
[03:04:00] <ori>	 That said, if you'd be more at ease with stricter limits over the weekend, go for it
[03:10:41] <YuviPanda>	 Krinkle: btw, decided to not send it now. friday evening, etc.
[03:18:37] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Redirect tools.wmflabs.org/toolname appropriately [puppet] - 10https://gerrit.wikimedia.org/r/206504 (https://phabricator.wikimedia.org/T66627) 
[03:28:37] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Redirect tools.wmflabs.org/toolname appropriately [puppet] - 10https://gerrit.wikimedia.org/r/206504 (https://phabricator.wikimedia.org/T66627) 
[03:34:34] <grrrit-wm>	 (03PS3) 10Yuvipanda: tools: Redirect tools.wmflabs.org/toolname appropriately [puppet] - 10https://gerrit.wikimedia.org/r/206504 (https://phabricator.wikimedia.org/T66627) 
[03:46:54] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] tools: Redirect tools.wmflabs.org/toolname appropriately [puppet] - 10https://gerrit.wikimedia.org/r/206504 (https://phabricator.wikimedia.org/T66627) (owner: 10Yuvipanda)
[03:50:24] <grrrit-wm>	 (03PS1) 10Yuvipanda: Revert "tools: Redirect tools.wmflabs.org/toolname appropriately" [puppet] - 10https://gerrit.wikimedia.org/r/206506 
[03:50:32] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "tools: Redirect tools.wmflabs.org/toolname appropriately" [puppet] - 10https://gerrit.wikimedia.org/r/206506 (owner: 10Yuvipanda)
[03:53:07] <YuviPanda>	 Krinkle: i accept defeat on this one for today https://phabricator.wikimedia.org/T66627#1235824
[03:54:49] <Krinkle>	 O/
[03:54:55] <Krinkle>	 YuviPanda: You've tried.
[03:55:04] <YuviPanda>	 Krinkle: I almost got it to work.
[03:55:08] <YuviPanda>	 Krinkle: and I think I have the right approach.
[03:55:20] <YuviPanda>	 Krinkle: however, I’m wondering… if a better solution is to just offer nginx as an option :D
[03:55:33] <YuviPanda>	 Krinkle: it doesn’t affect anything other than lighttpd
[03:55:38] <YuviPanda>	 uwsgi handles it just fine and so does nodejs
[03:55:51] <Krinkle>	 YuviPanda: Right, the redirect is coming from the indiivudal tools' lighttpd instances in most cases
[03:55:59] <YuviPanda>	 in all the cases
[03:56:05] <YuviPanda>	 nginx itself redirects protocol-relatively
[03:56:07] <YuviPanda>	 and so is good
[03:56:08] <Krinkle>	 YuviPanda: What about the top level
[03:56:15] <YuviPanda>	 Krinkle: that’s what I tried to fix in that patch
[03:56:25] <Krinkle>	 I mean, that one isn't running on lighttpd right?
[03:56:27] <YuviPanda>	 thing is you’ve to make sure you are only matching tools.wmflabs.org/toolname
[03:56:39] <YuviPanda>	 and not tools.wmflabs.org/toolname/ or tools.wmflabs.org/?status
[03:56:49] <YuviPanda>	 Krinkle: what do you mean by ‘toplevel’?
[03:56:53] <YuviPanda>	 Krinkle: tools.wmflabs.org?
[03:56:53] <Krinkle>	 YuviPanda: I guess Lighttpd, like Apache, has DirectorySlash behaviour based on full url and it's never told that the user may be on HTTPS by the outside proxy
[03:57:09] <Krinkle>	 maybe we can make it use x-forwarded-proto like we do in prod redirect.conf.
[03:57:12] <YuviPanda>	 Krinkle: well, it’s told it’s on https by outsdie proxy (X-Forwarded-Proto)
[03:57:18] <YuviPanda>	 Krinkle: it just doesn’t give a shit
[03:57:19] <Krinkle>	 {ENV:PROTO}//$1/$2
[03:57:37] <YuviPanda>	 and the only way to customize that seems to be to either use embedded lua or to write your own module
[03:57:42] <Krinkle>	 Yeah, but X-Forwarded-Proto isn't common enough to expect a program to handle
[03:57:47] <Krinkle>	 Afaik Nginx and apache don't handle it either by default
[03:57:56] <Krinkle>	 we made it do that in our conf
[03:58:01] <YuviPanda>	 sure, but it should allow me to handle it in config without having to embed lua or write config for it
[03:58:03] <Krinkle>	 and even then, only for prod redirects.
[03:58:15] <Krinkle>	 prod regular domains and misc web lb (as for integration.wikimedia.org) has the same bug
[03:58:22] <YuviPanda>	 from spending about 2h on it I can’t find any way to tell lighttpd to pick up the proto from x-forwarded-prot
[03:58:28] <Krinkle>	 Ah, we can't program it on lighttpd?
[03:58:30] <YuviPanda>	 $HTTP[‘scheme’] can not be set
[03:58:34] <YuviPanda>	 yeah that’s the crux of the problem
[03:58:45] <YuviPanda>	 you can’t without using lua or writing a custom C module
[03:59:04] <YuviPanda>	 Krinkle: longterm, I’d like to replace lighttps with nginx
[03:59:12] <YuviPanda>	 Krinkle: and language specific severs
[03:59:17] <YuviPanda>	 (like uwsgi / rack / nodejs / HHVM)
[03:59:59] <Krinkle>	 YuviPanda: maybe scheme can't be overwritten but the redirect target may be overwritable
[04:00:21] <Krinkle>	 e.g. disable directoryslash and enable a manual rewrite rule for if -d, redirect to proto://domain/path/ 
[04:00:22] <YuviPanda>	 Krinkle: yeah, if we had an explicit redirect. but we aren’t doing an explicit redirect so I’ve no idea how lighttpd is deciding to do the redirect
[04:00:37] <YuviPanda>	 there’s no way to disable directoryslash as far as I can tell
[04:00:41] <YuviPanda>	 in fact that behavior isn’t even documented
[04:02:21] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet).
[04:02:53] <YuviPanda>	 Krinkle: emailed about cdnjs :)
[04:03:01] <YuviPanda>	 wtf icinga-wm 
[04:03:04] <Krinkle>	 YuviPanda: Looks like newer versions of lighttpd may have changed this
[04:03:06] <YuviPanda>	 > No changes to merge.
[04:03:20] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet).
[04:04:59] <YuviPanda>	 Krinkle: oh, hmm. how newer?
[04:05:00] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[04:05:44] <Krinkle>	 YuviPanda: "Lighttpd doesn't have this trailing slash problem in the latest release." -- random person on the internet
[04:05:46] <Krinkle>	 .. in 20 09
[04:05:55] <Krinkle>	 nvm
[04:06:02] <YuviPanda>	 Krinkle: :)
[04:07:21] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[04:07:33] <YuviPanda>	 fine
[04:07:50] <YuviPanda>	 Krinkle: lighttpd would’ve not been my first choice, but there exist 228 .ligghttpd.conf files now
[04:07:58] <YuviPanda>	 Krinkle: I wonder if / when wikimedia goes ssl only toollabs can too
[04:08:28] <Krinkle>	 YuviPanda: that might not solve the problem though
[04:08:39] <YuviPanda>	 in fact it might make it worse :D
[04:08:39] <Krinkle>	 unless we use ssl internally as well
[04:08:41] <YuviPanda>	 redirect loooop!
[04:08:43] <Krinkle>	 yeah
[04:08:44] <YuviPanda>	 yup
[04:08:53] <YuviPanda>	 it definitely will make it worse :)
[04:09:04] <YuviPanda>	 we can maybe rewrite redirects coming *out* of the proxy to be http but that’s… a hack
[04:09:06] <Krinkle>	 nah, why would it loop?
[04:09:15] <Krinkle>	 https foo -> http foo/ -> https foo/
[04:09:19] <Krinkle>	 just one extra hop
[04:09:24] <YuviPanda>	 oh hmm
[04:09:26] <YuviPanda>	 but still
[04:09:28] <YuviPanda>	 extra hop
[04:09:52] <Krinkle>	 YuviPanda: we have the same problem in prod already with every non-mediawiki domain we have serving out of apache
[04:10:04] <YuviPanda>	 silently stripping https?
[04:10:08] <Krinkle>	 Yup
[04:10:29] <YuviPanda>	 I’m going to for toollabs advocate someone writing and packaging a small lighttpd mod
[04:10:44] <YuviPanda>	 that handles XFS header
[04:10:49] <YuviPanda>	 in C
[04:10:58] <YuviPanda>	 it shouldn’t be too hard - there’s already one for XFF
[04:11:24] <Krinkle>	 YuviPanda: I assume the XFF one is disabled for tool labs currently?
[04:11:28] <Krinkle>	 and only for web proxy?
[04:11:41] <YuviPanda>	 Krinkle: yeah, and webproxy doesn’t set xff anyway
[04:11:47] <YuviPanda>	 so it doesn’t matter
[04:11:49] <Krinkle>	 oh
[04:11:51] <Krinkle>	 hm..
[04:11:58] <Krinkle>	 so why is there one for XFF :D ?
[04:12:18] <YuviPanda>	 Krinkle: it’s included by defualt :D we don’t enable it
[04:13:47] <Krinkle>	 right
[04:18:41] <icinga-wm>	 PROBLEM - puppet last run on mw2171 is CRITICAL puppet fail
[04:19:04] <YuviPanda>	 Krinkle: announcement for the cdnjs mirror sent :) 
[04:19:19] <YuviPanda>	 Krinkle: I’m looking for other small things that’ll make web devs’ on toollabs life easier :) let me know if you got any ideas
[04:22:48] <grrrit-wm>	 (03CR) 10Mattflaschen: [C: 032] Enable VectorBeta form refresh on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson)
[04:22:50] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Enable VectorBeta form refresh on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson)
[04:26:16] <grrrit-wm>	 (03PS3) 10Mattflaschen: Enable VectorBeta form refresh on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson)
[04:26:33] <grrrit-wm>	 (03CR) 10Mattflaschen: [C: 032] Enable VectorBeta form refresh on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson)
[04:26:38] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable VectorBeta form refresh on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson)
[04:30:26] <logmsgbot>	 !log mattflaschen Synchronized wmf-config/InitialiseSettings-labs.php: Sync Beta Cluster-only change (for MW UI beta feature) (duration: 00m 16s)
[04:30:32] <morebots>	 Logged the message, Master
[04:30:56] <logmsgbot>	 !log mattflaschen Synchronized wmf-config/CommonSettings-labs.php: Sync Beta Cluster-only change (for MW UI beta feature) (duration: 00m 16s)
[04:30:59] <morebots>	 Logged the message, Master
[04:32:27] <grrrit-wm>	 (03CR) 10Mattflaschen: "Deployed (automatically), synced (so it doesn't trigger the puppet error), and tested (works)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson)
[04:36:51] <icinga-wm>	 RECOVERY - puppet last run on mw2171 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures
[04:46:02] <wikibugs>	 10Ops-Access-Requests, 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1235870 (10GWicke) Just re-read this task, and realized that apertium is third-party machine translation software. This means that sending logs to logstash or fluorine might not be tri...
[04:54:49] <wikibugs>	 10Ops-Access-Requests, 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1235872 (10KartikMistry) @gwicke Both is good idea, but right now logs are sufficient. We've not 'seen' log since deployment, so it is bit weird :/
[04:57:50] <wikibugs>	 10Ops-Access-Requests, 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1235884 (10GWicke) Also adding @bd808, as he might have an idea for shipping plain log files to logstash.
[05:18:44] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Apr 25 05:17:41 UTC 2015 (duration 17m 40s)
[05:18:52] <morebots>	 Logged the message, Master
[05:21:31] <icinga-wm>	 PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 75.00% of data above the critical threshold [24.0]
[05:26:30] <icinga-wm>	 RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[06:04:32] <icinga-wm>	 PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 66.67% of data above the critical threshold [24.0]
[06:09:01] <icinga-wm>	 PROBLEM - puppet last run on mw2184 is CRITICAL puppet fail
[06:16:01] <icinga-wm>	 RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[06:29:20] <icinga-wm>	 PROBLEM - puppet last run on cp3037 is CRITICAL Puppet has 1 failures
[06:30:00] <icinga-wm>	 PROBLEM - puppet last run on labvirt1003 is CRITICAL Puppet has 1 failures
[06:30:10] <icinga-wm>	 PROBLEM - puppet last run on mc2011 is CRITICAL puppet fail
[06:30:20] <icinga-wm>	 PROBLEM - puppet last run on elastic1027 is CRITICAL Puppet has 1 failures
[06:30:41] <icinga-wm>	 PROBLEM - puppet last run on mw1099 is CRITICAL Puppet has 1 failures
[06:31:00] <icinga-wm>	 PROBLEM - puppet last run on cp1056 is CRITICAL Puppet has 1 failures
[06:31:11] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures
[06:31:20] <icinga-wm>	 PROBLEM - puppet last run on mw1100 is CRITICAL Puppet has 1 failures
[06:33:50] <icinga-wm>	 RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:33:51] <icinga-wm>	 PROBLEM - puppet last run on mw2104 is CRITICAL Puppet has 1 failures
[06:34:51] <icinga-wm>	 PROBLEM - puppet last run on mw2083 is CRITICAL Puppet has 1 failures
[06:35:40] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures
[06:45:20] <icinga-wm>	 RECOVERY - puppet last run on elastic1027 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:45:40] <icinga-wm>	 RECOVERY - puppet last run on mw1099 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures
[06:45:51] <icinga-wm>	 RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures
[06:46:00] <icinga-wm>	 RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures
[06:46:11] <icinga-wm>	 RECOVERY - puppet last run on mw1100 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:46:31] <icinga-wm>	 RECOVERY - puppet last run on mw2083 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures
[06:46:32] <icinga-wm>	 RECOVERY - puppet last run on labvirt1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:46:41] <icinga-wm>	 RECOVERY - puppet last run on mc2011 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:47:11] <icinga-wm>	 RECOVERY - puppet last run on mw2104 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:47:20] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures
[06:47:50] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:24:51] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1197 is CRITICAL 40.00% of data above the critical threshold [115.2]
[08:26:31] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1120 is CRITICAL 100.00% of data above the critical threshold [86.4]
[08:27:11] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0]
[08:27:20] <icinga-wm>	 PROBLEM - HHVM queue size on mw1130 is CRITICAL 33.33% of data above the critical threshold [80.0]
[08:29:01] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1136 is CRITICAL 60.00% of data above the critical threshold [86.4]
[08:29:01] <icinga-wm>	 RECOVERY - HHVM queue size on mw1130 is OK Less than 30.00% above the threshold [10.0]
[08:30:55] <grrrit-wm>	 (03CR) 10Glaisher: "doc.wikimedia.org too." [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn)
[08:31:51] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1190 is CRITICAL 80.00% of data above the critical threshold [115.2]
[08:38:30] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1190 is OK Less than 30.00% above the threshold [76.8]
[08:38:51] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1136 is OK Less than 30.00% above the threshold [57.6]
[08:40:31] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[08:40:58] <grrrit-wm>	 (03PS1) 10Glaisher: Modify AbuseFilter block configuration on eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206510 (https://phabricator.wikimedia.org/T96669) 
[08:41:40] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1120 is OK Less than 30.00% above the threshold [57.6]
[08:49:00] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1145 is CRITICAL 33.33% of data above the critical threshold [86.4]
[08:51:21] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1117 is CRITICAL 60.00% of data above the critical threshold [86.4]
[08:52:22] <icinga-wm>	 PROBLEM - HHVM queue size on mw1198 is CRITICAL 33.33% of data above the critical threshold [80.0]
[08:58:00] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1139 is CRITICAL 60.00% of data above the critical threshold [86.4]
[08:58:01] <icinga-wm>	 PROBLEM - puppet last run on lvs3003 is CRITICAL puppet fail
[09:05:25] <_joe_>	 the API cluster is super-loaded again
[09:06:22] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1139 is OK Less than 30.00% above the threshold [57.6]
[09:07:20] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1145 is OK Less than 30.00% above the threshold [57.6]
[09:07:22] <icinga-wm>	 RECOVERY - HHVM queue size on mw1198 is OK Less than 30.00% above the threshold [10.0]
[09:16:20] <icinga-wm>	 RECOVERY - puppet last run on lvs3003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:18:30] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1190 is CRITICAL 33.33% of data above the critical threshold [115.2]
[09:18:51] <icinga-wm>	 PROBLEM - HHVM queue size on mw1147 is CRITICAL 60.00% of data above the critical threshold [80.0]
[09:21:01] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1114 is CRITICAL 60.00% of data above the critical threshold [86.4]
[09:24:01] <icinga-wm>	 PROBLEM - HHVM queue size on mw1207 is CRITICAL 33.33% of data above the critical threshold [80.0]
[09:25:20] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1190 is CRITICAL 83.33% of data above the critical threshold [115.2]
[09:27:30] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1191 is CRITICAL 40.00% of data above the critical threshold [115.2]
[09:27:41] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1114 is CRITICAL 60.00% of data above the critical threshold [86.4]
[09:28:01] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1203 is CRITICAL 80.00% of data above the critical threshold [115.2]
[09:29:01] <icinga-wm>	 RECOVERY - HHVM queue size on mw1207 is OK Less than 30.00% above the threshold [10.0]
[09:29:41] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1117 is OK Less than 30.00% above the threshold [57.6]
[09:30:11] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1201 is CRITICAL 33.33% of data above the critical threshold [115.2]
[09:31:50] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1201 is OK Less than 30.00% above the threshold [76.8]
[09:31:51] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1190 is OK Less than 30.00% above the threshold [76.8]
[09:32:41] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1114 is OK Less than 30.00% above the threshold [57.6]
[09:33:31] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: "Please remember that this limit applies to the jobrunners as well, as I stated clearly in the reason for -1 on my own change introducing t" [puppet] - 10https://gerrit.wikimedia.org/r/206440 (owner: 10Ori.livneh)
[09:44:11] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1143 is CRITICAL 40.00% of data above the critical threshold [86.4]
[09:47:50] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1119 is CRITICAL 60.00% of data above the critical threshold [86.4]
[09:50:21] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1201 is CRITICAL 40.00% of data above the critical threshold [115.2]
[09:52:20] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1132 is CRITICAL 100.00% of data above the critical threshold [86.4]
[09:55:31] <icinga-wm>	 PROBLEM - HHVM queue size on mw1148 is CRITICAL 40.00% of data above the critical threshold [80.0]
[09:57:08] <_joe_>	 !log nuked User:Niteshift/MVneu/2015_April_21-30 on commonswiki
[09:57:14] <morebots>	 Logged the message, Master
[09:57:20] <icinga-wm>	 RECOVERY - HHVM queue size on mw1148 is OK Less than 30.00% above the threshold [10.0]
[09:57:20] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1129 is CRITICAL 33.33% of data above the critical threshold [86.4]
[09:59:00] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1129 is OK Less than 30.00% above the threshold [57.6]
[09:59:31] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1119 is OK Less than 30.00% above the threshold [57.6]
[10:00:51] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1132 is OK Less than 30.00% above the threshold [57.6]
[10:01:10] <icinga-wm>	 RECOVERY - HHVM queue size on mw1147 is OK Less than 30.00% above the threshold [10.0]
[10:01:20] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1191 is OK Less than 30.00% above the threshold [76.8]
[10:01:51] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1203 is OK Less than 30.00% above the threshold [76.8]
[10:02:10] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1197 is OK Less than 30.00% above the threshold [76.8]
[10:04:01] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1201 is OK Less than 30.00% above the threshold [76.8]
[10:04:30] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1143 is OK Less than 30.00% above the threshold [57.6]
[11:20:41] <icinga-wm>	 PROBLEM - configured eth on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:22:10] <icinga-wm>	 RECOVERY - configured eth on tin is OK - interfaces up
[11:26:11] <icinga-wm>	 PROBLEM - RAID on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:29:21] <icinga-wm>	 RECOVERY - RAID on tin is OK optimal, 1 logical, 2 physical
[11:41:51] <icinga-wm>	 PROBLEM - puppet last run on cp3033 is CRITICAL puppet fail
[11:42:41] <wikibugs>	 6operations: internal_api_error_Exception: [22e05a83] Exception Caught: wfDiff(): popen() failed errors on English Wikipedia - https://phabricator.wikimedia.org/T97145#1235990 (10Anomie)
[11:47:05] <wikibugs>	 6operations: internal_api_error_Exception: [22e05a83] Exception Caught: wfDiff(): popen() failed errors on English Wikipedia - https://phabricator.wikimedia.org/T97145#1235992 (10Anomie) Reviewing the logs again, this particular instance of the problem seems like it may have accidentally been fixed at around 20:...
[12:00:01] <icinga-wm>	 RECOVERY - puppet last run on cp3033 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:06:11] <icinga-wm>	 PROBLEM - RAID on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:07:41] <icinga-wm>	 RECOVERY - RAID on tin is OK optimal, 1 logical, 2 physical
[12:08:51] <icinga-wm>	 PROBLEM - puppet last run on mc2001 is CRITICAL puppet fail
[12:10:50] <icinga-wm>	 PROBLEM - puppet last run on multatuli is CRITICAL puppet fail
[12:25:21] <icinga-wm>	 RECOVERY - puppet last run on mc2001 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures
[12:27:21] <icinga-wm>	 RECOVERY - puppet last run on multatuli is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures
[12:39:21] <grrrit-wm>	 (03PS1) 10Aklapper: Phab monthly stats email: Clarify what day values for priority mean [puppet] - 10https://gerrit.wikimedia.org/r/206515 
[13:02:32] <grrrit-wm>	 (03PS1) 10Aklapper: Phab monthly stats email: Show how many projects saw workboard moves [puppet] - 10https://gerrit.wikimedia.org/r/206518 
[13:04:10] <icinga-wm>	 PROBLEM - puppet last run on mw2135 is CRITICAL Puppet has 1 failures
[13:04:14] <grrrit-wm>	 (03CR) 10Aklapper: [C: 031] Phab monthly stats email: Clarify what day values for priority mean [puppet] - 10https://gerrit.wikimedia.org/r/206515 (owner: 10Aklapper)
[13:15:40] <icinga-wm>	 PROBLEM - puppet last run on mw2205 is CRITICAL puppet fail
[13:20:41] <icinga-wm>	 RECOVERY - puppet last run on mw2135 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:33:51] <icinga-wm>	 RECOVERY - puppet last run on mw2205 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures
[14:09:11] <icinga-wm>	 PROBLEM - puppet last run on mw1100 is CRITICAL Puppet has 1 failures
[14:25:51] <icinga-wm>	 RECOVERY - puppet last run on mw1100 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures
[15:10:25] <subbu>	 _joe_, hi.
[15:10:57] <_joe_>	 subbu: hey
[15:11:01] <subbu>	 saw the ops email.
[15:11:11] <subbu>	 let me cherry pick tim's patch.
[15:11:25] <_joe_>	 subbu: it was released AFAIK
[15:11:29] <_joe_>	 it wasn't?
[15:12:01] <subbu>	 unless someone else did it after i reverted the deploy y'day .. see my mail on #ops.
[15:12:18] <subbu>	 because of https://phabricator.wikimedia.org/T97155
[15:12:40] <_joe_>	 oh ok
[15:13:26] <_joe_>	 so the deploy contained a few other things
[15:13:51] <_joe_>	 ok, I guess this morning we had to thank the time limit on HHVM
[15:13:58] <subbu>	 yes.
[15:19:39] <_joe_>	 subbu: so tell me if I can help
[15:20:10] <subbu>	 i am about to sync and will restart after .. i had to resolve a conflict aftter cherry-pick.
[15:26:45] <subbu>	 !log deployed parsoid version fca17070 (cherry-pick of d2135c6b on parsoid master)
[15:26:51] <morebots>	 Logged the message, Master
[15:27:00] <subbu>	 _joe_, there ^
[15:27:17] <_joe_>	 should I rolling-restart parsoid?
[15:27:23] <subbu>	 i've restarted too.
[15:27:25] <_joe_>	 subbu: thanks!
[15:27:31] <subbu>	 so, how bad was it this mornin?
[15:27:35] <subbu>	 *morning
[15:27:45] <subbu>	 morning CST/PST time :)
[15:42:29] <subbu>	 _joe_, https://www.mediawiki.org/wiki/Parsoid/Deployments has all deployment info for parsoid.
[15:42:40] <_joe_>	 subbu: ok thanks\
[16:19:11] <grrrit-wm>	 (03PS3) 10Shanmugamp7: Enable Extension:Shorturl on sa wiki projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201216 (https://phabricator.wikimedia.org/T94660) 
[16:21:42] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Tools: Fix redirects from https to http [puppet] - 10https://gerrit.wikimedia.org/r/206519 (https://phabricator.wikimedia.org/T66627) 
[16:22:08] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "Tested on Toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/206519 (https://phabricator.wikimedia.org/T66627) (owner: 10Tim Landscheidt)
[16:48:12] <grrrit-wm>	 (03PS4) 10Shanmugamp7: Enable Extension:Shorturl on sa wiki projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201216 (https://phabricator.wikimedia.org/T94660) 
[17:17:56] <wikibugs>	 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1236117 (10ssastry)
[18:24:19] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] Tools: Fix redirects from https to http [puppet] - 10https://gerrit.wikimedia.org/r/206519 (https://phabricator.wikimedia.org/T66627) (owner: 10Tim Landscheidt)
[18:30:32] <YuviPanda>	 Krinkle|detached: ^ omg the redirects are fixed :D
[18:48:36] <wikibugs>	 6operations, 10Wikimedia-Labs-General, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1236169 (10Aklapper)
[18:49:44] <grrrit-wm>	 (03CR) 10Hoo man: Update dispatchChanges cronjob to use new script location (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/205644 (owner: 10Aude)
[18:58:30] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0]
[19:08:31] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[19:32:43] <wikibugs>	 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, 7Availability: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1236196 (10GWicke) This seems to be caused by this call chain:  - [DjVu::doTransform calls LocalFile::getM...
[19:48:10] <Krinkle>	 YuviPanda: Woo, nice!
[19:48:25] <YuviPanda>	 Krinkle: :D scfc is awesome, etc
[20:10:40] <wikibugs>	 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, 7Availability: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1236228 (10GWicke) Patch to remove the XML loading at https://gerrit.wikimedia.org/r/#/c/206526/.
[20:17:17] <wikibugs>	 6operations, 7Wikimedia-log-errors: internal_api_error_Exception: [22e05a83] Exception Caught: wfDiff(): popen() failed errors on English Wikipedia - https://phabricator.wikimedia.org/T97145#1236233 (10Krenair)
[20:31:47] <wikibugs>	 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, and 2 others: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1236241 (10GWicke)
[20:43:11] <wikibugs>	 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown: dbtree loads third party resources - https://phabricator.wikimedia.org/T96499#1236252 (10Krenair) https://bits.wikimedia.org/meta.wikimedia.org/load.php?modules=jquery&only=scripts ?
[21:00:12] <icinga-wm>	 PROBLEM - puppet last run on mc1017 is CRITICAL Puppet has 1 failures
[21:01:55] <wikibugs>	 6operations, 10Wikimedia-Labs-wikitech-interface, 7Regression: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1236287 (10Aklapper) @Andrew: Any news? :-/
[21:12:56] <wikibugs>	 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Don't retry 503 unless allowed by Retry-After in Varnish - https://phabricator.wikimedia.org/T97206#1236301 (10GWicke)
[21:16:40] <icinga-wm>	 RECOVERY - puppet last run on mc1017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:56:21] <icinga-wm>	 PROBLEM - puppet last run on ganeti2003 is CRITICAL puppet fail
[22:13:01] <icinga-wm>	 RECOVERY - puppet last run on ganeti2003 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures
[22:32:49] <grrrit-wm>	 (03PS1) 10GWicke: Bump to 6ac383c [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/206612 
[22:47:31] <icinga-wm>	 PROBLEM - puppet last run on mw2080 is CRITICAL puppet fail
[23:02:11] <icinga-wm>	 PROBLEM - puppet last run on cp3033 is CRITICAL puppet fail
[23:05:50] <icinga-wm>	 RECOVERY - puppet last run on mw2080 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures
[23:18:50] <icinga-wm>	 RECOVERY - puppet last run on cp3033 is OK Puppet is currently enabled, last run 1 second ago with 0 failures