[00:00:55] sure? i bet i didn't do more than 100 thumbnails from this address before this message came [00:01:31] Who's "I"? You or your IP address? :-) [00:01:37] ^d: does the same message emit if the thumb is crashing during generation? Or does that have another message? [00:01:48] Separate message for an error. [00:01:50] It looks like. [00:02:13] } elseif ( wfThumbIsAttemptThrottled( $img, $thumbName, 5 ) ) { [00:02:13] wfThumbError( 500, wfMessage( 'thumbnail_image-failure-limit', 5 ) ); [00:02:16] yurik: just got off the flight :) [00:02:16] return; [00:02:18] } [00:02:42] no worries, its all good greg-g, did deployment. Do you have varnish root? [00:02:52] (stupid free airport wifi lag) [00:02:59] no [00:03:00] well we tested this app from both devices, everyone ob us trying 30 pictures maximum. since all is running over a server run by a friend i also don't see why anyone else should try to create thumbs from this IP [00:03:23] greg-g: Glad you made it. I'm still in ORD. :/ [00:03:25] <^d> Hmmm, hmmmm [00:03:34] i wonder if jgage can help - need to flush https://wikitech.wikimedia.org/wiki/MobileFrontend#Flushing_the_cache [00:03:39] bd808|MOBILE: yuck. The land of boingo wifi [00:04:03] bd808|MOBILE, ORD is probably the worst in the country [00:04:22] <^d> "(limit 70 in 30s)" according to limiter.log [00:04:27] <^d> That's...not consistent with the conf. [00:04:47] I need to get off laptop/ready to flag down a bus soon, just fyi [00:05:10] Verizon LTE from the tarmac actually. "Micro cells" holding 30+ planes already pushed off from gate. [00:05:36] <^d> Flexman: Would you mind /msg'ing me your IP address? I'm pretty sure I've got it from the logs already, but I want to confirm I'm looking at you. [00:05:36] * bd808|MOBILE looks for link to passenger bill of rights [00:06:54] This is America. You have few rights. [00:07:06] I wish I was kidding. The EU forces some regulations about this sort of stuff on the airlines but the US doesn't. [00:07:56] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [00:08:10] yurik: is this the same flush request adam was emailing about earlier today? [00:08:28] yurik: I don't think we want to flush everything in all the mobile varnishes, right? [00:08:47] bblack, not exactly - we have just updated core logic behind zero.wikipedia.org, which redirects to ..., etc [00:08:58] there are tons of redirects unfortunatelly, all of which are cached [00:09:12] if only urls were canonical! :) [00:09:17] hehe [00:09:22] some day :) [00:09:35] but seriously, we can't limit it any more than "flush all the cache"? [00:09:55] <^d> Flexman, Gloria: Ok, got it now. Two things happening here that makes it confusing. A) We're reusing the same ratelimit message for both renderfile and renderfile-notstandard limiting. That's confusing. [00:10:04] it might be difficult - can you flush anything that has "zero.wikipedia.org" in it? [00:10:08] yurik, so you need is not a complete flush but just a selective ban? [00:10:12] <^d> renderfile-notstandard is actually limited to 70 thumbs in 30 secs, it's the other one that's at 700/30 [00:10:21] (03PS3) 10Tim Starling: final (I hope!) fix for protorel redirects [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [00:10:35] MaxSem, we don't know for sure - one of our partners reported tons of very strange issues [00:10:45] some of them seem to go away with cache flushing [00:10:49] hence - we don't know [00:10:53] yurik: unfortunately the hostname (especially the zero-qualified hostname) is a request header, flushing on that doesn't work as well as a response header or url attached to the actual object [00:11:05] (03CR) 10Tim Starling: "PS3: rebase" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [00:11:11] remember we kill that hostname before it ever hits the backend and replace it with a header? [00:11:33] I mean, it can be done, it's just not very efficient at all, it will linger as a check on all requests for the next ~30 days. [00:11:38] bblack, what about X-SUBDOMAIN [00:12:14] i think it might as well be a global flush - according to this page, its not a big deal :) https://wikitech.wikimedia.org/wiki/MobileFrontend#Flushing_the_cache [00:12:22] yurik: the distinction is client request vs returned object in the cache. it would be better if the flush were on an attribute of the object, not the request [00:12:29] that page is a bit obsolete;) [00:13:01] <^d> Flexman: Ok, so here's a solution that might help. If you pick one of the "standard" thumbnail sizes (based on $wgImageLimits/$wgThumbLimits widths) you'll be at the 700 thumbs in 30 seconds limit instead of 1/10th that. [00:13:34] bblack, does the "localtion" header on the response count? [00:13:45] this way at least we can kill all redirects [00:13:58] ^d: well 70 thumbs in 30 seconds would also be ok, but i guess the limit was far lesss than that [00:14:24] is it maybe 70 thumbs in 30 minutes?? [00:14:25] yurik: yes, but that would be killing entries that were previously redirects, not entries that were previously real pages and are now redirects :) [00:15:02] the problem is that it would be much easier for us to request the thumbs in a certain resolution since the "standard" size varies, doesn't it? [00:15:04] <^d> Flexman: No, it's seconds. I'm pretty sure you were hitting the limiter and not realizing it. What sort of things does your app do? Just request thumbs? [00:15:26] yes, it just requested the thumbs [00:16:18] yurik: the statement about it not being a big deal is from a little over two years ago by a guy who doesn't even work here anymore :) [00:16:29] it might be ok, but hasn't mobile traffic grown since then? [00:16:41] probably :) [00:16:46] definitely :) [00:16:56] the reason is that we display pictures of heritage-protected buildings [00:17:02] i guess it would be a good idea not to kill everything at once, but go one server at a time :) Checking how to filter :( [00:17:37] yurik: do you have a link to a changeset that describes what changed on the backend that affected so many redirects? [00:17:43] ^d: so i really wonder how we should have hit the lilmiter... can you see it in the log that we did so many requests? [00:18:05] will the limit change if i sign in? [00:18:15] <^d> Limit's the same for anonymous and logged in users here. [00:19:09] bblack, how about this: several bans: all objects with X-CS in response, all objects with LOCATION headers. This should be small enough set. As for the change - no, the problem is that it seems there has been some accumulation of patches by different teams that made all varnishes out of sync - it shows up on one varnish, but not the other, etc [00:20:05] going forward, we should probably take a look at how that came to be the case. As I was saying in an earlier email thread, varnish bans shouldn't be common occurence on change deployments. [00:20:23] <^d> Flexman: I'm digging through the archived logs now, one moment... [00:20:56] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [00:21:22] i agree, but at the end of the day, if the difference is a days work of a dev vs a minor slow down, we should probably go with the slowdown. But yes, we should be more cache concious :) [00:22:25] bblack, and as a forefarning - there will be another large flush coming up - https://gerrit.wikimedia.org/r/#/c/132952/ [00:22:33] <^d> Flexman: I can't find you in our logs based on the IP you sent me. I'm kind of at a loss right now. I'm going to file a bug about this and CC Aaron...he worked on this rate limiting stuff very recently and might have a clue. [00:22:51] we will stop all HTTP->HTTPS redirects for zero subdomain [00:23:46] yurik: not all slowdowns are minor, especially sweeping flushes of content. this will only become more the case the more that mobile traffic grows [00:24:03] ^d: ok thank you [00:24:15] well i can try to request a thumb now. one moment [00:24:52] <^d> tailing the log, should see you try if you hit the limit again [00:29:16] hmm i guess we have to enable this feature first and will try again tomorrow [00:30:10] (03PS4) 10Tim Starling: final (I hope!) fix for protorel redirects [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [00:30:35] <^d> Flexman: If you could ping me tomorrow when you're testing that'd be great so I can have a look on our side as you're doing it. [00:31:34] ^d: great thanks [00:31:35] hmm [00:32:36] <^d> Filed to track this as well. [00:33:34] bblack, we should still figure out a way to flush at least by carrier ID - all too often our partner comes to us with a problem, and we have no clue how to debug it. Manually flushing individual pages is rediculous. I think a tool to flush everything with a given XCS would be of great help [00:34:10] flushing a single carrier isn't that big a deal [00:34:35] ^d: just checked the IP again, it is correct. strange you can't find it [00:34:41] although I still wouldn't want there to be a standard tool for it. it would be better if you had a clue how to debug it and what was going wrong. this reeks of "reboot it to fix the problem, it's windows" [00:34:47] (03CR) 10Tim Starling: "PS4: using %{HTTP_HOST} in a 301 target carries a risk of cache pollution, since although %{HTTP_HOST} must match the ServerName/ServerAli" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [00:37:35] yurik: in theory the X-CS and Location -having objects are banned, can you validate the result? [00:38:15] ^d: thank you very much, i'll try again tomorrow and ping you [00:38:36] <^d> You're welcome, hope we can get it sorted out tomorrow for you. Have a good evening. [00:39:35] yurik: you can query appservers directly and compare them to caches if you suspect a cache issue anyways, it should be pretty clear at that point whether a cache flush will help you or not. [00:48:50] (03CR) 10Tim Landscheidt: [C: 04-1] Set up redirects for toolserver.org (031 comment) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/108465 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [00:58:35] yurik: ? [00:58:48] bblack, fighting with cookies :( [00:58:52] they are evil :( [00:59:00] stop using them then! :) [00:59:15] on one hand, results are good - i can see the code is working. On the other - it is not doing what i need :) [00:59:26] i wish! i'm not using them - i am actually trying to delete them :) [01:00:06] you could have figured out the code wasn't working by hitting the appserver directly. if that's difficult with the rest of whatever you test with, someone should work on making that easier. [01:00:16] (e.g. due to them being on private networks) [01:00:37] most of our users test things by navigating to wikipedia.org :) [01:01:06] considering that we still haven't been able to fully replicate that to beta cluster, we shouldn't even try to build another env :) [01:01:26] just saying, we don't want to iterate a cycle of pointless flushes only to find we're re-caching bad results and deploying yet a different fix and flushing again. that whole cycle can be short-circuited by validating things without involving the cache first. [01:02:48] well, consider this example - we are testing with a carrier, who first enabled both m & zero, than added opera, than removed m, than killed zero, than reenabled zero+opera [01:03:04] and? [01:03:08] and that's not some random example - we have them right now :) [01:03:29] aand - how do we deal with cache? [01:04:01] when they change default lang from en to bn, the redirect is broken and needs flushing [01:04:34] we would have to build a substantial infrastructure just to support various flushing [01:05:05] it sounds like we're varying on the wrong things here then, since not much of this really affects the delivered content we care about caching in the first place. [01:05:07] if we simplify, and flush everything related to them (assuming its not a big impact), everyone wins [01:05:25] we can than concentrate on making all our code non-carrier dependent, thus removing this issue alltogther [01:06:03] in any case, why does switching on and off the conditions under which X-CS is set affect caching in the first place? [01:06:55] the cache isn't based on client IP after all, it's based on the X-CS header that's determined dynamically from zero config [01:07:51] (03CR) 10Tim Landscheidt: "We've strayed away with "echo "$@" | exec mysql $param -h $server $db" already, but if sql is supposed to behave like the eponymous Toolse" [operations/puppet] - 10https://gerrit.wikimedia.org/r/113755 (owner: 10Hoo man) [01:07:52] some of the content is shown based on the configuration [01:08:11] ? [01:08:15] mostly its the content related to special:zero page [01:08:19] based on the X-CS header you mean, right? [01:08:34] it is based on X-CS header + configuration [01:08:36] because we're not varying "on the configuration" [01:08:43] so that would be broken by design [01:08:58] correct, because we are assuming that configuration is static, and when it changes, we can flush it [01:09:24] what is "the configuration" above anyways? the if/else clusterfuck of m/zero/opera/https/etc? [01:09:38] and in a way, it is broken by design - legacy is always a pain to deal with :) [01:09:55] yes, with all the javascript blob values [01:10:01] lovely [01:10:23] yep [01:10:29] so, the problem here is "we are assuming that configuration is static", while it's known there is a constant process for routinely updating that configuration [01:10:41] you know, somehow most sites manage to give different things to different users ;) [01:11:10] mostly yes - the constant change of the configs is mostly during setup [01:11:21] once done, noone touches it for years [01:11:43] we are testing with a carrier, who first enabled both m & zero, than added opera, than removed m, than killed zero, than reenabled zero+opera [01:11:54] yep [01:12:06] because frequently they don't know themselves how much they can give us, etc [01:12:33] its a process, i try not to get involved with the carriers too much, it could be painful :) [01:12:33] you could stop keying on MCC-MNC and use make-believe keys that version the configuration [01:12:58] instead of changing the conditions of 470-01, you'd remove the config logic for 470-01-01 and replace it with new logic for 470-01-02 [01:13:15] how is that different from flushing? :) [01:13:36] it's not dangerous and error-prone and as potentially performance-impacting [01:14:09] if we vary on X-CS, the logical meaning of a given X-CS value shouldn't change - invent new ones. [01:15:12] bblack, even though i do agree with you wrt this portion of the design, i would rather try to get away from varying on xcs alltogether [01:15:24] that would be awesome, too [01:15:47] will be fun :) [01:16:26] didn't we once before shoot down the idea of replacing the banner with a small pre-generated X-CS-specific PNG? [01:16:31] I can't remember [01:16:56] (where we'd rewrite the PNG's request URL in varnish) [01:17:35] I know it would be tricky to scale, but can't that be solved with css and img dimensions and all that? or too hard on phones? [01:17:35] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue May 13 01:17:31 UTC 2014 [01:23:45] (03CR) 10Tim Starling: [C: 04-1] final (I hope!) fix for protorel redirects (032 comments) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [01:24:24] (or any of the other related crazy ideas: e.g. javascript loading a JSON object with the banner text and dom-manipulating it into place, but that depends on javascript and not all phone browsers have it. or iframes, which may also not be universal) [01:32:33] (03CR) 10Tim Starling: final (I hope!) fix for protorel redirects (031 comment) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [01:34:06] bblack, at this point i am still debating GIF (more supported than PNG) for older phones, with some backend image rendering, plus javascript-based proper banners [01:35:39] we could just create the gif/png with a script once when the carrier signs and store them as static assets with unique filenames and rewrite those requests, no need for any on-the-fly renderer. [01:36:22] if gif+js covers all the target devices, it would be way easier and faster than waiting on us to have ESI (and probably more efficient, too) [01:38:35] i wouldn't want to manually render 300 languages * 100 carriers :) (yes, exaggerating, but translations arrive each day) [01:38:54] (03PS5) 10Tim Starling: final (I hope!) fix for protorel redirects [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [01:40:08] fair enough [01:41:18] I'm out for an hour or two, will check back later [01:41:20] i am more concerned with fighting modern browsers [01:41:39] how to prevent them from loading the stuff i don't need [01:41:57] thx for your help, will probably head to bed soon, was early morning [01:45:55] (03CR) 10Tim Starling: "PS5:" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [01:59:34] ohai TimStarling. reply on the way in a min [02:02:56] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [02:07:22] (03CR) 10Jeremyb: "replies inline" (032 comments) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [02:07:39] TimStarling: ^ [02:12:20] It looks like just pipermail redirects to HTTP? [02:12:49] I *think* so [02:12:55] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3791 MB (3% inode=99%): [02:13:01] That's a pretty cute bug. [02:13:19] i've known about it for a while and always thought it was intentional [02:13:41] Seems goofy. [02:13:43] but in these days of HTTPSing everything i think we could reconsider [02:20:55] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3433 MB (3% inode=99%): [02:21:08] apparently private archives use the pattern https://lists.wikimedia.org/mailman/private/${list}/ [02:21:17] so it's still pipermail, just not pipermail in URL [02:21:20] Gloria [02:31:36] (03CR) 10Tim Starling: final (I hope!) fix for protorel redirects (031 comment) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [02:31:36] !log LocalisationUpdate completed (1.24wmf3) at 2014-05-13 02:30:33+00:00 [02:31:44] Logged the message, Master [02:32:53] The concern would be a redirect loop, of course. [02:34:43] (03CR) 10Jeremyb: final (I hope!) fix for protorel redirects (031 comment) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [02:35:49] (03PS1) 10Tim Starling: Added a few tests for redirects.conf [operations/apache-config] - 10https://gerrit.wikimedia.org/r/133045 [02:35:56] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [02:37:58] (03CR) 10Tim Starling: "Added tests in a dependent commit." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [02:52:55] MaxSem: wow. thanks for the heads-up [02:53:20] so, revert? [02:53:25] yes [03:00:55] RECOVERY - Disk space on virt0 is OK: DISK OK [03:01:11] !log LocalisationUpdate completed (1.24wmf4) at 2014-05-13 03:00:07+00:00 [03:01:18] Logged the message, Master [03:01:19] springle, https://gerrit.wikimedia.org/r/#/c/133046/ [03:06:14] MaxSem: what's the difference between ^ and the revert button on gerrit? (just wondering) [03:06:44] I produced ^ using that button:) [03:06:50] ah :) [03:20:39] springle, now mergeable, waiting for your review [03:26:38] MaxSem: thanks [03:44:55] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [03:45:40] "We have normality. I repeat, we have normality. Anything you still can't cope with is therefore your own problem." [04:08:59] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue May 13 04:07:53 UTC 2014 (duration 7m 52s) [04:09:04] Logged the message, Master [04:28:56] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [05:08:46] (03CR) 10Tim Starling: [C: 031] "OK for me to deploy this tomorrow?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [06:13:56] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [06:57:14] (03PS1) 10Matanya: bits: add icinga check [operations/puppet] - 10https://gerrit.wikimedia.org/r/133051 [06:57:27] (03CR) 10jenkins-bot: [V: 04-1] bits: add icinga check [operations/puppet] - 10https://gerrit.wikimedia.org/r/133051 (owner: 10Matanya) [06:59:36] hmm, why is this? [07:44:55] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [08:01:56] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [08:11:56] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [08:18:49] (03PS6) 10Giuseppe Lavagetto: Get rid of redundant and confusing $cluster defs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132921 [08:19:56] (03PS2) 10Giuseppe Lavagetto: Fix dynamic scoping in iptables.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/132945 [08:27:23] (03PS3) 10Giuseppe Lavagetto: Fix dynamic scoping in iptables.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/132945 [08:31:02] springle: around? [08:31:23] paravoid: yep [08:31:24] (03CR) 10Giuseppe Lavagetto: [C: 032] "merging." [operations/puppet] - 10https://gerrit.wikimedia.org/r/132945 (owner: 10Giuseppe Lavagetto) [08:31:46] lots of mariadb warnings for dbtore1002 [08:32:05] WARNING slave_io_state Slave_IO_Running: No [08:32:06] etc. [08:32:08] yep [08:32:23] playing with it. should be ok unless they go critical [08:32:28] ah, okay :) [08:33:40] _joe_: can you please advice way https://gerrit.wikimedia.org/r/133051 refuses to rebase? [08:35:16] <_joe_> matanya: you should rebase manually [08:35:39] <_joe_> btw, if I can advice you [08:35:55] <_joe_> do not create a new command in checkcommands.cfg [08:36:03] <_joe_> if it's so specific [08:36:11] <_joe_> you're using check_http after all [08:36:44] _joe_: i have, got: ! [remote rejected] HEAD -> refs/publish/production/7451 (no new changes) [08:36:45] <_joe_> so put everything in the monitor_service define [08:37:04] and btw, i didn't create it, it was there already, just using it [08:41:16] <_joe_> oh really? I must have overlooked the patch [08:45:09] (03PS2) 10Matanya: bits: add icinga check [operations/puppet] - 10https://gerrit.wikimedia.org/r/133051 [08:48:01] computer voodoo [08:53:57] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [08:54:10] (03PS1) 10Ottomata: Removing Leila Zia's SSH key [operations/puppet] - 10https://gerrit.wikimedia.org/r/133052 [08:54:27] (03PS2) 10Ottomata: Removing Leila Zia's SSH key [operations/puppet] - 10https://gerrit.wikimedia.org/r/133052 [08:54:42] (03CR) 10Ottomata: [C: 032 V: 032] Removing Leila Zia's SSH key [operations/puppet] - 10https://gerrit.wikimedia.org/r/133052 (owner: 10Ottomata) [08:55:12] (03PS12) 10Giuseppe Lavagetto: protoproxy: call enable_ipv6_proxy in a sane way [operations/puppet] - 10https://gerrit.wikimedia.org/r/118966 (owner: 10Matanya) [08:55:22] <_joe_> oh man, I will need to rebase again [08:55:29] <_joe_> you beat me to it ottomata :P [08:55:51] (03PS13) 10Giuseppe Lavagetto: protoproxy: call enable_ipv6_proxy in a sane way [operations/puppet] - 10https://gerrit.wikimedia.org/r/118966 (owner: 10Matanya) [08:56:23] _joe_: this patch eveloved to be a nightmare [08:56:29] uh oh :) [08:57:48] What happened to that person? [08:57:51] :O [08:58:29] Ah right.. [08:58:29] k [08:58:37] I can't ask that.. [08:58:59] So how is it working in operations? [09:00:42] <_joe_> Bsadowski1: sorry? [09:10:33] (03CR) 10Giuseppe Lavagetto: [C: 032] protoproxy: call enable_ipv6_proxy in a sane way [operations/puppet] - 10https://gerrit.wikimedia.org/r/118966 (owner: 10Matanya) [09:13:09] <_joe_> and that was scary [09:13:28] congratz [09:14:54] <_joe_> matanya: we could've tear down the whole ssl cluster for ipv6 if we did something wrong. [09:15:12] yes, i know :) glad it didn't [09:15:44] <_joe_> now. getting to the hard ones [09:25:56] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [09:39:40] (03PS1) 10Giuseppe Lavagetto: Compile under puppet3. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133053 [10:07:32] (03PS2) 10Giuseppe Lavagetto: Compile under puppet3. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133053 [10:27:13] (03CR) 10Giuseppe Lavagetto: Compile under puppet3. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/133053 (owner: 10Giuseppe Lavagetto) [10:33:25] RECOVERY - Puppet freshness on analytics1026 is OK: puppet ran at Tue May 13 10:33:23 UTC 2014 [10:35:23] (03PS1) 10Giuseppe Lavagetto: Make class call work. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133056 [10:36:23] (03CR) 10jenkins-bot: [V: 04-1] Make class call work. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133056 (owner: 10Giuseppe Lavagetto) [10:43:04] <_joe_> of course that does not work [10:43:35] <_joe_> I just found out in http://projects.puppetlabs.com/issues/10146 that dashes in class names have been valid between puppet 2.7.1 and 2.7.20 [10:43:51] * _joe_ facepalm [10:58:20] getting some lunch, back in a bit [11:10:05] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [11:41:45] (03PS1) 10Ottomata: [VERY-WIP] Refactor for easier abstraction and use in environments other than WMF production [operations/puppet/varnish] (refactor) - 10https://gerrit.wikimedia.org/r/133062 [11:43:12] (03PS2) 10Ottomata: [VERY-WIP] Refactor for easier abstraction and use in environments other than WMF production [operations/puppet/varnish] (refactor) - 10https://gerrit.wikimedia.org/r/133062 [12:00:04] (03PS3) 10Ottomata: [VERY-WIP] Refactor for easier abstraction and use in environments other than WMF production [operations/puppet/varnish] (refactor) - 10https://gerrit.wikimedia.org/r/133062 [12:01:33] (03CR) 10Ottomata: "See usage documentation in instance.pp." [operations/puppet/varnish] (refactor) - 10https://gerrit.wikimedia.org/r/133062 (owner: 10Ottomata) [12:06:46] (03PS2) 10Giuseppe Lavagetto: Make class call work. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133056 [12:06:48] (03CR) 10jenkins-bot: [V: 04-1] Make class call work. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133056 (owner: 10Giuseppe Lavagetto) [12:28:12] (03PS10) 10Ottomata: Running update-server-info for submodules [operations/puppet] - 10https://gerrit.wikimedia.org/r/126846 [12:29:12] (03CR) 10Ottomata: [C: 032 V: 032] "I suspect that git fetch, or even git merge master might need to be run on each submodule, but I am not sure." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126846 (owner: 10Ottomata) [12:35:08] (03PS3) 10Giuseppe Lavagetto: Make class call work. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133056 [12:39:12] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "do not merge for now." [operations/puppet] - 10https://gerrit.wikimedia.org/r/133056 (owner: 10Giuseppe Lavagetto) [12:44:04] (03CR) 10JanZerebecki: "See Change-Id: Iacc18a3dbcc81054ee5f420517ff07ce880001fa for the labs private dhparam file." [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [12:58:27] <_joe_> matanya, akosiaris I updated http://etherpad.wikimedia.org/p/Puppet3 with some fresh compilation errors; if you ever want to shop some and fix it... just put a note on the etherpad. [13:01:24] _joe_: i'll try to push one or two patches later [13:04:36] <_joe_> matanya: one is enough, just re-read what you do twice :) [13:05:21] _joe_: more like three times :) [13:21:49] (03CR) 10MZMcBride: "I still have a nagging feeling that trying to force pipermail to use HTTPS might result in a redirect loop, but that's easily remedied if " [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [13:25:57] (03PS1) 10Matanya: lvs: physicalcorecount is a fact, fully qualify [operations/puppet] - 10https://gerrit.wikimedia.org/r/133070 [13:26:26] _joe_: ^ [13:29:10] <_joe_> matanya: on it [13:29:22] i need some advice on the other one [13:30:03] <_joe_> matanya: this one seems ok [13:30:39] <_joe_> matanya: give me 10-15 minutes [13:30:46] the other one i took, seems to be an issue with the name of cr2-knams [13:30:51] sure, as long as you need [13:31:00] <_joe_> I have a change to checkin and do it coordinately [13:35:29] (03PS4) 10Giuseppe Lavagetto: Make class call work. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133056 [13:40:15] (03PS5) 10Giuseppe Lavagetto: Make class call work. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133056 [13:42:17] (03CR) 10Giuseppe Lavagetto: [C: 032] Make class call work. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133056 (owner: 10Giuseppe Lavagetto) [13:47:16] <_joe_> ook. matanya 5 more minutes and I'm here for you. [13:54:56] (03PS2) 10Giuseppe Lavagetto: lvs: physicalcorecount is a fact, fully qualify [operations/puppet] - 10https://gerrit.wikimedia.org/r/133070 (owner: 10Matanya) [13:55:15] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM." [operations/puppet] - 10https://gerrit.wikimedia.org/r/133070 (owner: 10Matanya) [13:58:47] <_joe_> matanya: tell me everything. [13:59:37] _joe_: Error: Must pass name to Pmacct::Configs[cr2-knams] on node rhenium.wikimedia.org [13:59:59] <_joe_> ok what can I clarify for you? [14:00:06] this points me to manifests/role/pmacct.pp on line line 78 [14:00:25] i wonder why cr1-esams works [14:00:35] and cr2-knams doesn't [14:01:06] they should both be evaluted the same [14:01:52] <_joe_> are you sure the other one gets evaluated correctly? [14:02:12] <_joe_> the puppet parser stops at the first error. [14:02:12] no, but it desn't seem to complian from the etherpad [14:02:30] i wonder if it starts from the end [14:02:47] and moreover, if it is related to the dash [14:03:07] <_joe_> it can start wherever, the $puppet_agents hash is passed as a parameter to the class pmacct [14:03:13] <_joe_> you should look there [14:03:36] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Tue May 13 11:03:04 2014 [14:03:53] ok. i must leave now, will look at it later, see you [14:07:39] <_joe_> ok, so maybe I will take a look :) [14:19:59] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The idea is good. But the road to PFS support is not easy." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [14:27:31] (03PS3) 10Giuseppe Lavagetto: Compile under puppet3. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133053 [14:27:40] (03CR) 10Giuseppe Lavagetto: [C: 032] Compile under puppet3. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133053 (owner: 10Giuseppe Lavagetto) [14:37:15] PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100% [14:37:54] <_joe_> is someone working on that? I don't see any role assigned to that machine [14:38:49] cmjohnson1 said he would [14:41:34] anomie, around? [14:42:05] RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [14:42:23] manybubbles, anomie, i need to do a SWAT for https://gerrit.wikimedia.org/r/#/c/133043/ could someone review pls? thx [14:43:20] yurikR: I sunk two hours into swat yesterday, maybe anomie ? [14:43:28] ottomata: can we debug analytics1004? [14:43:39] manybubbles, i will do the depl, just need a +2 ) [14:43:49] manybubbles: ja [14:43:53] got a few more minutes here [14:44:12] ottomata: just tried and failed to ssh to it [14:44:13] ottomata, have you been to the new place yet? i'm tihnking of going there today [14:44:19] manybubbles: do it one more time, watching now [14:44:28] yurikR: haven't been! but I am in the czech republic right now! [14:44:30] ottomata: done [14:44:32] back in NYC on the 21st [14:45:01] ottomata, enjoy )) let me know when u r back. [14:45:05] k [14:45:18] _joe_ that server is not being used for anything....testing DAC cables on it...i had to move it higher in the rack cuz the cables rob bought were 2M and did not reach [14:45:32] (03CR) 10Alex Monk: [C: 04-1] "See bug." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132947 (https://bugzilla.wikimedia.org/65221) (owner: 10Jackmcbarn) [14:45:33] weird, manybubbles, and you can log into elastic1001 right now? [14:45:55] ottomata: just did [14:45:58] hm, weird [14:46:05] the keys are the same [14:46:05] hm [14:46:14] perms look fine [14:46:17] am I sending a different key or something? [14:46:30] ottomata: do you see my failed attempt in the logs? [14:46:47] yes [14:46:56] it says failed publickey [14:47:00] doesn't say what the key was though [14:47:01] do ssh -v [14:47:05] make sure it is offering the right one [14:47:08] yurikR: Code looks mostly sane, but I'm not familiar enough with all the cookie issues to +2 it. Sorry. [14:47:22] yurikR: Also, if you want to join the morning SWAT team talk to greg-g. ;) [14:47:56] manybubbles: on elastic1001, it fails the first publickey you offer [14:48:01] i see two offered on analytics1004 [14:48:03] both fail there [14:48:47] <_joe_> cmjohnson1: yeah sorry man, I'm just lagged as hell right now [14:49:24] ottomata: hmmm - let me try configuring it to just offer the right onw [14:49:29] check now [14:49:47] same deal [14:49:51] failing two keys [14:50:43] anomie, thx, will do [14:50:53] i'm going to remove your authorized_keys file and run puppet, dunno if that will actually do anything... [14:51:16] anomie, pls +1, will have to find someone who knows cookies [14:51:26] hmmm [14:51:26] * yurikR is developing a strong allergy to cookies [14:51:29] it didn't re-add your key [14:51:29] hmmm [14:52:25] ottomata: remember that that machine lost me a couple months ago! [14:52:29] ah! its my user id [14:52:31] ! [14:52:36] we changed it [14:52:45] hey when I try to clone https://git.wikimedia.org/git/operations/puppet.git I get the error: warning: remote HEAD refers to nonexistent ref, unable to checkout. [14:52:48] and that machine doesn't have me in puppet [14:53:14] oo, ok [14:53:15] yeah [14:53:18] your userid changed? [14:53:51] you are still accounts::manybubbles, ja? [14:53:57] anyone know what's going on? [14:54:01] ottomata: yeah, userid changed though [14:54:05] pancakes9: out of the top of my head, try git clone https://git.wikimedia.org/git/operations/puppet.git production [14:54:10] k [14:54:23] (03PS1) 10Ottomata: Adding manybubbles on analytics1004 for some elasticsearch testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/133076 [14:54:48] whois pancakes9 [14:54:48] pancakes9: once you've failed the clone $(cd puppet && git checkout production) should work too [14:54:49] oops [14:54:50] hi! [14:54:56] was asking who you were to IRC! [14:55:10] not much info there anyway, hi! [14:55:17] * pancakes9 is pancakes9 [14:55:22] or pancakes9: git clone https://gerrit.wikimedia.org/r/operations/puppet [14:55:49] ottomata: Andrew Otto [14:55:53] :) [14:55:58] I've outdone you in your own game [14:56:01] haha [14:56:21] (03PS4) 10Faidon Liambotis: librenms: make custom function compatible with puppet 3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/133053 (owner: 10Giuseppe Lavagetto) [14:56:28] (03CR) 10Faidon Liambotis: [C: 032 V: 032] librenms: make custom function compatible with puppet 3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/133053 (owner: 10Giuseppe Lavagetto) [14:56:35] (03PS2) 10Ottomata: Adding manybubbles on analytics1004 for some elasticsearch testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/133076 [14:56:40] (03CR) 10Ottomata: [C: 032 V: 032] Adding manybubbles on analytics1004 for some elasticsearch testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/133076 (owner: 10Ottomata) [14:58:33] (03PS1) 10Ottomata: Including standard and admins::roots on analytics1004 [operations/puppet] - 10https://gerrit.wikimedia.org/r/133077 [14:58:39] (03PS2) 10Ottomata: Including standard and admins::roots on analytics1004 [operations/puppet] - 10https://gerrit.wikimedia.org/r/133077 [14:58:46] (03CR) 10Ottomata: [C: 032 V: 032] Including standard and admins::roots on analytics1004 [operations/puppet] - 10https://gerrit.wikimedia.org/r/133077 (owner: 10Ottomata) [14:59:48] greg-g : around? [15:00:13] ok manybubbles, try now [15:00:25] ottomata: so in [15:00:28] awesoome [15:00:28] thanks! [15:00:30] yup! [15:04:29] laters! [15:05:58] ottomata: i have a question about contributing [15:06:20] i've been helping out with puppet at mozilla, i'm still a newb, but i'd like to help out here [15:06:45] i'm reviewing the repo, how do i start getting assigned tasks? just pick up a bugzilla ticket? [15:11:45] why'd it go silent? [15:12:26] (03CR) 10JanZerebecki: "Easy? No, but all the road blocks are IMHO already out of the way for at least enabling it." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [15:12:35] just fix stuff that seems broken [15:13:02] (03PS2) 10Rush: changeup diamond collection to 60s [operations/puppet] - 10https://gerrit.wikimedia.org/r/133035 [15:13:29] (03CR) 10Rush: [C: 032 V: 032] "not waiting for jenkins again on rebase" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133035 (owner: 10Rush) [15:13:45] you can start by helping the puppet 3 migration [15:15:27] <_joe_> matanya2: there are two of you? [15:15:47] i mutiple [15:15:52] <_joe_> matanya2: puppet 3 migration is not fancy and does require quite some puppet knowledge [15:16:06] <_joe_> so, it strongly depends on the pancakes9's skillset [15:16:26] and mobile keyboard sucks [15:18:31] (03PS2) 10Rush: rollout diamond in standard for precise only [operations/puppet] - 10https://gerrit.wikimedia.org/r/133036 [15:18:34] _joe_: this is a point raised by some volunteer [15:18:41] matanya2: I am now, what's up? [15:19:01] where can i find tasks [15:19:07] good question [15:19:16] it's 8:18am or 17:18 depending on where my body thinks it is [15:19:18] im not sure there is a single clearing house of projects. [15:19:22] <_joe_> matanya2: my point is, it's better to let him do something that fits his abilities/interests [15:19:24] (other than RT which isnt public) [15:19:26] =[ [15:19:27] greg-g: my i pm [15:19:55] matanya2: tasks re the "is this code on production" thing? [15:21:10] no, reply to pancakes9 [15:22:06] greg-g: pm a diff issue. not suitable for the channel [15:22:28] pancakes9: yeah, finding things you want to work on and doing it, asking in or other channels when you need help is good. Curious, what is your skill set/what do you want to work on? [15:22:55] matanya2: k, I don't see it (the pm)? [15:24:00] greg-g: puppet, monitoring, anything infrastruture related [15:24:06] no php please [15:25:04] :) [15:25:36] I need our puppet configuration files for our facilities (power distribution units and the like) converted from a flat file of every single item to a puppet template [15:25:55] pancakes9: you said the words infrastructure, puppet, and monitoring [15:26:01] that fits all three =] [15:26:18] its one of those 'someone needs to get to this and then test that it generates properly in labs' [15:26:21] and no one has [15:26:40] RobH: and that's all in the puppet repo right? )https://git.wikimedia.org/git/operations/puppet.git [15:27:04] indeed it is, you'll also want to get yourself setup with labs access [15:27:17] so you can create a labs instance to load the puppet configuration onto [15:27:32] lemme find the specific file as a starting point [15:28:18] (03PS1) 10Giuseppe Lavagetto: Fix another call to the swift password class. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133082 [15:30:02] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix another call to the swift password class. [operations/puppet] - 10https://gerrit.wikimedia.org/r/133082 (owner: 10Giuseppe Lavagetto) [15:31:36] facilities.pp has some [15:31:51] but i recall another file that is just simply huge, perhaps its in icinga/nagios sections of repo [15:33:05] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Tue May 13 15:32:55 UTC 2014 [15:33:26] pancakes9: i'll keep looking for the file, its huge with every single pdu listed and then a ton of options after each one [15:33:38] also the fuses in our PDUs are not configured to be monitored in icinga [15:33:54] (though it does the phase balance and circuit load) [15:34:01] unfortuantely, this stuff is currently in a closed ticketing system. its something we're working to resolve [15:34:27] so the initial hurdle into being able to pull and generate your own work is pretty high for ops, as matanya2 can attest [15:34:45] (not excusing it, it sucks its this difficult, dont give up!) [15:35:00] RobH: oh, i've never worked with templates [15:35:18] Ahh, want to? (We have them in use in other parts of the configuration.) [15:35:25] the fuses check may be easier to figure out for icinga [15:35:29] RobH: let me explore [15:36:04] cool. if you have any questions just ask away [15:36:14] pancakes9: is you want rt access you will need to sign nda and run after people to get things moving [15:36:41] but the first merge feelibg [15:36:51] feeling id worth it [15:37:03] mobile ... :/ [15:37:39] what is rt access? [15:37:51] the ticking system [15:37:54] i think i already have a labs instance [15:38:03] oh, rt is different from bugzilla... [15:38:54] yes , ops don't use bz much [15:39:15] matanya2: how do i get started on getting access and the first few people to ping? [15:40:30] https://wikitech.wikimedia.org/wiki/Help:Getting_Started [15:40:36] Thats getting started for labs access and such [15:40:58] which ops contributions are a subset of labs access needs [15:41:05] pancakes9: no official docs. so don't know. try poking mark [15:41:53] mark: any info on getting rt access? [15:42:01] I can field that one [15:42:09] you have to have a signed NDA with the foundation [15:42:23] usually, its easier to start contributing and using BZ a bit before requesting that [15:42:31] We have a last minute SWAT deploy! Going to do it in just a minute [15:42:36] as we have to know you and trust you to let you into RT core ops, since it can have security concerns [15:42:39] James_F: ping, just for the record [15:42:48] anomie: Pong. [15:42:59] RobH: got it [15:43:05] we're working on possibly integrating into a different system so its more finely controlled [15:43:10] and we can make things public by default again [15:43:25] cuz this current setup for workflow for new volunteers doesnt work well at all ;] [15:43:26] anomie: And now added to the SWAT list for now. [15:43:31] James_F: Awesome [15:46:34] (03CR) 10Hoo man: [C: 031] "Looks good, but I like to let Chris have another look before I deploy this/ schedule it in a SWAT window." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130274 (https://bugzilla.wikimedia.org/64255) (owner: 10Gerrit Patch Uploader) [15:51:10] !log anomie synchronized php-1.24wmf4/resources/lib/jquery/jquery.migrate.js 'SWAT: Deploy jQuery Migrate to 1.24wmf4' [15:51:18] Logged the message, Master [15:52:04] (03CR) 10Rush: [C: 032] "Notes:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133036 (owner: 10Rush) [15:52:27] * anomie looks forward to the day when scap is fast enough that sync-file isn't necessary [15:52:30] !log anomie synchronized php-1.24wmf4/resources/Resources.php 'SWAT: Deploy jQuery Migrate to 1.24wmf4' [15:52:34] James_F: ^ Ok, test please [15:52:36] Logged the message, Master [15:56:18] RobH: is it usually hectic onboarind new contributors? it seems like everyone here is moving really fast and there's not a lot of good onboarding processes or documentation [15:57:13] Operations hasn't historically had a large quantity of volunteer contributors; the past has been a very small core of volunteer contributors. [15:57:24] labs is changing that and allowing us to have more folks come in [15:57:37] but its still new for us =P [15:57:44] (in my personal viewpoint) [15:57:44] anomie: Yup, working fine. [15:57:57] Krinkle|detached: ^^^ FYI, jQuery.Migrate now deployed to wmf4. [15:58:11] * anomie is done with SWAT [15:59:04] anomie: Thank you! [15:59:08] RobH: i can compile something from my exp, any other volunteerd i can use to help me with that? [15:59:37] hrmm, im not sure anyone has been both contributing in gerrit and in RT as much as you [15:59:39] James_F: No problem. [15:59:47] thehelpfulone has RT access [16:03:05] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [16:03:37] never better time than the present :) [16:03:38] http://docs.puppetlabs.com/guides/templating.html [16:05:15] anomie: there are a lot of exceptions in prod [16:05:24] http://ur1.ca/edq1f [16:06:11] ori: I happened to be looking at that. Someone was trying to upload two revisions of a file in the same second for bug 65251, and apparently ran into exceptions. I'll file a bug about the exceptions with details momentarily. [16:07:05] cool [16:15:49] ori: https://bugzilla.wikimedia.org/show_bug.cgi?id=65263 [16:17:04] anomie: good sleuthing! [16:18:03] ori: I'm just glad I caught that the backtrace contained the database password to redact it. [16:18:59] we should make the password be "[REDACTED]" [16:19:08] (03PS2) 10Nuria: Upstart follows fork when starting celery. [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/130588 (https://bugzilla.wikimedia.org/63819) [16:19:42] ori: I get a plugin to redact them, see: ************ [16:20:21] (03CR) 10Ori.livneh: Upstart follows fork when starting celery. (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/130588 (https://bugzilla.wikimedia.org/63819) (owner: 10Nuria) [16:20:38] hashar: ;) [16:20:56] ori: Ha! Also, BTW, I see a run of "Too many connections" errors in dberrors.log that might have something to do with the exception spike. [16:21:02] (03CR) 10Nuria: ">Why is queue the only wikimetrics daemon that needs this?" [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/130588 (https://bugzilla.wikimedia.org/63819) (owner: 10Nuria) [16:24:23] (03PS2) 10Alexandros Kosiaris: blog: moving firewall to node level [operations/puppet] - 10https://gerrit.wikimedia.org/r/133018 (owner: 10Matanya) [16:26:07] (03CR) 10Alexandros Kosiaris: [C: 032] blog: moving firewall to node level [operations/puppet] - 10https://gerrit.wikimedia.org/r/133018 (owner: 10Matanya) [16:26:08] ori: Looks like someone is spamming Special:Export, which is probably contributing to the database load leading to the "too many connections" [16:34:03] (03PS1) 10QChris: Redirect https traffic from old metrics sites to wikimetrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/133089 [16:36:05] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [16:57:06] (03PS2) 10Ori.livneh: Rewrite search latency metric to track new search [operations/puppet] - 10https://gerrit.wikimedia.org/r/133030 (owner: 10Chad) [16:58:42] (03CR) 10Ori.livneh: [C: 032] Rewrite search latency metric to track new search [operations/puppet] - 10https://gerrit.wikimedia.org/r/133030 (owner: 10Chad) [17:00:44] (03PS3) 10Ragesoss: Update the redirect target for education.wikimedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/122866 [17:14:15] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 5 below the confidence bounds [17:26:05] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 5 below the confidence bounds [17:27:31] Can we have a link to https://wikitech.wikimedia.org/wiki/Deployments in the topic pls? Highly useful link imho [17:45:42] (03PS1) 10Chad: search latency dashboard: also update description [operations/puppet] - 10https://gerrit.wikimedia.org/r/133094 [17:46:15] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 21 data above and 4 below the confidence bounds [17:51:15] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 22 data above and 3 below the confidence bounds [17:51:41] akosiaris: is amanda dead yet ? [17:52:28] greg-g, dr0ptp4kt, i would like to depl this fix today (adam, pls +2) - https://gerrit.wikimedia.org/r/#/c/133043/ [17:52:28] as part of SWAT if ok [17:52:43] btw, greg-g, could you add me to the morning SWAT pls [17:53:15] ah, as a swatter? [17:53:41] (03PS2) 10Dzahn: blog: remove nrpe [operations/puppet] - 10https://gerrit.wikimedia.org/r/133019 (owner: 10Matanya) [17:56:11] yurikR, i will go review [17:56:51] greg-g, yes, as a squatter [17:59:42] yurikR, i +2'd [17:59:54] (03CR) 10Dzahn: [C: 032] "yes, holmium is the node and it has standard, standard has base, base has nrpe" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133019 (owner: 10Matanya) [18:00:09] dr0ptp4kt, thx [18:03:00] (03CR) 10Ori.livneh: [C: 032] search latency dashboard: also update description [operations/puppet] - 10https://gerrit.wikimedia.org/r/133094 (owner: 10Chad) [18:03:31] fabriceflorin: hello [18:03:52] Hello matanya, how are you? [18:04:16] i'm fine thanks, hope you are ok too [18:04:30] have a question for you [18:04:32] when is https://bugzilla.wikimedia.org/show_bug.cgi?id=65225 going to be fixed? do you have any eta ? [18:05:31] Hello Reedy, hope you had a great time in Zurich :) I just wanted to check if you are handling today’s deployment. If so, wanted to remind you that we’d like to deploy Media Viewer as part of today’s MediaWiki train, on Telugu and Kannada Wikipedia, using this Gerrit change 132967: https://gerrit.wikimedia.org/r/132967 [18:05:39] fabriceflorin: i'm asking since it turnout that IE9 is more common than i wanted to hear [18:07:58] Hi matanya : Yes, we are working on that IE9 issue today, gi11es is the developer. Our hope is that we can address it very soon. We are tracking it on Mingle as #597: https://wikimedia.mingle.thoughtworks.com/projects/multimedia/cards/597 [18:08:20] great, thanks for the update [18:11:52] yurikR: btw, it seems you want chris s's review of that change, too? [18:11:57] yurikR: sorry, was on a call [18:12:53] greg-g, i do want chris' review on the overall ignore "forcehttps" cookie [18:13:21] Hey greg-g : who’s deploying today’s MediaWiki train? We want to make sure they include Media Viewer on Telugu and Kannada Wikipedia, using this Gerrit change 132967: https://gerrit.wikimedia.org/r/132967 [18:13:21] btw, is he around? [18:13:41] greg-g, https://gerrit.wikimedia.org/r/#/c/132952/ [18:14:15] fabriceflorin: reedy [18:14:26] yurikR: chris might be jetlagged [18:14:47] greg-g: Cool, thanks. Already pinged him. Hope you had a wonderful time in Zürich :) [18:14:51] yurikR: anyone else able to do it? [18:15:00] fabriceflorin: yeah, 'twas mostly fun and mostly useful :) [18:15:20] fabriceflorin: just expensive! [18:15:22] greg-g, good q... tim, brion, ...? [18:16:08] yurikR: how quickly do you need? [18:16:16] greg-g: Wonderful. It’s the best time of year to go to Switzerland, which was my homeland as a teenager. Sadly, that comes at a cost … [18:17:47] (03PS1) 10Reedy: Non wikipedias to 1.24wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133097 [18:18:15] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 31 data above and 1 below the confidence bounds [18:18:42] Hey Reedy|web : Hope you had a great time in Zurich I just wanted to remind you that we’d like to deploy Media Viewer on Telugu and Kannada Wikipedia as part of today’s MediaWiki train, using this Gerrit change 132967: https://gerrit.wikimedia.org/r/132967 [18:19:11] greg-g, its broken now, and a new partner cannot lunch because they simply can't test it (plus we had tons of complains about it since forever). In reality, the new hook patch i mentioned above is what we really need [18:19:11] Reedy|web: wait, you're not still on a boat are you? [18:19:12] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.24wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133097 (owner: 10Reedy) [18:19:20] (03Merged) 10jenkins-bot: Non wikipedias to 1.24wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133097 (owner: 10Reedy) [18:19:22] greg-g: fresh off the boat! [18:19:23] * Reedy|web grins [18:19:26] lol [18:19:40] I'm literally at the first service station north west ish [18:20:13] yurikR: k, so, you might be able to get brion to do it, he's on your supra-team [18:20:31] greg-g, i think he is jetlagged too ) [18:20:37] yeah :/ [18:20:48] wish I could, I did fairly well jetlag wise [18:20:50] (03PS2) 10Reedy: FUTURE: Sixth batch of pilot sites for MediaViewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132967 (owner: 10Gergő Tisza) [18:20:56] (03CR) 10Reedy: [C: 032] FUTURE: Sixth batch of pilot sites for MediaViewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132967 (owner: 10Gergő Tisza) [18:21:04] (03Merged) 10jenkins-bot: FUTURE: Sixth batch of pilot sites for MediaViewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132967 (owner: 10Gergő Tisza) [18:21:09] moving europe->us is usually easy - you just get up earlier [18:22:43] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.24wmf4 [18:22:50] Logged the message, Master [18:25:51] !log reedy synchronized wmf-config/InitialiseSettings.php 'touch for I1681addaed690b652822c0296b7a3e9b84de93b6' [18:25:54] Logged the message, Master [18:26:02] fabriceflorin: ^^ [18:27:15] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 32 data above and 3 below the confidence bounds [18:27:32] Reedy|web: Thanks so much :) Please let us know when it’s live … I hope that the station you’re working in is comfortable … [18:28:04] fabriceflorin: It should be live [18:29:38] Hmmm [18:29:38] There's a tonne of [18:29:39] May 13 18:28:43 10.64.48.34 apache2[29125]: [error] [client 10.64.32.104] request failed: URI too long (longer than 8190) May 13 18:28:43 10.64.48.35 apache2[19120]: [error] [client 10.64.0.102] request failed: URI too long (longer than 8190) [18:30:07] Reedy|web: Cool! It’s working well for me on both sites. Thank you kindly! Will alert the CLs now. [18:30:19] greg-g, ok, so can i depl https://gerrit.wikimedia.org/r/#/c/133043/ ? [18:30:20] Great, thanks [18:31:03] unless i can cherrypick https://gerrit.wikimedia.org/r/#/c/132952/ [18:31:36] (i haven't done that on prod servers, but shouldn't be that different than adding zero submodule to wmf branches [18:31:55] I suppose so re https://gerrit.wikimedia.org/r/#/c/133043/, but put it on your backlog to get the security review? [18:31:56] cherry pick into deployment branch of Zero [18:31:58] please [18:32:03] merge and submit [18:32:15] update extension in core deployment branch [18:33:09] greg-g, sure, it will go as part of the review for the other one (almost identical issue). Deploying now. [18:33:20] kk [18:33:57] alright, lunch time [18:33:59] All looks good to me [18:34:15] think I'll use the bathroom and see about getting home before midnight [18:36:03] I've got my mobile. And should have signal most of the way home... [18:41:22] (03PS1) 10Matanya: openstack: remove firewall access for amanda, replaced by bacula [operations/puppet] - 10https://gerrit.wikimedia.org/r/133104 [18:41:58] (03CR) 10Matanya: [C: 04-1] "don't merge yet." [operations/puppet] - 10https://gerrit.wikimedia.org/r/133104 (owner: 10Matanya) [18:47:48] (03CR) 10QChris: [C: 04-1 V: 031] "Works for me." (033 comments) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/130588 (https://bugzilla.wikimedia.org/63819) (owner: 10Nuria) [18:48:05] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 32 data above and 2 below the confidence bounds [18:49:04] !log replacing failed disk dataset1001 [18:49:10] Logged the message, Master [18:56:26] (03PS2) 10QChris: Redirect https traffic from old metrics sites to wikimetrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/133089 (https://bugzilla.wikimedia.org/64276) [18:56:48] greg-g: help :-( [18:56:55] which? [18:57:05] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 33 data above and 2 below the confidence bounds [18:57:40] greg-g: https://bugzilla.wikimedia.org/show_bug.cgi?id=64127#c10 [18:57:57] twkozlowski: Good news for you, https://bugzilla.wikimedia.org/show_bug.cgi?id=64127#c8 is the Language team clearing your config change (I didn't know he was part of that team until just now). [18:58:37] greg-g: That patch has nothing to do with the User Experience team or the Draft namespace [18:58:42] Why are people blocking it :-( [18:58:53] you'd have to ask Jared that. [18:59:12] twkozlowski: As for comment 10, I think he misunderstood; I replied at comment 11, but feel free to follow up if I got something wrong there. [19:01:12] greg-g: I talked with anomie and manybubbles on Friday I think. [19:01:25] greg-g: There is something very, very broken with the way this patch is being handled. [19:01:25] thursday? [19:01:33] twkozlowski: agreed. [19:01:38] Whether this is apparent with other patches, I don't know. [19:01:49] Maybe that's a bigger issue than this single patch [19:02:08] greg-g, almost done deploying that patch, sorry, had broken local repo [19:02:22] twkozlowski: its not the patch per se [19:02:29] !log yurik synchronized php-1.24wmf4/extensions/ZeroRatedMobileAccess/ [19:02:30] bleh, fatalmonitor is full of crap [19:02:34] Logged the message, Master [19:03:16] [13:57:44] manybubbles: Are you referring to the thing yesterday with that zhwikisource namespace config change? The language team *should* have actually -1ed or -2ed the patch and stated their issue. greg-g is handling why that didn't get done. [19:03:26] [13:58:50] I suggest we please, please write down a rule that no secret discussion can block a patch from being deployed [19:03:37] (among other things - that was Thursday May 8) [19:04:07] agreed. [19:04:44] anyway, thanks anomie for pushing this forward. Glad there's an okay from the LangEng team [19:07:12] !log yurik synchronized php-1.24wmf3/extensions/ZeroRatedMobileAccess/ [19:07:19] Logged the message, Master [19:07:35] greg-g, is anyone working on the fatalmonitor mess? that export bug is really crowding it :( [19:07:48] bug #? [19:08:00] and to more explicitly answer your question: no/not that I know of. [19:09:53] greg-g, don't know if there is a bug - i'm looking at the fatalmonitor - tons of: Warning: Cannot modify header information - headers already sent by (output started at /usr/local/apache/common-local/php-1.24w [19:09:53] mf3/includes/Export.php:944) in /usr/local/apache/common-local/php-1.24wmf3/includes/exception/MWException.php on line 261 [19:09:53] <^d> yurikR, greg-g: I started poking it last night. [19:10:19] yei ^d \o/ [19:10:25] <^d> It's not hard. [19:11:00] greg-g: I notice I appear to be moaning all the time on this channel, so on a more positive note [19:11:15] I was reading some MediaWiki history page about how stuff used to be deployed in the past [19:11:32] Our current one week schedule is damn amazing compared to the past! [19:11:39] * twkozlowski hat off [19:16:32] ^d: hi, are you here? i could do some testing now [19:16:50] <^d> yurikR, greg-g: https://gerrit.wikimedia.org/r/#/c/133115/, needs review + swatting. [19:17:01] twkozlowski: :) thanks and I don't mind the moaning [19:17:30] ^d: doit [19:17:59] <^d> It's only a warning, it doesn't need us to drop everything. [19:18:11] <^d> Let's get it reviewed and one of the swatters can get it this afternoon for us :) [19:18:34] <^d> Flexman: Yo, I'm here, just wrapped up something else. Give me a second to get logged in [19:19:27] ^d, you might want to make it a static func [19:19:34] there is no state there [19:19:50] ^d: ok just tell me when i should start [19:21:00] <^d> yurikR: more typing :p [19:21:14] ^d: yeah, just saying, no need to wait on anything [19:21:37] <^d> Flexman: Ok, I'm tailing. Give it a shot. [19:22:09] <^d> gotcha! [19:22:19] <^d> that'd explain why I couldn't find you. you're ipv6, not ipv4 :) [19:22:20] ok. i just opened https://commons.wikimedia.org/wiki/File:Drassmarkt_Mariahilfkapelle.jpg 4 minutes ago (not the accurate link) [19:23:03] I am ip v6?? how can that happen? i didn't even know ipv6 is being used already... [19:23:15] <^d> Wait, what? [19:23:18] <^d> This is wrong. [19:23:22] <^d> It's not you. [19:23:35] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Last successful Puppet run was Tue May 13 16:23:00 2014 [19:23:46] ah ok... :-D great. i hope i have some other 50 years until i'll need IPv6 [19:23:56] <^d> paravoid: I think we've got a problem with the frontends hitting the limiter similar to what we hit with the pool counter. [19:27:43] <^d> AaronSchulz: Thar you be. We've got a thumb.php issue. [19:28:18] <^d> I think one of our own frontends is hitting the ping limiter. [19:28:39] ^d: https://www.mediawiki.org/wiki/Git/New_repositories/Requests all the way at the bottom :) [19:29:14] does that make sense? used wiki page to net mess up the "parents" etc [19:29:22] ^d: did you find me? [19:29:29] ^d: btw, maybe some can merge https://gerrit.wikimedia.org/r/#/c/131770/ ? [19:30:05] <^d> Done. [19:30:20] AaronSchulz: are you back in SF? [19:30:27] <^d> Flexman: I found evidence of you :) [19:30:46] ori: yes [19:30:53] ^d: great. and it soulds like you got an idea what caused the problem? [19:31:02] <^d> I've got a hunch [19:31:05] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 2 below the confidence bounds [19:32:42] ^d, yes, one extra word & self::, but faster (in theory) execution [19:32:57] ))) [19:34:42] <^d> I think we've got bigger fish to fry than performance of between 1 and 4 function calls in the exception handler :) [19:36:59] <^d> AaronSchulz: Mind hopping on fluorine and confirming my suspicion? [19:37:19] <^d> `grep 2620 /a/mw-log/limiter.log` and then traceroute6 on the ip address you get back. [19:40:01] * AaronSchulz is still looking around [19:43:26] ^d: that's cp3004 [19:43:56] I don't see an ipv6 address for cp3004 or any cp in that cluster in squid.php [19:44:00] bblack: ^ [19:44:31] <^d> Ok that explains things. [19:44:38] we had prefer_ipv6=on set in varnish for that cluster [19:44:41] as a network workaround [19:44:55] this has been broken for a while :) [19:45:01] like, many months [19:45:28] yeah [19:45:38] what broke it months ago? [19:45:45] prefer_ipv6 [19:46:00] not that any of this would matter if we had our CIDR squid.php set :P [19:46:05] yup [19:46:24] it's good that someone has been fixing all that :P [19:46:54] ^d: did i miss something since i left 10 minutes ago? [19:47:02] <^d> It's definitely us, not you. [19:47:37] ^d: you mean the limit problem is on your side? [19:47:45] (03CR) 10Dzahn: "wouldn't this lead to a check on _each_ of the applicationservers? i think we just want a single one" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133051 (owner: 10Matanya) [19:47:49] <^d> Flexman: Yes. [19:48:10] ah ok fine [19:48:22] any chance to solve this? [19:49:03] <^d> paravoid: So what's the fix here? Removing prefer_ipv6? Finishing the CIDR stuff? [19:49:13] paravoid: https://gerrit.wikimedia.org/r/#/c/94168/ [19:49:29] so, because of the ip6_mapped-vs-SLAAC thing we killed prefer_ipv6 back in nov [19:49:52] it's all so interrelated :) [19:49:55] it's still on for upload [19:50:05] yeah just not text [19:50:13] the reason it's on for upload is [19:50:20] we had the crosslink network issues [19:50:23] when it was capped at 1gbps [19:50:48] hey jgage. around? I see you're on RT duty, can you knock out https://rt.wikimedia.org/Ticket/Display.html?id=7481? Should be trivial. [19:50:58] so mark's genius idea to offload it was by turning on prefer-ipv6 for upload and turn off ospf3 across the atlantic [19:50:58] jgage: would be nice to get done soon so he can have +2 rights [19:51:46] (genius wasn't an irony) [19:51:54] ^d: we have two ways forward really: (1) we can push the IPSet thing and then follow that up with the wmf-config/squid.php change to CIDR nets or (2) we can go add a boatload of pairs of IPv6 addresses to squid.php now to cover all cases regardless of interface choice [19:52:18] (3) turn off prefer_ipv6 [19:52:40] I think we should do (3) + eventually (1), but (2) + eventually (1) would also work [19:52:43] I thought there was a reason we wanted it? [19:52:54] not anymore [19:52:58] ah ok [19:53:01] YuviPanda: lemme take that [19:54:03] mobile has prefer_ipv6 as well [19:55:38] mutante: coool! :) [19:56:05] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 28 data above and 2 below the confidence bounds [19:56:15] YuviPanda: done [19:56:24] (03PS1) 10BBlack: disable prefer_ipv6 on esams mobile/upload caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/133124 [19:56:32] paravoid: ^ ? [19:56:46] graphite is broken btw [19:57:21] (03CR) 10Faidon Liambotis: [C: 031] disable prefer_ipv6 on esams mobile/upload caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/133124 (owner: 10BBlack) [19:57:53] chasemp: graphite is sick [19:57:59] losing data points [19:58:11] I saw you enabled diamond earlier today? [19:58:13] (03PS2) 10BBlack: disable prefer_ipv6 on esams mobile/upload caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/133124 [19:58:21] maybe related? [19:58:23] <^d> Ah that might explain why my graphs weren't showing up. [19:58:26] (03CR) 10BBlack: [C: 032 V: 032] disable prefer_ipv6 on esams mobile/upload caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/133124 (owner: 10BBlack) [20:04:20] getting that actually enabled will require varnish process restarts, which has to go slow [20:04:46] it's only like 12 machines though [20:04:59] oh it requires a restart? [20:05:01] damn [20:05:11] (03CR) 10Jeremyb: "> I still have a nagging feeling that trying to force pipermail to use HTTPS might result in a redirect loop" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (https://bugzilla.wikimedia.org/31369) (owner: 10Jeremyb) [20:05:32] well, I haven't looked to see if it can be set at runtime, actually [20:05:34] (separately from the puppet startup config) [20:06:09] oh it is runtime! [20:06:15] :) [20:08:44] fixed! [20:08:59] salt is pretty awesome sometimes [20:09:03] root@palladium:~# salt -C 'G@site:esams and ( G@cluster:cache_mobile or G@cluster:cache_upload )' cmd.run 'varnishadm -S /etc/varnish/secret -T 127.0.0.1:6083 param.set prefer_ipv6 off' [20:12:05] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 28 data above and 0 below the confidence bounds [20:12:32] ^ ? [20:12:39] graphite is broken [20:12:45] see above, my ping to chasemp [20:13:02] I think it's https://gerrit.wikimedia.org/r/#/c/133036/ [20:13:05] <_joe_> paravoid: uh, let me take a look at it [20:13:21] (03CR) 10Dzahn: [C: 04-1] "wouldn't this lead to a check on _each_ of the applicationservers? i think we just want a single one" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133051 (owner: 10Matanya) [20:14:04] paravoid: tungsten had diamond before that [20:14:15] <_joe_> and yeah, this can be the reason [20:14:25] <^d> Flexman: It should be fixed now. How about giving it a try on your end again? [20:14:39] ^d: ok i try [20:14:45] bblack: the rate of incoming metrics changed considerably with this though [20:14:52] true [20:14:57] it's not broken as in down, it just loses data points [20:15:27] <_joe_> paravoid: the carbon cache workers are at full steam, either we add more or we start sharding graphite [20:15:47] <_joe_> which, given the amount of data we're pouring there, makes sense [20:15:50] if only tungsten could be used on carbon and graphite could make diamonds [20:16:11] haha [20:16:14] <_joe_> lol [20:17:14] <_joe_> chasemp should know how to add carbon-cache workers, we have room for ~ 2 more I'd say [20:20:06] (03CR) 10Matanya: "I only wonder now why didn't the check raised an alert at outage time." [operations/puppet] - 10https://gerrit.wikimedia.org/r/133051 (owner: 10Matanya) [20:24:11] (03CR) 10Dzahn: "that's a good question, note how it uses the "skin1.5" in the past.. hmm" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133051 (owner: 10Matanya) [20:24:44] matanya: maybe /skins-1.5/common/ always worked [20:24:50] but /skins/common did not [20:28:05] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 24 data above and 0 below the confidence bounds [20:38:14] (03CR) 10Dzahn: "i think https://bits.wikimedia.org/skins-1.5/common/images/poweredby_mediawiki_88x31.png never broke" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133051 (owner: 10Matanya) [20:38:17] _joe_: we are currently adding a lot, a lot of whisper files [20:38:25] if we give it a little time it will level out again I think [20:38:29] ^d: thanks, it works now! what was the problem? [20:38:47] <_joe_> chasemp: oh ok, I trust your feelings :) [20:39:09] ^d: took me a while to test because the internet provider seems to have cached the images that weren't working [20:39:17] really I was just waiting for it to stabilize, all new clients checking in, all new whispers created [20:39:18] <^d> Flexman: Our caching proxies in Europe were preferring IPv6 addresses over IPv4 addresses and our whitelist of addresses didn't include the IPv6 ranges. [20:39:25] but you are right, lots and lots going on [20:39:34] <_joe_> chasemp: if tomorrow morning (my time, in ~ 10 hours) the situation hasn't recovered I can add a few workers [20:39:37] (03CR) 10MarkTraceur: [C: 031] "Core dependency is merged now; merge and ship whenever the next branch goes out." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132112 (owner: 10Gergő Tisza) [20:39:42] sounds good [20:39:49] ^d: did i really have IPv6?? [20:39:53] btw 166,845 whisper files added [20:39:55] _today_ [20:40:06] <_joe_> btw, now that we have server stats in graphite I will try to do something fancy [20:40:10] <^d> Flexman: No, our caching proxies did, so we were hitting our own rate limiter :p [20:40:11] <_joe_> chasemp: wow. [20:40:11] ah you mean i came via the european proxy [20:40:15] loool [20:40:16] <^d> Yep, exactly. [20:40:17] i see [20:40:40] loool*2 [20:40:51] <^d> Flexman: Thanks for reporting this though, it'd actually been broken for a bit and nobody had noticed or complained yet. [20:41:02] based on projections we have done probably 70% I think? so far [20:41:22] Thanks for solving this so quickly. Great I could help. :-) [20:41:25] some hosts are missing and I don't know why yet, and things of that nature [20:41:35] but so far 606 precise machines == lots of metrics [20:42:58] <_joe_> chasemp: and I can start creating some fancy alarms ;) [20:43:30] <_joe_> great, our athens project coming to life :) [20:45:45] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:46:24] maybe it won't self heal :) [21:02:30] greg-g: \o/ [21:03:03] greg-g: can we schedule https://gerrit.wikimedia.org/r/#/c/127584/ for deployment after https://bugzilla.wikimedia.org/show_bug.cgi?id=64127#c13 ? [21:06:05] PROBLEM - check configured eth on maerlant is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:06:15] PROBLEM - RAID on maerlant is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:06:15] PROBLEM - check if dhclient is running on maerlant is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:06:15] PROBLEM - Disk space on maerlant is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:06:25] PROBLEM - puppet disabled on maerlant is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:06:25] PROBLEM - DPKG on maerlant is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:15:35] chasemp, _joe_: what's going on with tungsten? i missed some of the discussion above [21:15:54] why did it reboot? [21:16:13] it appears to be hosed up atm, long and short of it is it is getting a lot of metrics it seems not to be handling [21:16:34] twkozlowski: please do. [21:16:44] chasemp: um, do you have any more details? [21:16:58] why did you reboot, and why is it getting lots of new metrics? [21:17:25] I do, but I'm looking at it as we speak, rebooted as it seeemed like runaway carbon, metrics as diamond has gone out to a bunch more hosts [21:18:49] (03PS1) 10Yuvipanda: dynamicproxy: Remove some unused code [operations/puppet] - 10https://gerrit.wikimedia.org/r/133171 [21:18:51] (03PS1) 10Yuvipanda: dynamicproxy: Use redis connection pooling [operations/puppet] - 10https://gerrit.wikimedia.org/r/133172 (https://bugzilla.wikimedia.org/65179) [21:19:08] Coren: ^ should take care of the 503s. [21:19:11] scfc_de: ^ [21:19:23] scfc_de: my laptop is gonna die, so I should go find an adapter soon [21:20:05] runaway how? [21:20:51] and why would a runaway process require a reboot? [21:23:32] happy to discuss it once it's back up but it seems to have returned in a not good state [21:23:32] uwsgi: unable to connect to uWSGI server: [21:24:18] chasemp: this is an internal service; it's not affecting users. it's good to take an extra moment or three to !log what you're doing and discuss if you're unsure [21:24:48] YuviPanda: You tested those live, so they just need to be merged by ops? [21:25:10] sure I'm not trying to be flippant with, so much as ask for your help [21:25:47] greg-g: Can I schedule it for later today, or do I absolutely have to be on-line? [21:25:57] ok, looking [21:26:05] 23:00 UTC is 01:00 my time, so too late for me [21:27:20] chasemp: /var/log/upstart/uwsgi_app-gdash.log has: bind(): No such file or directory [socket.c line 107] [21:27:41] if bind() is complaining about no such file, it must be attempting to bind to a unix domain socket [21:27:56] (03CR) 10Tim Landscheidt: [C: 031] "Provided that ngx.var.uri is indeed always "/" + optionally something, I also think that captures should never be null." [operations/puppet] - 10https://gerrit.wikimedia.org/r/133171 (owner: 10Yuvipanda) [21:28:16] if you look at /etc/uwsgi/apps-enabled/gdash.ini you'll see it's trying to bind to /var/run/gdash/gdash.sock [21:28:19] but /var/run/gdash doesn't exist [21:28:25] probably because it was wiped by a reboot [21:28:37] so that's a bug [21:29:26] !log I caused elasticsearch1015 to drop out of the Elasticsearch cluster by tring to take a heap dump on it. don't do that. It stops the application for many seconds. [21:29:33] modules/uwsgi/files/init/init.conf creates /run/uwsgi [21:29:34] Logged the message, Master [21:29:40] in the pre-start script section [21:29:54] manybubbles has a funny way of logging stuff, I notice [21:30:10] Very enjoyable when one peruses the SAL at one's leisure, though. [21:30:18] I believe the graphite and gdash interface have similar problems [21:30:18] twkozlowski: not normal [21:30:40] so what's the simple fix that needs to be translated to puppet? [21:30:40] manybubbles: you way of loggin things, or perusing SAL at one's leisure? :-D [21:31:32] twkozlowski: I guess both. I kind of like looking over my old comments in SAL - I can figure out what was going on reasonably well [21:31:36] twkozlowski: I'll stand in for you, add my name next to it [21:31:44] chasemp: the dirs are defined in puppet so they'll be created on puppet run [21:31:55] chasemp: but that leaves the time from system boot to first puppet run [21:31:57] and it someone wonders why the elasticsearch cluster went to warning state, now that can see it on SAL [21:32:10] I ran puppet manually and restarted apache to no avail [21:32:28] I saw a gdash creation but nothing else [21:33:06] that's ok, don't lose your head. as i said, it's an internal service, so it's ok if it's down for a bit while we figure out a proper fix. [21:33:20] !log gdash and graphite currently down; chase & ori debugging [21:33:25] Logged the message, Master [21:34:40] ok, so i think we should have gdash bind to sockets in /run/uwsgi, as it is a uwsgi app, and the uwsgi main init script guarantees its creation [21:34:51] graphite ditto [21:36:33] give a try and if it works we can translate to puppet, but I'm good with it [21:37:08] (03CR) 10Tim Landscheidt: [C: 031] "Looking at https://github.com/openresty/lua-resty-redis#set_keepalive, this feels alright to me." [operations/puppet] - 10https://gerrit.wikimedia.org/r/133172 (https://bugzilla.wikimedia.org/65179) (owner: 10Yuvipanda) [21:37:31] Thanks greg-g; appreciate it greatly. Just added that patch to the SWAT window later today. [21:38:55] PROBLEM - SSH on maerlant is CRITICAL: Server answer: [21:40:38] (03PS1) 10Ori.livneh: uWSGI apps should place sockets in /run/uwsgi [operations/puppet] - 10https://gerrit.wikimedia.org/r/133175 [21:40:44] chasemp: & [21:40:51] er, ^ [21:41:17] k, but with that, why wouldn't it run after boot, after a puppet run, post apache restart? [21:41:22] why would it still be hosed up? [21:42:00] now I mean, what that resolves it seems like would not affect it at this moment, only post boot pre-puppet [21:42:36] well, the sockets are not there [21:42:40] so apache must have restarted before the puppet run [21:42:53] and before the directories existed [21:43:14] incidentally, i need to amend that patch, since i forgot we're using apache there and not the nginx that is bundled with the modules [21:44:43] (03PS2) 10Ori.livneh: uWSGI apps should place sockets in /run/uwsgi [operations/puppet] - 10https://gerrit.wikimedia.org/r/133175 [21:44:43] ok, so that isn't the case here like [21:44:43] root@tungsten:/var/log/upstart# grep sock /etc/uwsgi/apps-enabled/graphite-web.ini [21:44:43] socket = /var/run/graphite-web/graphite-web.sock [21:44:43] stats = /var/run/graphite-web/graphite-web-stats.sock [21:44:44] root@tungsten:/var/log/upstart# ls /var/run/graphite-web/ [21:44:46] root@tungsten:/var/log/upstart# /etc/init.d/apache2 restart [21:44:48] * Restarting web server apache2 ... waiting . [ OK ] [21:44:50] root@tungsten:/var/log/upstart# [21:44:52] sock location, dir exists, apache restart [21:44:55] still nadda [21:45:03] so post boot, post puppet, post apache restart [21:45:23] still broken is what I'm asking, that changeset is good but doesn't explain why it's still broken now? [21:45:32] [Tue May 13 21:45:16 2014] [error] [client 208.80.154.14] uwsgi: unable to connect to uWSGI server: No such file or directory [21:45:41] in apache2 error log [21:45:48] so if you do: [21:46:03] uwsgictl restart [21:46:08] followed by apache2ctl restart [21:46:15] then see [21:46:25] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.936 second response time [21:46:29] there it is [21:46:55] the only thing that is not resolved by my patch above is the fact that uwsgi must start before apache [21:47:12] or maybe not [21:47:18] right, but isn't that significant? [21:47:26] well, let's figure out definitively [21:47:32] let's stop uwsgi and make sure the socket files are gone [21:47:57] restart apache, it'll complain that it can't reverse-proxy those sockets [21:48:15] then start uwsgi, and see if graphite/gdash start to work, or if it requires a manual restart of apache [21:48:53] could you try that? [21:49:04] if you 'uwsgictl stop' [21:49:56] graphite breaks, sock file are there, I don't mean it removes them on stop, only that logic to ensure they exist should be in the manifest [21:50:58] i'm not sure what you mean [21:51:14] let me try for a moment [21:51:29] so, you stop uwsgi it breaks, sock files still exist, you start uwsgi it is fine [21:51:34] no apache restart necessary [21:51:58] yeah, so it's fine [21:52:03] so the patch should fix it comprehensively [21:52:09] but the sock files being missing breaks all else, yet no logic exists in puppet to fix it? [21:52:24] are you sure that the sock files being missing breaks all else? [21:52:38] the problem was that uwsgi couldn't start the apps ebcause the parent directory didn't exist [21:52:58] sure but when they did exist running puppet didn't then create them you know? [21:53:02] but moving them to /run/uwsgi fixes that, because the pre-start-script section of the uwsgi init script ensures that directory is there [21:54:13] sounds good [21:54:23] let's merge it and restart to confirm [21:54:25] chasemp: i also just confirmed that apache still picks up gdash/graphite if the sock files aren't there [21:54:29] what i did was: [21:54:34] stopped apache and uwsgi [21:54:36] deleted sock files [21:54:43] started apache, gateway error as expected [21:54:50] start uwsgi, confirmed sock files were created [21:54:53] apache now worked [21:55:07] yup, but is the puppet manifest managing uwsgi correctly? [21:55:24] yes, it's proper for the /run directory to be managed by the service [21:55:29] there should have been a puppet error when uwsgi couldn't start from bad path on it's sock files? [21:55:31] it's not proper for it to be managed by puppet [21:55:39] the first tiem around I mean [21:55:59] no, puppet shouldn't manage ephemeral directories [21:56:11] it was a bug in the modules to put those in /var/run anyway [21:56:47] agreed, and that's fine, we can move ahead, my thought is basically that a service puppet manages was broken. puppet did not report it at all. [21:56:54] it's correct now, both lowercase-c correct (it'll work) and uppercase-c Correct (it's proper) [21:57:37] chasemp: it's a consequence of puppet not being able to reason about task jobs [21:57:42] but there are alerts set up for that [21:58:08] it's a general problem; puppet will ensure => running on an upstart service but not wait to see that it stays up [21:58:36] but that's also fine imo; puppet is not a monitoring tool [21:58:45] in this case it was more like, puppet said it was running, but puppet also doesn't actually know what that means [22:00:56] chasemp: will you merge or should i? [22:01:34] just looking, was going to +1 and let you do your thing? [22:01:50] (03CR) 10Rush: [C: 031] "cool deal thanks" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133175 (owner: 10Ori.livneh) [22:03:00] chasemp: i see what you're saying [22:03:12] puppet.log doesn't show puppet as starting the service when uwsgi was stopped [22:03:19] that must mean uwsgictl status is not returning the right exit code or something [22:03:20] yes [22:03:29] yes I don't know waht the deal was but [22:03:35] it did not manage that service correctly [22:04:09] the sock file paths are specified explicitly in the manifest, yet if they don't exist we do nothing in puppet? [22:04:17] not saying we should manage those files directly [22:04:26] but it should trigger restart of uwsgi or otherwise creation [22:04:51] root@tungsten:/var/log# uwsgictl status ; echo $? [22:04:51] uwsgi/app stop/waiting [22:04:51] 0 [22:04:57] yeah, so that's a bug [22:05:13] heh [22:05:17] initctl list | grep -P '^uwsgi/(?!init)' | sort [22:05:31] sort returns 0, status returns 0, puppet is happy [22:06:10] ^d: what's the --owner=MyGroup for operations in gerrit? i don't see just "operations" as a group, just 2 more specific ones [22:06:22] <^d> "ldap/operations" [22:06:34] aha, thx [22:07:09] <^d> yw [22:09:25] PROBLEM - swift-account-replicator on ms-be1009 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:09:25] PROBLEM - swift-object-server on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:25] PROBLEM - check configured eth on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:35] PROBLEM - swift-container-server on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:36] PROBLEM - check if dhclient is running on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:45] PROBLEM - swift-container-updater on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:45] PROBLEM - Disk space on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:05] PROBLEM - puppet disabled on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:15] PROBLEM - swift-object-replicator on ms-be1009 is CRITICAL: Timeout while attempting connection [22:10:15] PROBLEM - swift-object-auditor on ms-be1009 is CRITICAL: Timeout while attempting connection [22:10:15] PROBLEM - RAID on ms-be1009 is CRITICAL: Timeout while attempting connection [22:10:15] PROBLEM - swift-account-reaper on ms-be1009 is CRITICAL: Timeout while attempting connection [22:10:15] PROBLEM - swift-object-updater on ms-be1009 is CRITICAL: Timeout while attempting connection [22:10:16] PROBLEM - DPKG on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:16] PROBLEM - swift-container-auditor on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:17] PROBLEM - swift-account-server on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:17] PROBLEM - swift-container-replicator on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:18] PROBLEM - swift-account-auditor on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:31] uhhh wtf [22:11:08] work on swift? [22:12:13] that box is being hammered it seems [22:12:16] upload issue again? [22:16:15] (03PS1) 10Ori.livneh: Fix 'status' checks in *ctl scripts [operations/puppet] - 10https://gerrit.wikimedia.org/r/133180 [22:16:22] chasemp: thanks for pressing the point about the service check, it was a major bug and one that i reproduced in several places [22:16:46] (03CR) 10Ori.livneh: [C: 032] uWSGI apps should place sockets in /run/uwsgi [operations/puppet] - 10https://gerrit.wikimedia.org/r/133175 (owner: 10Ori.livneh) [22:17:05] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [22:17:54] (03CR) 10Ori.livneh: [C: 032] Fix 'status' checks in *ctl scripts [operations/puppet] - 10https://gerrit.wikimedia.org/r/133180 (owner: 10Ori.livneh) [22:18:03] (03CR) 10Greg Grossmeier: [C: 031] "I might be afk, but you probably don't need me to be online at the time of deploy. It's fairly trivial." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127584 (https://bugzilla.wikimedia.org/64127) (owner: 10Odder) [22:22:02] <^d> Graphite/Gdash's still busted, right? [22:22:25] i'm verifying the fixes so yes [22:22:28] but should be back in a moment [22:22:37] chasemp: notice: /Stage[main]/Uwsgi/Service[uwsgi]/ensure: ensure changed 'stopped' to 'running' [22:22:40] so that's fine now [22:22:51] nice awesome [22:22:58] !log restarting tungsten to verify fix for gdash/graphite initialization [22:23:05] Logged the message, Master [22:23:17] i'm still not sure why you restarted it in the first place but now i'm very glad that you did :P [22:24:35] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Last successful Puppet run was Tue May 13 16:23:00 2014 [22:25:18] ori: will explain here in a minute, have a weird swift box [22:25:28] yeah [22:25:31] ^d: chasemp: # ssh -p 29418 gerrit.wikimedia.org gerrit create-project --require-change-id --owner=ldap/operations --parent=operations --description='"open software engineering platform and fun adventure game"' phabricator/phabricator [22:25:35] it's not booting up now, though. hrm. [22:25:36] ? [22:25:55] <^d> There is no "operations" parent. [22:26:08] <^d> Just omit the parent, I'll adjust acls as needed. [22:26:15] i said phab/phab because [22:26:17] ssh -p 29418 gerrit.wikimedia.org gerrit create-project --require-change-id --owner=ldap/operations --parent=operations --description='"command line interface for phabricator"' phabricator/arcanist [22:26:21] mutante: idk I usually do it in the ui :) [22:26:23] there would be 3 of them [22:26:28] chasemp: wikitech says not to [22:26:34] not to use the GUI [22:26:36] did not know that [22:26:40] has worked fine for me? [22:26:47] it says "some shortcomings" [22:26:55] <^d> It's better than it was. [22:27:11] chasemp: ok, rebooted the machine and confirmed it started up correctly [22:27:50] ^d: should be fine now [22:28:07] <^d> Hmm, search stats still missing. Will have to keep poking there. [22:28:09] chasemp: ^d if the other 2 are phabricator/arcanist and phabricator/libphutil, should the main one be phabricator/phabricator? [22:28:15] PROBLEM - SSH on ms-be1009 is CRITICAL: Connection refused [22:28:44] <^d> mutante: Yes. And to make things simpler, make phab/arcanist and phab/libphutil inherit from phab/phab [22:28:53] <^d> Then I can just configure the phab/phab acl and call it a day [22:29:11] ^d: sounds good, cool [22:30:07] fatal: Group "ldap/operations" does not exist [22:30:29] also skip owner== ? [22:30:39] ^d: malkovich malkovich? [22:30:58] (03CR) 10MarkTraceur: [C: 04-2] "We shouldn't use this patch, it should be scheduled as a separate deployment window and we should have four patches queued up for it." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129828 (owner: 10MarkTraceur) [22:31:06] <^d> mutante: Yeah, easily fixed later. [22:31:35] PROBLEM - Host ms-be1009 is DOWN: PING CRITICAL - Packet loss = 100% [22:32:15] RECOVERY - SSH on ms-be1009 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [22:32:15] RECOVERY - swift-account-replicator on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [22:32:15] RECOVERY - swift-object-server on ms-be1009 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [22:32:15] RECOVERY - check configured eth on ms-be1009 is OK: NRPE: Unable to read output [22:32:25] RECOVERY - Host ms-be1009 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [22:32:25] RECOVERY - swift-container-server on ms-be1009 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [22:32:25] RECOVERY - check if dhclient is running on ms-be1009 is OK: PROCS OK: 0 processes with command name dhclient [22:32:35] RECOVERY - swift-container-updater on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [22:32:37] RECOVERY - Disk space on ms-be1009 is OK: DISK OK [22:32:45] RECOVERY - Puppet freshness on ms-be1009 is OK: puppet ran at Tue May 13 22:32:40 UTC 2014 [22:32:56] RECOVERY - puppet disabled on ms-be1009 is OK: OK [22:33:05] RECOVERY - DPKG on ms-be1009 is OK: All packages OK [22:33:05] RECOVERY - swift-object-auditor on ms-be1009 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [22:33:05] RECOVERY - swift-object-updater on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [22:33:05] RECOVERY - swift-container-replicator on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [22:33:05] RECOVERY - swift-account-auditor on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [22:33:05] RECOVERY - swift-account-server on ms-be1009 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [22:33:06] RECOVERY - swift-object-replicator on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [22:33:06] RECOVERY - swift-account-reaper on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [22:33:07] RECOVERY - swift-container-auditor on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:33:07] RECOVERY - RAID on ms-be1009 is OK: OK: optimal, 14 logical, 14 physical [22:33:50] chasemp: ^d: done, see auto-complete in gerrit when typing project:pha... [22:34:11] <^d> cool, will have a look in a moment. [22:35:08] !log created new gerrit projects for phabricator,arcanist and libphutil [22:35:14] Logged the message, Master [22:38:53] <^d> yurikR: Made it static like you asked, Aaron merged for me. Got it on the calendar for this afternoon's swat. [22:39:07] ^d thx :) [22:39:28] <^d> yw. We had actually just talked about this in our team meeting yesterday so I was halfway on it :) [22:43:53] !log ms-be1009 rebooted as it had locked up, swift seems to have recoverd [22:43:59] Logged the message, Master [22:45:59] mutante: atm no reports [22:46:57] Steinsplitter: great [22:47:35] ori: essentially, I created (as of now) 167314 new metrics today, it was slamming the box, I attempted to put in a max_creations rate to help [22:47:41] (03PS1) 10Chad: Search latency metric fix, use actual metric name [operations/puppet] - 10https://gerrit.wikimedia.org/r/133187 [22:47:43] carbon wasn't respecting it (seemingly) [22:47:51] the box was already offline for serving at that point [22:47:54] I opted to reboot [22:48:31] chasemp: k, makes sense [22:48:47] the rest you know :) [22:49:56] how is diamond working out so far? [22:50:11] (03CR) 10Ori.livneh: [C: 032] Search latency metric fix, use actual metric name [operations/puppet] - 10https://gerrit.wikimedia.org/r/133187 (owner: 10Chad) [22:50:20] the only thing I think may need some love at this moment is hte swift servers have so many disks [22:50:28] the disk specific checks are creating more load than I would like [22:50:46] normal agent cpu/mem usage is very light [22:51:27] we are soon to be in need of a dashboard that isn't gdash I think to make use of all of this [22:51:39] yeah, gdash sucks [22:52:24] ryan lane and paravoi.d recommended http://grafana.org/ , i'm partial to https://github.com/vimeo/graph-explorer [22:52:31] but the important thing is to pick something [22:53:04] those actually were kind of the two had I thought [22:53:22] maybe dieterbe has softened at the edges and graphite-explorer could be awesome [22:53:30] grafana is prettier tho [22:53:58] let's do grafana for now then [22:54:05] better that than wallow in indecision [22:54:54] previously we hacked up http://code.shutterstock.com/rickshaw/ [22:55:07] so that you could paste a graphite query and it would make an interactive graph [22:55:21] that for embedding was priceless (in phabricator or chat) [22:55:33] we meaning previously ppl I worked w/ [22:55:57] http://vimeo.github.io/graph-explorer/ does that, no? [22:56:01] it's v2.0 now [22:56:14] and may have changed a lot since you last looked [22:56:21] yeah, definitely has [22:56:27] in that case i don't know [22:56:58] it used to be he was using some graphite.js thing [22:57:02] that was like an abstracted D3 [22:57:08] and it was just ok [22:57:43] well, graph-explorer has a hard dependency on elastic, whereas grafana's dependency on elastic is soft -- it's just used for saving dashboards [22:57:58] and paravoi.d expressed a preference for it, so he won't be dismayed to find out we provisioned it while he was sleeping [22:58:02] I figure we'll want it anyways, so not too big a thing [22:58:19] ha, not my intention [22:58:25] i'm contemplating it :) [22:59:17] this http://grafana.org/docs/features/scripted_dashboards/ [22:59:21] would be really useful [22:59:42] i'll see if i can puppetize it quickly [23:04:06] (03PS5) 10Dzahn: initial commit for a phabricator module (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 [23:06:34] (03CR) 10Dzahn: initial commit for a phabricator module (WIP) (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [23:10:24] * MaxSem looks around [23:10:33] I'll do the SWAT [23:11:37] * MaxSem pokes today's contacts, greg-g and ^d - are you there guys? [23:11:46] <^d> Yessir. [23:14:02] (03CR) 10MaxSem: [C: 032] Configure two extra namespaces for zhwikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127584 (https://bugzilla.wikimedia.org/64127) (owner: 10Odder) [23:14:14] (03Merged) 10jenkins-bot: Configure two extra namespaces for zhwikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127584 (https://bugzilla.wikimedia.org/64127) (owner: 10Odder) [23:18:33] !log maxsem synchronized wmf-config/InitialiseSettings.php 'https://gerrit.wikimedia.org/r/127584' [23:18:39] Logged the message, Master [23:21:13] !log Ran namespaceDupes after adding new namespaces to zhwikisource - no problems found [23:21:20] Logged the message, Master [23:25:11] !log maxsem synchronized php-1.24wmf3/includes/exception/MWException.php 'https://gerrit.wikimedia.org/r/#/c/133183/' [23:25:18] Logged the message, Master [23:27:13] !log maxsem synchronized php-1.24wmf4/includes/exception/MWException.php 'https://gerrit.wikimedia.org/r/#/c/133184/' [23:27:16] <^d> MaxSem: Much thanks [23:27:17] ^d, done ^^^ [23:27:19] Logged the message, Master [23:27:22] :) [23:27:24] (03CR) 10Dzahn: "let's start with a proper labs project btw instead of putting phab instances into gerrit, it's confusing" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [23:28:41] (03CR) 10Chad: "Gerrit wasn't being used for anything else, hence the use there :p" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [23:30:04] ^d: i'd like to login to mysql to see which db's it created.. it's on localhost on fab2 but the root password is set [23:30:26] or is it using other backend [23:30:48] <^d> it's localhost mysql. [23:31:36] got the root pass? [23:32:31] <^d> /phab/phabricator/conf/local/local.json [23:32:50] ^d: haha, nice one [23:32:53] got it, thx [23:33:04] <^d> yw [23:33:33] dislikes a bit how it wants sooo many separate databases [23:33:50] so when i ask our dba to create them ... [23:33:56] i need to list all of that.. [23:34:04] (03CR) 10CSteipp: [C: 04-1] Improve nginx TLS/SSL settings. (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [23:34:24] <^d> mutante: Possibly. We don't use all of the applications so I wonder if all those databases are strictly necessary. [23:34:36] <^d> Like do you need `phabricator_legalpad` if you don't use the legalpad thing? [23:35:40] ^d: yea, it's "one per application", we need to figure out which apps we want, preferably before we put it on prod [23:35:51] having a ticket for it [23:58:56] (03CR) 10Dzahn: Improve nginx TLS/SSL settings. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki)