[00:01:15] (03PS2) 10Dzahn: remove pmtpa access switches [operations/dns] - 10https://gerrit.wikimedia.org/r/143202 [00:02:02] (03CR) 10Ori.livneh: "> I'm not sure how the HHVM configuration interacts with the non-HHVM configuration. Will the variant aliases still work?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/142983 (owner: 10Ori.livneh) [00:03:11] (03CR) 10Dzahn: [C: 04-1] "need to check what/if any of this is used for 10th floor" [operations/dns] - 10https://gerrit.wikimedia.org/r/143202 (owner: 10Dzahn) [00:11:29] (03PS1) 10Dzahn: retab wikipedia zone and fix aligning [operations/dns] - 10https://gerrit.wikimedia.org/r/143208 [00:14:48] (03PS1) 10Dzahn: delete anything 'toolserver' [operations/dns] - 10https://gerrit.wikimedia.org/r/143209 [00:15:34] (03CR) 10Dzahn: [C: 04-1] "after "July 1st 1:00 am UTC, the Toolserver accounts will be expired and the" [operations/dns] - 10https://gerrit.wikimedia.org/r/143209 (owner: 10Dzahn) [00:16:32] (03PS3) 10Ori.livneh: Apache config for Wikivoyage using mod_proxy_fcgi [operations/apache-config] - 10https://gerrit.wikimedia.org/r/142983 [00:21:03] (03PS1) 10Dzahn: wikimediafoundation - align and tabs [operations/dns] - 10https://gerrit.wikimedia.org/r/143212 [00:24:37] (03CR) 10Dzahn: [C: 031] Initialize some settings for wikimania 2015 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139279 (https://bugzilla.wikimedia.org/66370) (owner: 10Withoutaname) [00:28:54] (03PS7) 10Reedy: Initialize some settings for wikimania 2015 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139279 (https://bugzilla.wikimedia.org/66370) (owner: 10Withoutaname) [00:31:43] springle: http://chr13.com/2014/03/10/using-google-to-ddos-any-website/ [00:31:56] reminded me of the tendril report [00:35:38] YuviPanda|zz: I know you're not around anymore but...pong. I'll try to hit you up to talk about that stuff tomorrow [00:38:34] (03CR) 10Dzahn: [C: 031] "lgtm, but can we not introduce the literal tabs?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141671 (owner: 10Filippo Giunchedi) [00:44:17] (03PS2) 10Dzahn: update index page for wikimedia downloads [operations/puppet] - 10https://gerrit.wikimedia.org/r/141671 (owner: 10Filippo Giunchedi) [00:44:53] AaronSchulz: wow [00:47:08] chasemp: still around? I popped back in for a bit (can't sleep) [00:47:10] if not I'll just poke you tomorrow [00:47:15] yo [00:47:19] chasemp: yo! [00:47:56] chasemp: around and want to talk about it now? [00:48:00] so to give some context to your concern, there has been talk of primary data submission from diamond, as in even binary state data for services and such [00:48:06] chasemp: ah! [00:48:07] chasemp: right. [00:48:13] but it was always maintained ops side that he submission would be to icinga [00:48:30] or equivalent and that only the anomaly type stuff was alerted through graphite [00:48:31] as in? diamond -> icinga? or diamond -> graphite <- icinga [00:48:49] as in, we have a mechanism (which we make?) for diamond => icinga [00:49:07] ideally....in my personal world...we use passive checks much more heavily [00:49:07] ah, hmm. [00:49:11] and have hosts themselves be responsible for checking in [00:49:18] so I told you that just so we are on the same page [00:49:18] but [00:49:39] as far as actual alerting how you see fit in labs, I think it's in your hands [00:49:45] as the prod stuff needs lots of TLC atm [00:49:58] (03PS3) 10Dzahn: update index page for wikimedia downloads [operations/puppet] - 10https://gerrit.wikimedia.org/r/141671 (owner: 10Filippo Giunchedi) [00:50:06] not suggesting you should implement all that, just that it's in the mix somewhere [00:50:18] personally I think trying out some alerting dashboards would be really cool [00:50:26] chasemp: I agree too. [00:50:27] who knows maybe you love it [00:50:30] maybe it's java [00:50:33] (03CR) 10Dzahn: "also fixed some validation errors in PS3, (see diff between PS2 and PS3), now: This document was successfully checked as XHTML 1.0 Transit" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141671 (owner: 10Filippo Giunchedi) [00:50:35] heheh [00:50:53] chasemp: so one thing we can do is perhaps setup cabotapp (or equivalent) only for *toollabs*, and see how the experience is. [00:51:03] I think that's reasonable [00:51:08] it looks neat [00:51:24] chasemp: only thing I have against it is that checks aren't in version control, but need to be in a db. [00:51:29] that makes me feel weiirddd [00:51:39] not sure I understand? [00:51:41] so ideally I'd find something that lets me put the checks in a git repo [00:51:48] you mean the check logic itself? [00:51:52] chasemp: yeah [00:52:16] the graphite expressions will be stored in a db that cabotapp manages [00:52:18] yeah I'm sure we can do it right if we like it, I've done lots of things like that with git and post commit hooks [00:52:32] :D yeah, so that's a possibility too. [00:53:01] assume you've seen http://graphite.readthedocs.org/en/latest/tools.html ? [00:53:09] lots of effort in this space atm [00:53:10] chasemp: yeah, that's where I picked this up from [00:53:20] I've used most of those actually [00:53:41] oh [00:53:46] https://github.com/livingsocial/rearview/ looks nice too, but rub. [00:53:49] *ruby [00:53:51] but not cabot as it's new to me [00:54:01] and requires jvm [00:54:40] chasemp: have you used any of those tools for alerts? [00:54:55] no as is the case for everyone we rolled our own at my last place [00:55:08] hehe [00:55:14] I'd like to avoid that [00:55:26] chasemp: https://github.com/datacratic/check_graphite or similar is also a possibility [00:55:42] _joe_ actually rewrote most of that quite nicely [00:55:50] but it hasn't seen too much implementation yet [00:55:53] oh, check_graphite? [00:55:56] if you were curious he would be the guy to talk to [00:55:57] yes [00:56:21] nice. I wouldn't mind using check_graphite + icinga, but that seems a bit wasteful (+ brings all of icinga's complexity with it for little benefit) [00:56:27] (03PS3) 10Dzahn: modules/coredb_mysql/ sans systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137994 (owner: 10Rush) [00:57:05] well the hardest parts for an alerting system isn't alerting actually it's the finagle bits of assigning alerts, accepting them, silencing them [00:57:06] etc [00:57:13] chasemp: the previos icinga on labs effort stalled because prod's icinga setup was too prod specific, and hence petan just went his own way (and it was unpuppetized). I'd say my first priority is making sure that doesn't happen, and that any solution has buy-in from ops. [00:57:15] you get a lot more than you realize with that underbelly [00:57:18] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Jul 1 00:57:10 UTC 2014 [00:57:21] chasemp: that's true [00:57:44] idk how serious akosiaris is about shinken [00:57:58] but if he has more or less made up his mind maybe just deploy it in toollabs [00:58:04] rather than unraveling icinga [00:58:12] easier to build new than remodel [00:58:23] we could do that too, but then how do you define checks? collecting resources has the same problem with shinken, I'd suppose [00:58:38] ah I see [00:58:52] I have a few thoughts on it but let me noodle on it for a bit [00:59:05] chasemp: ok! [00:59:40] (03PS4) 10Dzahn: gerrit - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138008 (owner: 10Rush) [00:59:42] but say you get the histogram / trending portion stood up [00:59:47] that seems like the place to start [00:59:52] maybe you end up happy without icinga idk [01:00:24] I'd be happy to start with something as trivial as 'host is down!' and 'out of disk space!' (two things people on mailing lists had to alert us today, and happens almost every other week) :) [01:01:42] (03PS4) 10Dzahn: ganglia - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138010 (owner: 10Rush) [01:02:00] seems like you could do that second with most of the graphite dashboard tools [01:02:18] chasemp: yeah, with an alerting component [01:02:21] and we could write a really, really trivial cron to check for a host that hasn't reported a metric in x number of minutes [01:02:39] right [01:02:45] or even add that into cabotapp [01:02:52] yeah probably ideally that [01:03:01] it right now doesn't support reporting to IRC, for example (but reports to HipChat >_>) [01:03:25] yeah I kind of don't love our whole irc bots thing honestly [01:03:37] I think we talked previously about using a redis pubsub or some pubsub queue [01:03:43] hah, a new bot every few months as well [01:03:46] that can be used as an irc events pipeline [01:03:47] chasemp: oh, but where would that feed to? [01:04:23] chasemp: ah, got it [01:04:23] what I did before was a single bot(-ish) that could read from the queue (as could any consumer) and then you publish events [01:04:29] like an operational events pipeline [01:04:31] yeah [01:04:42] it's not complicated really but far more flexible than a lot of event islands [01:04:49] and we used that to post events to graphite too [01:04:54] faidon was interested in http://www.fedmsg.com/en/latest/ a while back [01:05:04] so syncs, etc all could be overlayed on graphs [01:05:34] in my (highly biased) opinion EventLogging is more robust, but it doesn't target operations stuff currently [01:05:38] ori: that's...pretty similar to how i've used redis int he past [01:05:50] I admit I know little of the current eventlogging stuff [01:06:08] I saw a chart from nuria it seemed not simple**tm [01:06:12] ori: do you have any thoughts/opinions on cabotapp and friends? [01:06:25] i've never seen cabotapp [01:06:36] (http://cabotapp.com/) [01:06:52] metrics based alerts. [01:07:30] seems okay, i'm just suspicious of new solutions [01:07:39] yeah, me too. [01:08:01] let's not marry it, but maybe drinks a few dates [01:08:17] then we say we are moving out of the country if it sucks [01:08:26] not marrying *anything* sounds great to me! :) [01:08:39] everybody wants to bring their ex from last work :) [01:08:45] * YuviPanda is actually moving out of the country soon too, so will work well [01:08:55] mutante: heh, that's so true [01:10:27] we have graphite, gdash, logstash, kibana, ganglia, diamond, txstatsd, python-statsd, icinga, tendril, udp2log, logster [01:10:48] we have python-statsd running somewhere as well? [01:10:51] YuviPanda: anyways, hope something there was helpful. tl;dr...I think giving it a try is cool. also, your use case may be simple enough it's all you need. [01:10:57] python-statsd is a client lib I think [01:11:03] ah [01:11:11] and I thought ganglia was going to die... [01:11:29] not dead yet tho [01:11:46] don't know tendril, ori? [01:11:48] watchmouse should make sure the other monitoring tools are actually up [01:11:51] YuviPanda: you know how it's more fun to roll out something new than sunset an existing solution? other people feel like that too :P [01:11:57] ori: :P [01:12:19] idk I love ripping things out :) [01:12:27] I'm more impressed with removing lines of code than adding them :) [01:12:29] mutante: PROBLEM - Number of monitoring apps is CRITICAL: CRITICAL: 20.00% of data exceeded the critical threshold [500.0] [01:12:38] ori: :)) [01:12:44] * YuviPanda likes ripping things out too [01:12:51] go kill toolserver :) [01:12:52] tendril is https://tendril.wikimedia.org/ [01:12:53] might get rid of all ganglia things that are labs specific. [01:12:58] mutante: ALREADY DEAD :) [01:13:03] s pringle's own [01:13:05] kinda cool actually [01:13:10] YuviPanda: https://gerrit.wikimedia.org/r/#/c/143209/ [01:13:22] ori: didn't we have ismaheal or something earlier? [01:13:28] oh, i forgot, there's ishmael too [01:13:29] "mha" ? [01:13:32] heh [01:14:04] to be fair several of those are at different layers in the monitoring stack not direct competitors :) [01:14:48] oh i'm not pointing fingers, i am directly responsible for some of the duplication [01:15:07] but i'm just leery of single-purpose monitoring solutions for exactly that reason [01:15:15] yeah agreed [01:15:18] +1 [01:15:31] bottom line is we need to figure out what we are gonig to do w/ icinga [01:15:41] so tools, etc know which way to go as well I guess [01:15:42] <^demon|lunch> Maybe we can move monitoring to phabricator since it's going to be our do-everything tool. [01:15:51] ^demon|lunch: heheheh [01:15:59] yeah, if icinga is going to be kept at least it should be moved to be a module [01:16:01] you laugh, but... https://secure.phabricator.com/book/phabricator/article/herald/ [01:16:20] herald is quite different from this :) [01:16:35] we could feed all monitoring data into it if we hate ourselves [01:16:42] herald sounds like gerrit-reviewer-bot [01:17:29] so, did you try the code review part of phab? [01:17:47] * YuviPanda hasn't [01:17:54] I gave up trying to set that up after a while [01:18:07] well...I mean I have used it a lot since I used it for work before :) [01:18:21] <^demon|lunch> I've done it. [01:18:27] <^demon|lunch> Not bad. [01:19:27] btw, here's another option, this is what we used at former work(tm) [01:19:30] http://docs.pnp4nagios.org/pnp-0.4/start [01:19:40] that is a nagios/icinga plugin to feed performance data into rrd [01:19:48] without the additional ganglia part [01:19:49] I used pnp before too, but it's definitely showing it's age [01:20:44] HEY GUYS LET'S WRITE OUR OWN IN GO AND SCALA [01:20:46] fair, it isn't bleeding edge for sure [01:21:49] the end of every awesome tool-selection process is to write a better one than all the ones you evaluated, so that the selection process is easier for the next guy :) [01:22:07] obligatory http://xkcd.com/927/ [01:22:32] bblack: :) after that you have the new IRC bot to report it [01:22:35] at least the character encoding one seems... somewhat fixed, for some definition of 'fixed' [01:23:17] ori: we could use Limn for dashboards. [01:23:18] unless you're trying to interoperate between languages that prefer internal UCS-32 vs UTF-16 vs UTF-8 [01:23:32] oh i forgot limn too [01:23:43] yeah, or using languages that think UCS-2 is UTF-16 [01:23:54] s/UCS-32/UCS-4/ heh [01:24:00] I can't even keep all the acronyms straight [01:24:05] bblack: UCS-32 sounds awesome! :D [01:24:09] or using languages that think they're languages that htink UCS-2 is UTF-16 (coco) [01:24:24] Limn, heh. [01:26:17] what we need is a new universal encoding that fixes all the legacy problems of unicode and subsumes them into a single character set. It will use 7 bits per character at minimum to save 12% file size on most documents and mirror the first 128 code values of UTF-8, then have this awesome encoding scheme that re-encodes the rest of UTF-8 using RLE compression. [01:26:22] we'll call it UTF-WIN [01:27:26] (your string libraries will need to bit-shift the 7-bit codes around to display them, but we've got plenty of processor power for that these days!) [01:27:41] someone start an IETF group, quick [01:27:56] YuviPanda: !monitoring is http://downforeveryoneorjustme.com/${instance} :p [01:27:59] gotta run, laters [01:28:03] mutante: :D [01:28:28] bblack: should just use bytes that when rendered in binary look like the character. [01:28:50] hah [01:28:58] our clear text will be crypto and our crypto will be clear text [01:29:25] come to think of it, with UCS-4 you could just do an 4x8 bitmap as a low-resolution font in the same bitspace. [01:30:13] every possible glyph ever, at a certain crappy resolution. no need for character sets [01:30:57] heh [01:31:01] it shall be UTF-BLOB [01:32:21] alright, I shall sleep [01:32:53] chasemp: do noodle on it more, but once I complete sending more metrics into graphite/diamond on toollabs (nginx error counts, grid engine stats) I'll do something about it :) [01:33:05] sounds good [01:33:34] chasemp: disk just got full on *7* of our hosts, and we got notified on IRC again :( [01:33:42] * YuviPanda mumbles about 2G /vars [01:33:55] ah [01:33:57] only 2 hosts [01:34:04] 03-10 doesn't mean 03 through 10 :) [01:46:59] (03PS1) 10Yuvipanda: diamond: Keep only 2 days of local logs around [operations/puppet] - 10https://gerrit.wikimedia.org/r/143233 [01:47:43] * YuviPanda adds a bunch of people to review ^. [01:48:07] YuviPanda: could we instead make it a param and keep prod as is? [01:48:19] chasemp: sure. [01:48:29] chasemp: if that's the case I can probably even trim labs down to 1. [01:48:35] sounds good [01:48:42] chasemp: let me amend [01:53:01] (03PS2) 10Yuvipanda: diamond: Keep only 1 day of logs for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143233 [01:53:01] chasemp: updated. [01:54:55] I hate to nitpick you but the integers...are really strings even tho puppet is loose with the logic [01:55:28] if you look at say port numbers we define, strings representing integers especially for templates [01:55:29] chasemp: heh, alright. good point, though. [01:55:49] these are python ints but puppet strings [01:56:39] (03PS3) 10Yuvipanda: diamond: Keep only 1 day of logs for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143233 [01:56:44] chasemp: updated again. [01:58:13] (03CR) 10Rush: [C: 032 V: 032] diamond: Keep only 1 day of logs for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143233 (owner: 10Yuvipanda) [01:58:18] chasemp: woot, ty [01:59:23] I have had it get weird in a template where a integer in ruby you mean to be a string gets some template logic that works out with dynamic typing [01:59:27] but is insane to debug [01:59:38] chasemp: :D [01:59:52] chasemp: yeah, esp it is dsl -> another dsl -> python [02:00:03] (03PS4) 10Yuvipanda: labs: Enable diamond PuppetAgent collector on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 [02:00:19] chasemp: ^ too if you're up for it. 'sok if it's too late :) [02:01:00] so for me to review collectors if you wouldn't mind I like to throw an example from teh logs of what actual metrics they collect [02:01:34] I usually literally just paste from the log, but makes it clearer down the road, I haven't had time to look at what that outputs specifically and the docs in teh header of the collector don't say [02:01:42] chasemp: sure. https://github.com/BrightcoveOS/Diamond/wiki/collectors-PuppetAgentCollector documents. [02:02:07] chasemp: I can add a link to the commit message [02:02:09] ha that's fair, used to be in the __doc__ for collectors I guess they stopped [02:02:26] (03PS5) 10Yuvipanda: labs: Enable diamond PuppetAgent collector on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 [02:02:36] chasemp: it's in __doc__, they just have a generator [02:03:01] I looked but possible I'm also blind, I was doing 3 things at once :) [02:03:08] : [02:03:55] actually lots of good stuff there, some of it kinda awkward to get into graphite like last_run [02:04:07] but tracking total run times acrosss hosts over time [02:04:09] yeah I like that [02:04:37] chasemp: yeah, and I want to update that to actually put in $CURTIME - last_run [02:04:53] chasemp: so that'll be puppet freshness graph, and if we monitor based on that I can get puppet freshness alerts that way [02:04:57] ah, there is a built in derivative function in the collector super class :) [02:05:08] method even [02:05:08] yeah, or that :) [02:05:38] have you done the math on releasing this for disk space on graphite? [02:05:38] chasemp: if you like this enough I can make this happen in prod too. [02:05:53] that's a lot of new metrics [02:05:55] chasemp: heh, no. I'm just keeping a close eye on it, seems ok so far. [02:06:12] can you tell me how many hosts this hits? [02:06:16] or how many hosts in graphite now? [02:06:19] in labs I mean [02:07:01] chasemp: unsure how to count... [02:07:02] if someting like this [02:07:03] servers.hostname.puppetagent.time.file [02:07:13] is going to be for all defined types (unsure if just for builtins) [02:07:16] you could be looking at a lot [02:07:25] servers.hostname.puppetagent.time.package [02:07:33] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [02:07:59] go to servers dir for whisper files [02:08:03] chasemp: am looking at the yaml file it gets its data from, moment [02:08:18] ah, yeah, running find on that should tell me [02:08:38] I use ls | wc -l [02:08:41] heh :) [02:08:56] chasemp: not on labs, since it is not all under servers [02:09:02] ah [02:09:02] each is namespaced by projectname [02:09:04] graphite.wmflabs.org [02:09:37] chasemp: 245 [02:10:00] how big are your whisper files? [02:10:39] idk I haven't done the math yet, and I don't know if the yaml file has stats on all defined types, but this could eat a lot of disk [02:10:54] chasemp: looking at the yaml file now [02:11:09] chasemp: it does report time for each resource type [02:11:28] I don't know how many we have, but it's more than teh default [02:11:45] so take the original 20-ish metrics per host and add 10 say, maybe 30-40 per host [02:11:48] x whisper file size [02:11:50] x hosts [02:12:04] rough guess is mucho disk space [02:12:26] chasemp: hmm, it's currently 80G [02:12:53] which means I've to also cut out some other things I'm logging that I don't have a direct use for (NFS usage, user login usage) [02:13:08] I know the feeling of 'hey I can watch all this!' :) [02:13:31] chasemp: :) [02:13:34] I think when I looked in prod we have sizeable whispers [02:13:49] could trim them down w/ less granularity over the long haul for labs [02:13:53] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Tue 01 Jul 2014 00:13:05 UTC [02:14:05] chasemp: yeah, I think that's the way to go [02:14:05] (03PS1) 10Yuvipanda: toollabs: Don't collect NFS stats [operations/puppet] - 10https://gerrit.wikimedia.org/r/143236 [02:14:09] my feeling is 1 minute has not much context in labs more than 72 hours out [02:14:20] chasemp: yeah, I'd agree again. [02:14:25] and then even at 30 minute blocks for labs, depending on teh host you are probably going to see what you want [02:14:35] or who knows what blocks [02:14:42] but yeah, probably have to tweak that before you push this out [02:14:45] chasemp: I'd say very less granularity in general, and toollabs / betalabs can override for slightly more [02:15:14] !log LocalisationUpdate completed (1.24wmf10) at 2014-07-01 02:14:11+00:00 [02:15:24] Logged the message, Master [02:16:03] chasemp: for labs, 1minute for 7 days, 30m for 30 days, 1h for 1 year, 1d for 5y? [02:16:19] let me ask you [02:16:28] how many labs hosts down the 1 hour a year ago is gonig to be useful [02:16:34] maybe a lot idk [02:16:55] chasemp: well, for toollabs it would definitely be, at leat to spot trends in usage. [02:16:55] I would say do that and see how big the file is :) [02:16:58] chasemp: yeah :) [02:17:05] chasemp: if I change it, will graphite automatically compact? [02:17:08] I have a few guidelines hang on [02:17:10] no [02:17:12] you can do it [02:17:18] manually there is a whisper tool [02:17:18] chasemp: ok! [02:17:21] but otherwise remove and recreate [02:17:25] (easier) [02:17:44] I haven't publicly announce graphite yet, so theoretically all this data can be thrown away [02:17:57] I made these notes before [02:17:58] but toolserver died today, so would be nice to have some data from before that to compare to how toollabs performs after. [02:17:59] #1.4MB per stat [02:17:59] #retentions = 10s:1d,1m:30d,15m:1y,1h:3y [02:17:59] #512KB per stat [02:18:01] #retentions = 5m:30d,15m:90d,1h:3y [02:18:35] is that 512KB per stat per host? [02:18:39] yes [02:18:40] guh, stupid question. [02:19:00] note that's aggressive for here, as we don't do 10s intervals [02:19:05] right, so at about 250 hosts, 250 * 1.4MB is under 400MB [02:19:46] anyways just a few numbers, you'll have to play [02:19:52] 400 MB per stat, so a 140GB disk would give me... [02:19:59] about 400 stats per host [02:20:06] err [02:20:06] 350 [02:20:09] (03CR) 10Rush: [C: 04-1] "let's get some numbers on how this affects disk space. especially since it seems to log a metric per defined type, per host :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 (owner: 10Yuvipanda) [02:20:19] I -1'd...but not because it's a bad idea [02:20:27] just because I don't know what happens disk space wise [02:20:29] chasemp: yeah, seems sane to figure out a retention strategy *before* [02:20:46] my suggestion is [02:20:50] chasemp: also labs has restrictions on disk sizes, I think. can't really go 'but we can just get a bigger disk' [02:20:51] go broader for disk space, etc [02:21:06] this one's already got the biggest disk labs has [02:21:12] because you don't care about 1 minute disk usage usually, 5 minutes gets you there [02:21:23] I meant like poll disk growth at 5 and so you can store it more cheaply [02:21:30] chasemp: right, so setup a storage schema that's more granular [02:21:45] rather than .* [02:21:53] the ones I pasted teh top was cpu say [02:21:57] and the bottom was disk usage [02:22:27] I never said to myself, in what minute in this five minute period did this disk grow :) [02:22:32] but I suppose it could happen [02:22:37] nah, makes sense for disk [02:23:08] I'm also learning as I go, (not a sysadmin! :D), so haven't really considered these questions before [02:23:47] it's all good, I've done a few rollouts like this so glad to at least provide some useful thoughts [02:24:12] :) [02:24:28] another thing is there are some things that are by nature ephemeral, like you need 1 minute for you don't care if it failed a year ago [02:24:34] and then also last thought, in labs at least [02:24:40] make your default catchall very small [02:24:50] so when someone tries to write a collector and sends a million random strings [02:24:52] !log LocalisationUpdate completed (1.24wmf11) at 2014-07-01 02:23:49+00:00 [02:24:53] right. CPU is probably ephemeral [02:24:55] you have a fighting chance [02:24:57] Logged the message, Master [02:25:14] idk cpu I like :) [02:25:24] but for labs maybe do percentage and not per item for cpu [02:25:25] ? [02:25:29] right. [02:25:43] I wonder if I can configure diamond to not send *all* the stats [02:25:50] like, I don't care about inode stats in disk usage, for example [02:26:00] so that's 4 extra stats per host I don't want/need [02:26:17] yeah I had to basically shadow the collector in puppet to overwrite with the more limited stats version before [02:26:24] right. [02:26:26] some of them do tho [02:27:08] chasemp: right. I wonder if upstream will accept a generic setting that lets us discard unused stats at the diamond level [02:27:17] like a [02:27:19] don't log this [02:27:21] regex [02:27:27] actually the disk one has that now if I recall [02:27:31] yeah [02:27:45] kormoc is the main guy [02:27:48] #python-diamond [02:27:52] he's actually really nice [02:27:53] it'll also need to account for variants between prod and labs [02:28:00] since I guess prod doesn't care as much about disk space [02:28:18] honestly we just have a crap load of disk space [02:28:24] hehe [02:28:25] so I got away with being permissive so far [02:28:28] labs images, though. [02:28:51] it's up to andrewbogott_afk or Coren but in this case maybe you get could more disk? [02:28:52] idk [02:28:57] chasemp: possibly, yeah [02:28:58] it's kind of a labs public utility if used right [02:29:02] (03PS2) 10Yuvipanda: toollabs: Don't collect NFS stats [operations/puppet] - 10https://gerrit.wikimedia.org/r/143236 [02:29:15] chasemp: yeah, I guess I'll need to make a new image. [02:30:05] I don't actually know if NFS metric is useful, since it won't tell us which user is hitting NFS hard [02:30:06] hmm [02:30:18] chasemp: I'll think more about this and talk to Coren and andrewbogott_afk [02:30:29] sounds good, let me know how it's going [02:30:37] chasemp: if it is possible to get a bigger disk, I see no harm in getting, say, 500G of /srv and then not worry about this [02:30:44] chasemp: would graphite slow down with age if it has too much data? [02:31:00]