[01:01:53] New patchset: Lcarr; "Adding in ganglia apache file temporarily using nickel.wikimedia.org in file for testing purposes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1787 [01:02:13] can i get a check plz ? [01:09:17] New patchset: Lcarr; "Adding in ganglia apache file temporarily using nickel.wikimedia.org in file for testing purposes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1787 [01:09:32] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1787 [01:09:33] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1787 [01:27:03] New patchset: Lcarr; "adding in rrdtool to ganglia::web" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1789 [01:27:24] Ryan_Lane: have a sec? [01:29:48] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1789 [01:29:48] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1789 [01:37:07] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1143s [01:43:17] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1492s [01:43:17] AaronSchulz: sup? [01:44:30] I was a bit worried about how active https://github.com/rackspace/php-cloudfiles/commits/master is [01:45:07] this is the PHP Cloud Files API for swift [01:48:32] * AaronSchulz wishes issues/pull requests were responded to more [02:00:12] quite a lot of the other projects look rather active [02:00:44] ? [02:01:26] https://github.com/rackspace/ [02:01:38] https://github.com/rackspace/php-cloudfiles/commits/master [02:01:43] bah [02:01:44] Last updated 28 minutes ago [02:01:49] Last updated about 2 hours ago [02:01:50] Etc [02:03:16] AaronSchulz: whoops. sorry [02:05:16] PROBLEM - ps1-a5-sdtpa-infeed-load-tower-A-phase-Y on ps1-a5-sdtpa is CRITICAL: ps1-a5-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2650* [02:05:26] AaronSchulz: yeah. no clue about that [02:08:27] New patchset: Ryan Lane; "We aren't using nova-volume right now, and it throws errors since it's not configured. Remove it." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1791 [02:08:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1791 [02:08:45] New patchset: Ryan Lane; "Adding requires to all services." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1792 [02:09:02] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1791 [02:09:02] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1791 [02:09:15] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1792 [02:09:16] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1792 [02:20:46] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:26:06] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:54:36] RECOVERY - ps1-a5-sdtpa-infeed-load-tower-A-phase-Y on ps1-a5-sdtpa is OK: ps1-a5-sdtpa-infeed-load-tower-A-phase-Y OK - 2388 [03:01:42] New patchset: Ryan Lane; "Make a ganglia cluster for the virt cluster." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1793 [03:02:03] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1793 [03:02:04] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1793 [04:19:22] RECOVERY - Disk space on es1004 is OK: DISK OK [04:20:02] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:28:32] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [04:37:02] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [06:01:57] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [09:54:01] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 430286 MB (3% inode=99%): [09:57:41] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 416314 MB (3% inode=99%): [10:02:11] RECOVERY - MySQL slave status on es1004 is OK: OK: [12:24:15] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [14:31:54] hexmode: the machine was rebooted for some reason [14:32:34] anyway I talked with Ariel and we decided to use another way of import so now I started it again, speed is 1 page in a second [14:32:57] so I suppose it should be finished in few days maybe 3 [14:35:15] you should just be generating sql right? [14:35:21] or it finished that already? [14:35:23] I used the mwimport [14:35:41] so you are stuffing in the results via mysql now? [14:35:43] !log restarting memcached on srv290 [14:35:45] Logged the message, Master [14:35:49] yes [14:35:51] ok [14:35:57] that seems very slow but whatever [14:36:13] I started bzip | mwimport | mysql... [14:36:35] that's only way how I can avoid extracting it [14:36:56] ohhhh [14:37:00] well um [14:37:03] I would have done: [14:37:10] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [14:37:13] bzip | mwimport | bzip > import-this.sql.bz2 [14:37:26] would it be faster? [14:37:30] and then a second mysql command to feed the results [14:37:58] you get through the first part faster, I feel like. but as long as you started it that way [14:38:02] might as well let it run for now [14:38:22] yes I would probably get through first part faster but not through second [14:38:30] because load of server is 1 atm [14:38:43] so I guess it's running as fast as possible (slowest is mysql) [14:39:00] RECOVERY - Memcached on srv290 is OK: TCP OK - 0.004 second response time on port 11000 [14:39:07] it's actually waiting for mysqld all the time [14:39:08] !log srv290 - before restart memcached was running with -m 64 and -l 127.0.0.1 for some reason, causing Nagios CRIT, now it looks like others and recovered [14:39:09] Logged the message, Master [14:39:47] out of curiosityyyy [14:40:04] I wonder what real db host and what real tables are being used by this test project [14:40:15] I think we ought to find that out very soon [14:40:27] you mean physical server? [14:41:15] both apache and sql are vm instances they run on virt servers [14:41:43] exact id of server is in console if you wanted to know it [14:42:02] hostnames are deployment-sql and deployment-web [14:42:29] I'll look into that further at some point [14:42:52] !log same on srv193 [14:42:53] Logged the message, Master [14:43:44] btw is there a way to import log actions, page properties etc? I imported page props but all pages have still no properties, protections etc [14:44:07] there's some other tables, look at the dumps page [14:44:08] I don't think it does matter but I thought hexmode wanted exact copy [14:44:14] you'll shove all those in too, ore most of them [14:44:23] *or [14:44:23] ok [14:44:43] what is simplewiki-latest-page.sql.gz [14:44:45] is that page table? [14:45:00] if so I guess I don't need it because it's being filled up by import [14:45:48] so . . . when reading about search earlier I noticed that searchidx1's data partition is at 100%, has anyone followed the wiki directions on fixing that and observed a good outcome? [14:48:27] you should have a look, I don't remember what they all are [14:48:35] 100%? seriously? [14:48:38] ya [14:48:41] crap [14:48:47] well I wrote those instructions back in the day [14:48:54] oh that's good! [14:48:55] things might have moved around a little but they [14:49:02] should be safe, let me look at em once more [14:49:10] RECOVERY - Memcached on srv193 is OK: TCP OK - 0.002 second response time on port 11000 [14:50:36] the ones that say "citation needed" may not help you much, I would skip those [14:50:55] everything else is ok and you do need to do a restart [14:51:06] ok [14:51:08] check to make sure things are still in the same locations [14:51:18] man how can we be out of space again already, that sucks [14:51:50] i see that searchidx1 does not show up in nagios, perhaps that's why it didn't page [14:52:51] here goes. /me crosses fingers about causing a search outage :-( [14:53:27] things look to be in the locations you documented [14:53:28] * apergos crosses them too [14:53:49] what happens when /a/search is maxed? do we stop indexing new content? [14:54:32] guess so [14:54:38] is it at 100% 0 bytes left? [14:55:04] i think it was, but i already started deleting [14:56:52] ah, from backscroll: /dev/sda6 564176672 559158400 5018272 100% /a [14:57:07] ok well saved by the bell or something [14:57:35] I'd rather not have it hit 0 bytes left, better we wonder what the consequences might have been.... [14:57:45] next up . . . why is searchidx1 not monitored? [14:57:55] what a great q [14:58:52] !log searchidx1 /a reached 100%, did the "space issues" maintenance procedure from wikitech search documentation [14:58:53] Logged the message, Master [14:59:09] now: /dev/sda6 564176672 558949252 5227420 100% /a [14:59:12] sigh. [15:00:08] oh lookit that, according to RT, seachidx1 is out of commission [15:00:17] RT #1286 [15:00:41] !log magnesium - memcached runs on default port 11211, but we run all the others on 11000, this causes Nagios CRIT. Is it supposed to run here? (was also on -l 127.0.0.1 only, but init script starts it on all) [15:00:42] Logged the message, Master [15:00:45] !rt 1286 [15:00:45] http://rt.wikimedia.org/Ticket/Display.html?id=1286 [15:00:48] :o [15:01:19] sweet. I'm just going to make a little note on the wikitech search page that it is now inaccurate [15:01:40] RECOVERY - Memcached on magnesium is OK: TCP OK - 0.031 second response time on port 11211 [15:01:50] I knew we had set up new searchi ndexers but I didn't know they had retired "1"(the name) [15:02:10] so it's out of space because it was out of space then [15:02:11] fine [15:09:53] hexmode: around [15:09:58] !puppet-kick is http://docs.puppetlabs.com/man/kick.html [15:09:58] Key was added! [15:16:50] RECOVERY - Puppet freshness on lvs1004 is OK: puppet ran at Thu Jan 5 15:16:37 UTC 2012 [15:18:01] !log lvs1004 - puppet didnt run since 12 hours, looked stuck, "already in progress" on every run. rm /var/lib/puppet/state/puppetdlock, restart puppet agent, finished fine in a few seconds. maybe puppet bug 2888,5246 or related [15:18:03] Logged the message, Master [15:37:56] !log dataset1 - date was off by ~ 27 hours. known issues RT 216 & 1345 with hardware clock, additionally though Nagios NTP check is still buggy (possibly due to leap seconds ;P) -> http://tech.akom.net/archives/27-Nagios-check_ntp-quits-working-in-2009-with-Offset-unknown.html) [15:37:56] Logged the message, Master [15:42:32] !log Nagios check_ntp does stuff like: overall average offset: 0 -> NTP OK: Offset unknown| -> NTP CRITICAL: Offset unknown (even though this bug was supposed to be fixed in a version before the one we use)..sigh [15:42:33] Logged the message, Master [15:48:40] !log ms1002 - kswapd 100% CPU - but no swap used and free memory left - this looks like https://bugs.launchpad.net/ubuntu/+bug/721896 again [15:48:41] Logged the message, Master [15:49:51] well leap seconds are not going to cause a 27 hour drift [15:50:05] yeah I sent mail mentioning it on the last email on the rt ticket... [15:50:27] they are talking about abolishing leap seconds [15:50:29] in theory new motherboard = clock issues fixed, and it seemed ok... [15:50:37] what will they do without them? [15:50:52] apergos: leap seconds broke the Nagios check, the broken hardware clock cause the 27 hours on the box itself [15:51:19] well it shouldn't (in theory) have a broken hardware clock any more [15:51:24] how I hate dataset1 [15:52:30] !log quotes on kswapd problem (that also appeared on other servers): "has nothing to do with swap space or memory".."the kernel process which swaps tasks".."means the kernel is spending more time context switching tasks than it is actually executing the tasks".."you're chasing a ghost if you're trying to tune your swap/memory environment" [15:52:31] Logged the message, Master [15:53:28] apergos: i set the time via: ntp stop, ntpdate dobson.wikimedia.org, ntp start .. and that told me the offset was over 100k seconds [15:53:52] apergos: but the "offset unknown" in Nagios check is an unrelated problem.. [15:53:56] ok [15:54:09] the problem on dataset appears to be an xfs problem [15:54:19] only once have I seen it crap out in kswapd [15:55:25] it seems like ms1002 needs to be rebooted and there is no other fix [15:55:42] ohhh [15:55:52] like "06:29 Tim: depooled db28, locked up in kswapd, needs reboot " [15:56:06] yeah once it goes out to lunch that's usually all she wrote [15:56:06] "13:21 mark: Stopped all rsyncs to investigate ms5's sudden kswapd system cpu load " [15:56:09] etc..etc [16:10:57] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [16:13:25] baah. "Although this bug is reported on LTS 10.04, I've to report that the bug still exists in kernel 3.1" [16:18:38] !log people claim it was "completely resolved with "2.6.38-10 backport from PPA." (add-apt-repository ppa:kernel-ppa/ppa ...). wanna try that? (or just reboot ms1002 pls) [16:18:39] Logged the message, Master [16:19:28] I think that's the backport I am running on ds1 :-P [16:20:40] know anything about es1002? [16:20:57] down since 23 days [16:21:30] mutante: that's our box that we use to make sure boxes can still be down. [16:21:43] it's like a control box in the experiement of bringing boxes up [16:22:03] (or, in non-sarcastic english... no. I don't know) [16:22:14] hehee, hi peter [16:22:21] that was a nice one [16:23:11] apergos: i need to take dataset1 down to open it up and get pic for SM [16:23:25] please do [16:23:31] and all the stuff with the cables and everything else [16:23:39] nothing's running over there (gee, wonder why not :-P) [16:24:06] :( [16:24:31] it thought it wasnt time to run yet:) [16:25:02] well so far we still (apparently) have the clock issues, and we still (apparently) have the xfs kernel whammo issues [16:25:05] sooooo [16:26:16] we don't use dhcp for anything other than pxe boot, correct? [16:26:38] sounds right to me [16:34:10] ACKNOWLEDGEMENT - MySQL master status on es1001 is CRITICAL: CRITICAL: Read only: expected OFF, got ON daniel_zahn RFC in RT #2216 [16:35:40] ACKNOWLEDGEMENT - Disk space on es1002 is CRITICAL: Connection refused by host daniel_zahn please update RT 2216 [16:49:10] ACKNOWLEDGEMENT - Host db19 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn request for comment in !rt 2217 [16:51:14] !log payments4 - 25 running nginx procs cause a warning - but normal and just raise limit? [16:51:15] Logged the message, Master [16:55:40] PROBLEM - Host dataset1 is DOWN: CRITICAL - Host Unreachable (208.80.152.166) [16:58:00] mutante: I don't understand your log re. payments4? [16:58:41] Jeff_Green: the number of nginx processes on it cause a warning [16:59:18] Jeff_Green: and trying to get rid of warnings that are not actually critical.. so if its normal like that, i would raise the warning limit [16:59:39] iirc it should warn <49 procs [17:00:40] ok,let me check if its puppetized [17:00:43] I believe normal operation is 50 procs and less or more is fail [17:00:53] it's puppetized within the payments rig puppet instance [17:01:09] remember these use nsca passive checks [17:01:30] ah,true! thx [17:01:32] check_procs -w 49:50 -c 10:75 -C nginx [17:01:47] it's in payments*:/usr/local/bin/nagios_nsca [17:01:49] runs on cron [17:02:05] * cmjohnson1 hates dataset1 [17:02:10] sorry, docs are still not done for payments, this is what I'm trying to work on this month [17:02:31] Jeff_Green: that makes it: warn if less than 49 OR more than 50, critical if less than 10 OR more than 75 [17:02:47] Jeff_Green: hence it is a warning because it is less than 49, and 25 right now [17:02:47] yeah [17:02:59] exactly, which is a little weird [17:03:01] looking [17:03:12] i supposed its about the maximum, but its the minimum instead.ok [17:03:42] the minimum suggests it's failing to spawn new child procs when they die [17:04:11] so its good that it tells us and we dont need to change it? ok, fine [17:04:40] i didn't see notify in IRC? was it on the nagios front page? [17:05:02] eh, yeah, i just went through Nagios web [17:05:14] k. this is the first time I've heard of any of them even warning. have you seen it before? [17:05:24] trying to lower the number of warnings that have been sitting there [17:06:20] well, just because i went through them on purpose ..its not reported here because its just a warning [17:06:24] ok [17:06:55] curious--the nginx conf is different here for some reason [17:07:22] the fr folks have been using it as a staging box, maybe they tewaked it [17:09:28] maybe I should make the monitoring script read nginx.conf! [17:10:25] heh, or use a puppet variable and use it in nginx.conf.erb and the monitor_service definition? [17:10:34] yeah [17:17:47] apergos: sent email to SM...cc'd you on it....a new problem has come up...only one power supply will work at a time and one fan is not functioning [17:18:03] oh nnnnooooooooo [17:18:08] oh this bad movie will never end [17:18:10] :-( [17:18:36] i think the fan connector is the culprit...it is not meant to be disconnected this many times [17:18:59] power....no idea...there is only one way to connect the power back to the motherboard [17:19:15] thanks for doing it [17:19:36] * cmjohnson1 wonders how long this saga will go on... [17:20:39] apergos: with the bad fan...I think i should shut it down to prevent anymore problems [17:21:20] I see your point [17:21:23] sure [17:21:49] I don't want to kill the processors w/heat [17:23:54] I almost do :-P [17:24:01] but no, go ahead, power it off for now [17:24:16] it's not doing anything but still best to be on the safe side [17:24:35] okay...once SM gets back to us...i can power on for more "testing" [17:25:04] exactly [17:25:17] presumably they would send out a new part first [17:25:20] this is ds1? who's producing the movie? [17:25:29] yeah dataset1 [17:25:36] brought to you by silicon mechanics [17:25:41] *eyeroll* [17:53:06] <^demon> !log removed chuck norris plugin from jenkins, restarted [17:53:07] Logged the message, Master [17:57:08] what does chuck do? [17:57:17] Nothing now [17:57:23] well sure [17:58:05] ;D [17:58:15] <^demon> He added chuck norris jokes to pages. [17:58:34] <^demon> But useless plugins are useless and an extra point of failure. [18:02:38] and the cycles we saved! [18:03:43] !log tarin - added "#includedir /etc/sudoers.d" to sudo config, needs to read /etc/sudoers.d/nrpe for Nagios RAID check [18:03:44] Logged the message, Master [18:04:45] mutante: we already had that in labs fwiw [18:06:53] jeremyb: the sudo option? yeah, it's a default now but tarin has been changed somehow [18:07:24] mutante: i can't tell if tarin is a person (and you're saying she's the one that did it?) or a box? [18:07:33] a server:) [18:07:42] ok :) [18:08:02] its RAID check was: before: NRPE: Unable to read output after: OK: no RAID installed [18:08:14] this is caused by the nagios user not having sudo [18:08:16] okey :) [18:08:46] for the specific command: /usr/bin/check-raid.py [18:09:12] yeah [18:09:13] which it gets in /etc/sudoers.d/nrpe . which wasnt read on this system.. arr..took a while:) [18:54:36] binasher: do you have some time to chat about the glam filter? [19:11:51] !log restarting dhcpd on brewster [19:11:52] Logged the message, and now dispaching a T1000 to your position to terminate you. [19:34:46] New patchset: Catrope; "WIP for breaking out puppet-specific hooks to puppet.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1794 [19:35:14] Change abandoned: Catrope; "WIP, shouldn't actually be merged" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1794 [20:44:44] Ryan_Lane: will you look at the chagnes I made to zone files, plx? [20:44:53] if you're not at lunch< i mean [20:49:44] notpeter: which zone? [20:49:49] oh. it's in svn :) [20:50:24] seems fine to me [20:52:09] while you're talking zones, DaBPunkt had a question in #-tech [20:53:33] kk, I checked it in. [20:53:53] Ryan_Lane: anything else I need to do to push to ns0-2? or is post commit hook? (it's been a while...) [20:57:36] instructions are on wikitech [20:57:47] should just be authdns-update on dobson [20:57:50] with your key forwarded [20:57:58] check the docs, though [20:58:01] yep, just read doc [20:58:03] thanks :) [20:58:08] yw [21:05:05] maplebed: yeah, so it's .shard-xx [21:05:56] so the string '.shard' is part of the container name? [21:06:06] why not just -xx? [21:38:51] LeslieCarr: you there? [21:38:57] here [21:38:59] what's up [21:39:26] soooo, remember the other day when I was trying to get those boxes to image? [21:39:42] can you poke around at the networking side of things a little more? [21:39:52] I had asher look over confs,and he agrees that they look correct [21:40:19] also, the logs look quite weird: [21:40:19] Jan 5 21:13:21 brewster dhcpd: DHCPDISCOVER from 84:2b:2b:77:50:b2 via 10.64.0.3: network 10.64.0/22: no free leases [21:40:25] LeslieCarr: successful requests from eqiad look like - Jan 5 13:10:45 brewster dhcpd: DHCPREQUEST for 10.64.0.29 from 00:40:8c:a2:3b:6e via eth0 [21:40:30] that usually happens when it doesn't get dns or rdns [21:40:38] note from eth0 vs from 10.64.0.3 [21:40:43] dns digs properly [21:40:48] which machine is 84:2b:2b:77:50:b2 ? [21:40:54] search1001 [21:40:57] cool [21:41:05] i'll use that as my test victim [21:41:12] ok, I'll get out of console [21:41:19] thank you for taking a look [21:42:35] !log rebooting virt1 [21:42:36] Logged the message, Master [21:43:09] lemme point gerrit to the others right now [21:43:41] Ryan_Lane: i can't ldap auth sudo in my labs instance now, to be expected? [21:44:26] unfortunately, yes [21:44:32] I had to reboot virt1 [21:44:42] and the instances can't connect to nfs1/2's LDAP [21:44:50] I really need to bring up that node in eqiad.... [21:45:59] * Ryan_Lane groans [21:46:04] fucking internal CA :( [21:47:14] I guess I'm pointing it back at virt1 for now [21:50:07] notpeter: i'm not getting any console output on search1001 [21:50:10] have you been getting some ? [21:50:22] none at all? [21:50:25] none at all [21:50:32] yes, I was able to see it fail to boot [21:51:42] so i see the reboot on the switch side... [21:53:15] can I grab the console for a sec? [21:53:26] oh there now i see it [21:53:28] sloooow bootup [21:53:40] yeah [21:57:36] iit's installing now (i restarted dhcp3-server , if that was what did it [21:57:45] i'm off console if you want to see [21:57:46] lajhusuflakjhg [21:57:48] I mean, awesome [21:57:49] thanks! [21:57:51] haha [21:57:59] i know how it is, having someone else look at it magically fixes it [21:58:03] I don't know how many times I've restarted dhcp3 [21:58:06] yup! [21:58:18] well, thank you for being the person with the magic [21:59:07] np [22:20:26] New patchset: Ryan Lane; "Putting LDAP before files is insane, when most sudo is being handled by files." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1795 [22:21:39] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1795 [22:21:39] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1795 [23:52:18] New patchset: Lcarr; "Puppetizing ganglia and gangliaweb Puppetizing automatic saving and restoration of rrd's from tmpfs to disk Modifying gmetad startup to import rrd's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [23:53:00] can i get a review on this (especially maplebed ) ? [23:53:16] sure. [23:54:17] how long does it take to run save-gmetad-rrds? [23:57:40] PROBLEM - mobile traffic loggers on cp1042 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [23:59:16] i can check [23:59:54] about 10-15 seconds