[00:03:28] New patchset: Ryan Lane; "Decommissioning mobile2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1895 [00:04:06] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1895 [00:04:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1895 [00:23:58] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 191 MB (2% inode=60%): /var/lib/ureadahead/debugfs 191 MB (2% inode=60%): [00:26:55] New patchset: Ryan Lane; "Making virt0 the new controller. Moving all nova config to point to it." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1896 [00:27:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1896 [00:29:15] !log srv219 is out of diskspace [00:29:16] Logged the message, Master [00:32:00] !log srv223 is also out of diskspace [00:32:01] Logged the message, Master [00:32:28] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=60%): /var/lib/ureadahead/debugfs 0 MB (0% inode=60%): [00:32:42] * Reedy kicks nagios-wm [00:33:32] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1896 [00:33:33] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1896 [00:34:45] If somebody has 5 minutes and wants to fix srv223 and srv219 it'd be appreciated [00:44:20] New patchset: Ryan Lane; "This dependency is needed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1897 [00:44:34] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1897 [00:44:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1897 [00:44:42] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1897 [00:44:43] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1897 [00:47:36] http://ganglia.wikimedia.org/test/ w00t!!! this is two different groups on a single ganglia aggregator! yay! [00:48:43] ah. cool [00:56:19] RECOVERY - Disk space on srv219 is OK: DISK OK [00:57:49] RECOVERY - Disk space on srv223 is OK: DISK OK [02:21:16] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [02:21:36] PROBLEM - Auth DNS on labsconsole.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:01:16] RECOVERY - Auth DNS on labsconsole.wikimedia.org is OK: DNS OK: 0.134 seconds response time. www.wikipedia.wmflabs.org returns 208.80.153.197 [04:17:00] RECOVERY - Disk space on es1004 is OK: DISK OK [04:20:13] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:42:00] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [05:29:04] PROBLEM - Puppet freshness on db22 is CRITICAL: Puppet has not run in the last 10 hours [07:01:08] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [08:01:12] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 193 MB (2% inode=60%): /var/lib/ureadahead/debugfs 193 MB (2% inode=60%): [08:14:04] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=60%): /var/lib/ureadahead/debugfs 0 MB (0% inode=60%): [08:32:12] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=60%): /var/lib/ureadahead/debugfs 0 MB (0% inode=60%): [08:38:42] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=60%): /var/lib/ureadahead/debugfs 0 MB (0% inode=60%): [08:42:02] RECOVERY - Disk space on srv223 is OK: DISK OK [08:47:42] RECOVERY - Disk space on srv221 is OK: DISK OK [08:49:42] RECOVERY - Disk space on srv222 is OK: DISK OK [08:58:22] RECOVERY - Disk space on srv220 is OK: DISK OK [09:15:42] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.008 second response time on port 11000 [09:55:43] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 448656 MB (3% inode=99%): [10:01:33] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 417067 MB (3% inode=99%): [10:24:33] RECOVERY - MySQL slave status on es1004 is OK: OK: [12:14:47] PROBLEM - Puppet freshness on db1001 is CRITICAL: Puppet has not run in the last 10 hours [12:57:09] RECOVERY - HTTPS on sodium is OK: OK - Certificate will expire on 08/22/2015 22:23. [12:59:22] !log PXE booting srv191, installing OS [12:59:23] Logged the message, Master [13:03:49] RECOVERY - Host srv191 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [13:07:38] New patchset: Mark Bergsma; "Add exim::roled class documentation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1900 [13:07:54] New patchset: Mark Bergsma; "Rename relay_domains file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1901 [13:08:08] New patchset: Mark Bergsma; "Add IPv6 service IP for lists.wikimedia.org on sodium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1902 [13:08:22] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1900 [13:08:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1900 [13:08:22] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1902 [13:08:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1901 [13:08:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1902 [13:08:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1901 [13:12:15] New patchset: Mark Bergsma; "Enable v6 for outbound as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1903 [13:12:57] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1903 [13:12:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1903 [13:15:25] New patchset: Mark Bergsma; "Notify service exim4 on config changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1904 [13:15:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1904 [13:15:45] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1904 [13:15:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1904 [13:24:29] PROBLEM - DPKG on srv191 is CRITICAL: Connection refused by host [13:24:39] PROBLEM - Memcached on srv191 is CRITICAL: Connection refused [13:28:09] PROBLEM - Disk space on srv191 is CRITICAL: Connection refused by host [13:32:19] PROBLEM - RAID on srv191 is CRITICAL: Connection refused by host [13:32:39] PROBLEM - Apache HTTP on srv191 is CRITICAL: Connection refused [13:59:53] PROBLEM - Host sodium is DOWN: PING CRITICAL - Packet loss = 100% [14:06:43] RECOVERY - Puppet freshness on srv191 is OK: puppet ran at Fri Jan 13 14:06:32 UTC 2012 [14:09:43] RECOVERY - Apache HTTP on srv191 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.012 seconds [14:14:04] PROBLEM - Auth DNS on ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:14:33] RECOVERY - Disk space on srv191 is OK: DISK OK [14:19:34] RECOVERY - RAID on srv191 is OK: OK: no RAID installed [14:22:33] RECOVERY - DPKG on srv191 is OK: All packages OK [14:24:03] RECOVERY - Auth DNS on ns1.wikimedia.org is OK: DNS OK: 0.026 seconds response time. www.wikipedia.org returns 208.80.152.201 [14:28:03] RECOVERY - Host sodium is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [14:32:05] PROBLEM - mailman on sodium is CRITICAL: Connection refused by host [14:32:05] RECOVERY - Memcached on srv191 is OK: TCP OK - 2.994 second response time on port 11000 [14:33:15] PROBLEM - HTTPS on sodium is CRITICAL: Connection refused [14:34:55] PROBLEM - DPKG on sodium is CRITICAL: Connection refused by host [14:34:55] PROBLEM - RAID on sodium is CRITICAL: Connection refused by host [14:34:59] !log srv191 - has now fresh OS, re-issued puppet certs, ran puppet, restart memcached, etc. - all back in monitoring [14:35:00] Logged the message, Master [14:36:45] PROBLEM - SSH on sodium is CRITICAL: Connection refused [14:41:25] PROBLEM - spamassassin on sodium is CRITICAL: Connection refused by host [14:41:35] PROBLEM - HTTP on sodium is CRITICAL: Connection refused [15:38:42] PROBLEM - Puppet freshness on db22 is CRITICAL: Puppet has not run in the last 10 hours [15:56:36] RECOVERY - SSH on sodium is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:32:00] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 84 MB (1% inode=60%): /var/lib/ureadahead/debugfs 84 MB (1% inode=60%): [16:37:10] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=60%): /var/lib/ureadahead/debugfs 0 MB (0% inode=60%): [16:46:50] RECOVERY - Disk space on srv221 is OK: DISK OK [16:51:30] RECOVERY - Disk space on srv219 is OK: DISK OK [17:01:40] PROBLEM - NTP on ms1002 is CRITICAL: NTP CRITICAL: No response from NTP server [17:11:00] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [18:35:19] so i think ms1002 is messed up, does anyone know what it's doing ? [18:36:21] it has a load of 149... [18:40:42] woah [18:40:44] no idea [18:40:56] I have had my head in qemu/vde/junos all day [18:41:06] :) [18:41:26] well i could go ahead and reboot it but that might have some very negative consequences [18:41:37] what is its role? es? [18:42:52] misc::images::rsyncd, misc::images::rsync [18:43:00] are its included items [18:43:26] that's it? [18:43:29] hmmm [18:44:14] got a few stack traces from it's raid program in /var/log/messages as well [18:45:55] LeslieCarr: its the kswapd 100% CPU kernel bug [18:46:02] we got that on several hosts [18:46:07] reboot is the only solution i know so far [18:46:15] okay [18:46:58] ah, 2.6.40 looks like it fixes it according to google, but we're runnign 2.6.32 [18:47:01] oh, it's not running a rcent enough kernel.. right [18:47:04] 15:49 mutante: quotes on kswapd problem (that also appeared on other servers): "has nothing to do with swap space or memory".."the kernel process which swaps tasks".."means the kernel is spending more time context switching tasks than it is actually executing the tasks".."you're chasing a ghost if you're trying to tune your swap/memory environment" [18:47:09] I thought 2.6.38 [18:47:09] 15:45 mutante: ms1002 - kswapd 100% CPU - but no swap used and free memory left - this looks like https://bugs.launchpad.net/ubuntu/+bug/721896 again [18:47:10] whatever [18:47:34] should i reboot it ? [18:47:41] or will that kill the iste [18:47:58] it's on the receiving end of things [18:48:03] it's not doing image service, no worries [18:48:08] okay [18:48:12] it should fix the issue, its just that i also dont know what consequences a reboot has in general on this specific host [18:48:28] ms5/6/7/8 are images/thumbs [18:48:33] this is just getting a copy [18:48:35] !log rebooting ms1002 due to kswapd 100% cpu bug https://bugs.launchpad.net/ubuntu/+bug/721896 [18:48:36] no worries [18:48:38] Logged the message, Mistress of the network gear. [18:48:44] well time to see if it's my first time destroying the site ;) [18:49:37] holy crap 9 pm already [18:50:01] as much as this stuff is very interesting I am going to put the keyboard down :-P [18:50:11] g'night [18:50:14] no more vde tonight [18:50:14] have a good qeekend [18:50:25] thanks, you too [18:51:28] same here. have a good weekend.. starting some music :) cu [18:51:39] enjoy! [18:51:50] hrm, ms1002 is stuck on killing puppet [18:52:03] ..still watching if it comes back though:) [18:52:15] if it doesn't shut down after a while there's always mgmt :-P [18:52:20] hasn't even gone down yet.. [18:52:20] hehehe [18:52:27] and kill -9 didn't work [18:52:30] on the puppet process [18:52:34] LeslieCarr: eh.yeah.sorry, and i was still on a shell :p [18:52:53] still on a shell shouldn't have killed it though… [18:53:00] should just kick you out [18:53:02] nope [18:53:13] no, and i already logged out [18:53:35] holy crap they are really going to downgrade france? [18:53:46] :o [18:53:55] I thought that was just ... [18:54:25] you know, people talking to try to impact the markets [18:54:29] LeslieCarr: still stuck? [18:55:01] yep [18:55:07] looks like it's hard reboot time [18:55:18] give it the boot! [18:55:23] eh, yeah, because now it refuses SSH [18:55:31] want me to mgmt powercycle? [18:55:56] !log hard powercycling ms1002 [18:55:58] Logged the message, Mistress of the network gear. [18:56:02] just kicked it [18:56:08] kk [18:57:43] apergos: it could be people talking to try to impact the markets .. and being succesful at it [18:58:51] don't think so [18:59:12] it's fsck'ing [19:00:21] reuters could have got it wrong of course but they were very specific: saying an announcement would be made shortly after ny stock markets close today [19:00:33] it's fscking what? ( :-P ) [19:00:37] hehehe [19:01:19] looks like it's back up [19:01:25] though disappointed that nagios didn't catch it [19:01:33] meh [19:01:42] nagios has been full of disappointments as of late [19:01:42] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Fri Jan 13 19:01:30 UTC 2012 [19:01:55] haha, thanks nagios for chiming in [19:02:20] LeslieCarr: out of curiosity, because of another issue, did it start nagios-nrpe-server by itself [19:02:22] well, looks happy and okay to me, and now hopefully the ganglia graphs will look more proper (no longer having to scale to 175 for laod graphs ) [19:02:40] nagios 1218 0.0 0.0 24784 1112 ? Ss 18:59 0:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d [19:02:50] but no nagios-nrpe-server [19:03:14] can you try init.d/nagios-nrpe-server start [19:03:46] ah, well, but that should be it, nevermind [19:04:09] it wouldnt have been able to do the disk check without it and looks fine now:) [19:04:44] mutante: nope, didn't start [19:05:39] LeslieCarr: hmm,ok, interesting, i already have a ticket for that and shall look into that [19:05:44] thanks [19:05:53] RECOVERY - NTP on ms1002 is OK: NTP OK: Offset 0.08668124676 secs [19:06:44] RECOVERY - RAID on ms1002 is OK: OK: State is Optimal, checked 2 logical device(s) [19:12:51] !log restarting gmond on cp1043 [19:12:53] Logged the message, Mistress of the network gear. [19:27:01] New patchset: Lcarr; "Fixed some formatting and ensure gmond.conf present" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1905 [19:28:52] New patchset: Lcarr; "Fixed some formatting and ensure gmond.conf present" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1905 [19:29:33] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1905 [19:29:34] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1905 [19:36:19] PROBLEM - Host virt1 is DOWN: PING CRITICAL - Packet loss = 100% [19:38:18] robh ^ [19:38:41] huh [19:38:46] takin a look [19:39:17] hi banisher: would you have some time today to give me access to Locke,emery & bays? [19:39:44] ? [19:41:18] New patchset: Lcarr; "separating out the cp machines to make them try and realize they are collectors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1906 [19:41:39] binasher: it's diederik [19:41:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1906 [19:41:55] drdee_: looks like your access ticket was taken care of yesterday? [19:42:15] binasher: mmmm…didn't get notification, let me check [19:42:33] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1906 [19:42:34] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1906 [19:43:42] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1893 [19:43:42] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1894 [19:43:43] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1893 [19:45:57] RECOVERY - Host virt1 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [19:47:24] binasher: can't find the ticket anymore….do you remember the ticket id? [19:48:07] New patchset: Lcarr; "cp1044 is an aggregator" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1907 [19:48:51] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1907 [19:48:51] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1907 [19:51:39] New patchset: Bhartshorne; "shouldn't change what the user passed in - range() does what I want instead." [operations/software] (master) - https://gerrit.wikimedia.org/r/1908 [19:53:28] how do you clean out the config of a single machine from the puppet db again ? [19:54:10] New patchset: Bhartshorne; "shouldn't change what the user passed in - range() does what I want instead." [operations/software] (master) - https://gerrit.wikimedia.org/r/1908 [20:00:19] New patchset: Asher; "named virthosts on 443. shine on, little star cert." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1909 [20:00:46] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1909 [20:00:47] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1909 [20:04:56] drdee_: I gave you access yesterday [20:05:07] cool [20:05:15] drdee_: did you see my messages asking if you were all set with the access you needed? [20:05:21] no [20:05:43] ah, I messaged you a couple of times asking [20:05:44] anyway [20:05:47] are you all set? [20:07:40] not sure, on Skype let me check in 30 minutes [20:09:42] kk [20:19:13] New patchset: Asher; "further cluster def cleanup, write a marker file on dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1910 [20:20:44] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1910 [20:20:44] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1910 [20:20:59] New patchset: Ryan Lane; "Point recursor to virt0 for wmflabs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1911 [20:21:34] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1911 [20:21:34] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1911 [20:27:59] New patchset: Lcarr; "changing match condition for ganglia_aggregator" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1913 [20:28:41] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1913 [20:28:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1913 [20:30:12] New patchset: Lcarr; "Fixing cp1044" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1914 [20:31:08] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1914 [20:31:08] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1914 [20:56:46] New patchset: Ryan Lane; "Fix mchenry's ldap client config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1915 [20:57:02] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1915 [20:57:08] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1915 [22:18:15] Ryan_Lane: bsitu and rmoen would need to request Labs/Gerrit access before following https://labsconsole.wikimedia.org/wiki/Access , right? [22:24:05] PROBLEM - Puppet freshness on db1001 is CRITICAL: Puppet has not run in the last 10 hours [22:25:49] yes [22:27:40] !restarting ganglia1001 [22:30:07] notpeter: super thx! everything works [22:31:51] !log restarting ganglia1001 [22:31:54] Logged the message, Mistress of the network gear. [22:48:59] notpeter: is this peter youngmeister? [22:57:21] New patchset: Lcarr; "adding in startup script so that gmond can start up multiple instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1916 [22:57:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1916 [22:59:44] New patchset: Lcarr; "adding in startup script so that gmond can start up multiple instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1916 [22:59:51] maplebed: can i get a checkout of this patchset please ? [22:59:58] you bet. [23:01:22] LeslieCarr: I think you could add that start script to all host, not just aggregators. [23:01:32] hi apergos, are you around? [23:01:41] yes but not really :-D [23:01:44] (it's 1 am) [23:01:46] most hosts will only have one matching /etc/ganglia/gmond*.conf (gmond.conf itself) [23:02:24] adding only to aggregators is probably safer (no unexpcted behavior if someone creates a gmond-testing.conf on some random host) [23:02:46] okay, just 1 question: any update on getting glam filter output to bays? [23:02:47] yeah, i would rather go with the safe option ? I've managed to not break the site for this long ;) [23:02:49] I would add one comment to the aggregators class "overriding default gmond start script to start multiple gmonds on the aggregator" or something like that. [23:02:59] cool [23:03:04] you were going to sanitize it right? [23:03:21] the aggregator class is going to have more stuff in the future, just getting this step for now :) [23:03:50] if you already have an rt ticket in with the location where the sanitized logs are produced, I haven't checked it today [23:04:04] apergos: ohh.yeah that's right :D [23:04:05] ok [23:04:59] New patchset: Lcarr; "adding in startup script so that gmond can start up multiple instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1916 [23:05:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1916 [23:05:40] * maplebed looks [23:05:50] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/1916 [23:07:10] ok, I see I'm going to have to do a bunch of testing on both emulated nics to see which things work and which not. not going to happen tonight... [23:07:14] have a good weekend, folks [23:10:19] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1916 [23:10:20] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1916 [23:15:49] RECOVERY - Puppet freshness on db1001 is OK: puppet ran at Fri Jan 13 23:15:40 UTC 2012 [23:26:28] New patchset: Lcarr; "adding in aggregator class to ganglia1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1917 [23:26:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1917 [23:27:02] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1917 [23:27:02] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1917 [23:29:47] New patchset: Lcarr; "fixing ganglia-monitor permissions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1918 [23:29:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1918 [23:59:45] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours