[00:12:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [00:42:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [01:12:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [01:42:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [02:12:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [02:42:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [03:12:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [03:42:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [04:12:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [04:42:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [05:12:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [05:42:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [06:12:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [06:30:01] RECOVERY Disk Space is now: OK on nova-production1 nova-production1 output: DISK OK [06:42:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [07:12:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [07:42:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [08:12:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [08:42:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [09:12:41] PROBLEM host: twreview is DOWN address: twreview check_ping: Invalid hostname/address - twreview [09:19:21] PROBLEM Disk Space is now: CRITICAL on twpreprod twpreprod output: DISK CRITICAL - free space: / 0 MB (0% inode=20%): [09:22:21] PROBLEM dpkg-check is now: CRITICAL on twpreprod twpreprod output: DPKG CRITICAL dpkg reports broken packages [10:22:21] RECOVERY dpkg-check is now: OK on twpreprod twpreprod output: All packages OK [10:31:31] PROBLEM host: twpreprod is DOWN address: twpreprod CRITICAL - Host Unreachable (twpreprod) [11:02:21] PROBLEM host: twpreprod is DOWN address: twpreprod CRITICAL - Host Unreachable (twpreprod) [11:32:21] PROBLEM host: twpreprod is DOWN address: twpreprod check_ping: Invalid hostname/address - twpreprod [12:01:40] New review: Dzahn; "can we use a header that is always the same (the one in htcp-purger looks nicer than the other)" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1572 [12:02:21] PROBLEM host: twpreprod is DOWN address: twpreprod check_ping: Invalid hostname/address - twpreprod [12:15:06] !log bots Packages update which includes a new kernel [12:15:07] Logged the message, Master [12:32:21] PROBLEM host: twpreprod is DOWN address: twpreprod check_ping: Invalid hostname/address - twpreprod [13:02:21] PROBLEM host: twpreprod is DOWN address: twpreprod check_ping: Invalid hostname/address - twpreprod [13:32:21] PROBLEM host: twpreprod is DOWN address: twpreprod check_ping: Invalid hostname/address - twpreprod [13:32:22] New patchset: Hashar; "logrotate: add "managed by puppet" headers" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1572 [13:41:11] New patchset: Dzahn; "logrotate: add "managed by puppet" headers" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1572 [13:44:25] New review: Dzahn; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1572 [13:44:26] Change merged: Dzahn; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1572 [14:02:21] PROBLEM host: twpreprod is DOWN address: twpreprod check_ping: Invalid hostname/address - twpreprod [14:05:30] !log testswarm twpreprod VM was deleted. You can disregard Nagios notifications about it being down. [14:05:31] Logged the message, Master [14:32:21] PROBLEM host: twpreprod is DOWN address: twpreprod check_ping: Invalid hostname/address - twpreprod [15:02:21] PROBLEM host: twpreprod is DOWN address: twpreprod check_ping: Invalid hostname/address - twpreprod [15:05:31] PROBLEM Total Processes is now: CRITICAL on bots-cb bots-cb output: PROCS CRITICAL: 216 processes [15:10:31] RECOVERY Total Processes is now: OK on bots-cb bots-cb output: PROCS OK: 129 processes [15:32:21] PROBLEM host: twpreprod is DOWN address: twpreprod check_ping: Invalid hostname/address - twpreprod [16:40:16] New review: Pyoungmeister; "(no comment)" [operations/puppet] (testlabs/searchoverhaul); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1578 [16:40:16] Change merged: Pyoungmeister; [operations/puppet] (testlabs/searchoverhaul) - https://gerrit.wikimedia.org/r/1578 [16:40:27] New review: Pyoungmeister; "(no comment)" [operations/puppet] (testlabs/searchoverhaul); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1590 [16:40:28] Change merged: Pyoungmeister; [operations/puppet] (testlabs/searchoverhaul) - https://gerrit.wikimedia.org/r/1590 [18:01:52] I hope people won't bitch at me because I started to write nagios parser of instance list in c++ [18:02:17] it seems that every language is a bad choice [18:02:28] hehehe [18:03:23] hehe [18:03:29] Meh, ignore the language wars [18:03:37] Every language other than PHP will probably provoke one [18:04:01] oh god, php ? who uses that anymore ? [18:04:04] * LeslieCarr goes into troll mode [18:04:24] asp rules << troll mode [18:04:33] :D [18:04:43] or whatever that microsoft language is called [18:05:14] yeah, i think we're going to dump apache and go with IIS [18:05:30] I can't imagine cluster of windows machines [18:05:55] all of them would reboot in same time because of update of ie [18:06:06] haha, patch tuesday would be rough on the site [18:06:09] I remember I had to reboot my office pc when new ie was released [18:06:39] that's crazy you have to reboot pc because of some application which isn't even integrated to system core [18:09:21] No, no [18:09:28] IE *is* integrated into the core, that's the *problem* [18:11:23] :D [18:11:30] OS powered by browser [18:11:57] Ryan_Lane: maria db installed :) [18:12:03] works cool [18:12:07] sweet [18:12:16] I just need to integrate it with ldap now [18:12:20] * Ryan_Lane nods [18:12:32] integrates with pam. I believe [18:12:40] I am working on nagios parser [18:12:50] can you send me api query to get a list of instances? [18:13:01] it's written in c++ I will commit it to svn [18:13:23] I think I will wget api page and parse it [18:13:33] parser generate nagios config and reload service [18:20:32] petan: sure. gimme a min, though. getting coffee and breakfast :) [18:20:42] right [18:21:27] Ryan_Lane: at this hour? [18:23:08] hyperon: I hope you don't need bots-1 [18:23:19] because I am running there bot which eats 95% of ram [18:23:21] :D [18:23:49] petan: no, i don't....i was just copying something. [18:23:55] ok [18:24:03] petan: You write something in java? [18:24:03] :P [18:24:08] hehe [18:24:14] Damianz_: is cluebot ok? [18:24:25] garbage colector suck [18:24:34] <3 c++ [18:24:55] I think so [18:25:00] Tbh I could do with bringing CBIII over [18:25:01] ok [18:25:10] I've seen dude upgraded system [18:25:13] was it needed for it? [18:25:39] not sure, I've been away with a friend for like the past week so not been paying attention [18:25:49] I hope he is not going to upgrade bots-1 because I hate rebooting servers :D [18:26:03] I think methecooldude was going to sort cbII but if there isn't another db added for cb then it hasn't been done. [18:26:15] rebooting more than twice a year is evil [18:26:26] lol [18:26:35] I have some servers with ~7years uptime =/ [18:26:43] I have some with 500 days of uptime [18:26:50] not so much :P [18:27:07] one ubuntu with 160 days [18:27:19] My personal servers usualy get updated this time of year so loose 250+ uptime xD [18:27:35] I don't reboot my personal server unless it's needed [18:27:39] never :D [18:28:19] last time I rebooted ubuntu box was because of hardware failure [18:28:44] other machines are production from my work, don't run linux [18:29:04] some 6 years old ux :D [18:29:31] :D [18:29:32] best shell there is ksh [18:29:35] no bash, no mc [18:29:44] tar can't open gzip :D [18:29:54] I'm just installing some new web servers for work, going to take down the old ones with like 800days uptime before christmas :( [18:30:23] it's sometimes problem to get it back up after so long time [18:30:40] ./tmp is cleaning 50 minutes :D [18:30:44] lol [18:31:21] raid's don't want to assembly... etc [18:31:25] I had a ldap server once that wouldn't start up cleanly (and was massivly due a re-install due to a botch upgrade) that was fun to reboot after 2years and try to remember how to get it to boot xD [18:31:43] exactly [18:31:56] I rebooted one virtual host server I had troubles get vm's back up [18:32:37] I've been reading bash history to get how to start it up [18:54:44] I'd have to imagine we don't have systems with much more than 180 days of uptime [18:57:09] wikipedia is having a big cluster of machines which can be rebooted because other machines take their job, so it's no problem there [18:57:36] we have a bunch of application servers which are independent and if rebooted it cause huge problems [18:58:32] well, it's about taking care of security patches [18:58:45] these machines are not accessible from outside [18:58:59] that's not a good excuse :D [18:59:18] it contains private data of 4 millions of german people :D, but yes security patches is something we install :) [18:59:25] heh [19:07:37] Most our production servers have ksplice installed now to solve kernel patches but most the kernel exploits are local and arn't a massive issue when they are locked down internal servers imo. [19:56:09] 12/19/2011 - 19:56:09 - Creating a home directory for maxsem at /export/home/bastion/maxsem [19:56:53] Ryan_Lane: some news related to gluster? [19:57:09] 12/19/2011 - 19:57:08 - Updating keys for maxsem [19:57:41] !instance [19:57:41] https://labsconsole.wikimedia.org/wiki/Instances [19:57:44] !instancelist [19:57:44] https://labsconsole.wikimedia.org/w/index.php?title=Special:Ask&offset=0&limit=100&q=[[Resource+Type%3A%3Ainstance]]&p=format%3Dbroadtable&po=%3FInstance+Name%0A%3FInstance+Type%0A%3FProject%0A%3FImage+Id%0A%3FFQDN%0A%3FLaunch+Time%0A%3FPuppet+Class%0A%3FModification+date%0A%3FInstance+Host%0A%3FNumber+of+CPUs%0A%3FRAM+Size%0A%3FAmount+of+Storage%0A [19:58:01] heh. cool. [19:58:10] Ryan_Lane: this would be cool if I could get it using api :D [19:58:20] then nagios could parse it easily [19:58:31] and update it's config everytime instance was added or removed [19:58:51] petan: hmm. api query to get a list of instances... [19:59:15] I believe you sent me it once [19:59:19] ah [19:59:23] gotta check logs [19:59:25] that's an SMW query [19:59:28] aha [19:59:40] can't it be retrieved using api? [19:59:43] I believe so [19:59:47] I dunno [20:00:02] it's a list of pages, so probably [20:00:11] but, I also have projects in there too [20:00:20] ok [20:00:24] the list of pages isn't likely as helpful as the SMW query, though [20:00:25] I will just ise html [20:00:40] I hope smw won't change its syntax for output [20:00:46] you can also just query openstack directly, but it'll only give you info from the projects you are in [20:00:55] hm... [20:01:03] this would be better I guess [20:01:14] it also include info about it [20:01:23] storage etc [20:01:27] https://labsconsole.wikimedia.org/wiki/Special:Ask/-5B-5BResource-20Type::instance-5D-5D/-3FInstance-20Name/-3FInstance-20Type/-3FProject/-3FImage-20Id/-3FFQDN/-3FLaunch-20Time/-3FPuppet-20Class/-3FModification-20date/-3FInstance-20Host/-3FNumber-20of-20CPUs/-3FRAM-20Size/-3FAmount-20of-20Storage/limit%3D100/format%3Djson [20:01:35] !instance-json is https://labsconsole.wikimedia.org/wiki/Special:Ask/-5B-5BResource-20Type::instance-5D-5D/-3FInstance-20Name/-3FInstance-20Type/-3FProject/-3FImage-20Id/-3FFQDN/-3FLaunch-20Time/-3FPuppet-20Class/-3FModification-20date/-3FInstance-20Host/-3FNumber-20of-20CPUs/-3FRAM-20Size/-3FAmount-20of-20Storage/limit%3D100/format%3Djson [20:01:35] Key was added! [20:01:47] !instance-json [20:01:48] https://labsconsole.wikimedia.org/wiki/Special:Ask/-5B-5BResource-20Type::instance-5D-5D/-3FInstance-20Name/-3FInstance-20Type/-3FProject/-3FImage-20Id/-3FFQDN/-3FLaunch-20Time/-3FPuppet-20Class/-3FModification-20date/-3FInstance-20Host/-3FNumber-20of-20CPUs/-3FRAM-20Size/-3FAmount-20of-20Storage/limit%3D100/format%3Djson [20:01:50] heh [20:01:51] cool [20:02:06] this is limited to 500 instances [20:02:19] but, that should be good for the forseeable future [20:02:31] that's cool [20:02:34] I can use it [20:02:39] oh. wait. that's 100 [20:02:49] I see [20:02:54] https://labsconsole.wikimedia.org/wiki/Special:Ask?title=Special%3AAsk&q=[[Resource+Type%3A%3Ainstance]]&po=%3FInstance+Name%0D%0A%3FInstance+Type%0D%0A%3FProject%0D%0A%3FImage+Id%0D%0A%3FFQDN%0D%0A%3FLaunch+Time%0D%0A%3FPuppet+Class%0D%0A%3FModification+date%0D%0A%3FInstance+Host%0D%0A%3FNumber+of+CPUs%0D%0A%3FRAM+Size%0D%0A%3FAmount+of+Storage%0D%0A&sort_num=&order_num=ASC&eq=yes&p[format]=json&p[limit]=500&p[headers]=&p[mainlabel]=&p[se [20:02:54] np I can just update it [20:02:58] !instance-json is https://labsconsole.wikimedia.org/wiki/Special:Ask?title=Special%3AAsk&q=[[Resource+Type%3A%3Ainstance]]&po=%3FInstance+Name%0D%0A%3FInstance+Type%0D%0A%3FProject%0D%0A%3FImage+Id%0D%0A%3FFQDN%0D%0A%3FLaunch+Time%0D%0A%3FPuppet+Class%0D%0A%3FModification+date%0D%0A%3FInstance+Host%0D%0A%3FNumber+of+CPUs%0D%0A%3FRAM+Size%0D%0A%3FAmount+of+Storage%0D%0A&sort_num=&order_num=ASC&eq=yes&p[format]=json&p[limit]=500&p[headers]=& [20:02:59] Key exist! [20:03:04] pretty long :D [20:03:08] how do I delete it? [20:03:15] !instance-json del [20:03:16] Successfully removed instance-json [20:03:20] well, that query has other fields in the json too [20:03:23] !instance-json is https://labsconsole.wikimedia.org/wiki/Special:Ask?title=Special%3AAsk&q=[[Resource+Type%3A%3Ainstance]]&po=%3FInstance+Name%0D%0A%3FInstance+Type%0D%0A%3FProject%0D%0A%3FImage+Id%0D%0A%3FFQDN%0D%0A%3FLaunch+Time%0D%0A%3FPuppet+Class%0D%0A%3FModification+date%0D%0A%3FInstance+Host%0D%0A%3FNumber+of+CPUs%0D%0A%3FRAM+Size%0D%0A%3FAmount+of+Storage%0D%0A&sort_num=&order_num=ASC&eq=yes&p[format]=json&p[limit]=500&p[headers]=& [20:03:23] Key was added! [20:03:50] Seriously [20:04:05] and it has a bunch of empty arguments [20:04:35] actually main thing is instance name and classes [20:04:43] that's what I need to know for nagios [20:04:47] host name [20:05:00] you can edit the query in the interface [20:05:08] I know [20:05:13] I didn't know how to get it :) [20:05:17] using this [20:05:42] whoops [20:05:51] !instance-json del [20:05:51] Successfully removed instance-json [20:06:02] !instance-json is https://labsconsole.wikimedia.org/wiki/Special:Ask/-5B-5BResource-20Type::instance-5D-5D/-3FInstance-20Name/-3FInstance-20Type/-3FProject/-3FImage-20Id/-3FFQDN/-3FLaunch-20Time/-3FPuppet-20Class/-3FModification-20date/-3FInstance-20Host/-3FNumber-20of-20CPUs/-3FRAM-20Size/-3FAmount-20of-20Storage/limit%3D500/format%3Djson [20:06:02] Key was added! [20:09:31] Damianz: CB 3 is done :P [20:10:41] methecooldude: Cooli [20:10:47] Did you add it into supervisord? [20:12:20] hmm I see [20:12:38] methecooldude: I'd rather we run it under the cb user in the mnt rather than your home dir tbh [20:20:32] !log bots Move cluebot3 to the /mnt/share/cluebot/cluebot3 dir + added a new process group to supervisor for it on bots-cb. [20:20:33] Logged the message, Master [20:21:36] I don't think you need to log changes to cluebot :), but I don't care [20:21:44] * Damianz shrugs [20:21:45] changes to instance are important though [20:22:08] I will get around to documenting cb stuff sometime.... [20:22:19] ah, added process group is interesting, right [20:22:23] it's good to log actions in the project so others know what you are up to [20:22:45] then anyone else that wants to help out knows how things are changing :) [20:22:53] That was my thinking [20:23:05] I should re-write my graphing scripts for cb that got lost at some point actually. [20:23:12] And re-impliment the bot monitoring. [20:25:53] So you know when it becomes Sentient? [20:26:20] cluebot, reverting humans since 2012 [20:26:27] 2012? [20:26:32] It was doing it 5 years ago [20:26:42] no. I mean reverting *humans* [20:26:54] ohh [20:26:56] :p [20:27:04] I can think of some that need reverting [20:27:13] heh [20:28:17] great. the bug I have in production I'm not seeing in labs [20:28:29] oh. I had forgotten to svn up in production [20:28:39] * Ryan_Lane can be safely ignored [20:34:25] From now on? [20:34:31] probably [20:35:12] know what would be awesome? [20:35:45] if there was documentation on how to make your extension's database schema automatically get applied when update.php is run [20:36:19] aha http://www.mediawiki.org/wiki/Manual:Hooks/LoadExtensionSchemaUpdates [20:38:15] haha, yes [20:38:48] and now it's even easier [20:38:51] it is? [20:38:58] those docs suck, btw [20:38:59] Not having to do shit with globals [20:39:00] hm, I think I could just insert all the Manual: links to this bot so that you could just type @regsearch [Ee]xtension.*[Ss]chema :D [20:39:06] DatabaseUpdater has nice methods [20:39:15] any docs on this? [20:39:30] I wish we had a policy that code would be rejected if docs aren't submitted with it [20:39:37] heh :D [20:39:43] I hate that almost all of our damn code is undocumented [20:39:45] it would suck if we had no code [20:39:50] Is for openstackmanager? [20:40:09] openstackmanager isn't totally undocumented ;) [20:40:19] I was meaning adding the schema updates :p [20:40:24] oh [20:40:25] yeah [20:40:32] I need to make a schema first :) [20:40:39] but I wanted to know how I'd add it before I made it [20:40:46] heh [20:40:55] Look at CodeReview or something [20:41:00] this is for available classes and variables [20:41:04] I looked at LQT [20:41:12] $updater->addExtensionTable( 'name', 'path/to/file.sql' ); [20:41:18] ah [20:41:24] Safely? You mean we wern't ignoring Ryan_Lane previously :P [20:41:39] well, dangerously previously :D [20:41:41] PROBLEM Total Processes is now: CRITICAL on bots-cb bots-cb output: PROCS CRITICAL: 227 processes [20:41:51] there's also functions for adding new indexes, columns etc etc [20:41:53] I should up that limit again actually [20:43:32] * Damianz wonders if Ryan_Lane|food tastes nice [20:45:15] Damianz: everytime Ryan_Lane becomes Ryan_Lane|food someone tastes him. [20:45:47] Clearly there is no standard defined for how he tastes, [20:46:06] Damianz: hmm, true [20:46:40] Damianz: i suppose you could use http://en.wikipedia.org/wiki/Scoville_scale [20:46:41] RECOVERY Total Processes is now: OK on bots-cb bots-cb output: PROCS OK: 112 processes [20:46:51] it's a literal 'hotness' scale :) [20:48:15] * Damianz gives labs-nagios-wm a cookie [20:51:27] Damianz: do you want me to make it ignore that? [20:51:55] petan: Tbh it's not usualy process that are an issue, it's ram usage so possible that might be an idae. [20:51:58] idea* [20:52:31] right, gotta fix it then [20:52:47] ram usage, 80% warning, 90% crit? [20:52:49] or 95% [20:52:55] 90% should be fine [20:52:58] ok [20:53:35] The bot is pretty ok recourse wise, but because it forks for each edit that needs checking if there is a large bunch of edits done it can quickly nom recourses. [20:53:45] LeslieCarr: can you please tell me what is that command to reset git? :P [20:53:54] your favorite... [20:53:55] ^^ [20:53:55] git reset? [20:53:59] git reset --hard origin/test [20:54:17] !leslie's-reset is git reset --hard origin/test [20:54:17] Key was added! [20:54:22] :) [20:54:28] ?leslie's-reset [20:54:35] !damianz's-reset is just git reset [20:54:35] Key was added! [20:54:37] what's the key again ? [20:54:39] I will try :D [20:54:47] !leslie's-reset ? [20:54:47] git reset --hard origin/test [20:54:55] which key [20:55:08] the "what does this mean" key [20:55:17] oh yay yu just did it [20:55:18] :) [21:05:52] This is going to be fun... I need to remember how to setup acls in openldap and how to get freeradius playing nicely with it =/ [21:12:27] !git [21:12:27] for more information about git on labs see https://labsconsole.wikimedia.org/wiki/Git [21:14:09] [remote rejected] HEAD -> refs/for/test (missing Change-Id in commit message) :( [21:14:38] Did you not install the Change-Id hook? Did you run git-setup? [21:14:42] yes [21:14:47] but I commited before [21:16:05] Maybe you committed before pulling, and git created a merge commit? [21:16:53] New patchset: Petrb; "Inserted check of free ram to nagios" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1630 [21:16:55] :) [21:16:58] here we go [21:17:37] HUH [21:17:42] where is my change to nrpe [21:17:51] WTF [21:18:31] lol [21:19:03] New patchset: Petrb; "Inserted check of free ram to nagios" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1631 [21:19:22] WTF [21:19:27] why it created a new file [21:19:43] Gerrit is soooo ugly [21:19:52] there should be some "how to use gerrit for idiots" guide [21:20:01] Damianz, I agree [21:20:14] Ryan_Lane|food: when you come back please don't kill me [21:20:23] It's actually really confusing layout wise =/ [21:21:20] could someone explain to me why it was commited this wrong... [21:21:38] petan: Well, for gerrit generally you should be following My Branching Guide (TM) [21:21:40] :D [21:21:53] https://labsconsole.wikimedia.org/wiki/User:Catrope/Branching_guide [21:22:02] @regsearch production [21:22:02] Results: $realm, [21:22:06] !$realm [21:22:06] either labs or production [21:22:20] And, don't commit to the local tracking branch [21:22:26] where is $realm it dissapared from the file [21:22:30] So the idea is you create working branches, and only commit to the working branches [21:22:39] You don't commit to your local 'production' or 'test' branch [21:22:51] so how do I commit, step by step [21:22:52] Those branches only exist to track the remotes [21:23:00] $ git checkout -b fixfileperms test # Create a new branch called 'fixfileperms' based off 'test', and switch to it [21:23:01] !git is what I followed :( | RoanKattouw [21:23:01] Key exist! [21:23:02] work work work [21:23:03] $ git commit -a -m "Fix file permissions" [21:23:05] $ git push-for-review-test [21:23:05] ah [21:23:14] Yeah the docs need improvement in that area [21:23:39] Change abandoned: Petrb; "will fix" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1630 [21:23:47] Change abandoned: Petrb; "will fix :)" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1631 [21:24:01] !leslie's-reset [21:24:02] git reset --hard origin/test [21:24:15] You don't have to abandon changes to fix them, either, although in this case it's probably best [21:24:25] so [21:24:30] now I change the files ok? [21:24:34] Yeah [21:24:36] Well, no [21:24:39] First, you create a working branch [21:24:48] done [21:24:49] ah [21:24:50] yay [21:24:51] git checkout -b whatyourworkingon test [21:24:58] You can do that after changing the files too [21:25:00] so what is first [21:25:02] As long as you don't commit [21:25:04] Doesn't matter [21:25:09] I reset now [21:25:13] You can create the branch before or after you edit the files [21:25:14] so now? [21:25:19] As long as you create the branch before you commit [21:25:21] So, yeah [21:25:25] Create a branch and switch to it [21:25:34] Call it something like nrpefix or whatever makes sense to you [21:25:43] (the name is local only and isn't published to the world) [21:25:44] git checkout -b "Inserted check of free ram to monitoring" test [21:25:45] ? [21:25:52] No, you can't have spaces I think [21:25:55] ah [21:26:10] git checkout -b "insertedcheckoffreeramtomonitoring" test [21:26:13] ? [21:26:20] Yeah sure [21:26:24] Whatever name makes sense to you [21:26:31] It's for personal use only [21:26:34] fatal: git checkout: updating paths is incompatible with switching branches. [21:26:34] Did you intend to checkout 'test' which can not be resolved as commit? [21:27:15] petan: kill you for what? heh [21:27:23] I messed gerrit [21:27:40] wtf [21:27:41] where is $realm in nrpe [21:27:46] Ryan_Lane: ^ [21:27:49] I don't see it [21:27:57] I see production ip's there [21:28:01] git checkout -b blahblah test should Just Work [21:28:02] :| [21:28:16] same [21:28:17] What was your exact command? [21:28:17] nrpe::packages [21:28:22] $nrpe_allowed_hosts = $realm ? { [21:28:27] petrbena@Desktop-ws:~/Bordel/production/running/puppet$ git checkout -b blahblah test [21:28:38] git checkout -b blahblah origin/test [21:28:49] :) [21:28:55] otherwise it checks out your local test branch [21:28:56] Just 'test' should work, but yeah origin/test is better [21:29:02] ok [21:29:04] now what? [21:29:07] change files? [21:29:15] * RoanKattouw updates his branching guide [21:29:19] Yes [21:29:26] done [21:29:32] now? [21:29:41] Commit and push [21:29:47] git commit? [21:29:50] Yes [21:30:11] # Untracked files: [21:30:13] # (use "git add ..." to include in what will be committed) [21:30:15] # [21:30:17] # check_ram.sh [21:30:19] # nrpe_local.cfg [21:30:25] before that I did git add [21:30:33] and it overwrite the original file isntead of making diff [21:30:50] should I do git add or something else? [21:30:52] Huh ahm [21:30:58] You're adding new files, then? [21:31:02] only one [21:31:03] If so then yes, do git add [21:31:04] OK [21:31:08] Then git add the new file [21:31:10] Then git commit [21:31:19] that is what caused only file was in gerrit [21:31:23] * one [21:31:33] other one was in second commit [21:31:48] I don't want to commit every file :| [21:31:54] <^demon|away> You can list files explicitly. [21:32:06] how do I just add one file and update the other [21:32:19] <^demon|away> `git add foo` [21:32:19] I want check_ram to be added, git add, and nrpe updated [21:32:23] <^demon|away> `git commit foo bar` [21:32:30] I know how git add works [21:32:37] but how do I update nrpe_local [21:32:55] <^demon|away> Like I said, you can name the two files [21:33:04] ok [21:33:17] <^demon|away> `git commit file1 file2 file3 etc etc etc` [21:33:37] error: pathspec 'nrpe_local.cfg' did not match any file(s) known to git. [21:33:38] You can also use git add for files you want to update [21:33:47] Or, to update everything, use git commit -a [21:34:04] You said it was an existing file? It seems git disagrees then [21:34:18] <^demon|away> petan: The relative path to the file? [21:34:26] that doesn't work [21:34:35] RoanKattouw: when I do commit -a I see only one file [21:34:42] second is "not tracked" [21:34:58] Yeah, so then you need to git add it too [21:35:00] <^demon|away> If it's not tracked it sounds like you haven't added it yet. [21:35:01] ^demon|away: . [21:35:03] it's in pwd [21:35:06] * wd [21:35:17] <^demon|away> mmk. [21:35:53] petan: So yeah, you'll have to git add that file as well then [21:36:07] New patchset: Petrb; "Inserted ram check to nagios" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1632 [21:36:13] so? [21:36:22] Ryan_Lane: ^ [21:37:08] did we not already have a memory check? [21:37:15] I didn't find it [21:37:21] not in nrpe [21:38:15] I like this check :) (not just because I just improved it) but it allows you to define if you want to check memory or memmory - buffers [21:38:23] * memory [21:39:21] Ryan_Lane: do we [21:39:58] I got a feeling that this month I just keep reinventing wheels [21:40:08] I dunno. I'm looking [21:40:20] it seems none of our nrpe checks take arguments [21:40:29] mh [21:40:30] I think there's a reason for that [21:40:37] I don't get it [21:40:41] which arguments you mean [21:40:43] petan: Alright, looks like that worked. So the key thing is that you create a new branch based off origin/test for everything you do. Each branch would normally contain only one commit, unless it's like a follow-up or something [21:41:02] ok [21:41:09] !git update it | RoanKattouw [21:41:09] RoanKattouw: for more information about git on labs see https://labsconsole.wikimedia.org/wiki/Git [21:41:13] :P [21:41:30] I know, I need to get around to that some time [21:41:32] I would update it but I got a feeling I would just break it [21:41:37] because I don't know much [21:41:41] New review: Demon; "(no comment)" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1632 [21:42:13] D: [21:42:40] ^demon|away: it was in dos format when I downloaded it [21:42:55] <^demon|away> Ew :( [21:42:56] how do I follow up [21:43:12] <^demon|away> `git commit --amend` [21:43:13] <^demon|away> Then push [21:43:35] That is actually documented [21:43:50] !amend is https://labsconsole.wikimedia.org/wiki/Git#Amending_a_change [21:43:50] Key was added! [21:44:01] !amend is If you want to amend your change in git do git commit --amend [21:44:01] Key exist! [21:44:05] !amend [21:44:05] https://labsconsole.wikimedia.org/wiki/Git#Amending_a_change [21:44:07] lol [21:44:09] ah [21:44:11] ok [21:44:21] ah. ignore me :) [21:44:26] that's ok :D [21:44:31] I wasn't lookig heh [21:44:45] petan: so, this looks ok, but you aren't actually including the script anywhere [21:44:56] meaning it'll be in puppet, but nothing will install it [21:45:00] ah... [21:45:18] Ryan_Lane: I shudder at https://labsconsole.wikimedia.org/w/index.php?title=Git&diff=849&oldid=558 . Why not use rebase? [21:45:25] New review: Ryan Lane; "Please add something to actually install the script." [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1632 [21:45:28] Ryan_Lane: nagios.pp? [21:45:41] RoanKattouw: every time I try to rebase, it fails [21:45:48] That's strange [21:45:51] It's always worked for me [21:45:59] git fetch blah blah blah && git checkout FETCH_HEAD [21:46:00] I obviously have no clue how to properly rebase [21:46:01] git rebase origin/test [21:46:04] git push-for-review-test [21:46:06] done [21:46:09] really? that's it [21:46:10] ? [21:46:22] change that section then :) [21:46:28] The thing with rebase is that after fixing conflicts you have to do git rebase --continue I think [21:46:33] Alright :) [21:47:45] <^demon|away> Wow, that's so much easier than the long convoluted reset --hard and cherry picking your own change back in :p [21:47:53] +1 [21:48:08] Ryan_Lane ^ [21:48:08] I just added what worked for me :) [21:48:11] eh [21:48:13] wait [21:48:26] !g 1632 [21:48:27] https://gerrit.wikimedia.org/r/1632 [21:48:28] ^ [21:48:34] ah nvm [21:48:50] ! [remote rejected] HEAD -> refs/for/test (no changes made) [21:48:52] error: failed to push some refs to 'ssh://petrb@gerrit.wikimedia.org:29418/operations/puppet' [21:49:01] You also don't have to add a change ID etc [21:49:11] how do I fix it [21:49:18] ^demon|away: And amending so the Change-ID is added :) [21:49:20] petan: What did you do [21:49:29] I changed a file and did commit [21:49:31] then push [21:49:36] Did you do commit --amend or just commit? [21:49:40] --amend [21:49:43] OK [21:49:48] Oh [21:49:50] You have to do --amend -a [21:49:53] ah [21:49:56] how do I fix now? [21:50:06] git commit --amend -a [21:50:09] Then git push-for-review-tset [21:50:18] New patchset: Petrb; "Inserted ram check to nagios" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1632 [21:50:21] :) [21:50:30] (The -a flag means "yeah I know that in theory I have to git add every changed file, but I'm lazy so stfu and just commit all changed files) [21:51:19] <^demon|away> Easier than adding all your files by hand...and less likely to make a mistake than -a...is listing the files in your git commit. [21:52:05] ok [21:52:15] Ryan_Lane: check please :) [21:53:18] New review: Ryan Lane; "You should be changing templates/nagios/nrpe_local.cfg.erb, not adding a file." [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1632 [21:53:38] we already have a template being included for that file :) [21:53:45] huh [21:54:31] you are adding files/nagios/nrpe_local.cfg [21:54:37] we have a template for it. [21:54:50] done [21:54:59] how do I revert the change to file [21:55:12] just amend your commit [21:55:17] did the file already exist? [21:55:30] yes [21:55:30] I think it doesn't [21:55:37] it waas in folder before [21:55:48] it doesn't exist in my repo [21:55:51] meh [21:55:56] New patchset: Petrb; "Inserted ram check to nagios" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1632 [21:55:58] but it existed in mine [21:56:08] ok, delete it if you don't think so [21:56:17] I just saw it in the folder so I changed it [21:56:33] damn it [21:56:42] command-q is way to close to command-tab [21:56:53] :P [21:57:43] <^demon|away> I hate command+Q [21:57:49] <^demon|away> It's also too close to command+W [21:58:02] in which editor [21:58:15] <^demon|away> In my browser. [21:58:15] it causes problem [21:58:17] ah [21:58:19] right [21:59:37] just changed iterm to warn me before it quits :) [21:59:50] hah [21:59:57] !g 1632 [21:59:57] https://gerrit.wikimedia.org/r/1632 [22:00:18] the file is still there... [22:00:28] I know [22:00:34] I don't know how to remove it [22:00:39] git rm file [22:00:44] hm... ok [22:00:44] git rm [22:00:56] I am sure it was there o.o [22:00:56] notice that it has A next to it in the interface [22:01:03] that shows that it's a new file [22:01:03] because I didn't create it [22:01:45] New patchset: Petrb; "Inserted ram check to nagios" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1632 [22:02:25] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1632 [22:02:25] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1632 [22:02:52] !log nagios updating configuration to check ram [22:02:53] Logged the message, Master [22:08:55] yay [22:09:18] NRPE: Command 'check_ram' not defined :| [22:09:28] I must have forgot to change something eh [22:09:31] New patchset: Pyoungmeister; "some more. getting the global conf in, plus some monitoring" [operations/puppet] (testlabs/searchoverhaul) - https://gerrit.wikimedia.org/r/1633 [22:10:09] I got it [22:10:34] New review: Pyoungmeister; "(no comment)" [operations/puppet] (testlabs/searchoverhaul); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1633 [22:10:35] Change merged: Pyoungmeister; [operations/puppet] (testlabs/searchoverhaul) - https://gerrit.wikimedia.org/r/1633 [22:11:55] PROBLEM Free ram is now: CRITICAL on bastion1 bastion1 output: NRPE: Command check_ram not defined [22:11:55] PROBLEM Free ram is now: CRITICAL on bots-sql1 bots-sql1 output: NRPE: Command check_ram not defined [22:11:55] PROBLEM Free ram is now: CRITICAL on hugglewiki hugglewiki output: NRPE: Command check_ram not defined [22:11:55] PROBLEM Free ram is now: UNKNOWN on labs-lvs1 labs-lvs1 output: NRPE: Unable to read output [22:11:55] PROBLEM Free ram is now: CRITICAL on labs-ocg1 labs-ocg1 output: NRPE: Command check_ram not defined [22:11:55] PROBLEM Free ram is now: UNKNOWN on membase4 membase4 output: NRPE: Unable to read output [22:11:56] PROBLEM Free ram is now: UNKNOWN on pageviews pageviews output: NRPE: Unable to read output [22:11:56] PROBLEM Free ram is now: CRITICAL on vumi-gw1 vumi-gw1 output: NRPE: Command check_ram not defined [22:12:08] o.o [22:12:43] :D [22:12:45] PROBLEM Free ram is now: UNKNOWN on bots-1 bots-1 output: NRPE: Unable to read output [22:12:45] PROBLEM Free ram is now: UNKNOWN on bots-sql2 bots-sql2 output: NRPE: Unable to read output [22:12:45] PROBLEM Free ram is now: CRITICAL on labs-build1 labs-build1 output: NRPE: Command check_ram not defined [22:12:55] PROBLEM Free ram is now: CRITICAL on labs-mc1 labs-mc1 output: NRPE: Command check_ram not defined [22:12:55] PROBLEM Free ram is now: CRITICAL on master master output: NRPE: Command check_ram not defined [22:12:55] PROBLEM Free ram is now: UNKNOWN on puppet-lucid puppet-lucid output: NRPE: Unable to read output [22:12:58] right puppet didn't run [22:13:01] it needs to exist on all hosts ;) [22:13:03] that's one problem [22:13:05] PROBLEM Free ram is now: CRITICAL on wikistats-01 wikistats-01 output: NRPE: Command check_ram not defined [22:13:14] it runs every 30 mins or so [22:13:17] hm... [22:13:35] PROBLEM Free ram is now: CRITICAL on bots_2 bots-2 output: NRPE: Command check_ram not defined [22:13:35] PROBLEM Free ram is now: CRITICAL on bots-sql3 bots-sql3 output: NRPE: Command check_ram not defined [22:13:35] PROBLEM Free ram is now: CRITICAL on labs-cp1 labs-cp1 output: NRPE: Command check_ram not defined [22:13:35] PROBLEM Free ram is now: CRITICAL on labs-mc2 labs-mc2 output: NRPE: Command check_ram not defined [22:13:45] PROBLEM Free ram is now: CRITICAL on mediahandler-test mediahandler-test output: NRPE: Command check_ram not defined [22:13:55] PROBLEM Free ram is now: CRITICAL on reportcard1 reportcard1 output: NRPE: Command check_ram not defined [22:13:55] PROBLEM Free ram is now: CRITICAL on nova-production1 nova-production1 output: NRPE: Command check_ram not defined [22:13:57] I will fix [22:14:10] well, it's fine [22:14:20] the checks will all fail now [22:14:25] PROBLEM Free ram is now: UNKNOWN on bots-apache1 bots-apache1 output: NRPE: Unable to read output [22:14:25] PROBLEM Free ram is now: UNKNOWN on canonical-bridge canonical-bridge output: NRPE: Unable to read output [22:14:25] PROBLEM Free ram is now: CRITICAL on labs-cp2 labs-cp2 output: NRPE: Command check_ram not defined [22:14:33] and will start working when puppet runs [22:14:35] PROBLEM Free ram is now: CRITICAL on pad1 pad1 output: NRPE: Command check_ram not defined [22:14:35] PROBLEM Free ram is now: CRITICAL on test1 test1 output: NRPE: Command check_ram not defined [22:14:35] PROBLEM Free ram is now: CRITICAL on labs-mw1 labs-mw1 output: NRPE: Command check_ram not defined [22:14:35] PROBLEM Free ram is now: UNKNOWN on membase1 membase1 output: NRPE: Unable to read output [22:14:42] NRPE: Unable to read output [22:14:43] wtf [22:14:53] heh [22:15:01] it works on my comp. [22:15:15] PROBLEM Free ram is now: UNKNOWN on bots-cb bots-cb output: NRPE: Unable to read output [22:15:15] PROBLEM Free ram is now: CRITICAL on ganglia-collector ganglia-collector output: NRPE: Command check_ram not defined [22:15:15] PROBLEM Free ram is now: UNKNOWN on labs-db1 labs-db1 output: NRPE: Unable to read output [22:15:15] PROBLEM Free ram is now: CRITICAL on labs-mw2 labs-mw2 output: NRPE: Command check_ram not defined [22:15:15] PROBLEM Free ram is now: UNKNOWN on membase2 membase2 output: NRPE: Unable to read output [22:15:25] PROBLEM Free ram is now: CRITICAL on pad2 pad2 output: NRPE: Command check_ram not defined [22:15:35] PROBLEM Free ram is now: UNKNOWN on test3 test3 output: NRPE: Unable to read output [22:16:05] PROBLEM Free ram is now: CRITICAL on bots-nfs bots-nfs output: NRPE: Command check_ram not defined [22:16:05] PROBLEM Free ram is now: CRITICAL on ganglia-master ganglia-master output: NRPE: Command check_ram not defined [22:16:05] PROBLEM Free ram is now: CRITICAL on labs-db2 labs-db2 output: NRPE: Command check_ram not defined [22:16:05] PROBLEM Free ram is now: CRITICAL on labs-nfs1 labs-nfs1 output: NRPE: Command check_ram not defined [22:16:05] PROBLEM Free ram is now: UNKNOWN on membase3 membase3 output: NRPE: Unable to read output [22:16:06] PROBLEM Free ram is now: CRITICAL on testpuppet testpuppet output: NRPE: Command check_ram not defined [22:16:55] PROBLEM Free ram is now: UNKNOWN on bots-sql1 bots-sql1 output: NRPE: Unable to read output [22:16:55] PROBLEM Free ram is now: UNKNOWN on vumi-gw1 vumi-gw1 output: NRPE: Unable to read output [22:17:15] it doesn't work [22:18:06] thereis no check_ram.sh on servers [22:18:10] Ryan_Lane: ^ [22:18:22] as mentioned, puppet needs to run on them [22:18:31] but even if I do puppetd -tv [22:18:35] PROBLEM Free ram is now: UNKNOWN on bots-sql3 bots-sql3 output: NRPE: Unable to read output [22:18:40] it needs to run on all of them [22:18:45] PROBLEM Free ram is now: UNKNOWN on mediahandler-test mediahandler-test output: NRPE: Unable to read output [22:18:45] ah [22:18:58] but when I do puppetd -tv it doesn't work even on that machine where Id o that [22:19:01] try it out on bots-1 and bots-2 [22:19:04] I did [22:19:06] ah [22:19:07] it doesn't work [22:19:11] it doesn't upload .sh [22:19:14] only nrpe [22:19:16] restart nrpe ont hem [22:19:18] *on them [22:19:24] but file is not there [22:19:27] so it can't work [22:19:28] the file isn't installed? [22:19:30] yes [22:19:31] it isn't [22:19:35] PROBLEM Free ram is now: UNKNOWN on test1 test1 output: NRPE: Unable to read output [22:19:54] oh. I wonder if it should go in nrpe.pp [22:19:59] oh [22:20:00] yes [22:20:22] yep. wrong place for it :) [22:20:36] actually should I keep it or remove from nagios.pp [22:20:41] it could be used even in local [22:20:50] keep it [22:21:05] PROBLEM Free ram is now: UNKNOWN on labs-nfs1 labs-nfs1 output: NRPE: Unable to read output [22:21:55] PROBLEM Free ram is now: UNKNOWN on bastion1 bastion1 output: NRPE: Unable to read output [22:22:55] New patchset: Petrb; "Included check_ram.sh to nrpe" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1634 [22:23:05] PROBLEM Free ram is now: UNKNOWN on wikistats-01 wikistats-01 output: NRPE: Unable to read output [22:23:25] PROBLEM Current Load is now: CRITICAL on asher1 asher1 output: Connection refused by host [22:23:25] PROBLEM Total Processes is now: CRITICAL on asher1 asher1 output: Connection refused by host [22:23:25] PROBLEM dpkg-check is now: CRITICAL on asher1 asher1 output: Connection refused by host [22:23:31] Ryan_Lane: added [22:23:35] PROBLEM Free ram is now: UNKNOWN on labs-mc2 labs-mc2 output: NRPE: Unable to read output [22:23:59] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1634 [22:23:59] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1634 [22:24:35] PROBLEM Free ram is now: UNKNOWN on pad1 pad1 output: NRPE: Unable to read output [22:25:04] asher is down? [22:25:14] possibly [22:25:38] that's weird [22:25:44] it was online few minutes ago [22:26:06] PROBLEM Free ram is now: UNKNOWN on bots-nfs bots-nfs output: NRPE: Unable to read output [22:26:06] PROBLEM Free ram is now: UNKNOWN on ganglia-master ganglia-master output: NRPE: Unable to read output [22:26:06] PROBLEM Free ram is now: UNKNOWN on labs-db2 labs-db2 output: NRPE: Unable to read output [22:26:16] PROBLEM Disk Space is now: CRITICAL on asher1 asher1 output: Connection refused by host [22:26:55] lol [22:27:01] there is some issue with script [22:27:07] 517% free memory [22:27:13] :D [22:27:13] wtf [22:27:36] RECOVERY Free ram is now: OK on bots-1 bots-1 output: OK: 508% free memory [22:27:43] :o [22:27:46] RECOVERY Free ram is now: OK on labs-build1 labs-build1 output: OK: 1018% free memory [22:27:46] RECOVERY Free ram is now: OK on labs-mc1 labs-mc1 output: OK: 1010% free memory [22:27:52] somehow I like it [22:27:56] RECOVERY Free ram is now: OK on master master output: OK: 337% free memory [22:28:03] yay [22:28:13] heh [22:28:36] RECOVERY Free ram is now: OK on labs-cp1 labs-cp1 output: OK: 1089% free memory [22:28:46] PROBLEM Free ram is now: UNKNOWN on p-b p-b output: NRPE: Unable to read output [22:28:56] RECOVERY Free ram is now: OK on reportcard1 reportcard1 output: OK: 323% free memory [22:29:26] RECOVERY Free ram is now: OK on labs-cp2 labs-cp2 output: OK: 1036% free memory [22:29:36] RECOVERY Free ram is now: OK on membase1 membase1 output: OK: 51% free memory [22:30:16] RECOVERY Free ram is now: OK on ganglia-collector ganglia-collector output: OK: 615% free memory [22:30:16] RECOVERY Free ram is now: OK on labs-mw2 labs-mw2 output: OK: 1016% free memory [22:30:26] RECOVERY Free ram is now: OK on pad2 pad2 output: OK: 775% free memory [22:30:33] New patchset: Petrb; "Disabled check of free ram without cache, it's broken" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1635 [22:30:40] Ryan_Lane: merge it please [22:30:41] :D [22:30:44] quick fix [22:31:07] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1635 [22:31:07] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1635 [22:31:10] heh [22:31:11] RECOVERY Free ram is now: OK on testpuppet testpuppet output: OK: 296% free memory [22:31:32] unfortunatelly this will alarm even if there is actually free ram but eaten by cache [22:31:46] PROBLEM dpkg-check is now: CRITICAL on nova-production1 nova-production1 output: Connection refused by host [22:31:56] RECOVERY Free ram is now: OK on hugglewiki hugglewiki output: OK: 272% free memory [22:32:24] Ryan_Lane: what's up with firewall in testlabs [22:32:30] all instances refuse nrpe [22:32:39] eh? shouldn't [22:32:43] !monitor nova-production1 [22:32:43] http://nagios.wmflabs.org/cgi-bin/nagios3/status.cgi?host=nova-production1 [22:32:56] PROBLEM Disk Space is now: CRITICAL on nova-production1 nova-production1 output: Connection refused by host [22:33:01] !monitor asher1 [22:33:01] http://nagios.wmflabs.org/cgi-bin/nagios3/status.cgi?host=asher1 [22:33:06] PROBLEM Total Processes is now: CRITICAL on nova-production1 nova-production1 output: Connection refused by host [22:33:26] PROBLEM Current Load is now: CRITICAL on nova-production1 nova-production1 output: Connection refused by host [22:33:36] RECOVERY Free ram is now: OK on bots_2 bots-2 output: OK: 830% free memory [22:34:36] RECOVERY Free ram is now: OK on labs-mw1 labs-mw1 output: OK: 699% free memory [22:34:47] it isn't firewalls [22:34:58] weird [22:35:01] what is it... [22:35:02] well, not security groups anyway [22:35:10] I can't ssh there to check [22:35:16] RECOVERY Free ram is now: OK on membase2 membase2 output: OK: 22% free memory [22:35:24] :) [22:35:28] that's better [22:35:44] nrpe isn't running on them it seems [22:35:52] it stopped just now [22:36:00] maybe puppet issue [22:36:07] seems puppet didn't run on them [22:36:16] RECOVERY Disk Space is now: OK on asher1 asher1 output: DISK OK [22:36:24] !monitor bots-cb [22:36:24] http://nagios.wmflabs.org/cgi-bin/nagios3/status.cgi?host=bots-cb [22:36:46] RECOVERY Free ram is now: OK on bastion1 bastion1 output: OK: 424% free memory [22:36:56] RECOVERY Free ram is now: OK on labs-ocg1 labs-ocg1 output: OK: 75% free memory [22:36:56] RECOVERY Free ram is now: OK on membase4 membase4 output: OK: 32% free memory [22:37:00] What about bots-cb? [22:37:14] nrpe is running on bots-cb [22:37:34] Free ram UNKNOWN 2011-12-19 22:35:11 0d 0h 24m 20s 4/4 NRPE: Unable to read output [22:38:21] methecooldude: wait a moment [22:38:26] RECOVERY Current Load is now: OK on asher1 asher1 output: OK - load average: 0.03, 0.04, 0.01 [22:38:26] RECOVERY Total Processes is now: OK on asher1 asher1 output: PROCS OK: 92 processes [22:38:26] RECOVERY dpkg-check is now: OK on asher1 asher1 output: All packages OK [22:40:06] PROBLEM Disk Space is now: CRITICAL on membase1 membase1 output: Connection refused by host [22:40:16] RECOVERY Free ram is now: OK on bots-cb bots-cb output: OK: 21% free memory [22:40:26] RECOVERY Free ram is now: OK on test3 test3 output: OK: 58% free memory [22:40:46] PROBLEM Total Processes is now: CRITICAL on membase1 membase1 output: Connection refused by host [22:41:06] RECOVERY Free ram is now: OK on membase3 membase3 output: OK: 27% free memory [22:41:46] PROBLEM dpkg-check is now: CRITICAL on membase1 membase1 output: Connection refused by host [22:41:56] RECOVERY Free ram is now: OK on labs-lvs1 labs-lvs1 output: OK: 78% free memory [22:41:56] PROBLEM Free ram is now: WARNING on pageviews pageviews output: Warning: 6% free memory [22:42:36] PROBLEM Free ram is now: CRITICAL on membase1 membase1 output: Connection refused by host [22:42:46] RECOVERY Free ram is now: OK on bots-sql2 bots-sql2 output: OK: 28% free memory [22:42:56] PROBLEM Free ram is now: WARNING on puppet-lucid puppet-lucid output: Warning: 12% free memory [22:42:56] ^_^ [22:43:36] PROBLEM Current Load is now: CRITICAL on membase1 membase1 output: Connection refused by host [22:43:46] RECOVERY Free ram is now: OK on p-b p-b output: OK: 33% free memory [22:44:26] RECOVERY Free ram is now: OK on bots-apache1 bots-apache1 output: OK: 54% free memory [22:45:16] PROBLEM Free ram is now: CRITICAL on labs-db1 labs-db1 output: Connection refused by host [22:46:56] RECOVERY Free ram is now: OK on bots-sql1 bots-sql1 output: OK: 70% free memory [22:46:56] RECOVERY Free ram is now: OK on vumi-gw1 vumi-gw1 output: OK: 57% free memory [22:48:36] RECOVERY Free ram is now: OK on bots-sql3 bots-sql3 output: OK: 60% free memory [22:48:56] PROBLEM Free ram is now: WARNING on mediahandler-test mediahandler-test output: Warning: 12% free memory [22:49:26] PROBLEM Total Processes is now: CRITICAL on labs-db1 labs-db1 output: Connection refused by host [22:49:36] PROBLEM dpkg-check is now: CRITICAL on labs-db1 labs-db1 output: Connection refused by host [22:49:36] PROBLEM Disk Space is now: CRITICAL on labs-db1 labs-db1 output: Connection refused by host [22:49:36] PROBLEM Free ram is now: WARNING on test1 test1 output: Warning: 12% free memory [22:51:01] !ping [22:51:04] !bang [22:51:05] Bang!! [22:51:06] PROBLEM Free ram is now: CRITICAL on labs-nfs1 labs-nfs1 output: Critical: 3% free memory [22:51:16] PROBLEM Current Load is now: CRITICAL on labs-db1 labs-db1 output: Connection refused by host [22:52:46] Ryan_Lane: 3% :P [22:52:48] that's bad [22:53:36] RECOVERY Free ram is now: OK on labs-mc2 labs-mc2 output: OK: 74% free memory [22:53:40] it has lots [22:53:44] +- buffers [22:53:47] I know [22:53:52] that's option b for [22:53:58] but it works only on my pc :D [22:54:06] nfs servers are supposed to use as much ram as possible :) [22:54:14] probably [22:54:21] I will fix it [22:54:36] RECOVERY Free ram is now: OK on pad1 pad1 output: OK: 60% free memory [22:56:06] RECOVERY Free ram is now: OK on bots-nfs bots-nfs output: OK: 70% free memory [22:56:06] RECOVERY Free ram is now: OK on ganglia-master ganglia-master output: OK: 48% free memory [22:56:06] PROBLEM Free ram is now: CRITICAL on labs-db2 labs-db2 output: Connection refused by host [22:58:16] PROBLEM Total Processes is now: CRITICAL on labs-db2 labs-db2 output: Connection refused by host [22:58:26] RECOVERY Current Load is now: OK on nova-production1 nova-production1 output: OK - load average: 1.73, 1.14, 0.92 [22:59:26] PROBLEM Disk Space is now: CRITICAL on labs-db2 labs-db2 output: Connection refused by host [22:59:51] PROBLEM Current Load is now: CRITICAL on labs-db2 labs-db2 output: Connection refused by host [22:59:51] PROBLEM dpkg-check is now: CRITICAL on labs-db2 labs-db2 output: Connection refused by host [22:59:56] PROBLEM Current Load is now: CRITICAL on master master output: Connection refused by host [22:59:56] PROBLEM dpkg-check is now: CRITICAL on master master output: Connection refused by host [23:00:06] RECOVERY Disk Space is now: OK on membase1 membase1 output: DISK OK [23:00:51] RECOVERY Total Processes is now: OK on membase1 membase1 output: PROCS OK: 77 processes [23:01:06] PROBLEM Free ram is now: CRITICAL on master master output: Connection refused by host [23:01:26] PROBLEM Disk Space is now: CRITICAL on master master output: Connection refused by host [23:01:46] PROBLEM Total Processes is now: CRITICAL on master master output: Connection refused by host [23:01:51] RECOVERY dpkg-check is now: OK on nova-production1 nova-production1 output: All packages OK [23:01:51] RECOVERY dpkg-check is now: OK on membase1 membase1 output: All packages OK [23:02:05] New patchset: Petrb; "Fixed nrpe and check of ram" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1638 [23:02:06] PROBLEM Free ram is now: WARNING on reportcard1 reportcard1 output: Warning: 16% free memory [23:02:12] Ryan_Lane: can you review it? :P [23:02:44] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1638 [23:02:45] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1638 [23:02:48] thanks [23:02:51] yw [23:02:56] RECOVERY Disk Space is now: OK on nova-production1 nova-production1 output: DISK OK [23:03:06] RECOVERY Total Processes is now: OK on nova-production1 nova-production1 output: PROCS OK: 150 processes [23:03:19] I'm making an interface for adding/removing puppet classes and variables [23:03:20] :) [23:03:32] oh, damn I made a mistake there [23:03:34] and letting them be grouped [23:03:36] RECOVERY Current Load is now: OK on membase1 membase1 output: OK - load average: 0.57, 0.16, 0.05 [23:03:38] heh [23:03:41] oh? [23:03:51] I left a debug message there [23:04:16] submit a new change [23:04:26] PROBLEM Free ram is now: CRITICAL on testpuppet testpuppet output: Critical: 3% free memory [23:06:23] New patchset: Petrb; "fixed debug message" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1639 [23:06:23] Ryan_Lane: done [23:08:41] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1639 [23:08:41] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1639 [23:09:46] PROBLEM dpkg-check is now: CRITICAL on membase1 membase1 output: Connection refused by host [23:11:36] PROBLEM Current Load is now: CRITICAL on membase1 membase1 output: Connection refused by host [23:11:56] RECOVERY Free ram is now: OK on pageviews pageviews output: 295900 [23:12:56] RECOVERY Free ram is now: OK on puppet-lucid puppet-lucid output: 300952 [23:13:06] PROBLEM Disk Space is now: CRITICAL on membase1 membase1 output: Connection refused by host [23:13:46] PROBLEM Total Processes is now: CRITICAL on membase1 membase1 output: Connection refused by host [23:14:26] RECOVERY Free ram is now: OK on canonical-bridge canonical-bridge output: OK: 71% free memory [23:14:35] :) [23:15:16] RECOVERY Free ram is now: OK on labs-db1 labs-db1 output: OK: 94% free memory [23:16:16] RECOVERY Current Load is now: OK on labs-db1 labs-db1 output: OK - load average: 0.08, 0.05, 0.01 [23:19:26] RECOVERY Total Processes is now: OK on labs-db1 labs-db1 output: PROCS OK: 87 processes [23:19:36] RECOVERY Disk Space is now: OK on labs-db1 labs-db1 output: DISK OK [23:19:36] RECOVERY dpkg-check is now: OK on labs-db1 labs-db1 output: All packages OK [23:23:06] RECOVERY Free ram is now: OK on wikistats-01 wikistats-01 output: OK: 90% free memory [23:24:02] 12/19/2011 - 23:24:02 - Updating keys for robh [23:24:02] 12/19/2011 - 23:24:02 - Updating keys for robh [23:24:04] 12/19/2011 - 23:24:04 - Updating keys for robh [23:24:09] 12/19/2011 - 23:24:09 - Updating keys for robh [23:34:46] PROBLEM dpkg-check is now: CRITICAL on nova-production1 nova-production1 output: Connection refused by host [23:35:56] PROBLEM Disk Space is now: CRITICAL on nova-production1 nova-production1 output: Connection refused by host [23:36:26] PROBLEM Current Load is now: CRITICAL on nova-production1 nova-production1 output: Connection refused by host [23:37:06] PROBLEM Total Processes is now: CRITICAL on nova-production1 nova-production1 output: Connection refused by host [23:53:16] RECOVERY Total Processes is now: OK on labs-db2 labs-db2 output: PROCS OK: 87 processes [23:54:26] RECOVERY Disk Space is now: OK on labs-db2 labs-db2 output: DISK OK [23:54:46] RECOVERY Current Load is now: OK on labs-db2 labs-db2 output: OK - load average: 0.03, 0.03, 0.00 [23:54:46] RECOVERY dpkg-check is now: OK on labs-db2 labs-db2 output: All packages OK [23:56:06] RECOVERY Free ram is now: OK on master master output: OK: 77% free memory [23:56:06] RECOVERY Free ram is now: OK on labs-db2 labs-db2 output: OK: 94% free memory [23:56:26] RECOVERY Disk Space is now: OK on master master output: DISK OK [23:56:46] RECOVERY Total Processes is now: OK on master master output: PROCS OK: 113 processes [23:59:56] RECOVERY Current Load is now: OK on master master output: OK - load average: 0.07, 0.13, 0.09 [23:59:56] RECOVERY dpkg-check is now: OK on master master output: All packages OK