[03:00:45] RECOVERY Puppet freshness is now: OK on otrs-jgreen i-0000015a output: puppet ran at Mon Apr 30 03:00:32 UTC 2012 [03:37:00] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [03:51:11] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 14% free memory [03:57:01] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 4% free memory [03:58:31] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 15% free memory [03:58:41] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 15% free memory [04:06:02] PROBLEM Free ram is now: WARNING on test3 i-00000093 output: Warning: 7% free memory [04:07:02] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 96% free memory [04:11:02] RECOVERY Free ram is now: OK on test3 i-00000093 output: OK: 96% free memory [04:16:12] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 3% free memory [04:18:32] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 4% free memory [04:18:42] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 3% free memory [04:21:12] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 93% free memory [04:23:32] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 96% free memory [04:23:42] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 96% free memory [06:54:22] PROBLEM Disk Space is now: CRITICAL on nagios 127.0.0.1 output: DISK CRITICAL - free space: /home/petrb 0 MB (0% inode=81%): /home/laner 0 MB (0% inode=81%): /home/dzahn 0 MB (0% inode=81%): [06:59:22] RECOVERY Disk Space is now: OK on nagios 127.0.0.1 output: DISK OK [07:29:02] Ryan_Lane: hey [07:29:30] I was thinking of moving etherpad from prod to labs, there are many reasons to do that [07:30:14] etherpad isn't working well and there are almost no people who care of it, which makes sense [07:31:01] I don't see a reason why wmf ops should manage it, this is something what could be easily managed by volunteer ops as well [07:33:04] or mutante? [07:33:14] some opinions on that [07:33:53] someone awake? [07:34:48] Damianz: btw regarding the 404 not found, it was OK, why it's NOT OK now [07:35:01] someone did a change and didn't log it [08:46:18] Change on 12mediawiki a page OAuth/User stories was modified, changed by 90.183.23.27 link https://www.mediawiki.org/w/index.php?diff=530678 edit summary: + 2 [08:48:13] Change on 12mediawiki a page OAuth/User stories was modified, changed by 90.183.23.27 link https://www.mediawiki.org/w/index.php?diff=530679 edit summary: /* User story: huggle 3x */ [09:02:17] petan|wk: ok, what's up? [09:02:36] [09:29:30] I was thinking of moving etherpad from prod to labs, there are many reasons to do that [09:02:36] [09:30:14] etherpad isn't working well and there are almost no people who care of it, which makes sense [09:02:36] [09:31:01] I don't see a reason why wmf ops should manage it, this is something what could be easily managed by volunteer ops as wel [09:03:14] what you think? [09:03:15] i think its a good idea to test enhancements to etherpad in labs [09:03:29] i am not sure about moving it away from production [09:03:32] I was rather thinking of operating it on labs [09:03:49] Ryan said that labs are going to be "operation platform" for volunteers not only testing area [09:03:53] hmm, not sure about that, we rely on it [09:04:00] that's problem [09:04:11] people rely on it and production version doesn't work well [09:04:22] because ops are busy maintaining wikimedia projects [09:04:37] what do you mean exactly by "doesnt work well" [09:04:47] so no one has time to maintain etherpad and volunteers who could help with that are not going to get access [09:04:48] issues in the software or with the server [09:04:54] with installation [09:05:02] during weekend we had a session with Sumana [09:05:21] she told us to use ep on other server because wikimedia has technical troubles [09:05:30] in fact I saw a lot of problems with it in past [09:05:48] apergos said there is only 1 person in ops team who knows about etherpad and that's guy who installed it [09:06:03] and it looks to me that since it was installed no one really cared to fix it or update it [09:06:12] wait [09:06:22] we have existing etherpad tickets [09:06:28] and when i started to work on some of it [09:06:43] i was told to wait, because there would already be volunteer work on it, going on [09:06:59] ok my point is why this tool needs to be strictly maintained only by wmf [09:07:05] and them somebody asked where that was.. and somebody else linked something [09:07:18] I understand that wikimedia projects collect private data and that's reason why volunteers can't have shell [09:08:30] puppetizing etherpad-lite: great! but check what others have done already [09:08:44] finding out about getting the "private pad" feature / right license: great [09:08:59] running non-puppetized etherpad on labs, with volunteers on shell: i dont like so much [09:09:34] i would like to find out about the private pad feature [09:09:52] and then, if there would be private pads, that would be an issue with volunteer access again.. maybe.. though..hmm [09:10:02] running non-puppetized etherpad on labs, with volunteers on shell: i dont like so much - why? [09:10:13] ok, first thing is: we need to know what the other volunteers have done already [09:10:21] ok private pads, is something what should be on prod [09:10:27] but public pads why not? [09:10:41] I don't understand why we have to use 3rd server for pad for own wikimedia purposes [09:10:46] because the whole point is trying to puppetize all services, so nobody (volunteer or ops) needs to work on shell at all (in an ideal world) [09:11:03] I don't want to blame anyone from operation for not being able to maintain it, but this is something what could be perfectly maintained by volunteers as well [09:11:17] you always need to work in shell [09:11:20] when stuff breaks [09:11:26] puppet won't fix broken db [09:12:40] i just don't like the part of having it un-puppetized, we have a part of it already [09:12:44] misc::etherpad [09:12:49] etherpad has low priority for people from operation, they are supposed to manage the main projects, so why not leave these minor things to volunteers, you said "we rely on that" why you think that volunteers aren't able to keep it working? [09:12:58] and as mentioned above, somebody already worked on it, we need to check where and how much [09:13:16] + if it was on labs, it would be same as now, people from operation has access on labs as well [09:13:47] mutante: I never said it would not be puppetized [09:14:31] it could be, but there could be access for some other people to patch it or fix the broken database [09:14:31] well an instance on labs running etherpad to improve thing is of course fine [09:14:45] or broken configuration of current EP [09:14:51] which is broken [09:15:00] this is hard to do without shell [09:15:48] anyway if you don't like to move etherpad from production maybe having a second etherpad on labs would be a compromise then, I think people would appreciate to have etherpad which isn't so broken [09:16:05] these are 2 different things, or 3 [09:16:24] first is: what have other volunteers done already and how much of it works [09:16:32] nothing [09:16:33] second: with which exact version, etherpad vs. etherpad-lite [09:16:45] third: what about the license (hack) [09:17:33] first something new should run in labs, sure, the last decision after something new is here would be to move it to prod or just keeping it in labs.. to my understanding [09:17:58] and it would be enhancing class misc::etherpad [09:18:23] well, or misc::etherpad_lite [09:18:26] or a merge of the 2 [09:19:31] and the existing issues with it that sumana listed should be compared against existing ticket [09:19:51] yea, maybe volunteers need access to the tickets (copy from RT to Bugzilla additionally) [09:20:45] petan|wk: looking up the right mail thread i mentioned earlier... [09:21:33] https://bugzilla.wikimedia.org/show_bug.cgi?id=34953 [09:21:51] it's T.Gries, the guy who developed the MW extension [09:21:56] i wonder if he is on labs [09:22:16] Etherpad lite is an entire rewrite using node.js. It aims at being [09:22:16] embedable and Thomas Gries wrote an extension to do just that on MediaWiki. [09:22:32] hashar listed what we need: [09:22:36] an up-to-date node.js debian package [09:22:42] package EtherpadLite if it is not already done by someone [09:22:48] review and deploy the MediaWiki extension [09:23:02] these steps should, yeah, sure, be done in labs [09:23:56] like get the packages, install them on an instance, with that MW extension, test embedding in a wiki page.. [09:24:12] best if we can get T.Gries into the project as a user as well [09:28:52] existing tickets we have include: "install search", "https link is broken/redirect", "update version" ,"reports all logged in users as IP 127.0.0.1", "connection failures". ack [09:33:17] all things that should be fixed and tested on labs. [09:35:04] 04/30/2012 - 09:35:04 - Creating a home directory for dzahn at /export/home/etherpad/dzahn [09:36:04] 04/30/2012 - 09:36:04 - Updating keys for dzahn [09:40:50] mutante: sorry [09:40:52] was on meetin [09:44:27] mutante: ok so right now question is if we are going to have etherpad like now, or embebed [09:44:52] embebed can't be maintained by volunteers who don't have access to production, but the current version could be if it was living on labs [09:44:59] btw have to go for 1 hour [09:45:40] why my home is full [10:06:22] petan|wk: that's right. indeed the next question is "do we want the mw embedding?", it _looks_ like that would mean we have to use etherpad-lite instead of classic, which is a complete rewrite and not necessarily everybody agrees is the better one, BUT it is the one made for embedding. But maybe the extension could work with classic as well if we ask T.Gries, maybe not. And i'm pretty sure if we ask others if they want the embedded pads, they wil [10:20:02] editing on http://en.wikipedia.beta.wmflabs.org/ is blocked? Your current IP address is 10.4.0.17. Your IP address has been blocked on all wikis. [10:27:25] lol thats a private ip [10:29:45] yes, looks like there is some issue with the squid setup [10:30:54] 10.4.0.17 is the squid ip [10:52:15] New patchset: J; "move tmh ppa into its own class" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6109 [10:52:29] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/6109 [11:03:51] PROBLEM Current Users is now: CRITICAL on timedmedia-webtest i-00000232 output: Connection refused by host [11:04:45] PROBLEM Disk Space is now: CRITICAL on timedmedia-webtest i-00000232 output: Connection refused by host [11:05:17] PROBLEM Free ram is now: CRITICAL on timedmedia-webtest i-00000232 output: Connection refused by host [11:06:32] PROBLEM Total Processes is now: CRITICAL on timedmedia-webtest i-00000232 output: Connection refused by host [11:06:57] PROBLEM dpkg-check is now: CRITICAL on timedmedia-webtest i-00000232 output: Connection refused by host [11:08:27] PROBLEM Current Load is now: CRITICAL on timedmedia-webtest i-00000232 output: Connection refused by host [11:15:27] petan|wk: have you figured out why all requests to *.wikipedia.beta.wmflabs.org are comming from 10.4.0.17? [11:17:04] petan|wk: also, /etc/apache2 or more /data/project/ is not working on deployment-web(ls fails with ls: cannot open directory but its listed in mount) [11:17:33] PROBLEM host: timedmedia-webtest is DOWN address: i-00000232 check_ping: Invalid hostname/address - i-00000232 [11:20:20] j^: no where is hashar [11:20:26] he caused this [11:24:41] PROBLEM host: deployment-transcoding is DOWN address: i-00000105 CRITICAL - Host Unreachable (i-00000105) [11:25:54] j^: did you restart transcoding now? [11:25:59] j^: can you please log stuff [11:26:29] petan|wk: yes, i was rebooting transcoding just now, but now it does not start [11:28:16] !log deployment-prep reboot deployment-transcoding(i-00000105) [11:28:18] Logged the message, Master [11:29:46] sadly no ping now and get console output hangs on labsconsole [11:40:04] RECOVERY host: deployment-transcoding is UP address: i-00000105 PING OK - Packet loss = 0%, RTA = 0.97 ms [11:40:25] mount.nfs: DNS resolution failed for deployment-nfs-memc: Temporary failure in name resolution [11:45:04] PROBLEM dpkg-check is now: CRITICAL on deployment-transcoding i-00000105 output: Connection refused by host [11:45:35] PROBLEM Free ram is now: CRITICAL on deployment-transcoding i-00000105 output: Connection refused by host [11:46:35] PROBLEM Current Load is now: CRITICAL on deployment-transcoding i-00000105 output: Connection refused by host [11:47:15] PROBLEM Total Processes is now: CRITICAL on deployment-transcoding i-00000105 output: Connection refused by host [11:47:15] PROBLEM Current Users is now: CRITICAL on deployment-transcoding i-00000105 output: Connection refused by host [11:47:35] PROBLEM SSH is now: CRITICAL on deployment-transcoding i-00000105 output: Connection refused [11:53:44] j^: checking [11:56:35] RECOVERY Current Load is now: OK on deployment-transcoding i-00000105 output: OK - load average: 0.35, 0.08, 0.03 [11:56:55] j^: solved [11:57:15] RECOVERY Total Processes is now: OK on deployment-transcoding i-00000105 output: PROCS OK: 102 processes [11:57:20] RECOVERY Current Users is now: OK on deployment-transcoding i-00000105 output: USERS OK - 0 users currently logged in [11:57:35] RECOVERY SSH is now: OK on deployment-transcoding i-00000105 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:57:42] !log deployment-prep petrb: fixed transcoding [11:57:43] Logged the message, Master [11:57:52] mutante: back [11:58:31] petan|wk: thanks [11:58:32] mutante: I don't know what you wanted to say, it was too long for irc server, anyway if you don't like idea of moving etherpad, what about creating another one on labs, which would actually work [11:59:47] mutante: maybe if you saw that volunteers are able to operate stable and running version you would change your opinion on that... [12:00:05] RECOVERY dpkg-check is now: OK on deployment-transcoding i-00000105 output: All packages OK [12:00:35] RECOVERY Free ram is now: OK on deployment-transcoding i-00000105 output: OK: 78% free memory [12:58:44] PROBLEM Disk Space is now: CRITICAL on bz-dev i-000001db output: DISK CRITICAL - free space: / 34 MB (2% inode=43%): [12:59:24] RECOVERY Total Processes is now: OK on incubator-bots2 i-00000119 output: PROCS OK: 123 processes [12:59:34] PROBLEM Current Load is now: CRITICAL on incubator-bots2 i-00000119 output: CRITICAL - load average: 102.48, 1562.26, 1093.07 [13:03:44] PROBLEM Disk Space is now: WARNING on bz-dev i-000001db output: DISK WARNING - free space: / 52 MB (3% inode=43%): [13:24:36] !log deployment-prep petrb: rebooting web3 broken /data/project/ [13:24:38] Logged the message, Master [13:26:45] !log deployment-prep petrb: same for web [13:26:46] Logged the message, Master [13:27:12] !log deployment-prep petrb: web4 reboot for same reason [13:27:13] Logged the message, Master [13:28:33] boom [13:31:21] !log deployment-prep petrb: rebooting web5 [13:31:22] Logged the message, Master [13:38:19] werdna: did you make a backup before you installed het? [13:38:29] because it seems that you replaced some files [13:38:39] actually replaced some folders with other folders [13:39:29] I like how someone make a crucial change and mysteriosly disappear [13:39:45] :O [13:58:13] RECOVERY Disk Space is now: OK on labs-nfs1 i-0000005d output: DISK OK [13:59:43] PROBLEM HTTP is now: WARNING on deployment-web i-00000217 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.003 second response time [14:00:29] PROBLEM HTTP is now: CRITICAL on deployment-web3 i-00000219 output: Connection refused [14:01:03] PROBLEM HTTP is now: WARNING on deployment-web4 i-00000214 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.011 second response time [14:01:33] PROBLEM HTTP is now: WARNING on deployment-web5 i-00000213 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.005 second response time [14:04:43] PROBLEM Current Load is now: WARNING on incubator-bots2 i-00000119 output: WARNING - load average: 0.06, 0.08, 16.39 [14:09:12] How does one get bottie to do something? [14:09:21] ircecho [14:10:19] Hydriz: Ok... so how does it work? [14:10:36] get petan to help [14:10:40] he knows how [14:11:04] methecooldude: where [14:12:14] I'm just wondering how you make it interact with morebots? [14:12:21] type log in shell [14:12:26] Do you like, edit a page which sets it off? [14:12:30] Ohh, ok [14:12:52] This works on any instance? [14:12:52] but it needs to be enabled on project [14:13:00] no only where it is enabled [14:13:06] mutante: can we make a puppet class for this [14:13:12] Ah, can you turn it on for bots then? [14:13:17] yes [14:14:39] petan|wk: for what exactly? a default IRC bot? sure! we love to see bots being puppetized and documented [14:15:04] mutante: no bottie [14:15:08] there is a command log [14:15:14] which is enabled on some instances [14:15:24] I would rather make a puppet class for it [14:15:30] so that I don't have to copy it to /bin by hand [14:15:52] oh, it's logging shell commands? where does it log to? [14:15:56] SAL [14:16:05] i see, that was new to me [14:16:08] no it logs what you want to log [14:16:12] look [14:16:44] yea, you type "log" , saw that [14:16:52] !log bots petrb: this is a message I just logged [14:16:53] Logged the message, Master [14:17:03] Thanks petan [14:17:08] petrb@bots-4:~$ log this is a message I just logged [14:17:08] Message logged [14:17:13] petan|wk: so "log" is a bash alias and a bash script? [14:17:22] it's a bash script living in /bin [14:17:32] sure, can puppetize [14:17:35] the bin [14:17:38] but it has 1 variable [14:17:43] that one is per project [14:17:45] i get a 500 error, is there a way to see the error that gets thrown? it does nto show up in the logs right now [14:18:19] j^: it should be in logs unless someone broke it [14:18:30] j^: I suspect hashar did a lot of changes and disappeared [14:18:56] petan|wk: not sure exceptions are logged [14:18:59] j^: can you link me to page that produces 500 error [14:19:00] petan|wk: suggesting this: find a god place for it in the puppet/files directory and just commit a change putting the bash script itself there, then add a review request for me in gerrit that i should puppetize it in a class [14:19:20] petan|wk: http://en.wikipedia.beta.wmflabs.org/w/index.php?title=User:J&action=purge [14:19:20] j^: hashar reinstalled all apaches since then I don't know how does it work [14:19:29] you should ask him [14:19:38] i know where in the code it fails but not why [14:19:54] mutante: I have no idea how to do that [14:20:08] someone told me it's easy, but that someone never had a time to tell me how [14:20:26] I know how to put files there, but I don't know how to replace that variable [14:21:13] mutante: http://pastebin.mozilla.org/1607281 [14:21:19] just put the bash file in gerrit, replace password or variable with XXX [14:21:20] second line [14:21:25] and i will care about puppet part [14:21:34] mutante: I just sent you a source code, there is no password in that [14:21:45] putting one static file in there is also a good way to try the gerrit pushing [14:22:27] I know how to push things, I don't know the variable thing :) [14:22:36] check that second line [14:22:42] then push it and ask for review [14:22:53] but I know it won't work because of that second line [14:23:02] I need to know how to fix that [14:23:06] doesn't matter, can amend to same change, you or others [14:23:22] you can also use it for questions that way [14:23:24] ok can you just tell me how to fix it so that I don't have to commit something what is broken [14:23:51] that is = review :) [14:23:56] ok [14:24:29] i haven't looked at a line of the Bash yet, but commits dont have to be perfect , especially in labs [14:24:37] we can amend 100 times before merging it [14:24:43] RECOVERY Current Load is now: OK on incubator-bots2 i-00000119 output: OK - load average: 0.00, 0.00, 4.50 [14:24:48] and to me it is more "wiki way" anyways [14:25:31] mutante: where should I save it [14:28:33] petan|wk: /puppet/files/misc/ is like a default if nothing else in /puppet/files/ seems to match [14:29:35] if you know there will be more scripts related to this, make a new directory in files or files/misc/ or files/misc/scripts ...hmm, yeah this one is probably not the easiest to put into a group [14:30:23] PROBLEM HTTP is now: WARNING on deployment-web3 i-00000219 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.037 second response time [14:30:28] make it puppet/files/misc/scripts/ for now [14:30:55] !leslie's-reset [14:30:55] git reset --hard origin/test [14:31:15] heh, this is named after leslie now?:) gg [14:32:02] New patchset: Petrb; "Added a script for mutante to fix" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6118 [14:32:15] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/6118 [14:35:24] petan|wk: fyi, add me as reviewer in the "Add Reviewer" thing in gerrit rather than having to put nickname in commit. this way you can be sure i always see it on my ToDo list in gerrit as well [14:35:46] won't grep logs for my nick most likely. just saying in general [14:35:49] I didn't know what to put as commit message [14:35:57] ok [14:36:17] I just commited a file with no other purpose then letting you see it [14:36:21] * than [14:36:58] the purpose is you want it as a nice class to apply to other instances :) [14:37:25] so you said the problem is you can't get the project name? [14:37:37] New review: Petrb; "(no comment)" [operations/puppet] (test) C: 0; - https://gerrit.wikimedia.org/r/6118 [14:38:03] mutante: yes, line 2 [14:38:09] project="something here" [14:38:24] I don't know that something [14:38:27] it's project name [14:38:32] how do I get it from puppet [14:39:08] New review: Dzahn; "replied inline" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/6118 [14:39:29] ah, you want that to be automatic [14:39:42] thought for a moment you want "log project bla" like on IRC [14:39:59] no it's automatic [14:40:03] yea, someone mentioned the variable name.. ehm..looking [14:43:42] petan|wk: i just see $INSTANCENAME currently [14:43:54] we need a project name [14:44:13] it is defined in /etc/bash.bashrc, that was what Platonides pointed out when i wanted instance name [14:44:23] i was hoping for project name to be in the same place [14:44:37] I don't think there is a project name defined anywhere on instance [14:44:40] if you dont have another way to ask for project name by instance name... [14:44:59] I would rather have puppet do that [14:45:00] by asking an external server [14:45:17] so that when it install the file to /bin it already contains the line project=blah [14:45:18] yeah, it would certainly be nice to have $PROJECTNAME just like $INSTANCENAME [14:45:21] in the same file [14:45:28] that is Bugzilla-worthy [14:45:37] i can create it [14:45:49] I don't mean $PROJECTNAME in bash.rc I mean in puppet [14:46:02] puppet can have a template for files [14:46:08] well, you want it for a bash script [14:46:17] puppet will just put both files in place, the bashrc and your script [14:46:21] yes, bash script could be a template in puppet [14:46:30] it would be generated by puppet [14:46:40] in bots project it would create line project=bots [14:46:41] yes, like INSTANCENAME is [14:46:42] in that script [14:46:53] like it is now [14:46:57] puppet puts it there when creating the bash.bashrc on new instances [14:47:10] so then you can use it in bash scripts per default [14:47:10] ok, I need the same [14:47:29] yes I know what is bash.rc I don't know how puppet change it [14:47:52] puppet writes it on instance creation or at least adds that part exporting INSTANCENAME [14:48:01] i'll create a bug [14:48:12] what I need is that when puppet install it to instance it replace second line of script with project=xx [14:48:30] where xx is project name [14:48:36] I don't need any change in bash.rc [14:48:44] nah, the other way around [14:48:55] I need puppet generate this file just as it does with bash rc [14:48:58] you just want project=$PROJECTNAME in your script and its done [14:49:11] you just want to be able to script with it in any script on any instance [14:49:17] without having to generate every script [14:49:20] yes I know that's a way to do that but I would like to know to do this using puppet as well [14:49:32] how does it generate scripts? [14:49:36] like bash.rc [14:49:39] it uses templates [14:49:42] in "erb" [14:49:54] !erb [14:49:58] so if this file was a template we could just do project= [14:50:24] I don't know if we need to have project name system wide [14:50:30] http://docs.puppetlabs.com/guides/templating.html [14:50:35] yes, that would technically work [14:50:43] ok, if you know how to do that [14:50:44] but it means you have the template every script [14:50:53] there is only 1 script so far [14:50:58] where the other way is a lot easier and has already been started with INSTANCENAME [14:51:13] stress on "so far" [14:51:18] ok, in that case why you don't replace the bash.rc in puppet [14:51:25] so that template contains PROJECTNAME [14:51:36] thats what i say i am opening a bug for [14:51:41] why a bug? [14:51:42] ack [14:51:47] you can change it? [14:52:00] that's what we do, we document stuff in bugs and schedule them [14:52:09] hm, right [14:52:33] i don't get anything else done if i jump on those little changes right away, even if each one sounds like quick [14:53:00] also very hard to look anything up later if we dont write it _somewhere_. and before i write mail i could as well use bz right away [14:54:59] it does not mean i may not be the one who resolves it later, but others have the chance to as well [14:55:13] need to go for short break [14:55:17] ok [15:05:00] New review: Dzahn; "(no comment)" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/6118 [15:20:37] Can anyone test this Windows 7 Theme that I wrote to subscribe to Wikimedia Commons' Featured Wallpapers as a desktop wallpaper slideshow? I am unable to test it myself since I am on a network that restricts bandwidth. http://bit.ly/IkgOvn [15:43:44] PROBLEM Current Load is now: CRITICAL on pediapress-ocg1 i-00000233 output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:44:24] PROBLEM Current Users is now: CRITICAL on pediapress-ocg1 i-00000233 output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:45:04] PROBLEM Disk Space is now: CRITICAL on pediapress-ocg1 i-00000233 output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:45:44] PROBLEM Free ram is now: CRITICAL on pediapress-ocg1 i-00000233 output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:46:54] PROBLEM Total Processes is now: CRITICAL on pediapress-ocg1 i-00000233 output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:47:34] PROBLEM dpkg-check is now: CRITICAL on pediapress-ocg1 i-00000233 output: CHECK_NRPE: Error - Could not complete SSL handshake. [16:01:56] Change on 12mediawiki a page OAuth/User stories was modified, changed by Denny Vrandečić (WMDE) link https://www.mediawiki.org/w/index.php?diff=530746 edit summary: [16:49:15] Change on 12mediawiki a page OAuth/User stories was modified, changed by DarTar link https://www.mediawiki.org/w/index.php?diff=530768 edit summary: minor clean up, changed section titles, changing subheading to bold to avoid cluttering up TOC [17:03:35] Change on 12mediawiki a page OAuth/User stories was modified, changed by DarTar link https://www.mediawiki.org/w/index.php?diff=530774 edit summary: /* Securely uploading media to Commons from 3rd party website */ typo [17:13:43] PROBLEM Current Users is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:14:23] PROBLEM Disk Space is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:15:03] PROBLEM Free ram is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:16:13] PROBLEM Total Processes is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:16:53] PROBLEM dpkg-check is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:18:13] PROBLEM Current Load is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:23:43] RECOVERY Current Load is now: OK on pediapress-ocg1 i-00000233 output: OK - load average: 0.45, 0.14, 0.08 [17:24:23] RECOVERY Current Users is now: OK on pediapress-ocg1 i-00000233 output: USERS OK - 0 users currently logged in [17:24:33] RECOVERY Disk Space is now: OK on pediapress-ocg1 i-00000233 output: DISK OK [17:25:43] RECOVERY Free ram is now: OK on pediapress-ocg1 i-00000233 output: OK: 92% free memory [17:26:53] RECOVERY Total Processes is now: OK on pediapress-ocg1 i-00000233 output: PROCS OK: 77 processes [17:27:33] RECOVERY dpkg-check is now: OK on pediapress-ocg1 i-00000233 output: All packages OK [17:45:52] New patchset: Ryan Lane; "Only set the member attribute map before precise" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6133 [17:46:06] New patchset: Ryan Lane; "Change the automount timeout to 2 hours" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/5928 [17:46:19] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/6133 [17:46:19] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/5928 [17:50:44] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5928 [17:50:47] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/5928 [17:51:04] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6133 [17:51:07] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6133 [18:52:39] New patchset: Ryan Lane; "Try another way to call the function" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6139 [18:52:53] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/6139 [18:53:01] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6139 [18:53:04] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6139 [19:14:01] New patchset: Ryan Lane; "nslcd will fail to start with improper permissions" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6145 [19:14:14] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/6145 [19:14:31] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6145 [19:14:34] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6145 [19:30:54] RECOVERY Puppet freshness is now: OK on nova-production1 i-0000007b output: puppet ran at Mon Apr 30 19:30:39 UTC 2012 [19:49:00] paravoid: I'm going to replace the gluster virt1 brick with virt5 [19:49:10] good time as any to see if this is going to break horribly [20:12:15] 04/30/2012 - 20:12:14 - Creating a home directory for orion at /export/home/bastion/orion [20:13:14] 04/30/2012 - 20:13:14 - Updating keys for orion [21:18:08] PROBLEM SSH is now: CRITICAL on opengrok-web i-000001e1 output: CRITICAL - Socket timeout after 10 seconds [21:18:08] PROBLEM Disk Space is now: CRITICAL on utils-abogott i-00000131 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:18:08] PROBLEM Free ram is now: CRITICAL on opengrok-web i-000001e1 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:18:08] PROBLEM Total Processes is now: CRITICAL on opengrok-web i-000001e1 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:18:08] PROBLEM dpkg-check is now: CRITICAL on opengrok-web i-000001e1 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:18:08] I'm replacing virt1's brick with virt5. things may get slow [21:18:31] * Damianz does a moon walk slowly past Ryan [21:19:11] PROBLEM Current Load is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [21:19:11] :( [21:19:11] PROBLEM dpkg-check is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:19:11] PROBLEM dpkg-check is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [21:19:11] PROBLEM Current Users is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:19:11] I hope that's just nagios being retarded [21:19:14] PROBLEM HTTP is now: CRITICAL on bots-apache1 i-000000b0 output: CRITICAL - Socket timeout after 10 seconds [21:20:08] Bots still seem up at least. [21:20:08] PROBLEM Current Load is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [21:20:08] PROBLEM Current Users is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [21:20:08] PROBLEM Disk Space is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [21:20:08] PROBLEM Total Processes is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [21:20:08] PROBLEM Free ram is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [21:21:04] <^demon> Man, gerrit *really* doesn't like this query. "Loading..." forever. [21:21:04] I'm thinking this is because of the load spiking [21:22:14] PROBLEM Current Load is now: CRITICAL on opengrok-web i-000001e1 output: CRITICAL - load average: 40.77, 28.64, 13.95 [21:22:19] PROBLEM SSH is now: CRITICAL on bots-cb i-0000009e output: CRITICAL - Socket timeout after 10 seconds [21:22:20] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: CHECK_NRPE: Socket timeout after 10 seconds. [21:23:08] PROBLEM Total Processes is now: CRITICAL on utils-abogott i-00000131 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:23:56] NRPE is a bit 'sensitive' in my experience [21:23:56] PROBLEM Total Processes is now: CRITICAL on swift-be2 i-000001c8 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:23:56] well, I'm definitely having issues on some of these instances [21:23:56] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 7.89, 11.82, 8.54 [21:23:56] RECOVERY dpkg-check is now: OK on bots-cb i-0000009e output: All packages OK [21:23:59] PROBLEM Current Load is now: WARNING on bots-apache1 i-000000b0 output: WARNING - load average: 9.81, 9.59, 6.07 [21:25:24] PROBLEM Disk Space is now: WARNING on deployment-feed i-00000118 output: DISK WARNING - free space: / 65 MB (4% inode=40%): [21:26:57] ok, paused the brick replace [21:27:18] let's see if things get better [21:27:18] PROBLEM Current Load is now: WARNING on opengrok-web i-000001e1 output: WARNING - load average: 13.44, 23.62, 16.47 [21:27:18] RECOVERY SSH is now: OK on bots-cb i-0000009e output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:27:24] PROBLEM Total Processes is now: CRITICAL on test2 i-0000013c output: CHECK_NRPE: Socket timeout after 10 seconds. [21:27:39] PROBLEM Current Load is now: WARNING on swift-be2 i-000001c8 output: WARNING - load average: 17.98, 15.15, 8.34 [21:28:08] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:08] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:14] RECOVERY dpkg-check is now: OK on test3 i-00000093 output: All packages OK [21:29:46] RECOVERY HTTP is now: OK on bots-apache1 i-000000b0 output: HTTP OK: HTTP/1.1 200 OK - 1480 bytes in 0.003 second response time [21:29:46] PROBLEM Total Processes is now: CRITICAL on dumps-2 i-00000174 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:46] PROBLEM dpkg-check is now: CRITICAL on labs-lvs1 i-00000057 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:46] RECOVERY Current Load is now: OK on reportcard2 i-000001ea output: OK - load average: 6.88, 7.03, 4.19 [21:29:46] RECOVERY Current Users is now: OK on reportcard2 i-000001ea output: USERS OK - 0 users currently logged in [21:29:46] RECOVERY Disk Space is now: OK on reportcard2 i-000001ea output: DISK OK [21:29:46] RECOVERY Total Processes is now: OK on reportcard2 i-000001ea output: PROCS OK: 89 processes [21:29:46] RECOVERY Free ram is now: OK on reportcard2 i-000001ea output: OK: 76% free memory [21:30:09] PROBLEM SSH is now: CRITICAL on utils-abogott i-00000131 output: CRITICAL - Socket timeout after 10 seconds [21:30:20] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:20] PROBLEM Current Load is now: CRITICAL on utils-abogott i-00000131 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:20] PROBLEM Current Users is now: CRITICAL on utils-abogott i-00000131 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:44] PROBLEM SSH is now: CRITICAL on test2 i-0000013c output: CRITICAL - Socket timeout after 10 seconds [21:32:17] PROBLEM Current Load is now: WARNING on test2 i-0000013c output: WARNING - load average: 21.53, 18.49, 10.27 [21:32:44] PROBLEM Current Load is now: CRITICAL on mobile-enwp i-000000ce output: CRITICAL - load average: 30.55, 18.24, 9.50 [21:33:28] PROBLEM dpkg-check is now: CRITICAL on localpuppet1 i-0000020b output: CHECK_NRPE: Socket timeout after 10 seconds. [21:34:14] RECOVERY dpkg-check is now: OK on labs-lvs1 i-00000057 output: All packages OK [21:34:24] PROBLEM Disk Space is now: CRITICAL on bz-dev i-000001db output: CHECK_NRPE: Socket timeout after 10 seconds. [21:34:35] PROBLEM dpkg-check is now: CRITICAL on resourceloader2-apache i-000001d7 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:34:35] * Ryan_Lane1 sighs [21:34:54] PROBLEM dpkg-check is now: CRITICAL on incubator-apache i-00000211 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:35:30] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:35:30] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:35:30] PROBLEM dpkg-check is now: CRITICAL on grail i-0000021e output: CHECK_NRPE: Socket timeout after 10 seconds. [21:35:30] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:35:49] well, this is going to be a long night, I can tell [21:36:24] PROBLEM Free ram is now: WARNING on mobile-enwp i-000000ce output: Warning: 11% free memory [21:36:39] PROBLEM Disk Space is now: CRITICAL on grail i-0000021e output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:35] PROBLEM Free ram is now: CRITICAL on bz-dev i-000001db output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:35] PROBLEM Free ram is now: CRITICAL on incubator-apache i-00000211 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:35] PROBLEM Current Load is now: CRITICAL on resourceloader2-apache i-000001d7 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:35] PROBLEM Total Processes is now: CRITICAL on localpuppet1 i-0000020b output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:35] PROBLEM Current Load is now: CRITICAL on localpuppet1 i-0000020b output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:35] PROBLEM Total Processes is now: CRITICAL on incubator-apache i-00000211 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:35] PROBLEM Free ram is now: CRITICAL on grail i-0000021e output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:35] PROBLEM Total Processes is now: CRITICAL on grail i-0000021e output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:35] * Krinkle mutes ping by /resourceloader/ [21:37:35] PROBLEM Free ram is now: CRITICAL on localpuppet1 i-0000020b output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:35] PROBLEM Free ram is now: CRITICAL on resourceloader2-apache i-000001d7 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:35] PROBLEM Total Processes is now: CRITICAL on resourceloader2-apache i-000001d7 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:36] PROBLEM Current Load is now: WARNING on dumps-2 i-00000174 output: WARNING - load average: 18.86, 16.44, 10.30 [21:37:40] RECOVERY Total Processes is now: OK on opengrok-web i-000001e1 output: PROCS OK: 108 processes [21:37:45] RECOVERY Free ram is now: OK on opengrok-web i-000001e1 output: OK: 68% free memory [21:37:50] PROBLEM Current Load is now: CRITICAL on bz-dev i-000001db output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:50] PROBLEM Current Users is now: CRITICAL on bz-dev i-000001db output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:50] PROBLEM Disk Space is now: CRITICAL on localpuppet1 i-0000020b output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:50] PROBLEM Current Load is now: CRITICAL on incubator-apache i-00000211 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:50] PROBLEM HTTP is now: CRITICAL on bots-apache1 i-000000b0 output: CRITICAL - Socket timeout after 10 seconds [21:37:51] PROBLEM Current Load is now: CRITICAL on labs-nfs1 i-0000005d output: CRITICAL - load average: 30.36, 20.18, 10.77 [21:38:00] PROBLEM Current Users is now: CRITICAL on localpuppet1 i-0000020b output: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:00] PROBLEM Current Load is now: CRITICAL on grail i-0000021e output: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:00] PROBLEM Current Users is now: CRITICAL on grail i-0000021e output: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:00] PROBLEM SSH is now: CRITICAL on labs-nfs1 i-0000005d output: CRITICAL - Socket timeout after 10 seconds [21:39:00] PROBLEM Disk Space is now: CRITICAL on incubator-apache i-00000211 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:00] PROBLEM Total Processes is now: CRITICAL on labs-nfs1 i-0000005d output: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:00] PROBLEM Current Load is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:00] PROBLEM Current Users is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:00] PROBLEM Disk Space is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:00] PROBLEM Total Processes is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:00] PROBLEM Free ram is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:00] RECOVERY dpkg-check is now: OK on opengrok-web i-000001e1 output: All packages OK [21:39:00] RECOVERY Total Processes is now: OK on swift-be2 i-000001c8 output: PROCS OK: 107 processes [21:39:00] PROBLEM Current Load is now: WARNING on memcache-puppet i-00000153 output: WARNING - load average: 8.24, 8.56, 6.50 [21:39:00] PROBLEM Current Users is now: CRITICAL on resourceloader2-apache i-000001d7 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:00] PROBLEM HTTP is now: CRITICAL on grail i-0000021e output: CRITICAL - Socket timeout after 10 seconds [21:40:01] PROBLEM Total Processes is now: CRITICAL on bz-dev i-000001db output: CHECK_NRPE: Socket timeout after 10 seconds. [21:40:01] PROBLEM dpkg-check is now: CRITICAL on vumi i-000001e5 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:40:01] PROBLEM Disk Space is now: CRITICAL on resourceloader2-apache i-000001d7 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:40:01] PROBLEM Total Processes is now: CRITICAL on orgcharts-dev i-0000018f output: CHECK_NRPE: Socket timeout after 10 seconds. [21:40:50] PROBLEM dpkg-check is now: CRITICAL on bz-dev i-000001db output: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:42] PROBLEM Current Users is now: CRITICAL on labs-lvs1 i-00000057 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:42] PROBLEM SSH is now: CRITICAL on bots-cb i-0000009e output: CRITICAL - Socket timeout after 10 seconds [21:41:42] Bleh totally can't login to bastion atm and the vm just crashed [21:41:42] yeah [21:41:42] we're having issues [21:41:42] RECOVERY Disk Space is now: OK on grail i-0000021e output: DISK OK [21:41:42] PROBLEM Current Load is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:42] PROBLEM Free ram is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:42] PROBLEM Disk Space is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:42] PROBLEM Current Users is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:42] PROBLEM Total Processes is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:42] * Damianz blames paravoid [21:41:42] RECOVERY SSH is now: OK on test2 i-0000013c output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:41:42] RECOVERY Free ram is now: OK on grail i-0000021e output: OK: 86% free memory [21:41:42] RECOVERY Total Processes is now: OK on grail i-0000021e output: PROCS OK: 96 processes [21:42:46] Wow that's weird [21:42:46] bastion is failing my key on auth [21:42:46] PROBLEM dpkg-check is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [21:42:46] * Damianz wonders if home dirs are screwed [21:42:46] RECOVERY Total Processes is now: OK on test2 i-0000013c output: PROCS OK: 104 processes [21:42:46] RECOVERY SSH is now: OK on opengrok-web i-000001e1 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:42:46] PROBLEM Current Load is now: WARNING on jenkins2 i-00000102 output: WARNING - load average: 7.86, 7.23, 5.60 [21:42:46] PROBLEM Current Load is now: WARNING on grail i-0000021e output: WARNING - load average: 11.17, 9.93, 6.13 [21:42:46] RECOVERY Current Users is now: OK on grail i-0000021e output: USERS OK - 0 users currently logged in [21:43:10] PROBLEM Total Processes is now: CRITICAL on swift-be4 i-000001ca output: CHECK_NRPE: Socket timeout after 10 seconds. [21:44:45] PROBLEM Current Load is now: WARNING on upload-wizard i-0000021c output: WARNING - load average: 7.44, 6.85, 5.48 [21:46:03] PROBLEM HTTP is now: CRITICAL on ee-prototype i-0000013d output: CRITICAL - Socket timeout after 10 seconds