[00:06:40] RECOVERY Disk Space is now: OK on wikistream-1 wikistream-1 output: DISK OK [00:19:40] PROBLEM Free ram is now: CRITICAL on test3 test3 output: Critical: 1% free memory [00:24:40] PROBLEM Disk Space is now: CRITICAL on aggregator1 aggregator1 output: DISK CRITICAL - free space: / 0 MB (0% inode=94%): [00:24:40] RECOVERY Free ram is now: OK on test3 test3 output: OK: 96% free memory [01:50:46] no Ryan :-( [03:41:32] PROBLEM Free ram is now: WARNING on test-oneiric test-oneiric output: Warning: 15% free memory [03:44:42] PROBLEM Free ram is now: WARNING on utils-abogott utils-abogott output: Warning: 16% free memory [03:45:02] PROBLEM Free ram is now: WARNING on orgcharts-dev orgcharts-dev output: Warning: 16% free memory [03:56:32] PROBLEM Free ram is now: CRITICAL on test-oneiric test-oneiric output: Critical: 5% free memory [04:00:02] PROBLEM Free ram is now: WARNING on nova-daas-1 nova-daas-1 output: Warning: 16% free memory [04:01:32] RECOVERY Free ram is now: OK on test-oneiric test-oneiric output: OK: 97% free memory [04:04:45] PROBLEM Free ram is now: CRITICAL on utils-abogott utils-abogott output: Critical: 3% free memory [04:05:02] PROBLEM Free ram is now: CRITICAL on orgcharts-dev orgcharts-dev output: Critical: 4% free memory [04:09:42] RECOVERY Free ram is now: OK on utils-abogott utils-abogott output: OK: 97% free memory [04:10:02] RECOVERY Free ram is now: OK on orgcharts-dev orgcharts-dev output: OK: 96% free memory [04:19:55] PROBLEM Free ram is now: CRITICAL on nova-daas-1 nova-daas-1 output: Critical: 4% free memory [04:24:55] RECOVERY Free ram is now: OK on nova-daas-1 nova-daas-1 output: OK: 93% free memory [05:02:45] PROBLEM Free ram is now: WARNING on test3 test3 output: Warning: 6% free memory [05:07:45] RECOVERY Free ram is now: OK on test3 test3 output: OK: 96% free memory [05:10:55] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [05:12:05] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [05:12:05] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [05:41:56] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [05:42:06] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [05:42:06] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [06:11:56] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [06:12:16] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [06:12:16] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [06:40:46] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 5.32, 9.35, 5.70 [06:42:21] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [06:42:21] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [06:42:21] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [06:50:52] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.25, 2.21, 3.71 [06:55:52] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.01, 0.88, 2.72 [07:12:32] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [07:13:52] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [07:13:52] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [07:43:32] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [07:43:52] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [07:43:52] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [07:54:34] petan: ping [08:03:17] mutante: hi [08:03:47] creating a service groups is simple, but can you review the gerring change I did [08:03:55] Ryan_Lane: hi [08:04:00] howdy [08:04:40] they rebooted my server damn... what do I pay for :D [08:04:49] my irssi was down [08:05:51] petan|wk: where are the scripts located on the nagios box? [08:05:58] Ryan_Lane: I disabled checkuser on beta I will remove cu tables as soon as I have a time [08:06:04] cool. thanks [08:06:14] but right now we are facing huge problem with sql [08:06:14] heh [08:06:21] we had to reboot virt3 a couple times [08:06:22] because most of db's are corrupt, dunno why [08:06:31] that's weird [08:06:43] yes, it looks that only frm files are wrong, data should be ok [08:06:55] but recovery is likely going to take weeks... [08:07:14] I can't even update mediawiki db structure because of that [08:07:25] :/ [08:07:34] mutante: in my home :_ [08:07:35] petan|wk: oh, so you did use gerrit.. [08:07:35] :) [08:07:37] we need a database service [08:08:12] Ryan_Lane: is it going to be possible in future for us to recover db from that service? [08:08:17] since we are not going to have access to it [08:08:31] not sure i understood why using C++ means SVN > git [08:08:33] well, if the box crashes, I'd imagine ops would take care of it [08:08:49] mutante: agreed. I don't see why svn would be better either [08:10:24] like now I need to do lot of dba stuff, recover files from backups, fix tables etc I need to have root for this, if it happened on shared service, I don't believe ops would spend weeks by recovering corrupt files for a project which isn't even important for production [08:10:47] I couldn't imagine it taking weeks [08:11:05] we've never had such a problem on our production database servers [08:11:24] hm, question is how to recover it, I need to recover table schemes from backups for more than 400 db's [08:11:38] I don't understand that [08:11:48] but you have skilled dba in operation :-) [08:12:00] no. I don't understand how all the databases could be fucked up [08:12:09] I don't undestand it either [08:12:20] I just know we get a lot of sql errors in logs [08:12:35] I also don't understand how they get fucked up after every restart of the instance [08:12:54] yes that's weird... [08:13:22] it doesn't seem that db files are actually corrupted only the scheme files [08:13:25] for tables [08:13:26] data should be ok [08:13:41] is this some weird configuration problem? [08:13:50] maybe something is being stored in tmpfs and is getting wiped out? [08:13:52] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [08:13:52] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [08:13:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [08:14:12] I didn't change anything in configuration when it happened [08:14:50] well, this started after the database files were moved onto the project storage, right? [08:15:15] mutante: storing the c++ is git is no problem for me, I thought you want to puppetize the binary code [08:16:20] Ryan_Lane: actually not, the files are corrupt in backup as well [08:16:25] so it happened before that [08:16:34] or that's how does it look [08:16:35] we are talking about innodb right? [08:16:39] yes [08:16:53] when I move frm files from old storage to new, it still happen [08:17:11] old and new storage = different mysql versions? [08:17:23] no it's same server just different path [08:17:40] I moved files from /mnt to /data [08:23:50] hmm. host.frm has "incorrect information" [08:24:18] and i see other people saying they copied host.frm from another machine [08:24:32] when that happened to them [08:25:35] 120321 8:05:02 [ERROR] Fatal error: Can't open and lock privilege tables: Incorrect information in file: './mysql/host.frm' [08:25:38] 120321 8:27:16 [ERROR] /usr/sbin/mysqld: Incorrect information in file: './wtf/host.frm' [08:25:50] ./wtf/ ?:) [08:26:58] I have no idea what is that [08:27:00] people do not log [08:27:15] someone created the new db "wtf" and I don't know who it was [08:28:55] i see suggested fixes like "remove mysql db, run mysql_install_db" to similar problems with Incorrect information in those .frm's [08:29:31] it almst seems like the .frm are for a different mysql version than the server is [08:30:11] nope [08:30:33] I didn't change mysql server, unless someone else upgraded it [08:30:37] general_log.frm: MySQL table definition file Version 10 [08:30:48] servers.frm: MySQL table definition file Version 9 [08:31:25] are people accessing the same files from two different mysql servers? [08:31:36] no [08:31:43] there is only 1 sql [08:31:47] for deployment [08:31:57] + another one, but that's write only [08:32:00] for backups [08:32:27] I mean we never use it to read data [08:34:26] !gerrit 3376 [08:34:26] https://gerrit.wikimedia.org/ [08:34:29] !g 3376 [08:34:30] https://gerrit.wikimedia.org/r/3376 [08:34:34] !g 3376 | mutante [08:34:34] mutante: https://gerrit.wikimedia.org/r/3376 [08:34:39] can you check it [08:35:48] hmmm.. http://forge.mysql.com/wiki/MySQL_Internals_File_Formats vs. hexdump -v -C host.frm [08:36:34] Ryan_Lane: is there a reason why puppet is readeable by root only [08:36:43] what do you mean? [08:36:50] I made a check for puppet freshness but it has troubles reading it's log file [08:36:53] yaml [08:37:05] petan|wk: you saw the inline comments by maplebed already? [08:37:10] probably for security reasons [08:37:18] because it needs to run as root, for some reason even when I 4755 it, doesn't work still [08:37:23] mutante: just see them :) [08:37:32] I will fix that [08:38:10] petan|wk: i saw you said snmp trap , like in production, doesnt work due to firewall. we could fix that? [08:38:14] brb [08:39:17] mutante: I have no idea [08:39:40] in fact I don't know anything about how does it work on prod, apart of that check which is in puppet now doesn't work in labs [08:40:32] just modify the security group [08:40:37] then it'll work [08:41:49] New patchset: Petrb; "inseted a new check for puppet freshness" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3376 [08:42:01] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3376 [08:42:15] hm... I will check it [08:43:46] but still I have no idea how does it work, it would be better if we could use this nrpe version at least until someone fix the other one [08:43:52] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [08:43:52] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [08:43:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [08:44:06] petan|wk: you need to allow port 162 udp on the nagios host [08:44:13] because I don't even know how this snmp works [08:44:36] ok, I can allow that port [08:48:39] !log nagios petrb: inserted snmp to firewall for range 10.4.0.0/24 [08:48:59] ok it's done [08:49:02] what's next :) [08:49:23] my bottie works better than your log bot [08:49:23] :D [08:49:32] at least it doesn't crash so much [08:50:13] /puppet/files/snmp$ vi snmptt.conf [08:50:20] ok [08:50:31] EXEC /usr/local/nagios/libexec/eventhandlers/submit_check_result $r "Puppet freshness" 0 "puppet ran at `date`" [08:50:40] snmp trap that's sent whenever puppet runs on a wikimedia host. [08:51:02] the difference is that each host will send an OK to the nagios host when puppet ran [08:51:15] and Nagios just turns it CRIT if it did not hear from a host [08:51:31] instead of NRPE actively asking each host all the time (active vs. passive checks) [08:51:47] I don't get how it is run when puppet is started [08:52:53] I mean how is this EXEC /usr/local/nagios/libexec/eventhandlers/submit_ch started by puppet run? [08:54:17] also where is defined ip of nagios [08:55:06] petan|wk: puppet will just run the command during the run [08:55:10] submit_check_result is still on the nagios host, it "fakes" the result [08:55:20] ah [08:55:31] how do we tell which IP nagios has to it? [08:55:56] do we need to insert a services to nagios conf? [08:56:40] thats snmptrapd [08:57:06] so what's the next step to configure it [09:01:21] you need snmptrapd and snmptt stuff on the nagios host [09:01:29] ok [09:02:07] it's in nagios.pp though [09:02:19] class snmp [09:02:43] /etc/snmp should look similar to this [09:02:47] it's already there [09:02:48] snmpd.conf snmptrapd.conf snmptt.conf snmptt.conf.bak snmptt.ini [09:02:54] ok [09:03:07] petrb@nagios:~$ ls /etc/snmp/ [09:03:07] snmpd.conf snmptrapd.conf snmptt.conf snmptt.ini [09:03:47] do you also see processes running? snmpd, snmptrapd, etc [09:03:55] /usr/bin/perl /usr/sbin/snmptt --daemon [09:04:12] petrb@nagios:~$ ps -ef | grep snm [09:04:12] snmp 937 1 0 Mar21 ? 00:00:10 /usr/sbin/snmpd -Lsd -Lf /dev/null -u snmp -g snmp -I -smux -p /var/run/snmpd.pid 127.0.0.1 [09:04:15] root 955 1 0 Mar21 ? 00:00:01 /usr/bin/perl /usr/sbin/snmptt --daemon [09:04:42] yes [09:04:45] ok, so the even is this: [09:04:47] EVENT enterpriseSpecific .1.3.6.1.4.1.33298.0.1004 "Status Events" Normal [09:04:57] that is an snmp OID [09:05:25] ok [09:05:40] how do I implement it to snmp [09:05:48] I mean that puppet check [09:06:01] I suppose that I need to copy that line [09:06:07] EXEC /usr/local/nagios/libexec/eventhandlers/submit_check_result $r "Puppet f [09:06:22] that is from /etc/snmp/snmptt.conf [09:06:27] ok [09:06:30] I will move it there [09:06:31] so i suppose you have it already [09:06:47] let puppet do it [09:06:57] if you already have that class applied it should be there [09:07:29] it's already there [09:07:37] yes I think [09:08:18] ok so I need to insert the check [09:08:21] how do I do that [09:08:32] I don't know syntax for snmp in nagios conf [09:08:38] base.pp: monitor_service { "puppet freshness": description => "Puppet freshness", check_command => "puppet-FAIL", passive => "true", freshness => 36000, retries => 1 ; } [09:08:44] ok [09:08:55] check_command => "puppet-FAIL" [09:08:59] I need to define it [09:09:01] this command [09:09:25] hm... [09:09:26] define command{ command_name puppet-FAIL command_line echo "Puppet has not run in the last 10 hours" && exit 2 [09:09:30] } [09:09:43] grep FAIL ./puppet/templates/nagios/* [09:10:11] it's all puppetized already [09:10:35] but you probably do not use the stuff from base.pp ? [09:10:43] I don't know [09:10:48] so there is just no monitor_service , but you got the rest [09:10:52] I guess I can't use puppet version of nagios from prod [09:10:57] it would colide in many ways [09:11:15] ok I will try to define it [09:11:53] i wonder in which ways it would collide [09:12:05] probably that is what we should fix [09:12:35] if labs and production are different how are we ever able to merge labs into prod [09:13:18] ..oh. .and note that we have new classes for "icinga" now . currently being worked on [09:13:52] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [09:13:52] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [09:13:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [09:14:53] petan|wk: <-- btw, if you ever want to turn off those notifications for hosts for a while: here's how i did it http://wikitech.wikimedia.org/view/Nagios#Scheduling_downtimes_with_a_shell_command [09:15:44] because we cant do that via web ui currently. i tried to mess with config and permissions for it, but reverted it due to security [09:16:16] we can do it [09:16:24] nagios.wmflabs.org/nlogin [09:16:43] there is web ui for that [09:16:59] I can create an account for you, in future it will be ldap pw [09:17:08] once I find out how to do that [09:17:16] it's not just the login, it's also permissions on the "external command file" [09:17:26] that I fixed [09:17:28] i logged stuff in nagios project recently [09:17:29] we should really use openid or something like that at some point [09:17:33] really, this thing work [09:17:37] so that a password isn't necessary [09:17:44] Ryan_Lane: ok, if you can do that [09:17:44] maybe I'll add simplesaml at some point [09:17:54] untill then I can just create a pw for anyone who wants [09:18:07] * Ryan_Lane nods [09:18:14] mutante: did you try it? :) [09:18:23] I use it to acknowledge problems [09:18:24] look [09:18:36] ACKNOWLEDGEMENT host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [09:18:45] using gui [09:18:51] petan|wk: tr to schedule a host downtime yet? [09:18:57] that works too [09:19:18] you just need to have an account on nagios [09:19:46] run /etc/nagios3/adduser mutante [09:19:52] it will insert a new user and generate pw [09:20:03] so that you can login there [09:20:18] that's cool [09:20:40] wait, that script is not there [09:20:45] but i think there's the same security issue then, with the webserver writing to the comamnd file [09:20:59] i guess you did what i did temp.? (http://nagios.manubulon.com/traduction/docs14en/commandfile.html) [09:20:59] what's a security issue there? [09:21:07] creating group "nagiocmd" [09:21:09] you need to have an account in order to write in it [09:21:13] putting webserver into the group.. and so on [09:21:26] no I just changed the rights of the file [09:21:36] so that it's writable by server [09:21:51] and everybody can write to it now [09:21:52] however, anonymous users can't do that [09:22:02] no only users who have account on nagios [09:22:09] you need to open /nlogin [09:22:15] which require you to autenticate [09:22:19] but everybody can get on the host [09:22:23] no [09:22:29] only people who are in project can [09:22:35] or, what do you mean [09:22:41] get on host? [09:22:49] I don't see any security issue [09:23:57] it's sudo nagiosadd mutante [09:24:01] that command [09:25:11] mutante: can you send me the part of services.conf in production nagios so that I see how is it defined? [09:25:15] just paste it to pastebin [09:25:41] there is no single services.conf in prod, it would be HUGE [09:26:00] we have: [09:26:14] ok, just the part where you define check [09:26:15] etc/nagios/puppet_checks.d/ [09:26:16] this one [09:26:28] and one file for each host [09:26:36] ok, I need an example, it's 5 lines [09:26:53] for puppet freshness? [09:27:04] define service { hostgroup_name ssh-servers service_description Puppet freshness check_command puppet-FAIL use generic-service notification_interval 0 [09:27:18] that's what I have now [09:28:21] http://pastebin.com/4Tr4LgyG [09:28:27] right [09:28:31] updating [09:28:37] the main difference is that it is a passive check [09:29:02] passive_checks_enabled 1 [09:29:08] active_checks_enabled 0 [09:30:46] the freshness_threshold is also important [09:31:04] need to go afk for a little [09:32:32] it's there [09:32:38] check nagios, it's pending for all [09:34:12] now we need to tell puppet to send it to nagios [09:36:17] Ryan_Lane: you know how to do that? [09:36:30] how the puppet knows where to send it [09:40:43] Ryan_Lane: is there any restriction for project names? [09:40:53] like special characters and such [09:41:17] because if someone create a project this is#$#@%^&%$&#evil name, nagios parser likely crash [09:44:10] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [09:44:10] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [09:45:04] project names are restricted to [a-zA-Z0-9_-] [09:45:15] PROBLEM Disk Space is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:45:28] project names are restricted to [a-z]([a-zA-Z0-9_-])+ [09:45:35] PROBLEM dpkg-check is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:45:40] or something along those lines [09:45:47] can check in the OSM code :) [09:47:05] PROBLEM Free ram is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:49:25] PROBLEM Current Load is now: CRITICAL on deployment-web5 deployment-web5 output: CRITICAL - load average: 78.12, 50.26, 22.95 [09:50:15] RECOVERY Disk Space is now: OK on deployment-web5 deployment-web5 output: DISK OK [09:55:25] RECOVERY dpkg-check is now: OK on deployment-web5 deployment-web5 output: All packages OK [09:56:55] RECOVERY Free ram is now: OK on deployment-web5 deployment-web5 output: OK: 70% free memory [10:04:15] PROBLEM Current Load is now: WARNING on deployment-web5 deployment-web5 output: WARNING - load average: 0.07, 7.66, 18.05 [10:15:15] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [10:16:35] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [10:24:15] RECOVERY Current Load is now: OK on deployment-web5 deployment-web5 output: OK - load average: 0.10, 0.21, 4.98 [10:45:16] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [10:47:36] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [11:16:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [11:18:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [11:36:24] :-/ [11:46:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [11:48:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [12:12:41] New patchset: Demon; "Initial import of https://github.com/ralberts/Gerrit-Submodule-Hooks" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3521 [12:12:54] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3521 [12:16:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [12:18:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [12:46:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [12:47:29] mutante: ? [12:47:50] check nagios, it's pending for all [12:48:00] we need to configure puppet as well [12:48:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [13:16:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [13:18:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [13:21:06] petan or petan|wk: are you here [13:21:18] ? [13:25:22] PROBLEM Free ram is now: CRITICAL on deployment-web2 deployment-web2 output: Critical: 4% free memory [13:26:32] PROBLEM Free ram is now: CRITICAL on deployment-web4 deployment-web4 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:27:22] PROBLEM Current Load is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:52] PROBLEM Disk Space is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:52] PROBLEM dpkg-check is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:07] PROBLEM Free ram is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:07] PROBLEM Total Processes is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:12] PROBLEM SSH is now: CRITICAL on deployment-web5 deployment-web5 output: CRITICAL - Socket timeout after 10 seconds [13:30:12] PROBLEM Current Users is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:22] PROBLEM Current Load is now: WARNING on deployment-web deployment-web output: WARNING - load average: 4.34, 13.77, 7.51 [13:30:22] RECOVERY Free ram is now: OK on deployment-web2 deployment-web2 output: OK: 33% free memory [13:30:32] PROBLEM Current Load is now: CRITICAL on deployment-web4 deployment-web4 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:32] PROBLEM Current Users is now: CRITICAL on deployment-web4 deployment-web4 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:32] PROBLEM SSH is now: CRITICAL on deployment-web4 deployment-web4 output: CRITICAL - Socket timeout after 10 seconds [13:30:32] PROBLEM dpkg-check is now: CRITICAL on deployment-web4 deployment-web4 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:02] PROBLEM Total Processes is now: CRITICAL on deployment-web4 deployment-web4 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:07] PROBLEM Disk Space is now: CRITICAL on deployment-web4 deployment-web4 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:22] RECOVERY Current Load is now: OK on deployment-web deployment-web output: OK - load average: 0.00, 1.85, 3.92 [13:46:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [13:48:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [14:16:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [14:18:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [14:43:58] IWorld: yes [14:46:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [14:48:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [14:51:42] PROBLEM Disk Space is now: WARNING on wikistream-1 wikistream-1 output: DISK WARNING - free space: / 78 MB (5% inode=47%): [15:06:42] RECOVERY Disk Space is now: OK on wikistream-1 wikistream-1 output: DISK OK [15:09:32] petan|wk: can you add me to bots? [15:16:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [15:18:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [15:18:23] IWorld: you want to run a bot there? [15:18:31] yes [15:18:37] which one [15:18:54] supybot [15:19:12] is it for a wikimedia project? [15:19:18] wikidata [15:19:21] ? [15:19:24] what is it [15:19:42] PROBLEM Disk Space is now: WARNING on wikistream-1 wikistream-1 output: DISK WARNING - free space: / 78 MB (5% inode=47%): [15:20:00] --> http://meta.wikimedia.org/wiki/Wikidata [15:20:17] I was wondering how to explain that to petan :P [15:21:13] Platonides: huh [15:21:30] IWorld: I don't understand why you need a bot for a project which doesn't even exist [15:21:52] supybot = ircbot :O [15:22:01] And the project exists [15:22:41] please understand that wikimedia labs are in beta, and all projects are fairly unstable, including bots, we don't usually let the people move their bots there unless there is a reason (see irc logs), so can you explain to me what is that bot going to do? [15:23:43] ah [15:23:58] no problem [15:24:08] hm? [15:24:38] When Labs is stable? Never? [15:24:51] I hope soon [15:28:49] http://bots.wmflabs.org/~petrb/logs/%23wikimedia-labs/20111215.txt [15:29:51] time? [15:30:01] beginning of log [15:30:14] [19:16:22] I'd really like to have a proper infrastructure up before we allow bots [15:31:08] that means if there is a bot which can't be hosted anywhere else and it's needed to run it, like cluebot we can make an exception, but in general it's not a good idea to move any bots there [15:31:11] We really need some scheduling and puppetization [15:31:12] lot of outages too [15:32:21] ok [15:46:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [15:48:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [16:16:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [16:18:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [16:39:01] Is there a ridiculous wait for new git repositories? [16:46:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [16:48:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [17:16:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [17:18:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [17:23:33] New patchset: Sara; "Fourth iteration of adding ganglia for labs." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3561 [17:23:45] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3561 [17:30:08] IWorld hello [17:32:07] New review: Sara; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3561 [17:32:33] Change merged: Sara; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3561 [17:32:38] IWorld you still around ? [17:33:13] petan|wk: how are you doing [17:34:59] Hi OrenOf [17:46:48] what is you bot supposed to do [17:46:52] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [17:46:57] is it one of those scarping stuff [17:48:02] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [17:48:24] I better put dumpster out of it's misery [17:50:01] IWorld are you a wikidata developer ? [17:50:14] not yet [17:50:42] But I'm interested. [17:50:50] so what is your relation to thta project ? [17:51:22] What's that? [17:51:36] are you affiliated with them ? [17:52:57] do you mean "that project"? [17:53:06] wikidata? [17:53:45] sorry if we are getting off on the wrong foot [17:53:53] PROBLEM Current Load is now: CRITICAL on aggregator-test aggregator-test output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:53:58] I am trying to learn about you and about wikidata too [17:54:05] and perhaps help you out [17:54:33] PROBLEM Current Users is now: CRITICAL on aggregator-test aggregator-test output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:54:55] at the moment I have only created a logo canididate [17:55:13] PROBLEM Disk Space is now: CRITICAL on aggregator-test aggregator-test output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:56:03] PROBLEM Free ram is now: CRITICAL on aggregator-test aggregator-test output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:57:23] PROBLEM Total Processes is now: CRITICAL on aggregator-test aggregator-test output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:58:06] is it going to be related to dbpedia ? [17:58:13] PROBLEM dpkg-check is now: CRITICAL on aggregator-test aggregator-test output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:59:02] dbpedia? [18:16:53] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [18:46:53] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [18:57:43] PROBLEM host: deployment-web5 is DOWN address: deployment-web5 CRITICAL - Host Unreachable (deployment-web5) [19:01:23] PROBLEM Free ram is now: CRITICAL on deployment-web deployment-web output: Critical: 3% free memory [19:03:25] PROBLEM Current Load is now: WARNING on deployment-web deployment-web output: WARNING - load average: 0.60, 10.83, 8.35 [19:07:45] PROBLEM host: deployment-web4 is DOWN address: deployment-web4 CRITICAL - Host Unreachable (deployment-web4) [19:13:22] RECOVERY Current Load is now: OK on deployment-web deployment-web output: OK - load average: 0.03, 1.55, 4.42 [19:18:42] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [19:28:09] PROBLEM host: deployment-web5 is DOWN address: deployment-web5 CRITICAL - Host Unreachable (deployment-web5) [19:33:53] PROBLEM Current Load is now: CRITICAL on aggregator-test2 aggregator-test2 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:38:33] PROBLEM host: deployment-web4 is DOWN address: deployment-web4 CRITICAL - Host Unreachable (deployment-web4) [19:38:53] PROBLEM host: aggregator-test2 is DOWN address: aggregator-test2 CRITICAL - Host Unreachable (aggregator-test2) [19:47:44] Cool topic! [19:49:03] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [19:57:38] New patchset: Sara; "Fifth iteration of adding ganglia for labs." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3661 [19:57:49] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3661 [19:59:33] PROBLEM host: deployment-web5 is DOWN address: deployment-web5 CRITICAL - Host Unreachable (deployment-web5) [20:00:56] New review: Sara; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3661 [20:00:56] Change merged: Sara; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3661 [20:05:39] New review: Hashar; "I think we should keep that kind of project on Github. We would benefit from more exposure and it wi..." [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/3521 [20:09:33] PROBLEM host: deployment-web4 is DOWN address: deployment-web4 CRITICAL - Host Unreachable (deployment-web4) [20:14:09] New patchset: Sara; "Sixth iteration of adding ganglia for labs." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3664 [20:14:21] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3664 [20:15:11] New review: Sara; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3664 [20:15:14] Change merged: Sara; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3664 [20:19:54] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 2.73, 13.13, 7.79 [20:20:04] PROBLEM Current Load is now: WARNING on deployment-web2 deployment-web2 output: WARNING - load average: 4.48, 10.64, 5.87 [20:20:34] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [20:25:04] RECOVERY Current Load is now: OK on deployment-web2 deployment-web2 output: OK - load average: 0.05, 3.99, 4.28 [20:30:34] PROBLEM host: deployment-web5 is DOWN address: deployment-web5 CRITICAL - Host Unreachable (deployment-web5) [20:34:54] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.25, 0.88, 3.12 [20:39:54] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.04, 0.46, 2.33 [20:40:34] PROBLEM host: deployment-web4 is DOWN address: deployment-web4 CRITICAL - Host Unreachable (deployment-web4) [20:44:22] New patchset: Sara; "Seventh iteration of adding ganglia for labs." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3670 [20:44:34] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3670 [20:45:20] New review: Sara; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3670 [20:45:23] Change merged: Sara; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3670 [20:50:34] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [20:51:14] PROBLEM dpkg-check is now: CRITICAL on deployment-web3 deployment-web3 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:52:04] PROBLEM Free ram is now: CRITICAL on deployment-web2 deployment-web2 output: Critical: 4% free memory [20:53:04] PROBLEM Current Load is now: WARNING on deployment-web2 deployment-web2 output: WARNING - load average: 24.42, 17.69, 9.15 [20:53:44] PROBLEM Current Load is now: CRITICAL on deployment-web3 deployment-web3 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:44] PROBLEM Total Processes is now: CRITICAL on deployment-web3 deployment-web3 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:49] PROBLEM Free ram is now: CRITICAL on deployment-web3 deployment-web3 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:09] mutante: hey [20:55:14] we need to finish nagios [20:55:41] is someone in states working on friday? :P [20:56:04] PROBLEM SSH is now: CRITICAL on deployment-web3 deployment-web3 output: No route to host [20:56:04] PROBLEM Disk Space is now: CRITICAL on deployment-web3 deployment-web3 output: Connection refused or timed out [20:56:04] PROBLEM Current Users is now: CRITICAL on deployment-web3 deployment-web3 output: Connection refused or timed out [20:58:14] RECOVERY host: deployment-web4 is UP address: deployment-web4 PING OK - Packet loss = 0%, RTA = 1.04 ms [21:00:34] PROBLEM host: deployment-web5 is DOWN address: deployment-web5 CRITICAL - Host Unreachable (deployment-web5) [21:00:34] RECOVERY SSH is now: OK on deployment-web4 deployment-web4 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:00:35] RECOVERY Current Load is now: OK on deployment-web4 deployment-web4 output: OK - load average: 0.92, 0.64, 0.26 [21:00:35] RECOVERY Current Users is now: OK on deployment-web4 deployment-web4 output: USERS OK - 0 users currently logged in [21:00:35] RECOVERY dpkg-check is now: OK on deployment-web4 deployment-web4 output: All packages OK [21:00:54] RECOVERY SSH is now: OK on deployment-web3 deployment-web3 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:00:54] RECOVERY Disk Space is now: OK on deployment-web3 deployment-web3 output: DISK OK [21:00:54] RECOVERY Current Users is now: OK on deployment-web3 deployment-web3 output: USERS OK - 0 users currently logged in [21:01:04] RECOVERY dpkg-check is now: OK on deployment-web3 deployment-web3 output: All packages OK [21:01:34] PROBLEM host: deployment-web2 is DOWN address: deployment-web2 CRITICAL - Host Unreachable (deployment-web2) [21:02:54] RECOVERY Free ram is now: OK on deployment-web4 deployment-web4 output: OK: 45% free memory [21:03:04] RECOVERY Disk Space is now: OK on deployment-web4 deployment-web4 output: DISK OK [21:03:04] RECOVERY Total Processes is now: OK on deployment-web4 deployment-web4 output: PROCS OK: 97 processes [21:03:09] RECOVERY host: deployment-web2 is UP address: deployment-web2 PING OK - Packet loss = 0%, RTA = 1.11 ms [21:03:34] RECOVERY Current Load is now: OK on deployment-web3 deployment-web3 output: OK - load average: 0.87, 0.75, 0.32 [21:03:34] RECOVERY Total Processes is now: OK on deployment-web3 deployment-web3 output: PROCS OK: 97 processes [21:03:39] RECOVERY Free ram is now: OK on deployment-web3 deployment-web3 output: OK: 72% free memory [21:07:04] RECOVERY Free ram is now: OK on deployment-web2 deployment-web2 output: OK: 50% free memory [21:08:54] RECOVERY host: deployment-web5 is UP address: deployment-web5 PING OK - Packet loss = 0%, RTA = 6.59 ms [21:10:34] RECOVERY Total Processes is now: OK on deployment-web5 deployment-web5 output: PROCS OK: 107 processes [21:10:39] RECOVERY Disk Space is now: OK on deployment-web5 deployment-web5 output: DISK OK [21:10:39] RECOVERY dpkg-check is now: OK on deployment-web5 deployment-web5 output: All packages OK [21:12:54] RECOVERY Free ram is now: OK on deployment-web5 deployment-web5 output: OK: 88% free memory [21:12:54] RECOVERY Current Users is now: OK on deployment-web5 deployment-web5 output: USERS OK - 0 users currently logged in [21:13:04] RECOVERY SSH is now: OK on deployment-web5 deployment-web5 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:13:04] RECOVERY Current Load is now: OK on deployment-web5 deployment-web5 output: OK - load average: 0.05, 0.11, 0.05 [21:18:14] PROBLEM host: aggregator-test is DOWN address: aggregator-test CRITICAL - Host Unreachable (aggregator-test) [21:20:43] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [21:50:43] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [22:20:43] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [22:50:43] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [23:20:43] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [23:50:43] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1)