[00:00:03] <gerrit-wm>	 New patchset: Bhartshorne; "first draft of the swift cleaner stuff.  I know this doesn't work but I want to check it in for reviews." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3134
[00:00:14] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.404 seconds
[00:00:15] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3134
[00:01:07] <Reedy>	 mutante: that was quick :)
[00:03:23] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.462 seconds
[00:06:05] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:06:30] <Reedy>	 Anyone know if "hume: sudo: no tty present and no askpass program specified" when doing sync-file etc has been reported?
[00:09:04] <mutante>	 Reedy: dont think so
[00:12:14] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.067 seconds
[00:27:30] <gerrit-wm>	 New patchset: Lcarr; "Splitting off the icinga packages and specific config files into their own manifest file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3135
[00:27:39] <gerrit-wm>	 New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3135
[00:28:51] <gerrit-wm>	 New patchset: Lcarr; "Splitting off the icinga packages and specific config files into their own manifest file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3135
[00:29:03] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3135
[00:29:20] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3135
[00:36:19] <gerrit-wm>	 New patchset: Lcarr; "Moving icinga specific files to own class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3136
[00:36:32] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3136
[00:37:58] <gerrit-wm>	 Change abandoned: Lcarr; "i hate you gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3135
[00:38:58] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3136
[00:39:00] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3136
[00:46:43] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:47:19] <gerrit-wm>	 New patchset: Lcarr; "explicitly calling out nagios_mysql_check_pass" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3137
[00:47:31] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3137
[00:48:00] * AaronSchulz  wonders why copper is so unreliable
[00:48:23] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3137
[00:48:26] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3137
[00:49:03] <mutante>	 AaronSchulz: in which way? since just 48 hours or all the time?
[00:49:25] <AaronSchulz>	 probably the later ;)
[00:49:52] <AaronSchulz>	 maybe it just sucks when connecting from here
[00:51:25] <LeslieCarr>	 the machine or internet via copper cables ?
[00:51:57] <AaronSchulz>	 hehe
[00:52:01] <AaronSchulz>	 the boxen
[00:52:10] * AaronSchulz  loves metal copper
[00:52:50] <mutante>	 well, i installed some upgrades on it like 2 says ago, thats why i said "since 48 hours"
[00:52:52] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.545 seconds
[00:52:54] <mutante>	 days
[00:54:25] <gerrit-wm>	 New patchset: Lcarr; "fixing out of scope variable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3138
[00:54:37] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3138
[00:55:20] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3138
[00:55:23] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3138
[00:58:43] <gerrit-wm>	 New patchset: Lcarr; "fixing other out of scope variables" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3139
[00:58:55] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3139
[00:59:44] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3139
[00:59:47] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3139
[01:06:13] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:08:55] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:13:07] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.859 seconds
[01:14:23] <AaronSchulz>	 mutante: it's hard for me to even connect
[01:15:31] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours
[01:15:31] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours
[01:15:31] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours
[01:15:31] <nagios-wm>	 PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours
[01:15:31] <nagios-wm>	 PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours
[01:16:16] <mutante>	 AaronSchulz: works for me (via fenari that is, right)
[01:18:29] <gerrit-wm>	 New patchset: Lcarr; "removing unneeded /etc/icinga check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3140
[01:18:41] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3140
[01:18:48] <mutante>	 AaronSchulz: 9 packets transmitted, 9 received, 0% packet loss   rtt min/avg/max/mdev = 26.350/26.429/26.466/0.114 ms
[01:19:05] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3140
[01:19:08] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3140
[01:19:25] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:24:17] <gerrit-wm>	 New patchset: Ryan Lane; "Upping svn rev for ldap tools to update manage-volumes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3141
[01:24:29] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3141
[01:24:33] <gerrit-wm>	 New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3141
[01:24:35] <gerrit-wm>	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3141
[01:28:25] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:31:52] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.194 seconds
[01:34:34] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.667 seconds
[01:43:05] <nagios-wm>	 PROBLEM - Misc_Db_Lag on db10 is CRITICAL: (Return code of 255 is out of bounds)
[01:43:05] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:43:14] <nagios-wm>	 PROBLEM - MySQL slave status on es2 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES)
[01:43:32] <nagios-wm>	 PROBLEM - Misc_Db_Slave on db10 is CRITICAL: CRITICAL: Access denied for user nagios@spence.wikimedia.org (using password: YES)
[01:43:32] <nagios-wm>	 PROBLEM - MySQL replication status on es1003 is CRITICAL: (Return code of 255 is out of bounds)
[01:43:32] <nagios-wm>	 PROBLEM - MySQL master status on es3 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES)
[01:43:32] <nagios-wm>	 PROBLEM - Misc_Db_Master on db9 is CRITICAL: CRITICAL: Access denied for user nagios@spence.wikimedia.org (using password: YES)
[01:43:50] <nagios-wm>	 PROBLEM - MySQL replication status on es1 is CRITICAL: (Return code of 255 is out of bounds)
[01:43:50] <nagios-wm>	 PROBLEM - MySQL replication status on es4 is CRITICAL: (Return code of 255 is out of bounds)
[01:44:08] <nagios-wm>	 PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: (Return code of 255 is out of bounds)
[01:44:08] <nagios-wm>	 PROBLEM - MySQL slave status on es1003 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES)
[01:44:08] <nagios-wm>	 PROBLEM - MySQL slave status on es1 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES)
[01:44:08] <nagios-wm>	 PROBLEM - MySQL replication status on es1004 is CRITICAL: (Return code of 255 is out of bounds)
[01:44:26] <nagios-wm>	 PROBLEM - MySQL slave status on es4 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES)
[01:44:26] <nagios-wm>	 PROBLEM - MySQL replication status on db1025 is CRITICAL: (Return code of 255 is out of bounds)
[01:44:35] <nagios-wm>	 PROBLEM - MySQL master status on db1008 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES)
[01:44:44] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.300 seconds
[01:44:44] <nagios-wm>	 PROBLEM - MySQL slave status on db1025 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES)
[01:44:44] <nagios-wm>	 PROBLEM - MySQL replication status on storage3 is CRITICAL: (Return code of 255 is out of bounds)
[01:44:53] <nagios-wm>	 PROBLEM - MySQL replication status on es2 is CRITICAL: (Return code of 255 is out of bounds)
[01:45:02] <nagios-wm>	 PROBLEM - MySQL slave status on storage3 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES)
[01:51:02] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:55:14] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.601 seconds
[01:59:53] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.266 seconds
[02:01:32] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:06:11] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:10:32] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:15:11] <gerrit-wm>	 New patchset: Lcarr; "Revert "fixing out of scope variable"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3142
[02:15:24] <gerrit-wm>	 New patchset: Lcarr; "Revert "fixing other out of scope variables"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3143
[02:15:36] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3142
[02:15:37] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3143
[02:16:06] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3142
[02:16:09] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3142
[02:16:13] <gerrit-wm>	 New patchset: Dzahn; "swift process monitoring (RT-2593)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3144
[02:16:25] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3143
[02:16:25] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3144
[02:16:25] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3143
[02:16:41] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.038 seconds
[02:16:41] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.313 seconds
[02:17:35] <nagios-wm>	 PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused
[02:18:20] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.581 seconds
[02:18:47] <gerrit-wm>	 New review: Dzahn; "expect NRPE to break when merging any change to nrpe_local.cfg - be prepared to restart nagios-nrpe-..." [operations/puppet] (production); V: 1 C: 1;  - https://gerrit.wikimedia.org/r/3144
[02:22:59] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:24:26] <gerrit-wm>	 New patchset: Lcarr; "fixing the nagios checkcommands template again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3145
[02:24:38] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3145
[02:24:38] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:25:06] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3145
[02:25:08] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3145
[02:40:59] <nagios-wm>	 PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours
[02:44:53] <nagios-wm>	 PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours
[02:47:42] <nagios-wm>	 RECOVERY - Misc_Db_Lag on db10 is OK: CHECK MySQL REPLICATION - lag - OK -  Seconds_Behind_Master : 0s
[02:47:42] <nagios-wm>	 RECOVERY - MySQL slave status on es2 is OK: OK:
[02:48:00] <nagios-wm>	 RECOVERY - Misc_Db_Slave on db10 is OK: OK:
[02:48:00] <nagios-wm>	 RECOVERY - MySQL master status on es3 is OK: OK:
[02:48:00] <nagios-wm>	 RECOVERY - Misc_Db_Master on db9 is OK: OK:
[02:48:00] <nagios-wm>	 RECOVERY - MySQL replication status on es1003 is OK: CHECK MySQL REPLICATION - lag - OK -  Seconds_Behind_Master : 0s
[02:48:27] <nagios-wm>	 RECOVERY - MySQL replication status on es1 is OK: CHECK MySQL REPLICATION - lag - OK -  Seconds_Behind_Master : 0s
[02:48:27] <nagios-wm>	 RECOVERY - MySQL replication status on es4 is OK: CHECK MySQL REPLICATION - lag - OK -  Seconds_Behind_Master : 0s
[02:48:27] <nagios-wm>	 RECOVERY - MySQL slave status on es1003 is OK: OK:
[02:48:45] <nagios-wm>	 RECOVERY - MySQL slave status on es4 is OK: OK:
[02:48:45] <nagios-wm>	 RECOVERY - MySQL replication status on es1004 is OK: CHECK MySQL REPLICATION - lag - OK -  Seconds_Behind_Master : s
[02:48:45] <nagios-wm>	 RECOVERY - MySQL slave status on es1 is OK: OK:
[02:48:53] <RobH>	 !log realized i forgot to log hours ago that cp1029-cp1036 are installed with puppet run, ready for varnish deployment tomorrow
[02:48:54] <nagios-wm>	 RECOVERY - MySQL replication status on db1025 is OK: CHECK MySQL REPLICATION - lag - OK -  Seconds_Behind_Master : 0s
[02:48:56] <morebots>	 Logged the message, RobH
[02:49:03] <nagios-wm>	 RECOVERY - MySQL master status on db1008 is OK: OK:
[02:49:03] <nagios-wm>	 RECOVERY - MySQL slave status on db1025 is OK: OK:
[02:49:21] <nagios-wm>	 RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK -  Seconds_Behind_Master : 11s
[02:49:39] <nagios-wm>	 RECOVERY - MySQL slave status on storage3 is OK: OK:
[02:49:48] <nagios-wm>	 RECOVERY - MySQL replication status on es2 is OK: CHECK MySQL REPLICATION - lag - OK -  Seconds_Behind_Master : 0s
[02:49:51] <RobH>	 !log revoked, cp1032 is some reason in grub error, and its too late at night for me to work on it, will troubleshoot tomorrow
[02:49:54] <morebots>	 Logged the message, RobH
[02:50:42] <nagios-wm>	 RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK -  Seconds_Behind_Master : 50s
[02:52:09] <RobH>	 !log cp1032-cp1035 reinstall issue wiped mbr causing issues, will reinstall in my AM
[02:52:12] <morebots>	 Logged the message, RobH
[02:53:51] <nagios-wm>	 RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time
[03:01:39] <nagios-wm>	 PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours
[03:10:12] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.862 seconds
[03:16:39] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.242 seconds
[03:31:12] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:33:31] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.136 seconds
[03:34:43] <nagios-wm>	 RECOVERY - Puppet freshness on mw53 is OK: puppet ran at Wed Mar 14 03:34:21 UTC 2012
[03:38:36] <mutante>	 !log free some disk space on spence - deleted user.log.1 on spence, compressing messages.1, apt-get clean,...
[03:38:39] <morebots>	 Logged the message, Master
[03:40:07] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:42:04] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:44:10] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.363 seconds
[03:44:16] <mutante>	 !log ekrem - user agent "AppleDictionaryService" requests cause temp. WAP outage ..it seems
[03:44:19] <morebots>	 Logged the message, Master
[03:46:20] <LeslieCarr>	 mutante: do you know much about appledictionaryservice ?
[03:46:34] <gerrit-wm>	 New patchset: Lcarr; "fixing paths" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3146
[03:46:36] <LeslieCarr>	 i don't myself :(
[03:46:46] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3146
[03:46:50] <mutante>	 LeslieCarr: no, i just see "ekrem is a Wikimedia Apple Dictionary to API OpenSearch bridge (misc::apple-dictionary-bridge)."
[03:47:02] <LeslieCarr>	 ah yeah :)  and that's all that's in the apache logs
[03:47:14] <LeslieCarr>	 my totally uncertain guess is it's overloaded and we should set up varnish or something
[03:47:15] <mutante>	 LeslieCarr: and when i looked a bit at tail -f access.log on it, i see a lot of requests ""AppleDictionaryService/158.2""
[03:47:24] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3146
[03:47:27] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3146
[03:47:50] <mutante>	 LeslieCarr: yeah, it does look overloaded, but only while that DictioaryService comes to index all pages or something...
[03:48:10] <mutante>	 LeslieCarr: when that happens we get the flapping WAP service on ekrem, and after its done its good a gain until next time.. afaik
[03:49:11] <LeslieCarr>	 ah so it's very sporadic
[03:49:27] <mutante>	 i would check but when hitting history.cgi i probaly kill spence :)
[03:49:37] <mutante>	 but yeah
[03:50:10] <LeslieCarr>	 :)
[03:50:19] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.630 seconds
[03:51:06] <mutante>	 "Dictionary running under Mac OS X Leopard, showing Wikipedia's page on Wikipedia."
[03:51:26] <mutante>	 http://en.wikipedia.org/wiki/Dictionary_%28software%29
[03:52:06] <mutante>	 http://en.wikipedia.org/wiki/Dictionary_%28software%29#Wikipedia
[03:55:10] <mutante>	 LeslieCarr: when fixing stuff on spence and adding new checks, i did not want to mess with your changes on the icinga files, but also i dont want us having to do stuff twice, i'd rather help with Icinga right way
[03:55:42] <LeslieCarr>	 cool - so most files that are in puppet should transfer over
[03:55:47] <LeslieCarr>	 however, many checks aren't in puppet ;)
[03:56:01] <mutante>	 and i have new process checks for swift sitting in gerrit.. but they will break NRPE again.. because every config change on nrpe_local.cfg does
[03:56:46] <mutante>	 so i will kill the bot, stop paging, wait for it to break..
[03:56:46] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:57:10] <mutante>	 then restart the service on ALL, using scripts...    OR ..find out the root cause first.. hrmm
[03:58:19] <LeslieCarr>	 ah yeah, nrpe local changes, so frustrating
[03:58:31] <LeslieCarr>	 and not having dsh
[03:58:35] <LeslieCarr>	 i mean dsh groups
[03:58:43] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.447 seconds
[03:58:49] <LeslieCarr>	 anyways, i'm gonna go afk for a bit to see if the latest change fixed neon
[03:59:34] <mutante>	 grep host_name /etc/nagios/puppet_hosts.cfg | cut -d " " -f23
[03:59:38] <mutante>	 kk, ttyl
[04:06:58] <LeslieCarr>	 meh, gotta fix users and stuff - that sounds likea tomorrow issue :)
[04:07:00] <LeslieCarr>	 bye
[04:11:28] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:13:34] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.855 seconds
[04:45:30] <nagios-wm>	 PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:49:33] <nagios-wm>	 RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s)
[05:41:39] <nagios-wm>	 PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours
[06:26:57] <gerrit-wm>	 New patchset: Dzahn; "also allow public esams net (91.198.174.0./25), not just private, snmp access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3147
[06:27:09] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3147
[06:28:56] <gerrit-wm>	 New patchset: Dzahn; "also allow public esams net (91.198.174.0./25), not just private, snmp access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3147
[06:29:09] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3147
[06:30:29] <gerrit-wm>	 New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3147
[06:30:32] <gerrit-wm>	 Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3147
[07:17:17] <nagios-wm>	 RECOVERY - Puppet freshness on amslvs1 is OK: puppet ran at Wed Mar 14 07:16:44 UTC 2012
[07:17:44] <nagios-wm>	 RECOVERY - Puppet freshness on ssl3001 is OK: puppet ran at Wed Mar 14 07:17:28 UTC 2012
[07:17:44] <nagios-wm>	 RECOVERY - Puppet freshness on knsq20 is OK: puppet ran at Wed Mar 14 07:17:33 UTC 2012
[07:18:11] <nagios-wm>	 RECOVERY - Puppet freshness on ssl3002 is OK: puppet ran at Wed Mar 14 07:17:54 UTC 2012
[07:18:47] <nagios-wm>	 RECOVERY - Puppet freshness on amssq45 is OK: puppet ran at Wed Mar 14 07:18:13 UTC 2012
[07:19:41] <nagios-wm>	 RECOVERY - Puppet freshness on amssq35 is OK: puppet ran at Wed Mar 14 07:19:27 UTC 2012
[07:22:50] <nagios-wm>	 PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours
[07:24:56] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours
[07:25:14] <nagios-wm>	 RECOVERY - Puppet freshness on amssq55 is OK: puppet ran at Wed Mar 14 07:24:59 UTC 2012
[07:25:14] <nagios-wm>	 RECOVERY - Puppet freshness on amssq51 is OK: puppet ran at Wed Mar 14 07:25:08 UTC 2012
[07:25:14] <nagios-wm>	 RECOVERY - Puppet freshness on amssq61 is OK: puppet ran at Wed Mar 14 07:25:09 UTC 2012
[07:25:41] <nagios-wm>	 RECOVERY - Puppet freshness on amssq34 is OK: puppet ran at Wed Mar 14 07:25:16 UTC 2012
[07:25:41] <nagios-wm>	 RECOVERY - Puppet freshness on amssq58 is OK: puppet ran at Wed Mar 14 07:25:22 UTC 2012
[07:27:47] <nagios-wm>	 RECOVERY - Puppet freshness on ssl3003 is OK: puppet ran at Wed Mar 14 07:27:27 UTC 2012
[07:27:47] <nagios-wm>	 RECOVERY - Puppet freshness on ssl3004 is OK: puppet ran at Wed Mar 14 07:27:32 UTC 2012
[07:27:56] <nagios-wm>	 PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours
[07:27:56] <nagios-wm>	 PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours
[07:28:05] <nagios-wm>	 RECOVERY - Puppet freshness on amssq57 is OK: puppet ran at Wed Mar 14 07:27:55 UTC 2012
[07:28:41] <nagios-wm>	 RECOVERY - Puppet freshness on amssq54 is OK: puppet ran at Wed Mar 14 07:28:15 UTC 2012
[07:28:41] <nagios-wm>	 RECOVERY - Puppet freshness on amslvs3 is OK: puppet ran at Wed Mar 14 07:28:19 UTC 2012
[07:29:17] <nagios-wm>	 RECOVERY - Puppet freshness on knsq16 is OK: puppet ran at Wed Mar 14 07:28:46 UTC 2012
[07:29:17] <nagios-wm>	 RECOVERY - Puppet freshness on cp3002 is OK: puppet ran at Wed Mar 14 07:28:57 UTC 2012
[07:29:17] <nagios-wm>	 RECOVERY - Puppet freshness on amssq33 is OK: puppet ran at Wed Mar 14 07:28:59 UTC 2012
[07:30:11] <nagios-wm>	 RECOVERY - Puppet freshness on maerlant is OK: puppet ran at Wed Mar 14 07:29:41 UTC 2012
[07:30:47] <nagios-wm>	 RECOVERY - Puppet freshness on amssq37 is OK: puppet ran at Wed Mar 14 07:30:14 UTC 2012
[07:30:47] <nagios-wm>	 RECOVERY - Puppet freshness on knsq17 is OK: puppet ran at Wed Mar 14 07:30:32 UTC 2012
[07:31:41] <nagios-wm>	 RECOVERY - Puppet freshness on amssq52 is OK: puppet ran at Wed Mar 14 07:31:31 UTC 2012
[07:31:50] <nagios-wm>	 PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours
[07:31:50] <nagios-wm>	 PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours
[07:32:17] <nagios-wm>	 RECOVERY - Puppet freshness on amssq59 is OK: puppet ran at Wed Mar 14 07:32:07 UTC 2012
[07:34:41] <nagios-wm>	 RECOVERY - Puppet freshness on amssq53 is OK: puppet ran at Wed Mar 14 07:34:14 UTC 2012
[07:34:41] <nagios-wm>	 RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Wed Mar 14 07:34:27 UTC 2012
[07:35:44] <nagios-wm>	 RECOVERY - Puppet freshness on knsq25 is OK: puppet ran at Wed Mar 14 07:35:25 UTC 2012
[07:35:44] <nagios-wm>	 RECOVERY - Puppet freshness on knsq22 is OK: puppet ran at Wed Mar 14 07:35:40 UTC 2012
[07:37:14] <nagios-wm>	 RECOVERY - Puppet freshness on nescio is OK: puppet ran at Wed Mar 14 07:36:45 UTC 2012
[07:40:14] <nagios-wm>	 RECOVERY - Puppet freshness on knsq18 is OK: puppet ran at Wed Mar 14 07:40:02 UTC 2012
[07:40:41] <nagios-wm>	 RECOVERY - Puppet freshness on ms6 is OK: puppet ran at Wed Mar 14 07:40:18 UTC 2012
[07:40:41] <nagios-wm>	 RECOVERY - Puppet freshness on amssq47 is OK: puppet ran at Wed Mar 14 07:40:40 UTC 2012
[07:40:50] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours
[07:41:44] <nagios-wm>	 RECOVERY - Puppet freshness on knsq24 is OK: puppet ran at Wed Mar 14 07:41:30 UTC 2012
[07:41:44] <nagios-wm>	 RECOVERY - Puppet freshness on amssq36 is OK: puppet ran at Wed Mar 14 07:41:31 UTC 2012
[07:42:47] <nagios-wm>	 RECOVERY - Puppet freshness on amssq39 is OK: puppet ran at Wed Mar 14 07:42:26 UTC 2012
[07:43:14] <nagios-wm>	 RECOVERY - Puppet freshness on amssq32 is OK: puppet ran at Wed Mar 14 07:43:04 UTC 2012
[07:43:41] <nagios-wm>	 RECOVERY - Puppet freshness on knsq29 is OK: puppet ran at Wed Mar 14 07:43:16 UTC 2012
[07:43:41] <nagios-wm>	 RECOVERY - Puppet freshness on hooft is OK: puppet ran at Wed Mar 14 07:43:26 UTC 2012
[07:45:11] <nagios-wm>	 RECOVERY - Puppet freshness on amssq43 is OK: puppet ran at Wed Mar 14 07:44:43 UTC 2012
[07:46:41] <nagios-wm>	 RECOVERY - Puppet freshness on amssq31 is OK: puppet ran at Wed Mar 14 07:46:16 UTC 2012
[07:49:50] <nagios-wm>	 PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours
[07:51:05] <mutante>	 !log fixing owa[1-3] Swift HTTP commands manually
[07:51:08] <morebots>	 Logged the message, Master
[07:51:11] <mutante>	 !log restarting mecached on marmontel
[07:51:14] <morebots>	 Logged the message, Master
[07:58:50] <nagios-wm>	 RECOVERY - Memcached on srv254 is OK: TCP OK - 0.002 second response time on port 11000
[07:59:08] <nagios-wm>	 RECOVERY - Memcached on srv255 is OK: TCP OK - 0.001 second response time on port 11000
[08:02:26] <nagios-wm>	 RECOVERY - Memcached on srv257 is OK: TCP OK - 0.003 second response time on port 11000
[08:02:49] <mutante>	 !log stop/start memcached on srv254,srv255,srv257
[08:02:52] <morebots>	 Logged the message, Master
[08:19:10] <mutante>	 apergos: does snapshot3 need mysql-client at all? currently mysql-client and mysql-common are on it in a "iU" (unpacked) state, but depends on mysql-client-core which it does not have, causing broken dpkg, can just remove it all?
[08:19:17] <apergos>	 hello
[08:19:34] <apergos>	 that's bad
[08:19:45] <apergos>	 as of when?
[08:19:46] <mutante>	 hi:) i didnt expect this message to reach you right this second:)
[08:20:00] <mutante>	 but didnt check time either
[08:20:21] <apergos>	 as of today?
[08:20:25] <apergos>	 or as of a week ago?
[08:21:35] <mutante>	 as of 1d 16 h
[08:21:41] <apergos>	 all the snapshots need mysql client
[08:21:52] <apergos>	 if it gets removed (and how did it get removed?) everything breaks
[08:22:04] <apergos>	 in this case it will be the "adds changes" dumps that broke
[08:22:12] <mutante>	 lemme try reinstalling it
[08:22:33] <mutante>	 i have no idea how it got removed, just Nagios told me "dkg broken packages" and then i checked why
[08:24:16] <mutante>	 !log running "apt-get -f install" on snapshot3 to fix dpkg, which installed mysql-client- and client-core-5.1
[08:24:19] <morebots>	 Logged the message, Master
[08:24:36] <apergos>	 that's pretty frustrating. I would love to know how they just disappeared
[08:24:37] <mutante>	 looks like Nagios should soon report that fixed
[08:24:41] <nagios-wm>	 RECOVERY - DPKG on snapshot3 is OK: All packages OK
[08:27:06] <apergos>	 nothing whatsoever in the sysadmin log
[08:27:57] <mutante>	 Start-Date: 2012-03-12  15:43:31
[08:27:58] <apergos>	 yeah my incrementals haven't been running for two days.
[08:27:59] <mutante>	  Upgrade: libmysqlclient16 (5.1.53-fb3753-wm1, 5.1.61-0ubuntu0.10.04.1), mysql-common (5.1.53-fb3753-wm1, 5.1.61-0ubuntu0.10.04.1), mysql-client-5.1 (5.1.53-fb3753-wm1, 5.1.61-0ubuntu0.10.04.1)
[08:28:03] <mutante>	 Error: Sub-process /usr/bin/dpkg returned an error code (1)
[08:28:08] <mutante>	  /Var/log/apt/history.log
[08:28:15] <apergos>	 but who ran it?
[08:28:53] <apergos>	 and why not follow up to be sure it worked?
[08:31:40] <apergos>	 I can get today's to run I think but it's going to include yesterday's data
[08:36:32] <mutante>	 apergos: hmm, see how the Start-Date above was at 15:_43_ ?
[08:36:51] <apergos>	 uh huh
[08:36:53] <mutante>	 apergos: and there is also a cronjob at "43"
[08:37:03] <apergos>	 this gets done by cron?
[08:37:11] <mutante>	 which [ -f /var/lib/puppet/state/puppetdlock ] && find /var/lib/puppet/state/puppetdlock -ctime +1 -delete
[08:37:32] <mutante>	 and then you can see in auth.log how root , started by cron ...
[08:38:09] <mutante>	 but i dont know yet how that would do the package upgrade.. but i dont see a manual user login around that time.. and the timestamps fit too good
[08:38:40] <apergos>	 I wonder if the other snaps got upgraded or are liabl to break for no reason shortly
[08:39:12] <apergos>	 I have rerun the one incremental step and tweaked the configured delay so that today's job should run
[08:40:03] <mutante>	 mm, how could deletin a puppet lock file, cause a apt command
[08:40:10] <mutante>	 lets run puppet manually again now
[08:40:48] <apergos>	 ok
[08:41:04] <apergos>	 you don't see apt in the log file as part of puppet?
[08:41:50] <mutante>	 Mar 12 15:43:34 snapshot3 puppet-agent[14592]: (/Stage[main]/Snapshots::Packages/Package[mysql-client-5.1]/ensure) change from 5.1.53-fb3753-wm1 to 5.1.61-0ubuntu0.10.04.1 failed: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install mysql-client-5.1' returned 100:
[08:41:58] <apergos>	 uh huh
[08:42:00] <mutante>	 puppet did it.. somehow
[08:42:26] <mutante>	 from wm1 to ubuntu .. its that again. our vs. distro
[08:42:51] <mutante>	 but you do have apt preferences saying it should prefer ours
[08:44:03] <mutante>	  (/Stage[main]/Snapshots::Sync/Exec[snapshot-trigger-mw-sync]) Dependency Package[mysql-client-5.1] has failures: true
[08:44:24] <mutante>	 so Exec[snapshot..] has a dependecy on that package
[08:44:27] <apergos>	 on snapshot2 I see this:
[08:44:28] <apergos>	 Mar 12 15:24:26 snapshot2 puppet-agent[28349]: (/Stage[main]/Snapshots::Packages/Package[mysql-client-5.1]/ensure) ensure changed '5.1.41-3ubuntu12.10' to '5.1.61-0ubuntu0.10.04.1
[08:45:15] <apergos>	 which therefore "just worked"
[08:46:32] <apergos>	 I don't care which version it uses as long as it upgrades cleanly and nothing breaks
[08:50:10] <mutante>	 apt sources.list and apt preferences are identical
[11:17:28] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours
[11:55:53] <gerrit-wm>	 New patchset: Hashar; "gerrit played ping pong between http / https URL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3148
[11:56:06] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3148
[12:04:06] <nagios-wm>	 PROBLEM - Disk space on search1017 is CRITICAL: DISK CRITICAL - free space: /a 5002 MB (3% inode=99%):
[12:08:27] <nagios-wm>	 PROBLEM - Disk space on search1017 is CRITICAL: DISK CRITICAL - free space: /a 5000 MB (3% inode=99%):
[12:43:14] <nagios-wm>	 PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours
[12:46:32] <nagios-wm>	 PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours
[12:55:43] <hashar>	 apergos: ping ? :-]
[12:56:49] <apergos>	 hashar: pong
[12:57:09] <hashar>	 I would need some root magic to list files on manganese, that is the server hosting gerrit
[12:57:23] <apergos>	 what exactly do you need?
[12:57:27] <hashar>	 I need to hack a script that would list all git repositories hosted there that begins with 'mediawiki'
[12:57:44] <hashar>	 it seems they are hold in /var/lib/gerrit2/git   , I would like to have a confirmation :)
[12:57:48] <apergos>	 ah
[12:57:48] <hashar>	 ssh manganese.wikimedia.org /bin/ls -1 /var/lib/gerrit2/git
[12:57:50] <apergos>	 lemme see
[12:58:24] <hashar>	 then I will write a php script that output to the public:    /bin/ls -1   <somepath>  | grep mediawiki
[12:58:24] <apergos>	 nope
[12:58:40] <apergos>	 root@manganese:/var/lib/gerrit2/review_site# ls
[12:58:40] <apergos>	 bin  cache  etc  git  hooks  lib  logs  static  tmp
[12:58:55] <apergos>	 there is no  /var/lib/gerrit2/git
[12:59:07] <hashar>	 maybe that 'git' subdirectory so
[12:59:13] <hashar>	  /var/lib/gerrit2/review_site/git
[12:59:15] <apergos>	  ls git
[12:59:15] <apergos>	 All-Projects.git  analytics  analytics.git  integration  labs  mediawiki  mediawiki.git  operations  test
[12:59:24] <hashar>	 wonderful!
[12:59:29] <apergos>	 ok
[12:59:34] <hashar>	 can you ls  mediawiki/extensions  ?
[13:00:04] <hashar>	 should have something like  /var/lib/gerrit2/review_site/git/mediawiki/extensions/*.git
[13:02:41] <apergos>	 root@manganese:/var/lib/gerrit2/review_site/git/mediawiki/extensions# ls -a
[13:02:42] <apergos>	 .                           ContributionReporting.git       Narayam.git                 SubPageList3.git
[13:02:42] <apergos>	 etc.
[13:02:51] <hashar>	 you rocks!
[13:03:29] <nagios-wm>	 PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours
[13:03:51] <apergos>	 thanks
[13:03:51] <RobH>	 !log cp1029-cp1035 all installed and ready for varnish deployment, puppet has been run
[13:03:53] <apergos>	 anyting else?
[13:03:55] <morebots>	 Logged the message, RobH
[13:04:14] <hashar>	 apergos: should be good for now
[13:04:20] <hashar>	 I am going to push a gerrit change for review
[13:06:11] <gerrit-wm>	 New patchset: Hashar; "publicly list mediawiki extensions git repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3149
[13:06:23] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3149
[13:10:32] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:10:59] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:28:59] <nagios-wm>	 PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:38:58] <gerrit-wm>	 New patchset: Hashar; "publicly list mediawiki extensions git repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3149
[13:39:10] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3149
[13:39:20] <nagios-wm>	 RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s)
[13:39:37] <gerrit-wm>	 New review: Hashar; "That second patch set makes the change no more dependent on another pending one." [operations/puppet] (production) C: 0;  - https://gerrit.wikimedia.org/r/3149
[13:39:56] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.174 seconds
[13:40:32] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.182 seconds
[13:44:00] <hashar>	 apergos: would you be willing to review some PHP hack I have submitted  https://gerrit.wikimedia.org/r/#change,3149  :-]
[13:44:33] <apergos>	 I could but it would carry very little weight I'm afraid
[13:46:32] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:46:50] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:49:31] <hashar>	 apergos: I will ask RobH  :-]]
[13:50:03] <hashar>	 more seriously, going to hold till some SF friend way e up
[13:50:04] <RobH>	 asking me to look at php code is like asking a blind person to describe the mona lisa.
[13:50:22] <RobH>	 i know it exists, but i have no idea what i am lookin at.
[13:50:35] <hashar>	 haha
[13:50:54] <hashar>	 which languages are you fluent with?  Beside english and dmesg output ? :)
[13:51:45] <RobH>	 i am ashamed to say
[13:52:05] <RobH>	 <quietly>vb</quiet>
[13:52:30] <RobH>	 before I got the job just before wiki i was employed in a microsoft shop.
[13:52:41] <RobH>	 did .net hacking and sql mining
[13:52:53] <RobH>	 my hands will never be clean.
[13:52:54] <hashar>	 I know someone who has made that is expert field
[13:53:13] <hashar>	 and he does vb hacking as a second job :-]   Pay off very well since real experts are rare
[13:53:14] <RobH>	 I was not a talented programmer.
[13:53:19] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.178 seconds
[13:54:06] <RobH>	 most of my programming was bare minimum to plug in sql quieries in to applications our customers used with our organization
[13:54:14] <RobH>	 it was SQL heavy and VB light
[13:54:26] <RobH>	 and I don't recall any of my sql knowledge anymore.
[13:54:30] <hashar>	 ahah
[13:54:36] <RobH>	 it was well..... 7 years ago.
[13:54:44] <hashar>	 you should attempt to learn perl. that saves a lot of time when doing sysadmin
[13:54:53] <hashar>	 and I am sure you eventually come back to SQL very easily
[13:54:59] <RobH>	 heh, what about python!
[13:55:06] <hashar>	 or python even better
[13:55:17] <RobH>	 that would make mark_ happier, he is a python fan
[13:58:01] <hashar>	 and I just wrote a script using perl ...
[13:59:37] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:59:46] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.504 seconds
[14:00:49] <hashar>	 RobH: do you have any apache skill ? :-)
[14:01:01] <hashar>	 got a funny mod_rewrite issue for you if so : https://gerrit.wikimedia.org/r/#change,3148
[14:01:54] <RobH>	 yer just forcing https ?
[14:05:03] <hashar>	 it is already forced
[14:05:15] <hashar>	 but accessing   HTTPS  /   we are redirect to HTTP  /
[14:05:21] <RobH>	 yea, thats crappy.
[14:05:30] <RobH>	 and you are fixing that with this patch is what it looks like to me.
[14:05:50] <hashar>	 hmm I am wrong above sorry
[14:06:03] <hashar>	 well the commit message have it right.
[14:06:11] <RobH>	 yea, i see what you mean
[14:06:13] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:06:26] <RobH>	 it does a loop to and bback from http url for no reason due to bad rule
[14:06:29] <RobH>	 is how i am reading it
[14:06:39] <RobH>	 though its not really noticable to most users
[14:06:49] <RobH>	 yes?
[14:06:53] <hashar>	 yup. I suspect the :443 virtual host was copy pasted from the :80 virtualhost
[14:07:07] <RobH>	 indeed, thats the usual culprit, i will review and approve =]
[14:07:08] <hashar>	 probabl
[14:07:27] <hashar>	 probably nobody notice, the URL bar is just bilking a bit while the redirects are being done
[14:07:33] <gerrit-wm>	 New review: RobH; "Someone fixing apache host files, huzzah!" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3148
[14:07:36] <gerrit-wm>	 Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3148
[14:07:45] <hashar>	 ideally we should use some apache configuration snippets
[14:07:50] <RobH>	 yea i never noticed until now and paid reallllly close attention
[14:08:05] <RobH>	 well, puppet has a way of generating vhost files, mark_ is presently toying iwth it
[14:08:08] <RobH>	 its not fully working yet
[14:08:21] <RobH>	 but once its done it should standardize all our vhost configurations on the cluster
[14:08:31] <hashar>	 puppet labs has a forge to host puppet "receipes"
[14:09:01] <hashar>	 \o/
[14:09:05] <RobH>	 heh, i guess i need to push your change live eh?
[14:09:29] <hashar>	 if you want, though it is not urgent
[14:09:38] <hashar>	 will be live next time apache restart :-]]]
[14:09:40] <RobH>	 yea but when someone else goes to push later they will see your change
[14:09:44] <RobH>	 in puppet
[14:09:51] <RobH>	 and its confusing to folks cuz they dunno where it came from
[14:10:13] <RobH>	 so i will merge it into production on puppet now
[14:11:49] <RobH>	 and going to force puppet update on gerrit to ensure we didnt break it, i dont think we did but rather break it now than it do it later on auto puppet run
[14:12:51] <RobH>	 hrmm, i wonder if puppet makes apache restart to take the change...
[14:13:11] <RobH>	 hashar: check gerrit for the redirect if ya dont mind, it shoudl be live now (if puppet restarts apache)
[14:13:34] <RobH>	 i imagine you have the curl test running from the comment to test, i just load it browser which isnt proper test
[14:13:44] <hashar>	 checking
[14:14:05] <RobH>	 if its not live, it means puppet doesnt force apache to reload, and i need to do it manually, im just curious if it did
[14:14:14] <hashar>	 I am not sure apache is restated by puppet
[14:14:18] <hashar>	 we have to subscribe it or something
[14:14:30] <RobH>	 so its not live?
[14:14:33] <hashar>	 nop
[14:14:37] <RobH>	 ok, manual restart apache
[14:14:41] <hashar>	 still the same output for:   curl -sIL http://gerrit.wikimedia.org/ | grep Location
[14:14:44] <RobH>	 ok, try now =]
[14:14:54] <hashar>	 $ curl -sIL http://gerrit.wikimedia.org/ | grep Location
[14:14:54] <hashar>	 Location: https://gerrit.wikimedia.org/
[14:14:54] <hashar>	 Location: https://gerrit.wikimedia.org/r/
[14:14:56] <hashar>	 well done!!!
[14:15:00] <hashar>	 on less redirect!
[14:15:03] <RobH>	 thx for fixing it =]
[14:15:18] <gerrit-wm>	 New review: Hashar; "Yeah one less redirect!!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3148
[14:15:47] <hashar>	 a puppet template would be to lint the apache configuration and automatically reload apache
[14:16:02] <hashar>	 thanks for applying the patch !!
[14:16:09] <RobH>	 glad to help
[14:18:46] <RobH>	 shit its already 1018
[14:18:50] <RobH>	 where is the day going =P
[14:19:08] <RobH>	 i even started early today
[14:22:41] <hashar>	 been awake at 6am, and it is 3pm already :-/
[14:25:00] <RobH>	 i feel your pain
[14:25:10] <RobH>	 i may need to take a nap for lunch
[14:25:31] <RobH>	 though the weather is absolutely stunning today, i may go for a walk instead.
[14:25:42] <RobH>	 which is the opposite of napping, but the weather is just that awesome today.
[14:27:53] <RobH>	 i just noticed from chris's irc quit he has fios
[14:27:56] <RobH>	 i am jealous.
[14:31:18] <RobH>	 ahh man, i was wondering why i couldnt hit my labs instance, i have to apply the webserver port rules...
[14:32:16] <RobH>	 can we only apply a sercurity group to an instance at creation?
[14:33:04] <RobH>	 grr, yes, how annoying.
[14:33:12] <RobH>	 there goes all my work in my lab instance, oh well.
[14:33:31] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.087 seconds
[14:43:00] <RobH>	 hrmm
[14:43:11] <RobH>	 i fubar'd up my labs, robh1 is resolving to an IP that the instance isnt running
[14:43:13] <RobH>	 what the hell.
[14:46:07] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:54:13] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.621 seconds
[14:54:22] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.960 seconds
[15:00:58] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:01:07] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:01:34] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.45025310924 (gt 8.0)
[15:04:46] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.883 seconds
[15:04:50] <hashar>	 RobH: it is sunny outside too
[15:04:58] <RobH>	 i missed the sun
[15:05:01] <hashar>	 RobH: I am out to read something
[15:05:13] <RobH>	 have a nice time =]
[15:08:35] <hashar>	 danke!
[15:11:04] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:28:19] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.683 seconds
[15:34:10] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.766 seconds
[15:34:37] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:38:04] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 14.3817131667 (gt 8.0)
[15:43:28] <nagios-wm>	 PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours
[15:46:46] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:04:53] <maplebed>	 do we have a wiki page on how to fix broken ssh host keys after reinstallation?  (broken from the client's perspective - i.e. on bast1001 it doesn't like ms-be5's new key)
[16:06:06] <maplebed>	 apergos or Jeff_Green or RobH maybe you know?
[16:06:07] <maplebed>	 :)
[16:06:22] <apergos>	 oh
[16:06:33] <apergos>	 you sorted out the cert and stuff right?
[16:06:36] <RobH>	 doesnt puppet copy over all the keys?
[16:06:47] <maplebed>	 the host itself is happy, but everyone else still has the old key
[16:06:55] <apergos>	 it's complaining because of /tc/something or /root/.ssh/known_hosts or whatever?
[16:07:06] <Jeff_Green>	 i'm not aware of such a page
[16:07:10] <apergos>	 well maybe it's wrong but I edit the file by hand
[16:07:28] <maplebed>	 yeah, I do that too, but I figured there was probably a "right" way to do it.
[16:07:41] <apergos>	 I have never heard of one
[16:08:08] <maplebed>	 ah well.
[16:08:12] <Jeff_Green>	 i've always edited the offending known_hosts by hand, but I believe there is a ssh-* command to remove a specific key
[16:08:25] <Jeff_Green>	 i don't believe it matters which way you do it
[16:08:30] <maplebed>	 oh yeah, I'd forgotten about that.
[16:08:34] <apergos>	 sure, but I guess you are looking to ddsh across all hosts or something
[16:08:50] <apergos>	 I douobt there is any such script
[16:08:55] <maplebed>	 it's easy enough to do it by hand with 'vi known_hosts +1623' (or whatever line it's at)
[16:08:57] <apergos>	 maybe there should be
[16:11:03] <Jeff_Green>	 re. puppet--from what I've seen our config will add keys but not remove them
[16:12:26] <maplebed>	 huh.  I hadn't seen ssh-copy-id before.
[16:12:34] <maplebed>	 (as I'm scanning ssh man pages)
[16:12:45] <apergos>	 :-)
[16:19:57] <RobH>	 !log updating dns for new domain wikimediacommons.pt (nameservers not yet pointed at us)
[16:20:00] <morebots>	 Logged the message, RobH
[16:21:23] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.181 seconds
[16:27:56] <maplebed>	 function clean_ssh_key() { linenumber=$1; echo -e "${linenumber}d\nw" | ed ~/.ssh/known_hosts 2> /dev/null; echo -e "${linenumber}d\nw" | ed /etc/ssh/ssh_known_hosts 2> /dev/null; }
[16:27:59] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:28:10] <maplebed>	 not exactly elegant, but hey.
[16:28:50] <apergos>	 that relies on knowing theline number and it being the same on all hosts
[16:29:10] <apergos>	 I will bet dollars to donuts (I tink I've never used that expression before!) that the line number varies
[16:29:23] <maplebed>	 it's only for localhost, and the ssh error spits out the line number.
[16:29:26] <apergos>	 ok
[16:29:34] <apergos>	 oh I see, a little scriptlet
[16:29:46] <apergos>	 I was thinking of a toll one would run to clean up the whole cluster
[16:29:46] <maplebed>	 i.e. you try and ssh somewhere, it says SCREWU! and you run that to clean it out.
[16:29:49] <apergos>	 *boom*
[16:30:00] <apergos>	 *tool
[16:30:23] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.722 seconds
[16:31:23] <maplebed>	 hrmph.  neither the new_install key nor my key works on ms-be5.
[16:32:00] <Jeff_Green>	 well that sounds permissions-y no?
[16:32:08] <maplebed>	 yeah, could be.
[16:32:30] <Jeff_Green>	 I'm gonna presumptively blame puppet
[16:32:36] <Jeff_Green>	 just because.
[16:33:34] <maplebed>	 hmm.  no login prompt on the console either.
[16:33:58] <maplebed>	 hey cmjohnson1 - in what state did you leave ms-be5?
[16:35:23] <cmjohnson1>	 maplebed: i don't remember...let me plug in the cart
[16:35:32] <maplebed>	 thanks.
[16:40:44] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:40:45] <cmjohnson1>	 maplebed: i left the testing software running...rebooting now
[16:40:53] <maplebed>	 cool.  tnx.
[16:41:29] <nagios-wm>	 PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100%
[16:47:38] <nagios-wm>	 RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 1.28 ms
[16:47:43] <cmjohnson1>	 maplebed:  good to go
[16:47:56] <nagios-wm>	 RECOVERY - DPKG on ms-be5 is OK: All packages OK
[16:48:14] <nagios-wm>	 RECOVERY - Disk space on ms-be5 is OK: DISK OK
[16:48:50] <nagios-wm>	 RECOVERY - RAID on ms-be5 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[16:50:56] <nagios-wm>	 RECOVERY - Puppet freshness on ms-be5 is OK: puppet ran at Wed Mar 14 16:50:46 UTC 2012
[17:08:18] * cmjohnson1  is moving to pmtpa
[17:11:29] <nagios-wm>	 RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.3287510084
[17:18:41] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.751 seconds
[17:24:05] <nagios-wm>	 PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours
[17:25:08] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:25:44] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.79999766667 (gt 8.0)
[17:26:02] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours
[17:31:53] <nagios-wm>	 RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.57810478992
[17:33:05] <nagios-wm>	 PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours
[17:33:05] <nagios-wm>	 PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours
[17:34:01] <nagios-wm>	 PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100%
[17:34:46] <gerrit-wm>	 New patchset: Lcarr; "Allowing icinga in sudoers as with nagios + group gammu" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3153
[17:34:58] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3153
[17:35:40] <nagios-wm>	 RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[17:35:45] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3153
[17:35:48] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3153
[17:40:46] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.6636874167 (gt 8.0)
[17:42:52] <nagios-wm>	 RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.85144512605
[17:49:55] <gerrit-wm>	 New patchset: Bhartshorne; "bumping up the number of replicator processes running on swift storage bricks to improve time to balance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3154
[17:50:07] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3154
[17:50:55] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3154
[17:50:58] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3154
[17:51:43] <nagios-wm>	 PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours
[18:16:25] <gerrit-wm>	 New patchset: Lcarr; "pushing http to http /icinga and https to https /icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3155
[18:16:38] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3155
[18:17:11] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3155
[18:17:14] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3155
[18:18:07] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:21:12] <gerrit-wm>	 New patchset: Bhartshorne; "dropping down to 2 from 4.  put latency increased too much." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3156
[18:21:25] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3156
[18:21:33] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3156
[18:21:36] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3156
[18:23:13] <nagios-wm>	 PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:23:49] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:24:16] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:24:43] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.623 seconds
[18:26:13] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.165386 (gt 8.0)
[18:31:01] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:32:31] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:35:49] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.601 seconds
[18:36:34] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 20.0373309167 (gt 8.0)
[18:41:24] <gerrit-wm>	 New patchset: Asher; "db18,19 decom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3157
[18:41:36] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3157
[18:42:16] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:42:52] <nagios-wm>	 RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.51411680672
[18:49:37] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.532 seconds
[18:55:15] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:56:54] <gerrit-wm>	 New patchset: Lcarr; "making sure conf.d exists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3158
[18:57:07] <gerrit-wm>	 New patchset: Lcarr; "more icinga tweaks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3159
[18:57:12] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.112 seconds
[18:57:19] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3158
[18:57:19] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3158
[18:57:19] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3159
[18:57:20] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3158
[18:57:35] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3159
[18:57:37] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3159
[18:58:36] <maplebed>	 !log ms-be5 is back in rotatino
[18:58:39] <morebots>	 Logged the message, Master
[18:59:26] <maplebed>	 RobH: ms-be1 is reporting a different number of CPUs than the rest of the ms-be hosts.  Do you know if that's hyperthreading or an actual difference?  http://ganglia.wikimedia.org/latest/?r=20min&cs=&ce=&m=load_report&s=by+name&c=Swift+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[18:59:51] <RobH>	 hyerpthreading
[18:59:59] <RobH>	 they are identical hosts and cpus
[19:00:04] <maplebed>	 k.
[19:00:51] <RobH>	 so unless swift can make use of HT we should reboot and turn it off on them.  (Or schedule to do so when convient)
[19:01:02] <RobH>	 since it tends to give a false impression, imho
[19:01:29] <RobH>	 but i dunno, can swift do anything with hyperthreading that is beneficial?
[19:01:35] <RobH>	 or does it really not matter
[19:01:42] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.146 seconds
[19:03:04] <maplebed>	 well, all I know is that ms-be1 is hurting much worse than the others.
[19:03:20] <maplebed>	 I don't think it has anything to do with hyperthreading, but it is screwingc up my config
[19:03:30] <maplebed>	 (there's a part of the swift config that says numworkers == numcpus)
[19:03:43] <maplebed>	 so ms-be1 only has half the workers that the rest do.
[19:03:57] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:06:30] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 335 seconds
[19:06:39] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 344 seconds
[19:15:48] <nagios-wm>	 PROBLEM - DPKG on db58 is CRITICAL: Connection refused by host
[19:16:15] <nagios-wm>	 PROBLEM - Disk space on db58 is CRITICAL: Connection refused by host
[19:16:33] <nagios-wm>	 PROBLEM - MySQL disk space on db58 is CRITICAL: Connection refused by host
[19:17:27] <nagios-wm>	 PROBLEM - RAID on db58 is CRITICAL: Connection refused by host
[19:17:36] <nagios-wm>	 PROBLEM - SSH on db58 is CRITICAL: Connection refused
[19:18:21] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:21:39] <nagios-wm>	 RECOVERY - RAID on db58 is OK: OK: State is Optimal, checked 12 logical device(s)
[19:21:48] <nagios-wm>	 RECOVERY - SSH on db58 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0)
[19:22:15] <nagios-wm>	 RECOVERY - DPKG on db58 is OK: All packages OK
[19:22:33] <nagios-wm>	 RECOVERY - Disk space on db58 is OK: DISK OK
[19:22:51] <nagios-wm>	 RECOVERY - MySQL disk space on db58 is OK: DISK OK
[19:28:15] <nagios-wm>	 PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:29:54] <maplebed>	 !log rebooting ms-be1 to enable hyperthreading (and make it the same as all the other ms-be hosts)
[19:29:58] <morebots>	 Logged the message, Master
[19:30:57] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.885 seconds
[19:30:57] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.729 seconds
[19:32:27] <nagios-wm>	 PROBLEM - Host ms-be1 is DOWN: PING CRITICAL - Packet loss = 100%
[19:37:15] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:37:15] <nagios-wm>	 RECOVERY - Host ms-be1 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms
[19:37:24] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:45:20] <gerrit-wm>	 New patchset: Lcarr; "more icinga fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3160
[19:45:32] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3160
[19:46:26] <nosy>	 hello
[19:46:31] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3160
[19:46:34] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3160
[19:47:00] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:47:27] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:47:36] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.531 seconds
[19:47:36] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.137 seconds
[19:53:54] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:55:46] <gerrit-wm>	 New patchset: Lcarr; "another apache updated" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3161
[19:55:58] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3161
[19:56:00] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3161
[19:56:02] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3161
[20:00:03] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:06:10] <gerrit-wm>	 New patchset: Lcarr; "making check all spelled out for purging nagios resources" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3162
[20:06:18] <gerrit-wm>	 New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3162
[20:07:06] <gerrit-wm>	 New patchset: Lcarr; "making check all spelled out for purging nagios resources" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3162
[20:07:18] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3162
[20:07:26] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3162
[20:07:29] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3162
[20:07:37] <hexmode>	 woosters: think this is pretty trivial https://rt.wikimedia.org/Ticket/Display.html?id=2631
[20:07:48] <hexmode>	 but would help wikisource :)
[20:08:04] <woosters>	  let me take a look and will get back to u
[20:08:09] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:09:04] <hexmode>	 woosters: tyvm
[20:12:39] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.953 seconds
[20:17:38] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:21:32] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.447 seconds
[20:21:50] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.018 seconds
[20:24:42] <gerrit-wm>	 New patchset: Lcarr; "Remove default icinga conf file from apache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3163
[20:24:54] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3163
[20:25:20] <gerrit-wm>	 New patchset: Lcarr; "Remove default icinga conf file from apache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3163
[20:25:32] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3163
[20:25:42] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3163
[20:25:44] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3163
[20:29:57] <Jamesofur>	 looks like we had a site bump (bunch of people complained and then back up)
[20:30:32] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 186 seconds
[20:31:08] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1017 is CRITICAL: CRIT replication delay 221 seconds
[20:31:08] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 221 seconds
[20:31:08] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: CRIT replication delay 221 seconds
[20:31:08] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 221 seconds
[20:31:26] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 240 seconds
[20:31:53] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 265 seconds
[20:31:53] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db52 is CRITICAL: CRIT replication delay 265 seconds
[20:32:11] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 285 seconds
[20:33:56] <binasher>	 ^^^ is ok now
[20:33:59] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db38 is CRITICAL: CRIT replication delay 391 seconds
[20:34:17] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:34:44] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 0 seconds
[20:35:11] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1017 is OK: OK replication delay 0 seconds
[20:35:11] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 0 seconds
[20:35:11] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay 0 seconds
[20:35:11] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds
[20:35:38] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds
[20:35:56] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db52 is OK: OK replication delay 0 seconds
[20:35:56] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds
[20:35:56] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db38 is OK: OK replication delay 0 seconds
[20:36:14] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.614 seconds
[20:36:23] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds
[20:36:23] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:39:16] <gerrit-wm>	 New patchset: Asher; "disabling log_queries_not_using_indexes for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3164
[20:39:29] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3164
[20:41:00] <binasher>	 !log disabled log_queries_not_using_indexes on all core dbs
[20:41:03] <morebots>	 Logged the message, Master
[20:41:58] <gerrit-wm>	 New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3130
[20:42:01] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3130
[20:42:18] <gerrit-wm>	 New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3157
[20:42:21] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3157
[20:42:32] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:44:29] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.117 seconds
[20:45:29] <gerrit-wm>	 New patchset: Asher; "disabling log_queries_not_using_indexes for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3164
[20:45:41] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3164
[20:46:01] <gerrit-wm>	 New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3164
[20:46:03] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3164
[20:50:56] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:51:42] <gerrit-wm>	 New review: Ryan Lane; "Let's make this a cron that generates a static file, so that we don't need to include php on the ger..." [operations/puppet] (production); V: 0 C: 0;  - https://gerrit.wikimedia.org/r/3149
[20:57:14] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.924 seconds
[21:03:23] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 2.372 seconds
[21:09:32] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:09:41] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:13:06] <gerrit-wm>	 New patchset: Hashar; "publicly list mediawiki extensions git repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3149
[21:13:18] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3149
[21:18:50] <gerrit-wm>	 New patchset: Hashar; "publicly list mediawiki extensions git repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3149
[21:18:59] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours
[21:19:02] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3149
[21:20:51] <gerrit-wm>	 New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3149
[21:20:54] <gerrit-wm>	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3149
[21:24:05] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.406 seconds
[21:24:05] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.122 seconds
[21:25:24] <Reedy>	 Anyone familiar with xinetd?
[21:29:40] <hashar>	 Reedy: haven't used it for years and years sorry :-(
[21:30:10] <Reedy>	 Just trying to finish replicating how we have the extdist remote client setup
[21:30:31] <Reedy>	 xinetd says the service is running, but there's nothing listening on the right port, and hence, can't telnet into it
[21:32:14] <AaronSchulz>	 TimStarling: hello
[21:32:20] <TimStarling>	 hi
[21:36:32] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:36:32] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:38:46] <gerrit-wm>	 New patchset: Ryan Lane; "Upping tools again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3165
[21:38:58] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3165
[21:40:06] <gerrit-wm>	 New patchset: Lcarr; "fixing ordering for icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3166
[21:40:18] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3166
[21:40:49] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3166
[21:40:52] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3166
[21:40:53] <gerrit-wm>	 New patchset: Ryan Lane; "Adding docroot for 443 on gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3167
[21:41:06] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3167
[21:41:10] <gerrit-wm>	 New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3167
[21:41:25] <gerrit-wm>	 New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3165
[21:41:27] <gerrit-wm>	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3167
[21:41:28] <gerrit-wm>	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3165
[21:46:22] <gerrit-wm>	 New patchset: Ryan Lane; "Changing ls one-liner to only output extension names, rather than directory contents" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3168
[21:46:32] <AaronSchulz>	 Error undeleting file: Could not connect to storage backend "swift-local-backend-copper".
[21:46:34] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3168
[21:46:35] * AaronSchulz  grumbles
[21:47:27] <gerrit-wm>	 New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3168
[21:47:29] <gerrit-wm>	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3168
[21:49:27] <gerrit-wm>	 New patchset: Bhartshorne; "trying to puppetize a cronjob to run the swift cleaner changed swiftcleanermanager to only allow one instance at a time (using a pidfile)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3134
[21:49:39] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3134
[21:52:10] <gerrit-wm>	 New patchset: Lcarr; "removing icinga default conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3169
[21:52:22] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3169
[21:52:40] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3169
[21:52:43] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3169
[21:53:37] <gerrit-wm>	 New patchset: Bhartshorne; "trying to puppetize a cronjob to run the swift cleaner changed swiftcleanermanager to only allow one instance at a time (using a pidfile)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3134
[21:53:49] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3134
[21:54:28] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.226 seconds
[21:58:47] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3134
[21:58:49] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3134
[22:00:55] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:02:24] <gerrit-wm>	 New patchset: Bhartshorne; "installing the swift cleaner on iron" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3170
[22:02:36] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3170
[22:02:40] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3170
[22:02:43] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3170
[22:09:10] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.817 seconds
[22:09:19] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.094 seconds
[22:10:37] <gerrit-wm>	 New patchset: Bhartshorne; "correcting variable scope in template file, correcting path to conf file in cron invocation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3171
[22:10:49] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3171
[22:11:07] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3171
[22:11:10] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3171
[22:21:29] <gerrit-wm>	 New patchset: Bhartshorne; "trying with local scope" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3172
[22:21:41] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3172
[22:22:19] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3172
[22:22:22] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3172
[22:38:25] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:38:34] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:42:37] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.622 seconds
[22:44:25] <nagios-wm>	 PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours
[22:48:28] <nagios-wm>	 PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours
[22:48:46] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:50:43] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.565 seconds
[22:51:01] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.418 seconds
[22:55:39] <gerrit-wm>	 New patchset: Bhartshorne; "added option to ignore previous state when running swiftcleaner, fixed bug in pidfile detection" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3173
[22:55:52] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3173
[22:56:19] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3173
[22:56:22] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3173
[22:57:10] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:57:10] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:05:25] <nagios-wm>	 PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours
[23:11:17] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:11:35] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:13:23] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.334 seconds
[23:13:32] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.164 seconds
[23:14:17] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds
[23:14:26] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds
[23:16:52] <maplebed>	 !log installed the swiftcleaner to run daily from iron.  see root's crontab for more info.
[23:16:55] <morebots>	 Logged the message, Master
[23:22:14] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:22:14] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:31:50] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:32:08] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:34:05] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.726 seconds
[23:37:50] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds
[23:42:38] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:45:20] <gerrit-wm>	 New patchset: Bhartshorne; "correcting scrubstate logic" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3177
[23:45:32] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3177
[23:46:30] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3177
[23:46:33] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3177
[23:48:29] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:48:29] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:50:26] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.631 seconds
[23:50:26] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.636 seconds