[00:06:03] <gerrit-wm>	 New patchset: Pyoungmeister; "assigning more stuff to fake hosts to make the catch-all term the same as in pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3178
[00:06:15] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3178
[00:08:09] <gerrit-wm>	 New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3178
[00:08:11] <gerrit-wm>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3178
[00:17:08] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:31:41] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.922 seconds
[00:35:35] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:38:19] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.128 seconds
[00:40:07] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:44:01] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.861 seconds
[00:52:25] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:56:28] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.098 seconds
[00:59:28] <nagios-wm>	 PROBLEM - Lucene on search1015 is CRITICAL: Connection refused
[01:02:46] <nagios-wm>	 RECOVERY - RAID on srv197 is OK: OK: no RAID installed
[01:03:19] <mutante>	 !log fixing nrpe "unable to read output" raid check on srv197,207,243,,244,253.. (nrpe running as wrong user)
[01:03:22] <morebots>	 Logged the message, Master
[01:05:19] <nagios-wm>	 RECOVERY - RAID on srv243 is OK: OK: no RAID installed
[01:06:52] <gerrit-wm>	 New patchset: Ryan Lane; "Fixing apache config for gerrit to work with labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3179
[01:07:04] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3179
[01:07:42] <gerrit-wm>	 New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3179
[01:07:44] <gerrit-wm>	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3179
[01:10:33] <gerrit-wm>	 New patchset: Ryan Lane; "Revert "Fixing apache config for gerrit to work with labs"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3180
[01:10:45] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3180
[01:10:49] <gerrit-wm>	 New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3180
[01:10:52] <gerrit-wm>	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3180
[01:17:28] <nagios-wm>	 PROBLEM - HTTP on singer is CRITICAL: Connection refused
[01:18:38] <mutante>	 ops, singer,, whats up, checkin
[01:19:43] <Ryan_Lane>	 rc  libapache2-mod-php5
[01:19:49] <RoanKattouw>	 Same thing
[01:19:49] <mutante>	 !log planet down - apache on singer, syntax error in site config "Invalid command 'php_admin_flag'"
[01:19:51] <Ryan_Lane>	 seems puppet forced an upgrade of somehting
[01:19:53] <morebots>	 Logged the message, Master
[01:20:12] <Ryan_Lane>	 libapache2-mod-php5: Conflicts: libapache2-mod-php5filter but 5.3.2-2wm1 is to be installed
[01:21:05] <mutante>	 The following extra packages will be installed: apache2-mpm-prefork
[01:21:09] <mutante>	 The following packages will be REMOVED: apache2-mpm-worker
[01:21:11] <mutante>	 hrmm
[01:21:18] <mutante>	 guess we dont care for planet
[01:22:39] <mutante>	 !log planet back up (installed libapache2-mod-php5 which installed apache2-mpm-prefork and removed apache2-mpm-worker)
[01:22:42] <morebots>	 Logged the message, Master
[01:23:30] <mutante>	 yesterday on snapshot3 something similar happened with mysql-client packages, puppet ran and then conflicts between wmf and ubuntu packages
[01:23:37] <nagios-wm>	 RECOVERY - HTTP on singer is OK: HTTP OK - HTTP/1.1 302 Found - 0.004 second response time
[01:25:16] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:25:34] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:26:12] <mutante>	 <-- and this ekrem thing is caused by AppleDictionaryService
[01:26:25] <Ryan_Lane>	 !log labsconsole was missing libapache2-mod-php5. puppet must have tried to upgrade a package unsuccessfully
[01:26:28] <morebots>	 Logged the message, Master
[01:27:13] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.382 seconds
[01:27:58] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:29:18] <mutante>	  /Webserver::Php5/Package[apache2]/ensure) ensure changed '2.2.14-5ubuntu8.8' to '2.2.14-5ubuntu8.9'
[01:29:56] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1843 bytes in 8.417 seconds
[01:36:04] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:36:22] <Ryan_Lane>	 that would do it
[01:36:33] <Ryan_Lane>	 why it would fail to update the dependencies, I have no clue
[01:39:40] <nagios-wm>	 PROBLEM - MySQL Slave Running on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:39:49] <mutante>	 it also conflicted with "libapache2-mod-php5filter" before, so the new part would have to be "but 5.3.2-2wm1 is to be installed" .. but also we did not change any of the apt preferences afaik.. so hmmmm
[01:42:13] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.982 seconds
[01:45:04] <nagios-wm>	 PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours
[01:49:52] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:51:49] <nagios-wm>	 PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:00:04] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.529 seconds
[02:06:58] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:10:25] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:22:40] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:22:58] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:43:04] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.093 seconds
[02:49:13] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:52:49] <nagios-wm>	 PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:53:34] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:03:55] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.103 seconds
[03:09:20] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:15:02] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.747 seconds
[03:15:29] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.030 seconds
[03:21:29] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:21:56] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:25:59] <nagios-wm>	 PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours
[03:25:59] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.751 seconds
[03:27:56] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours
[03:28:05] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:28:05] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:31:50] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.460 seconds
[03:34:59] <nagios-wm>	 PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours
[03:34:59] <nagios-wm>	 PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours
[03:45:02] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:49:05] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.821 seconds
[03:52:59] <nagios-wm>	 PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours
[04:01:31] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:03:46] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 5.822 seconds
[04:10:13] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:10:13] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:12:10] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 0.021 seconds
[04:12:10] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.008 seconds
[04:26:52] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:26:52] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:33:01] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.448 seconds
[04:33:01] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.455 seconds
[04:43:04] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:43:13] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:43:13] <nagios-wm>	 PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:55:58] <gerrit-wm>	 New patchset: Dzahn; "allow virt[1-5] subnet to access spence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3186
[04:56:10] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3186
[04:57:00] <gerrit-wm>	 New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3186
[04:57:03] <gerrit-wm>	 Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3186
[04:59:37] <gerrit-wm>	 New patchset: Dzahn; "comment out Swift HTTP monitoring on non-production hosts again, this used to work for a day now they socket timeout and they are not in production anyways" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3187
[04:59:49] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3187
[05:01:46] <gerrit-wm>	 New patchset: Dzahn; "comment out Swift HTTP monitoring on non-production hosts again, this used to work for a day now they socket timeout, so i expect they have been stopped deliberately" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3187
[05:01:58] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3187
[05:02:25] <gerrit-wm>	 New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3187
[05:02:28] <gerrit-wm>	 Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3187
[05:07:02] <jeremyb>	 mutante: http://nagios.wikimedia.org/nagios/cgi-bin/history.cgi?host=copper&service=Swift+HTTP is a little more info than you get from the notifications fwiw
[05:08:13] <mutante>	 jeremyb: waah, you are using histroy.cgi :O :)
[05:08:22] <mutante>	 jeremyb: the evil script.. heeh
[05:08:40] <jeremyb>	 i had no idea it had such a reputation...
[05:09:20] <jeremyb>	 mutante: so... what's the deal with viewvc?
[05:10:04] <mutante>	 jeremyb: well, in the past sometimes spence got all overloaded and then we could resolve that by killing instanes of history.cgi
[05:10:21] <jeremyb>	 oh. i wasn't a bot...
[05:10:35] <jeremyb>	 that does ring a bell
[05:11:19] <mutante>	 jeremyb: thanks, i know in this case though (about the Swift HTTP) .. those are not the production hosts
[05:11:40] <mutante>	 so testing stuff should not be in nagios anyways
[05:11:56] <mutante>	 about viewvc, i dont have news
[05:12:40] <jeremyb>	 yeah, i figured. just thought i'd point it out because the commit msg mentioned the timeout
[05:13:11] <mutante>	 yep, no worries
[05:13:48] <mutante>	 viewvc is just not being included on the host
[05:14:07] <mutante>	 but someone needs to look at the strucutre of the svn.pp and the subclasses closel
[05:15:03] <mutante>	 jeremyb: btw and unrelated, do you have a labs account?
[05:15:16] * jeremyb  is too tired to think straight about viewvc
[05:15:18] <jeremyb>	 yes
[05:15:33] <mutante>	 could you try to log on to the labs bastion host ?
[05:15:54] <mutante>	 cause i have issues with it currently
[05:16:03] <jeremyb>	 $ ssh bastion1.pmtpa.wmflabs echo foo
[05:16:03] <jeremyb>	 If you are having access problems, please see: https://labsconsole.wikimedia.org/wiki/Access#Accessing_public_and_private_instances
[05:16:06] <jeremyb>	 foo
[05:16:15] <jeremyb>	 mutante: tell me more
[05:16:41] <mutante>	 If you are having access problems, please see: https://labsconsole.wikimedia.org/wiki/Access#Accessing_public_and_private_instances
[05:16:45] <mutante>	 Connection closed by 208.80.153.194
[05:16:53] <jeremyb>	 when you do what?
[05:17:10] <mutante>	 ssh to bastion.wmflabs.org
[05:17:25] <mutante>	 like i did dozens of times before
[05:17:45] <mutante>	 lets talk in labs :)
[05:20:30] <gerrit-wm>	 New patchset: Asher; "vcl_config, not vcl_options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3188
[05:20:43] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3188
[05:22:37] <gerrit-wm>	 New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3188
[05:22:40] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3188
[05:26:59] <nagios-wm>	 RECOVERY - Puppet freshness on professor is OK: puppet ran at Thu Mar 15 05:26:47 UTC 2012
[05:28:05] <mutante>	 :)
[05:30:00] <gerrit-wm>	 New patchset: Asher; "removing misc::udpprofile::collector from spence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3189
[05:30:13] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3189
[05:30:22] <gerrit-wm>	 New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3189
[05:30:24] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3189
[05:52:20] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:01:17] <nagios-wm>	 RECOVERY - Puppet freshness on virt2 is OK: puppet ran at Thu Mar 15 06:01:10 UTC 2012
[06:04:44] <nagios-wm>	 RECOVERY - Puppet freshness on virt3 is OK: puppet ran at Thu Mar 15 06:04:20 UTC 2012
[06:10:17] <nagios-wm>	 RECOVERY - Puppet freshness on virt4 is OK: puppet ran at Thu Mar 15 06:10:11 UTC 2012
[06:11:24] <mutante>	 :)²
[06:23:56] <nagios-wm>	 PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:01:48] <mutante>	 !log uprading apache and apt on hume
[07:01:54] <morebots>	 Logged the message, Master
[07:18:05] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:20:47] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours
[07:47:38] <nagios-wm>	 RECOVERY - DPKG on hume is OK: All packages OK
[07:51:51] <mutante>	 !log messed with /var/lib/dpkg/status on hume to fix broken packages/remove "marked for purging" on libmysql-php5 without removing a ton of other packages, rather hackish but seems fine anyways, like not broken anymore on simulated dist-upgrade etc
[07:51:55] <morebots>	 Logged the message, Master
[07:57:23] <nagios-wm>	 PROBLEM - Lucene on search9 is CRITICAL: Connection timed out
[07:59:20] <nagios-wm>	 PROBLEM - Lucene on search3 is CRITICAL: Connection timed out
[08:02:19] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out
[08:06:13] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123
[08:12:40] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out
[08:12:52] <mutante>	 !log installing apache,apt,cron,mysql-client upgrades on spence
[08:12:55] <morebots>	 Logged the message, Master
[08:15:28] <Snowolf>	 mutante: any reason why the topic is http://? https:// seems to be working on wikitech.wikimedia for me...
[08:16:33] <mutante>	 Snowolf: oh, i guess because it has a self-signed cert still.. we should really fix that
[08:16:57] <mutante>	 Snowolf: you probably have an exception in your browser
[08:17:08] <Snowolf>	 Yeah, but should still be better than nothing, imo
[08:17:22] <Snowolf>	 I guess might confuse people tho, you're right
[08:17:51] <mutante>	 hmm, i dont know what is worse..
[08:18:38] <mutante>	 let me just put it on higher priority to fix that cert.. shouldnt be that big of a deal
[08:18:40] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123
[08:23:12] <Snowolf>	 mutante: the certificate on status.wikimedia.org is also not working
[08:23:30] <Snowolf>	 (or more accurately, it's not signed for that domain)
[08:24:14] <mutante>	 Snowolf: i know, that one is not that easy though, because we would have to proxy it, it is an alias for status.watchmouse.com
[08:24:41] <mutante>	 which isnt use
[08:24:43] <mutante>	 us
[08:24:50] <Snowolf>	 Yeah, I figured, but thought I'd point it out anyway given I just came across it :)
[08:25:16] <mutante>	 Snowolf: thanks, we got a ticket for it
[08:29:19] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out
[08:33:13] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.003 second response time on port 8123
[08:33:43] <mutante>	 as root: OK: State is Optimal, checked 6 logical device(s)   as nagios: Parse error processing MegaCli64 output   hrmmpf
[08:47:55] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out
[08:52:55] <mutante>	 morning hashasr
[08:52:58] <mutante>	 hashar
[08:53:55] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123
[08:55:06] <mutante>	 i was wondering if you already got that script to get all svn users and parse their info
[08:59:38] <hashar>	 I am working on it at the momment
[08:59:54] <hashar>	 aver wrote a perl module to do that : http://search.cpan.org/~avar/MediaWiki-USERINFO-0.04/lib/MediaWiki/USERINFO.pm :-]
[09:00:07] <mutante>	 cool
[09:00:22] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out
[09:02:16] <hashar>	 we have  lot of users that never sent any commits :-)
[09:02:45] <hashar>	 and I thought we were going to use some generic mails such as   foobar@users.mediawiki.org
[09:04:16] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.006 second response time on port 8123
[09:06:33] <nagios-wm>	 RECOVERY - Puppet freshness on mw1020 is OK: puppet ran at Thu Mar 15 09:06:26 UTC 2012
[09:08:15] <mutante>	 hashar: is it really worth creating git users for those that never even committed once before? i thought just all from http://svn.wikimedia.org/users.php
[09:08:22] <mutante>	 !log ran puppet on mw102
[09:08:25] <morebots>	 Logged the message, Master
[09:08:34] <hashar>	 mutante: I have no idea
[09:08:43] <hashar>	 mutante: someone just told me that that was suddenly required
[09:09:08] <mutante>	 < sumanah> mutante: I wrote it out a bit in https://bugzilla.wikimedia.org/show_bug.cgi?id=35209#c0
[09:10:07] <hashar>	 oh there is a bug report good
[09:10:11] <mutante>	 hashar: see earlier in -dev
[09:10:18] <mutante>	 if you can
[09:10:36] <hashar>	 I am just going to provide the requested file then write a rant about it
[09:11:28] <mutante>	 hmm, if we dont have their email address how are we going to tell them they got a new account
[09:11:40] <mutante>	 edit their user talk pages with a bot?;)
[09:12:38] <mutante>	 it would be kind of cool if i could email "foobar@users.mediawiki.org" and that would make a bot put the content on the User_talk of foobar :)
[09:12:42] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:13:55] <hashar>	 mutante: or just set an alias for *@users.mediawiki.org emails :-]
[09:14:04] <mutante>	 but of course way too easy for spammers
[09:14:12] <mutante>	 heh
[09:17:39] <nagios-wm>	 RECOVERY - Lucene on search3 is OK: TCP OK - 0.012 second response time on port 8123
[09:18:33] <hashar>	 pff
[09:18:47] <hashar>	 Avar module is so complicated that I have better time writing my own script
[09:22:09] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out
[09:23:57] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123
[09:30:24] <nagios-wm>	 PROBLEM - Lucene on search3 is CRITICAL: Connection timed out
[09:30:24] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out
[09:32:12] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123
[09:38:39] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out
[09:40:36] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123
[09:47:41] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out
[09:57:53] <nagios-wm>	 PROBLEM - Lucene on mw1020 is CRITICAL: Connection refused
[09:59:59] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.019 second response time on port 8123
[10:22:56] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out
[10:28:56] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123
[10:39:35] <nagios-wm>	 RECOVERY - Lucene on search3 is OK: TCP OK - 0.011 second response time on port 8123
[10:39:44] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out
[10:42:13] <nagios-wm>	 RECOVERY - Lucene on search9 is OK: TCP OK - 0.005 second response time on port 8123
[10:42:31] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123
[10:46:34] <nagios-wm>	 PROBLEM - carbon-cache.py on spence is CRITICAL: PROCS CRITICAL: 0 processes with command name carbon-cache.py
[10:50:48] <hashar>	 mutante: I have finished the CSV :-]
[11:17:37] <nagios-wm>	 PROBLEM - Host dataset1001 is DOWN: PING CRITICAL - Packet loss = 100%
[11:18:05] <apergos>	 no it isn't but I am responsible, please ignore
[11:18:31] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:18:40] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:22:43] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:22:52] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:25:34] <nagios-wm>	 RECOVERY - Host dataset1001 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms
[11:53:49] <gerrit-wm>	 New patchset: ArielGlenn; "bonded interfaces for dataset1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3191
[11:54:02] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3191
[11:56:19] <gerrit-wm>	 New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3191
[11:56:28] <gerrit-wm>	 New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3191
[11:56:31] <gerrit-wm>	 Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3191
[11:58:51] <nagios-wm>	 PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:23:36] <nagios-wm>	 PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:31:51] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:32:14] <gerrit-wm>	 New patchset: ArielGlenn; "mount gluster publicdata volume on dataset1001 (dumps)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3192
[12:32:26] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3192
[12:33:39] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:36:12] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.970 seconds
[12:37:42] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.002 seconds
[12:43:03] <gerrit-wm>	 New review: ArielGlenn; "latency between dcs could be problematic for this but let's give it a try, it's only for copy/delete..." [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3192
[12:43:06] <gerrit-wm>	 Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3192
[13:14:39] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:14:57] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:21:06] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.350 seconds
[13:26:57] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.490 seconds
[13:27:42] <nagios-wm>	 PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours
[13:29:39] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours
[13:33:24] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:34:00] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:35:21] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.888 seconds
[13:36:42] <nagios-wm>	 PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours
[13:36:42] <nagios-wm>	 PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours
[13:40:18] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.564 seconds
[13:40:40] <gerrit-wm>	 New patchset: Mark Bergsma; "Don't sign builds by default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3193
[13:40:53] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3193
[13:41:07] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3193
[13:41:10] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3193
[13:41:48] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:43:47] <gerrit-wm>	 New patchset: Mark Bergsma; "Move misc::package-builder into a separate file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3194
[13:44:00] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3194
[13:44:54] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3194
[13:44:57] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3194
[13:46:00] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.486 seconds
[13:46:36] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:52:09] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:02:39] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:03:06] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.796 seconds
[14:07:42] <gerrit-wm>	 New patchset: Mark Bergsma; "Puppetize pbuilder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3195
[14:07:54] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3195
[14:08:19] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3195
[14:08:21] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3195
[14:09:16] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:19:55] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.770 seconds
[14:24:23] <gerrit-wm>	 New patchset: Mark Bergsma; "Fix dependency cycle" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3199
[14:24:34] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3199
[14:24:58] <gerrit-wm>	 Change abandoned: Mark Bergsma; "this would merge test in again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3199
[14:28:19] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:47:04] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.421 seconds
[14:52:10] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.3677817857 (gt 8.0)
[14:53:40] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:56:13] <nagios-wm>	 RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.40850763158
[15:04:12] <gerrit-wm>	 New patchset: Mark Bergsma; "Fix dependency cycle" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3202
[15:04:25] <gerrit-wm>	 New patchset: Mark Bergsma; "Fix othermirrors, setup default dist link" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3203
[15:04:38] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3202
[15:04:38] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3203
[15:04:52] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3202
[15:04:55] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3202
[15:05:18] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3203
[15:05:20] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3203
[15:06:07] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.721 seconds
[15:12:25] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:15:32] <mark>	 !log Created git repo operations/debs/varnish in gerrit
[15:15:35] <morebots>	 Logged the message, Master
[15:21:16] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.718 seconds
[15:28:49] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:29:16] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.466 seconds
[15:35:34] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:50:07] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.864 seconds
[15:56:25] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.901 seconds
[15:56:25] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:00:37] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.865 seconds
[16:02:43] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:06:46] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:12:37] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:13:04] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.340 seconds
[16:15:10] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.342 seconds
[16:19:07] <gerrit-wm>	 New patchset: Lcarr; "Fixing icinga apache file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3204
[16:19:19] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3204
[16:19:39] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3204
[16:19:41] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3204
[16:21:28] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:21:37] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:35:34] <gerrit-wm>	 New patchset: Lcarr; "Making sure all config files are readable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3206
[16:35:46] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3206
[16:35:55] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3206
[16:35:57] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3206
[16:38:33] <nagios-wm>	 PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (10283)
[16:43:12] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:43:21] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:44:51] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.589 seconds
[16:53:15] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:53:33] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:53:42] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:54:36] <nagios-wm>	 RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000
[17:01:58] <gerrit-wm>	 New patchset: Lcarr; "trying to make exported files world readable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3207
[17:02:10] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3207
[17:02:28] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3207
[17:02:31] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3207
[17:06:23] <gerrit-wm>	 New patchset: Lcarr; "fix perms and purge decommissioned AFTER collecting resources" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3208
[17:06:36] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3208
[17:07:03] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.09612973913 (gt 8.0)
[17:07:41] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3208
[17:07:44] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3208
[17:13:12] <nagios-wm>	 RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.538383529412
[17:22:21] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours
[17:22:30] <nagios-wm>	 PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:23:08] <gerrit-wm>	 New patchset: Lcarr; "fixing collection" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3210
[17:23:21] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3210
[17:24:19] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3210
[17:24:22] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3210
[17:28:07] <gerrit-wm>	 New patchset: Mark Bergsma; "Build with source by default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3211
[17:28:19] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3211
[17:29:32] <gerrit-wm>	 Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3211
[17:29:50] <gerrit-wm>	 Change restored: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3211
[17:29:58] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3211
[17:30:00] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3211
[17:32:24] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.425 seconds
[17:38:33] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:40:18] <gerrit-wm>	 New patchset: Mark Bergsma; "Fix creates file name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3214
[17:40:31] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3214
[17:40:31] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3214
[17:40:42] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3214
[17:40:44] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3214
[17:54:37] <gerrit-wm>	 New patchset: Mark Bergsma; "Temporarily disable varnish package installation during package name migration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3215
[17:54:49] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3215
[17:55:04] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3215
[17:55:08] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3215
[18:15:06] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.680 seconds
[18:15:06] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.569 seconds
[18:18:15] <nagios-wm>	 RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123
[18:19:43] <RobH>	 !log db1022 coming down for reinstall and resetup of raid per rt 2537
[18:19:46] <morebots>	 Logged the message, RobH
[18:23:04] <cmjohnson1>	 robh: mark: d2-pmtpa is peaking on power usage across all 3 phase.  the servers in there are only single power supply...we have the other circuit we could utilize for half of the servers.  it would require bringing several down (mw28-mw58)
[18:23:18] <RobH>	 there should be a ticket to do just that already
[18:23:30] <RobH>	 actually, i take that back
[18:23:37] <RobH>	 we need to remove enough to work on a single circuit.
[18:23:49] <RobH>	 so determine how many should come out and put in a new ticket
[18:24:12] <RobH>	 we cannot split across feeds like that, its not legit
[18:24:29] <RobH>	 combined both feeds need to be under 50% on each or a total of 100% on one
[18:24:36] <RobH>	 its how redundant circuits work, we arent allowed to overload them
[18:24:42] <nagios-wm>	 PROBLEM - Varnish HTTP bits on sq67 is CRITICAL: Connection refused
[18:24:44] <cmjohnson1>	 it is not...but there is only 1 power supply on the servers...we have 60A that is not bein gused
[18:24:49] <cmjohnson1>	 being used
[18:24:51] <nagios-wm>	 PROBLEM - Full LVS Snapshot on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:24:57] <RobH>	 ok, i realize there is one psu
[18:25:00] <nagios-wm>	 PROBLEM - MySQL Idle Transactions on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:25:11] <RobH>	 but i am saying we are NOT allowed to fill both ciruicts past 50%
[18:25:15] <RobH>	 or a single one [past 100%
[18:25:18] <nagios-wm>	 PROBLEM - SSH on db1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:25:23] <RobH>	 so we need to remove the extra servers and relocate them.
[18:25:27] <nagios-wm>	 PROBLEM - MySQL Recent Restart on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:25:34] <RobH>	 if we are over the 80% on a single one.
[18:25:36] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:25:45] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:25:52] <RobH>	 since they are only one psu, ignore the second circuit
[18:25:55] <RobH>	 pretend it doesnt exist.
[18:25:57] <cmjohnson1>	 okay...i need to determine how many
[18:26:05] <cmjohnson1>	 got it
[18:26:12] <nagios-wm>	 PROBLEM - MySQL Slave Running on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:26:15] <RobH>	 ya have to do smoe math yea
[18:26:20] <cmjohnson1>	 what 2nd circuit ? =]
[18:26:21] <nagios-wm>	 PROBLEM - Disk space on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:26:28] <RobH>	 determine how many on a single phase and divide it up to determine average draw per server
[18:26:32] <RobH>	 to figure out how many to pull
[18:26:52] <RobH>	 i tried to get new psu's for those
[18:27:01] <RobH>	 but the r410 is NOT user swappable in that manner
[18:27:02] <RobH>	 which sucks
[18:27:39] <RobH>	 but we arent going to drop the redundant feed, it would be a pain and then it wouldnt match other racks
[18:27:42] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:27:42] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:27:44] <RobH>	 so we will simply not use it for now
[18:28:07] <cmjohnson1>	 ok
[18:28:09] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:28:22] <RobH>	 slightly distracted, sorry if that iddnt make total sense =]
[18:28:31] <cmjohnson1>	 i got it!
[18:28:45] <nagios-wm>	 PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:29:53] <RobH>	 im on a RT triage sprint, heh
[18:34:10] <gerrit-wm>	 New patchset: Ryan Lane; "Adding glusterfs cluster to gmetad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3218
[18:34:22] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3218
[18:35:12] <nagios-wm>	 RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.012 seconds
[18:38:03] <nagios-wm>	 PROBLEM - Host sq67 is DOWN: PING CRITICAL - Packet loss = 100%
[18:38:45] <gerrit-wm>	 New patchset: RobH; "added sq39 to decom due to pci training error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3219
[18:38:57] <gerrit-wm>	 New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3218
[18:38:58] <gerrit-wm>	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3218
[18:38:58] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3219
[18:39:11] <gerrit-wm>	 New review: RobH; "simple decom addition" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3219
[18:39:14] <gerrit-wm>	 Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3219
[18:41:12] <nagios-wm>	 RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms
[18:42:42] <nagios-wm>	 PROBLEM - Lucene on search1016 is CRITICAL: Connection refused
[18:48:51] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:49:45] <nagios-wm>	 PROBLEM - NTP on db1022 is CRITICAL: NTP CRITICAL: No response from NTP server
[18:50:03] <nagios-wm>	 RECOVERY - SSH on db1022 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0)
[18:51:22] <gerrit-wm>	 New patchset: Mark Bergsma; "Make sq67-sq70 use the new automatic partitioning for varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3221
[18:51:35] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3221
[18:51:44] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3221
[18:51:46] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3221
[18:57:24] <nagios-wm>	 PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 255 MB (3% inode=61%): /var/lib/ureadahead/debugfs 255 MB (3% inode=61%):
[18:59:04] <nagios-wm>	 RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123
[19:00:51] <nagios-wm>	 RECOVERY - Disk space on db1022 is OK: DISK OK
[19:00:53] <RobH>	 !log db1022 resetup and redeployed per rt 2537 and assigned back to asher
[19:00:56] <morebots>	 Logged the message, RobH
[19:02:03] <nagios-wm>	 RECOVERY - MySQL Recent Restart on db1022 is OK: OK seconds since restart
[19:02:12] <nagios-wm>	 RECOVERY - MySQL Idle Transactions on db1022 is OK: OK longest blocking idle transaction sleeps for seconds
[19:02:30] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1022 is OK: OK replication delay seconds
[19:02:57] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1022 is OK: OK replication delay seconds
[19:03:15] <nagios-wm>	 RECOVERY - MySQL Slave Running on db1022 is OK: OK replication
[19:03:33] <nagios-wm>	 RECOVERY - Full LVS Snapshot on db1022 is OK: OK no full LVM snapshot volumes
[19:05:30] <nagios-wm>	 RECOVERY - Disk space on srv223 is OK: DISK OK
[19:08:12] <nagios-wm>	 PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours
[19:08:48] <nagios-wm>	 PROBLEM - Host sq68 is DOWN: PING CRITICAL - Packet loss = 100%
[19:08:48] <nagios-wm>	 PROBLEM - Host sq70 is DOWN: PING CRITICAL - Packet loss = 100%
[19:08:48] <nagios-wm>	 PROBLEM - Host sq69 is DOWN: PING CRITICAL - Packet loss = 100%
[19:08:57] <nagios-wm>	 PROBLEM - Host sq67 is DOWN: PING CRITICAL - Packet loss = 100%
[19:10:09] <nagios-wm>	 RECOVERY - Host sq69 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[19:10:45] <nagios-wm>	 RECOVERY - Host sq70 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms
[19:11:00] <RobH>	 !log working on sq67-sq70 reinstalls, disregard alerts
[19:11:03] <morebots>	 Logged the message, RobH
[19:11:39] <nagios-wm>	 PROBLEM - LVS HTTP on bits.pmtpa.wikimedia.org is CRITICAL: Connection refused
[19:11:47] <mark>	 there we go
[19:12:24] <nagios-wm>	 RECOVERY - NTP on db1022 is OK: NTP OK: Offset -0.08507752419 secs
[19:12:33] <nagios-wm>	 RECOVERY - Host sq68 is UP: PING OK - Packet loss = 0%, RTA = 1.69 ms
[19:13:00] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.676 seconds
[19:13:00] <nagios-wm>	 PROBLEM - LVS HTTPS on bits.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[19:13:00] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.682 seconds
[19:14:57] <nagios-wm>	 PROBLEM - Varnish HTTP bits on sq70 is CRITICAL: Connection refused
[19:15:06] <nagios-wm>	 RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms
[19:15:24] <nagios-wm>	 PROBLEM - Disk space on search1015 is CRITICAL: DISK CRITICAL - free space: /a 3398 MB (2% inode=99%):
[19:15:51] <nagios-wm>	 PROBLEM - Varnish HTTP bits on sq69 is CRITICAL: Connection refused
[19:17:28] <nagios-wm>	 PROBLEM - Varnish HTTP bits on sq68 is CRITICAL: Connection refused
[19:17:46] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:18:31] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:18:49] <nagios-wm>	 PROBLEM - NTP on sq67 is CRITICAL: NTP CRITICAL: No response from NTP server
[19:19:52] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.967 seconds
[19:21:31] <nagios-wm>	 PROBLEM - Varnish HTTP bits on sq67 is CRITICAL: Connection refused
[19:23:01] <nagios-wm>	 RECOVERY - LVS HTTP on bits.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3911 bytes in 0.002 seconds
[19:23:19] <nagios-wm>	 PROBLEM - SSH on sq67 is CRITICAL: Connection refused
[19:23:28] <nagios-wm>	 RECOVERY - LVS HTTPS on bits.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3928 bytes in 0.007 seconds
[19:26:10] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:27:22] <nagios-wm>	 RECOVERY - SSH on sq67 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0)
[19:29:19] <nagios-wm>	 PROBLEM - LVS HTTP on bits.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:29:46] <nagios-wm>	 PROBLEM - LVS HTTPS on bits.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:30:40] <nagios-wm>	 PROBLEM - SSH on sq68 is CRITICAL: Connection refused
[19:33:40] <nagios-wm>	 PROBLEM - SSH on sq69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:33:58] <nagios-wm>	 PROBLEM - Varnish HTTP bits on sq69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:34:52] <nagios-wm>	 RECOVERY - SSH on sq68 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0)
[19:34:52] <nagios-wm>	 PROBLEM - SSH on sq70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:35:28] <nagios-wm>	 RECOVERY - SSH on sq69 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0)
[19:36:49] <nagios-wm>	 RECOVERY - SSH on sq70 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0)
[19:37:07] <nagios-wm>	 RECOVERY - Disk space on search1015 is OK: DISK OK
[19:37:52] <nagios-wm>	 PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:39:22] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.488 seconds
[19:40:34] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.301 seconds
[19:47:46] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:50:55] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:51:49] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.376 seconds
[19:54:12] <RobH>	 !log sq67-sq70 have been reinstalled, but not signed in puppet, not sure if they are ready for that or if there are other items mark needs to change first
[19:54:15] <morebots>	 Logged the message, RobH
[19:55:43] <nagios-wm>	 PROBLEM - NTP on sq68 is CRITICAL: NTP CRITICAL: No response from NTP server
[19:57:58] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:01:07] <gerrit-wm>	 New patchset: Lcarr; "adding in all old nagios groups to purge" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3223
[20:01:19] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3223
[20:01:25] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.335 seconds
[20:03:59] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3223
[20:04:01] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3223
[20:06:13] <nagios-wm>	 PROBLEM - NTP on sq69 is CRITICAL: NTP CRITICAL: No response from NTP server
[20:07:34] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:07:34] <nagios-wm>	 PROBLEM - NTP on sq70 is CRITICAL: NTP CRITICAL: No response from NTP server
[20:16:43] <nagios-wm>	 RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK HTTP/1.1 200 OK - 630 bytes in 0.003 seconds
[20:18:04] <nagios-wm>	 PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:23:19] <nagios-wm>	 RECOVERY - LVS HTTPS on bits.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3916 bytes in 9.013 seconds
[20:23:28] <nagios-wm>	 RECOVERY - Varnish HTTP bits on sq69 is OK: HTTP OK HTTP/1.1 200 OK - 630 bytes in 0.009 seconds
[20:24:49] <nagios-wm>	 PROBLEM - Varnish HTTP bits on sq67 is CRITICAL: Connection refused
[20:26:46] <nagios-wm>	 RECOVERY - NTP on sq69 is OK: NTP OK: Offset -0.04886293411 secs
[20:26:55] <nagios-wm>	 RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK HTTP/1.1 200 OK - 630 bytes in 0.007 seconds
[20:27:22] <nagios-wm>	 RECOVERY - LVS HTTP on bits.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3910 bytes in 0.009 seconds
[20:28:43] <nagios-wm>	 RECOVERY - Varnish HTTP bits on sq70 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.002 seconds
[20:32:10] <nagios-wm>	 RECOVERY - NTP on sq70 is OK: NTP OK: Offset -0.021941185 secs
[20:36:22] <nagios-wm>	 RECOVERY - Varnish HTTP bits on sq68 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.007 seconds
[20:43:25] <gerrit-wm>	 New patchset: Lcarr; "correcting service group config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3224
[20:43:32] <Reedy>	 hume: sudo: no tty present and no askpass program specified
[20:43:38] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3224
[20:43:45] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3224
[20:43:47] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3224
[20:44:39] <RobH>	 !log dns update for silver and zhen servers
[20:44:42] <morebots>	 Logged the message, RobH
[20:46:03] <mark>	 !log bits.pmtpa cluster back online
[20:46:06] <morebots>	 Logged the message, Master
[20:47:22] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.756 seconds
[20:47:40] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.439 seconds
[20:50:13] <nagios-wm>	 PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call
[20:53:40] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:54:07] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:12:16] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.500 seconds
[21:14:58] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.811 seconds
[21:19:27] <Ryan_Lane>	 !log rebalancing instances gluster volume
[21:19:30] <morebots>	 Logged the message, Master
[21:24:34] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:28:37] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.208 seconds
[21:45:43] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:47:49] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.461 seconds
[21:49:08] <gerrit-wm>	 New patchset: Bhartshorne; "increasing speed of the swiftcleaner so it has a chance to finish its scan in a reasonable amount of time" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3226
[21:49:21] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3226
[21:49:24] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3226
[21:49:26] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3226
[21:53:37] <gerrit-wm>	 New patchset: Bhartshorne; "attempt to get a timestamp into the swiftcleaner log name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3227
[21:53:50] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3227
[22:05:49] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:06:42] <mutante>	 binasher: hi, i'm here already
[22:06:55] <binasher>	 hey
[22:07:08] <binasher>	 lets aim to go over the db stuff in around an hour
[22:07:14] <mutante>	 great
[22:09:43] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.260 seconds
[22:21:10] <mutante>	 !log getting rid of Swift HTTP checks on non production machines manually (come on spence _purge_ ;P)
[22:21:14] <morebots>	 Logged the message, Master
[22:22:00] <jeremyb>	 mutante: maybe that requires managed resources? i don't remember
[22:34:48] <gerrit-wm>	 New patchset: Pyoungmeister; "no lucene monitoring for indexers, as it does not work properly..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3228
[22:35:01] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3228
[22:36:21] <gerrit-wm>	 New review: Dzahn; "yep, thanks!" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3228
[22:36:24] <gerrit-wm>	 Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3228
[22:42:43] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 205 seconds
[22:44:37] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3227
[22:44:40] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3227
[22:46:28] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 351 seconds
[22:48:32] <mutante>	 !log purging Lucene monitoring on indexer from db9, remove duplicate service definitions manually anyways (still tons left), run purge script, reload Nagios..
[22:48:35] <morebots>	 Logged the message, Master
[23:05:07] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds
[23:05:34] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds
[23:12:28] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:13:49] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 199 seconds
[23:14:25] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.684 seconds
[23:14:43] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 234 seconds
[23:21:38] <gerrit-wm>	 New patchset: Ryan Lane; "Removing gerrit bot from wikimedia-tech" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3230
[23:21:50] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3230
[23:21:58] <gerrit-wm>	 New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3230
[23:22:01] <gerrit-wm>	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3230
[23:28:49] <nagios-wm>	 PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours
[23:30:46] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours
[23:36:19] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 4 seconds
[23:37:13] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 1 seconds
[23:37:49] <nagios-wm>	 PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours
[23:37:49] <nagios-wm>	 PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours
[23:37:57] <gerrit-wm>	 New patchset: Bhartshorne; "needed to escape the %s for cron to play nice." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3231
[23:38:09] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3231
[23:38:20] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3231
[23:38:23] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3231
[23:48:21] <binasher>	 mutante: are you going to be online tomorrow?
[23:49:13] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:49:34] <mutante>	 binasher: unfortunately, not really, as tomorrow is Saturday for me already
[23:49:41] <mutante>	 and booked some tour
[23:50:14] <binasher>	 damn you, international date line!
[23:50:50] <binasher>	 looks like i'm going to go over building a new slave with notpeter tomorrow, but can go through the same with you next week
[23:52:08] <mutante>	 i can be .in like 12 hours.. but that would still be the night for you i think
[23:52:31] <mutante>	 binasher: alright, that would be great , and/or maybe i can just have the chat log or something
[23:53:16] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.034 seconds
[23:54:11] <mutante>	 there is a slight change that tour is cancelled due to weather , in that case i'll join in anyways
[23:54:27] <gerrit-wm>	 New patchset: Bhartshorne; "putting the location of the swiftcleaner script into the config file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3232
[23:54:39] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3232
[23:55:38] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3232
[23:55:40] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3232