[00:06:03] New patchset: Pyoungmeister; "assigning more stuff to fake hosts to make the catch-all term the same as in pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3178 [00:06:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3178 [00:08:09] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3178 [00:08:11] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3178 [00:17:08] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:31:41] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.922 seconds [00:35:35] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:38:19] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.128 seconds [00:40:07] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:44:01] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.861 seconds [00:52:25] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:56:28] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.098 seconds [00:59:28] PROBLEM - Lucene on search1015 is CRITICAL: Connection refused [01:02:46] RECOVERY - RAID on srv197 is OK: OK: no RAID installed [01:03:19] !log fixing nrpe "unable to read output" raid check on srv197,207,243,,244,253.. (nrpe running as wrong user) [01:03:22] Logged the message, Master [01:05:19] RECOVERY - RAID on srv243 is OK: OK: no RAID installed [01:06:52] New patchset: Ryan Lane; "Fixing apache config for gerrit to work with labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3179 [01:07:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3179 [01:07:42] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3179 [01:07:44] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3179 [01:10:33] New patchset: Ryan Lane; "Revert "Fixing apache config for gerrit to work with labs"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3180 [01:10:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3180 [01:10:49] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3180 [01:10:52] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3180 [01:17:28] PROBLEM - HTTP on singer is CRITICAL: Connection refused [01:18:38] ops, singer,, whats up, checkin [01:19:43] rc libapache2-mod-php5 [01:19:49] Same thing [01:19:49] !log planet down - apache on singer, syntax error in site config "Invalid command 'php_admin_flag'" [01:19:51] seems puppet forced an upgrade of somehting [01:19:53] Logged the message, Master [01:20:12] libapache2-mod-php5: Conflicts: libapache2-mod-php5filter but 5.3.2-2wm1 is to be installed [01:21:05] The following extra packages will be installed: apache2-mpm-prefork [01:21:09] The following packages will be REMOVED: apache2-mpm-worker [01:21:11] hrmm [01:21:18] guess we dont care for planet [01:22:39] !log planet back up (installed libapache2-mod-php5 which installed apache2-mpm-prefork and removed apache2-mpm-worker) [01:22:42] Logged the message, Master [01:23:30] yesterday on snapshot3 something similar happened with mysql-client packages, puppet ran and then conflicts between wmf and ubuntu packages [01:23:37] RECOVERY - HTTP on singer is OK: HTTP OK - HTTP/1.1 302 Found - 0.004 second response time [01:25:16] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:25:34] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:26:12] <-- and this ekrem thing is caused by AppleDictionaryService [01:26:25] !log labsconsole was missing libapache2-mod-php5. puppet must have tried to upgrade a package unsuccessfully [01:26:28] Logged the message, Master [01:27:13] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.382 seconds [01:27:58] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:29:18] /Webserver::Php5/Package[apache2]/ensure) ensure changed '2.2.14-5ubuntu8.8' to '2.2.14-5ubuntu8.9' [01:29:56] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1843 bytes in 8.417 seconds [01:36:04] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:36:22] that would do it [01:36:33] why it would fail to update the dependencies, I have no clue [01:39:40] PROBLEM - MySQL Slave Running on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:39:49] it also conflicted with "libapache2-mod-php5filter" before, so the new part would have to be "but 5.3.2-2wm1 is to be installed" .. but also we did not change any of the apt preferences afaik.. so hmmmm [01:42:13] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.982 seconds [01:45:04] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [01:49:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:49] PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:00:04] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.529 seconds [02:06:58] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:10:25] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:40] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:58] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:43:04] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.093 seconds [02:49:13] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:52:49] PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:34] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:03:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.103 seconds [03:09:20] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:02] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.747 seconds [03:15:29] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.030 seconds [03:21:29] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:21:56] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:25:59] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [03:25:59] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.751 seconds [03:27:56] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [03:28:05] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:05] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:31:50] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.460 seconds [03:34:59] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [03:34:59] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [03:45:02] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:49:05] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.821 seconds [03:52:59] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [04:01:31] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:03:46] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 5.822 seconds [04:10:13] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:10:13] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:12:10] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 0.021 seconds [04:12:10] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.008 seconds [04:26:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:33:01] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.448 seconds [04:33:01] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.455 seconds [04:43:04] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:43:13] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:43:13] PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:58] New patchset: Dzahn; "allow virt[1-5] subnet to access spence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3186 [04:56:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3186 [04:57:00] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3186 [04:57:03] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3186 [04:59:37] New patchset: Dzahn; "comment out Swift HTTP monitoring on non-production hosts again, this used to work for a day now they socket timeout and they are not in production anyways" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3187 [04:59:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3187 [05:01:46] New patchset: Dzahn; "comment out Swift HTTP monitoring on non-production hosts again, this used to work for a day now they socket timeout, so i expect they have been stopped deliberately" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3187 [05:01:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3187 [05:02:25] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3187 [05:02:28] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3187 [05:07:02] mutante: http://nagios.wikimedia.org/nagios/cgi-bin/history.cgi?host=copper&service=Swift+HTTP is a little more info than you get from the notifications fwiw [05:08:13] jeremyb: waah, you are using histroy.cgi :O :) [05:08:22] jeremyb: the evil script.. heeh [05:08:40] i had no idea it had such a reputation... [05:09:20] mutante: so... what's the deal with viewvc? [05:10:04] jeremyb: well, in the past sometimes spence got all overloaded and then we could resolve that by killing instanes of history.cgi [05:10:21] oh. i wasn't a bot... [05:10:35] that does ring a bell [05:11:19] jeremyb: thanks, i know in this case though (about the Swift HTTP) .. those are not the production hosts [05:11:40] so testing stuff should not be in nagios anyways [05:11:56] about viewvc, i dont have news [05:12:40] yeah, i figured. just thought i'd point it out because the commit msg mentioned the timeout [05:13:11] yep, no worries [05:13:48] viewvc is just not being included on the host [05:14:07] but someone needs to look at the strucutre of the svn.pp and the subclasses closel [05:15:03] jeremyb: btw and unrelated, do you have a labs account? [05:15:16] * jeremyb is too tired to think straight about viewvc [05:15:18] yes [05:15:33] could you try to log on to the labs bastion host ? [05:15:54] cause i have issues with it currently [05:16:03] $ ssh bastion1.pmtpa.wmflabs echo foo [05:16:03] If you are having access problems, please see: https://labsconsole.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [05:16:06] foo [05:16:15] mutante: tell me more [05:16:41] If you are having access problems, please see: https://labsconsole.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [05:16:45] Connection closed by 208.80.153.194 [05:16:53] when you do what? [05:17:10] ssh to bastion.wmflabs.org [05:17:25] like i did dozens of times before [05:17:45] lets talk in labs :) [05:20:30] New patchset: Asher; "vcl_config, not vcl_options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3188 [05:20:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3188 [05:22:37] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3188 [05:22:40] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3188 [05:26:59] RECOVERY - Puppet freshness on professor is OK: puppet ran at Thu Mar 15 05:26:47 UTC 2012 [05:28:05] :) [05:30:00] New patchset: Asher; "removing misc::udpprofile::collector from spence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3189 [05:30:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3189 [05:30:22] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3189 [05:30:24] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3189 [05:52:20] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:01:17] RECOVERY - Puppet freshness on virt2 is OK: puppet ran at Thu Mar 15 06:01:10 UTC 2012 [06:04:44] RECOVERY - Puppet freshness on virt3 is OK: puppet ran at Thu Mar 15 06:04:20 UTC 2012 [06:10:17] RECOVERY - Puppet freshness on virt4 is OK: puppet ran at Thu Mar 15 06:10:11 UTC 2012 [06:11:24] :)² [06:23:56] PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:01:48] !log uprading apache and apt on hume [07:01:54] Logged the message, Master [07:18:05] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:20:47] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [07:47:38] RECOVERY - DPKG on hume is OK: All packages OK [07:51:51] !log messed with /var/lib/dpkg/status on hume to fix broken packages/remove "marked for purging" on libmysql-php5 without removing a ton of other packages, rather hackish but seems fine anyways, like not broken anymore on simulated dist-upgrade etc [07:51:55] Logged the message, Master [07:57:23] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [07:59:20] PROBLEM - Lucene on search3 is CRITICAL: Connection timed out [08:02:19] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [08:06:13] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [08:12:40] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [08:12:52] !log installing apache,apt,cron,mysql-client upgrades on spence [08:12:55] Logged the message, Master [08:15:28] mutante: any reason why the topic is http://? https:// seems to be working on wikitech.wikimedia for me... [08:16:33] Snowolf: oh, i guess because it has a self-signed cert still.. we should really fix that [08:16:57] Snowolf: you probably have an exception in your browser [08:17:08] Yeah, but should still be better than nothing, imo [08:17:22] I guess might confuse people tho, you're right [08:17:51] hmm, i dont know what is worse.. [08:18:38] let me just put it on higher priority to fix that cert.. shouldnt be that big of a deal [08:18:40] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [08:23:12] mutante: the certificate on status.wikimedia.org is also not working [08:23:30] (or more accurately, it's not signed for that domain) [08:24:14] Snowolf: i know, that one is not that easy though, because we would have to proxy it, it is an alias for status.watchmouse.com [08:24:41] which isnt use [08:24:43] us [08:24:50] Yeah, I figured, but thought I'd point it out anyway given I just came across it :) [08:25:16] Snowolf: thanks, we got a ticket for it [08:29:19] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [08:33:13] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [08:33:43] as root: OK: State is Optimal, checked 6 logical device(s) as nagios: Parse error processing MegaCli64 output hrmmpf [08:47:55] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [08:52:55] morning hashasr [08:52:58] hashar [08:53:55] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [08:55:06] i was wondering if you already got that script to get all svn users and parse their info [08:59:38] I am working on it at the momment [08:59:54] aver wrote a perl module to do that : http://search.cpan.org/~avar/MediaWiki-USERINFO-0.04/lib/MediaWiki/USERINFO.pm :-] [09:00:07] cool [09:00:22] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [09:02:16] we have lot of users that never sent any commits :-) [09:02:45] and I thought we were going to use some generic mails such as foobar@users.mediawiki.org [09:04:16] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.006 second response time on port 8123 [09:06:33] RECOVERY - Puppet freshness on mw1020 is OK: puppet ran at Thu Mar 15 09:06:26 UTC 2012 [09:08:15] hashar: is it really worth creating git users for those that never even committed once before? i thought just all from http://svn.wikimedia.org/users.php [09:08:22] !log ran puppet on mw102 [09:08:25] Logged the message, Master [09:08:34] mutante: I have no idea [09:08:43] mutante: someone just told me that that was suddenly required [09:09:08] < sumanah> mutante: I wrote it out a bit in https://bugzilla.wikimedia.org/show_bug.cgi?id=35209#c0 [09:10:07] oh there is a bug report good [09:10:11] hashar: see earlier in -dev [09:10:18] if you can [09:10:36] I am just going to provide the requested file then write a rant about it [09:11:28] hmm, if we dont have their email address how are we going to tell them they got a new account [09:11:40] edit their user talk pages with a bot?;) [09:12:38] it would be kind of cool if i could email "foobar@users.mediawiki.org" and that would make a bot put the content on the User_talk of foobar :) [09:12:42] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:13:55] mutante: or just set an alias for *@users.mediawiki.org emails :-] [09:14:04] but of course way too easy for spammers [09:14:12] heh [09:17:39] RECOVERY - Lucene on search3 is OK: TCP OK - 0.012 second response time on port 8123 [09:18:33] pff [09:18:47] Avar module is so complicated that I have better time writing my own script [09:22:09] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [09:23:57] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [09:30:24] PROBLEM - Lucene on search3 is CRITICAL: Connection timed out [09:30:24] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [09:32:12] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [09:38:39] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [09:40:36] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [09:47:41] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [09:57:53] PROBLEM - Lucene on mw1020 is CRITICAL: Connection refused [09:59:59] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.019 second response time on port 8123 [10:22:56] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [10:28:56] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [10:39:35] RECOVERY - Lucene on search3 is OK: TCP OK - 0.011 second response time on port 8123 [10:39:44] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [10:42:13] RECOVERY - Lucene on search9 is OK: TCP OK - 0.005 second response time on port 8123 [10:42:31] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [10:46:34] PROBLEM - carbon-cache.py on spence is CRITICAL: PROCS CRITICAL: 0 processes with command name carbon-cache.py [10:50:48] mutante: I have finished the CSV :-] [11:17:37] PROBLEM - Host dataset1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:18:05] no it isn't but I am responsible, please ignore [11:18:31] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:18:40] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:43] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:52] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:25:34] RECOVERY - Host dataset1001 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [11:53:49] New patchset: ArielGlenn; "bonded interfaces for dataset1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3191 [11:54:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3191 [11:56:19] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3191 [11:56:28] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3191 [11:56:31] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3191 [11:58:51] PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:36] PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:31:51] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:14] New patchset: ArielGlenn; "mount gluster publicdata volume on dataset1001 (dumps)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3192 [12:32:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3192 [12:33:39] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:36:12] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.970 seconds [12:37:42] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.002 seconds [12:43:03] New review: ArielGlenn; "latency between dcs could be problematic for this but let's give it a try, it's only for copy/delete..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3192 [12:43:06] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3192 [13:14:39] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:57] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:06] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.350 seconds [13:26:57] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.490 seconds [13:27:42] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [13:29:39] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [13:33:24] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:00] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:21] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.888 seconds [13:36:42] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [13:36:42] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [13:40:18] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.564 seconds [13:40:40] New patchset: Mark Bergsma; "Don't sign builds by default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3193 [13:40:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3193 [13:41:07] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3193 [13:41:10] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3193 [13:41:48] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:47] New patchset: Mark Bergsma; "Move misc::package-builder into a separate file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3194 [13:44:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3194 [13:44:54] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3194 [13:44:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3194 [13:46:00] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.486 seconds [13:46:36] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:09] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:39] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:06] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.796 seconds [14:07:42] New patchset: Mark Bergsma; "Puppetize pbuilder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3195 [14:07:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3195 [14:08:19] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3195 [14:08:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3195 [14:09:16] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.770 seconds [14:24:23] New patchset: Mark Bergsma; "Fix dependency cycle" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3199 [14:24:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3199 [14:24:58] Change abandoned: Mark Bergsma; "this would merge test in again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3199 [14:28:19] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:04] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.421 seconds [14:52:10] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.3677817857 (gt 8.0) [14:53:40] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:56:13] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.40850763158 [15:04:12] New patchset: Mark Bergsma; "Fix dependency cycle" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3202 [15:04:25] New patchset: Mark Bergsma; "Fix othermirrors, setup default dist link" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3203 [15:04:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3202 [15:04:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3203 [15:04:52] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3202 [15:04:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3202 [15:05:18] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3203 [15:05:20] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3203 [15:06:07] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.721 seconds [15:12:25] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:32] !log Created git repo operations/debs/varnish in gerrit [15:15:35] Logged the message, Master [15:21:16] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.718 seconds [15:28:49] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:16] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.466 seconds [15:35:34] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:50:07] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.864 seconds [15:56:25] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.901 seconds [15:56:25] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:00:37] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.865 seconds [16:02:43] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:06:46] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:37] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:04] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.340 seconds [16:15:10] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.342 seconds [16:19:07] New patchset: Lcarr; "Fixing icinga apache file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3204 [16:19:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3204 [16:19:39] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3204 [16:19:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3204 [16:21:28] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:37] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:34] New patchset: Lcarr; "Making sure all config files are readable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3206 [16:35:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3206 [16:35:55] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3206 [16:35:57] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3206 [16:38:33] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (10283) [16:43:12] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:43:21] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:44:51] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.589 seconds [16:53:15] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:33] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:42] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:54:36] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [17:01:58] New patchset: Lcarr; "trying to make exported files world readable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3207 [17:02:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3207 [17:02:28] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3207 [17:02:31] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3207 [17:06:23] New patchset: Lcarr; "fix perms and purge decommissioned AFTER collecting resources" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3208 [17:06:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3208 [17:07:03] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.09612973913 (gt 8.0) [17:07:41] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3208 [17:07:44] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3208 [17:13:12] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.538383529412 [17:22:21] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [17:22:30] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:08] New patchset: Lcarr; "fixing collection" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3210 [17:23:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3210 [17:24:19] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3210 [17:24:22] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3210 [17:28:07] New patchset: Mark Bergsma; "Build with source by default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3211 [17:28:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3211 [17:29:32] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3211 [17:29:50] Change restored: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3211 [17:29:58] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3211 [17:30:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3211 [17:32:24] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.425 seconds [17:38:33] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:18] New patchset: Mark Bergsma; "Fix creates file name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3214 [17:40:31] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3214 [17:40:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3214 [17:40:42] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3214 [17:40:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3214 [17:54:37] New patchset: Mark Bergsma; "Temporarily disable varnish package installation during package name migration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3215 [17:54:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3215 [17:55:04] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3215 [17:55:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3215 [18:15:06] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.680 seconds [18:15:06] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.569 seconds [18:18:15] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [18:19:43] !log db1022 coming down for reinstall and resetup of raid per rt 2537 [18:19:46] Logged the message, RobH [18:23:04] robh: mark: d2-pmtpa is peaking on power usage across all 3 phase. the servers in there are only single power supply...we have the other circuit we could utilize for half of the servers. it would require bringing several down (mw28-mw58) [18:23:18] there should be a ticket to do just that already [18:23:30] actually, i take that back [18:23:37] we need to remove enough to work on a single circuit. [18:23:49] so determine how many should come out and put in a new ticket [18:24:12] we cannot split across feeds like that, its not legit [18:24:29] combined both feeds need to be under 50% on each or a total of 100% on one [18:24:36] its how redundant circuits work, we arent allowed to overload them [18:24:42] PROBLEM - Varnish HTTP bits on sq67 is CRITICAL: Connection refused [18:24:44] it is not...but there is only 1 power supply on the servers...we have 60A that is not bein gused [18:24:49] being used [18:24:51] PROBLEM - Full LVS Snapshot on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:24:57] ok, i realize there is one psu [18:25:00] PROBLEM - MySQL Idle Transactions on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:25:11] but i am saying we are NOT allowed to fill both ciruicts past 50% [18:25:15] or a single one [past 100% [18:25:18] PROBLEM - SSH on db1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:23] so we need to remove the extra servers and relocate them. [18:25:27] PROBLEM - MySQL Recent Restart on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:25:34] if we are over the 80% on a single one. [18:25:36] PROBLEM - MySQL Slave Delay on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:25:45] PROBLEM - MySQL Replication Heartbeat on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:25:52] since they are only one psu, ignore the second circuit [18:25:55] pretend it doesnt exist. [18:25:57] okay...i need to determine how many [18:26:05] got it [18:26:12] PROBLEM - MySQL Slave Running on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:26:15] ya have to do smoe math yea [18:26:20] what 2nd circuit ? =] [18:26:21] PROBLEM - Disk space on db1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:26:28] determine how many on a single phase and divide it up to determine average draw per server [18:26:32] to figure out how many to pull [18:26:52] i tried to get new psu's for those [18:27:01] but the r410 is NOT user swappable in that manner [18:27:02] which sucks [18:27:39] but we arent going to drop the redundant feed, it would be a pain and then it wouldnt match other racks [18:27:42] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:42] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:44] so we will simply not use it for now [18:28:07] ok [18:28:09] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:22] slightly distracted, sorry if that iddnt make total sense =] [18:28:31] i got it! [18:28:45] PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:29:53] im on a RT triage sprint, heh [18:34:10] New patchset: Ryan Lane; "Adding glusterfs cluster to gmetad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3218 [18:34:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3218 [18:35:12] RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.012 seconds [18:38:03] PROBLEM - Host sq67 is DOWN: PING CRITICAL - Packet loss = 100% [18:38:45] New patchset: RobH; "added sq39 to decom due to pci training error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3219 [18:38:57] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3218 [18:38:58] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3218 [18:38:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3219 [18:39:11] New review: RobH; "simple decom addition" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3219 [18:39:14] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3219 [18:41:12] RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [18:42:42] PROBLEM - Lucene on search1016 is CRITICAL: Connection refused [18:48:51] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:49:45] PROBLEM - NTP on db1022 is CRITICAL: NTP CRITICAL: No response from NTP server [18:50:03] RECOVERY - SSH on db1022 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:51:22] New patchset: Mark Bergsma; "Make sq67-sq70 use the new automatic partitioning for varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3221 [18:51:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3221 [18:51:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3221 [18:51:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3221 [18:57:24] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 255 MB (3% inode=61%): /var/lib/ureadahead/debugfs 255 MB (3% inode=61%): [18:59:04] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [19:00:51] RECOVERY - Disk space on db1022 is OK: DISK OK [19:00:53] !log db1022 resetup and redeployed per rt 2537 and assigned back to asher [19:00:56] Logged the message, RobH [19:02:03] RECOVERY - MySQL Recent Restart on db1022 is OK: OK seconds since restart [19:02:12] RECOVERY - MySQL Idle Transactions on db1022 is OK: OK longest blocking idle transaction sleeps for seconds [19:02:30] RECOVERY - MySQL Slave Delay on db1022 is OK: OK replication delay seconds [19:02:57] RECOVERY - MySQL Replication Heartbeat on db1022 is OK: OK replication delay seconds [19:03:15] RECOVERY - MySQL Slave Running on db1022 is OK: OK replication [19:03:33] RECOVERY - Full LVS Snapshot on db1022 is OK: OK no full LVM snapshot volumes [19:05:30] RECOVERY - Disk space on srv223 is OK: DISK OK [19:08:12] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [19:08:48] PROBLEM - Host sq68 is DOWN: PING CRITICAL - Packet loss = 100% [19:08:48] PROBLEM - Host sq70 is DOWN: PING CRITICAL - Packet loss = 100% [19:08:48] PROBLEM - Host sq69 is DOWN: PING CRITICAL - Packet loss = 100% [19:08:57] PROBLEM - Host sq67 is DOWN: PING CRITICAL - Packet loss = 100% [19:10:09] RECOVERY - Host sq69 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [19:10:45] RECOVERY - Host sq70 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [19:11:00] !log working on sq67-sq70 reinstalls, disregard alerts [19:11:03] Logged the message, RobH [19:11:39] PROBLEM - LVS HTTP on bits.pmtpa.wikimedia.org is CRITICAL: Connection refused [19:11:47] there we go [19:12:24] RECOVERY - NTP on db1022 is OK: NTP OK: Offset -0.08507752419 secs [19:12:33] RECOVERY - Host sq68 is UP: PING OK - Packet loss = 0%, RTA = 1.69 ms [19:13:00] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.676 seconds [19:13:00] PROBLEM - LVS HTTPS on bits.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [19:13:00] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.682 seconds [19:14:57] PROBLEM - Varnish HTTP bits on sq70 is CRITICAL: Connection refused [19:15:06] RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [19:15:24] PROBLEM - Disk space on search1015 is CRITICAL: DISK CRITICAL - free space: /a 3398 MB (2% inode=99%): [19:15:51] PROBLEM - Varnish HTTP bits on sq69 is CRITICAL: Connection refused [19:17:28] PROBLEM - Varnish HTTP bits on sq68 is CRITICAL: Connection refused [19:17:46] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:18:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:18:49] PROBLEM - NTP on sq67 is CRITICAL: NTP CRITICAL: No response from NTP server [19:19:52] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.967 seconds [19:21:31] PROBLEM - Varnish HTTP bits on sq67 is CRITICAL: Connection refused [19:23:01] RECOVERY - LVS HTTP on bits.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3911 bytes in 0.002 seconds [19:23:19] PROBLEM - SSH on sq67 is CRITICAL: Connection refused [19:23:28] RECOVERY - LVS HTTPS on bits.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3928 bytes in 0.007 seconds [19:26:10] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:27:22] RECOVERY - SSH on sq67 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:29:19] PROBLEM - LVS HTTP on bits.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:46] PROBLEM - LVS HTTPS on bits.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:40] PROBLEM - SSH on sq68 is CRITICAL: Connection refused [19:33:40] PROBLEM - SSH on sq69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:58] PROBLEM - Varnish HTTP bits on sq69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:52] RECOVERY - SSH on sq68 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:34:52] PROBLEM - SSH on sq70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:28] RECOVERY - SSH on sq69 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:36:49] RECOVERY - SSH on sq70 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:37:07] RECOVERY - Disk space on search1015 is OK: DISK OK [19:37:52] PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:22] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.488 seconds [19:40:34] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.301 seconds [19:47:46] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:50:55] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:49] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.376 seconds [19:54:12] !log sq67-sq70 have been reinstalled, but not signed in puppet, not sure if they are ready for that or if there are other items mark needs to change first [19:54:15] Logged the message, RobH [19:55:43] PROBLEM - NTP on sq68 is CRITICAL: NTP CRITICAL: No response from NTP server [19:57:58] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:07] New patchset: Lcarr; "adding in all old nagios groups to purge" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3223 [20:01:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3223 [20:01:25] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.335 seconds [20:03:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3223 [20:04:01] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3223 [20:06:13] PROBLEM - NTP on sq69 is CRITICAL: NTP CRITICAL: No response from NTP server [20:07:34] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:34] PROBLEM - NTP on sq70 is CRITICAL: NTP CRITICAL: No response from NTP server [20:16:43] RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK HTTP/1.1 200 OK - 630 bytes in 0.003 seconds [20:18:04] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:19] RECOVERY - LVS HTTPS on bits.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3916 bytes in 9.013 seconds [20:23:28] RECOVERY - Varnish HTTP bits on sq69 is OK: HTTP OK HTTP/1.1 200 OK - 630 bytes in 0.009 seconds [20:24:49] PROBLEM - Varnish HTTP bits on sq67 is CRITICAL: Connection refused [20:26:46] RECOVERY - NTP on sq69 is OK: NTP OK: Offset -0.04886293411 secs [20:26:55] RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK HTTP/1.1 200 OK - 630 bytes in 0.007 seconds [20:27:22] RECOVERY - LVS HTTP on bits.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3910 bytes in 0.009 seconds [20:28:43] RECOVERY - Varnish HTTP bits on sq70 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.002 seconds [20:32:10] RECOVERY - NTP on sq70 is OK: NTP OK: Offset -0.021941185 secs [20:36:22] RECOVERY - Varnish HTTP bits on sq68 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.007 seconds [20:43:25] New patchset: Lcarr; "correcting service group config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3224 [20:43:32] hume: sudo: no tty present and no askpass program specified [20:43:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3224 [20:43:45] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3224 [20:43:47] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3224 [20:44:39] !log dns update for silver and zhen servers [20:44:42] Logged the message, RobH [20:46:03] !log bits.pmtpa cluster back online [20:46:06] Logged the message, Master [20:47:22] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.756 seconds [20:47:40] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.439 seconds [20:50:13] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:53:40] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:54:07] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:12:16] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.500 seconds [21:14:58] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.811 seconds [21:19:27] !log rebalancing instances gluster volume [21:19:30] Logged the message, Master [21:24:34] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:37] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.208 seconds [21:45:43] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:47:49] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.461 seconds [21:49:08] New patchset: Bhartshorne; "increasing speed of the swiftcleaner so it has a chance to finish its scan in a reasonable amount of time" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3226 [21:49:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3226 [21:49:24] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3226 [21:49:26] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3226 [21:53:37] New patchset: Bhartshorne; "attempt to get a timestamp into the swiftcleaner log name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3227 [21:53:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3227 [22:05:49] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:06:42] binasher: hi, i'm here already [22:06:55] hey [22:07:08] lets aim to go over the db stuff in around an hour [22:07:14] great [22:09:43] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.260 seconds [22:21:10] !log getting rid of Swift HTTP checks on non production machines manually (come on spence _purge_ ;P) [22:21:14] Logged the message, Master [22:22:00] mutante: maybe that requires managed resources? i don't remember [22:34:48] New patchset: Pyoungmeister; "no lucene monitoring for indexers, as it does not work properly..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3228 [22:35:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3228 [22:36:21] New review: Dzahn; "yep, thanks!" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3228 [22:36:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3228 [22:42:43] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 205 seconds [22:44:37] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3227 [22:44:40] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3227 [22:46:28] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 351 seconds [22:48:32] !log purging Lucene monitoring on indexer from db9, remove duplicate service definitions manually anyways (still tons left), run purge script, reload Nagios.. [22:48:35] Logged the message, Master [23:05:07] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [23:05:34] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [23:12:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:13:49] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 199 seconds [23:14:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.684 seconds [23:14:43] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 234 seconds [23:21:38] New patchset: Ryan Lane; "Removing gerrit bot from wikimedia-tech" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3230 [23:21:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3230 [23:21:58] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3230 [23:22:01] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3230 [23:28:49] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [23:30:46] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [23:36:19] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 4 seconds [23:37:13] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 1 seconds [23:37:49] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [23:37:49] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [23:37:57] New patchset: Bhartshorne; "needed to escape the %s for cron to play nice." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3231 [23:38:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3231 [23:38:20] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3231 [23:38:23] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3231 [23:48:21] mutante: are you going to be online tomorrow? [23:49:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:49:34] binasher: unfortunately, not really, as tomorrow is Saturday for me already [23:49:41] and booked some tour [23:50:14] damn you, international date line! [23:50:50] looks like i'm going to go over building a new slave with notpeter tomorrow, but can go through the same with you next week [23:52:08] i can be .in like 12 hours.. but that would still be the night for you i think [23:52:31] binasher: alright, that would be great , and/or maybe i can just have the chat log or something [23:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.034 seconds [23:54:11] there is a slight change that tour is cancelled due to weather , in that case i'll join in anyways [23:54:27] New patchset: Bhartshorne; "putting the location of the swiftcleaner script into the config file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3232 [23:54:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3232 [23:55:38] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3232 [23:55:40] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3232