[00:08:58] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50268 [00:09:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50268 [00:12:27] New review: Ryan Lane; "Patch Set 2: Code-Review-1" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/50011 [00:14:24] New review: Ryan Lane; "Patch Set 2:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50015 [00:14:29] TimStarling: any chance of an inline documentation sprint for lua? [00:15:06] luadoc ported to PHP? [00:15:23] no I mean comments for the php code [00:15:28] ;) [00:15:37] I thought it had comments already [00:15:49] it certainly has some [00:16:17] * AaronSchulz wonders how long it will be before he can really review more stuff [00:18:51] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [00:19:51] TimStarling: I keep looking over the code, though I see stuff that seems to lack context or clarity about what it's trying to do [00:21:52] * AaronSchulz looks at doCachedExpansion [00:22:41] New patchset: Pyoungmeister; "more removal of old mediawiki class junk" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50321 [00:23:32] that should probably be protected [00:25:09] binasher: https://ishmael.wikimedia.org/?host=db1017&hours=24&sort=time ? [00:25:38] that seems frowny [00:25:44] what the... [00:25:49] and mathy [00:25:57] * binasher breaks ishmael again [00:26:25] are those numbers not accurate? [00:26:34] they should be accurate [00:26:45] I know math was changed lately [00:26:56] I also saw lock wait timeouts in the logs [00:26:59] i just meant that as in "lets make pretend we didn't see it!" [00:27:21] for some reason, i thought MathRenderer::writeDBEntry was only supposed to be called very infrequently [00:28:48] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50321 [00:28:57] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50321 [00:37:05] New patchset: Pyoungmeister; "moving udpprofile out of mediawiki.pp into misc/udpprofile.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50322 [00:37:47] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50322 [00:37:57] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50322 [00:42:34] New patchset: Aaron Schulz; "Made rewrite.py aware of the transcoded zone." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50323 [00:45:04] New review: Aaron Schulz; "Patch Set 1:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50323 [00:46:19] binasher: is the math load temporary perhaps? Maybe it is migrating things on the fly? [00:46:31] I actually know fuck all about what changed though [00:47:05] it's definitely still happening [00:47:16] I mean for the next few days [00:47:25] ori might know [00:51:12] New patchset: Pyoungmeister; "moving mediawiki-math package into pdf misc class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50324 [00:52:32] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50324 [00:52:40] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50324 [00:55:44] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49916 [00:57:06] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 0 seconds [01:06:16] New patchset: Aaron Schulz; "Made the transcoded container sharded like the others." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50327 [01:07:00] binasher: I'm checking [01:07:33] ori-l: thanks! [01:08:01] New review: Aaron Schulz; "Patch Set 1: Code-Review+2" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/50327 [01:08:18] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50327 [01:08:55] !log aaron synchronized wmf-config/filebackend.php 'deployed f699c5a9ad2f780dc18863592dcae4a51ba6d841' [01:08:56] Logged the message, Master [01:11:23] binasher: I think I see the problem. Should have a fix in a few. [01:15:51] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [01:16:00] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [01:16:54] Gerrit seems intermittantly unhappy [01:17:01] ssh: connect to host gerrit.wikimedia.org port 29418: Connection timed out [01:17:01] fatal: The remote end hung up unexpectedly [01:17:01] New patchset: Aaron Schulz; "Updated the sharded containers list." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50329 [01:17:45] New patchset: Aaron Schulz; "Updated the sharded containers list (added transcoded zone)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50329 [01:18:08] paravoid: enjoy :) [01:25:36] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , eswiki (31767), Total (33404) [01:27:51] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , eswiki (21559), Total (24516) [01:31:08] New patchset: Andrew Bogott; "I'm thinking that upstart_job regards true and "true" as not equal." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50330 [01:31:48] New review: Andrew Bogott; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50330 [01:32:57] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50330 [01:35:20] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [01:36:31] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [01:40:11] binasher: https://gerrit.wikimedia.org/r/#/c/50332/ [01:45:13] paravoid: Swift looks pretty unhappy from the MW side [01:45:22] Load of timeouts [01:45:25] ori-l: thanks! [01:45:49] RECOVERY - MySQL disk space on neon is OK: DISK OK [01:46:16] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [01:56:28] PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours [01:56:56] binasher: how soon does it need to go out? (wondering if i should cherry-pick this on top of the deployed tip and prepare a submodule bump) [01:57:11] !log migrating wikivoyage and wikidata external blob stores to innodb [01:57:14] Logged the message, Master [01:58:20] ori-l: the sooner the better, but not urgent enough that it shouldn't wait til tomorrow at least to deploy, when more people are around [01:58:41] k, i'll add it to the calendar for tomorrow [01:59:01] many thanks for tracking that down with lightning speed [01:59:32] if you discount the fact that aaron flagged it yesterday, sure :) [01:59:41] i promised to look but forgot. oh well. [02:04:25] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , ptwiki (84879), Total (86046) [02:05:01] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 190 seconds [02:05:46] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 205 seconds [02:20:28] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [02:29:09] !log LocalisationUpdate completed (1.21wmf10) at Fri Feb 22 02:29:09 UTC 2013 [02:29:11] Logged the message, Master [02:32:28] PROBLEM - Puppet freshness on sq71 is CRITICAL: Puppet has not run in the last 10 hours [02:40:52] PROBLEM - SSH on labstore4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:41:28] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [02:44:19] RECOVERY - SSH on labstore4 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:48:31] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , arwiki (14217), ptwiki (36874), Total (51752) [02:48:40] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , arwiki (13744), ptwiki (36102), Total (50527) [02:49:53] PROBLEM - SSH on labstore4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:06] !log LocalisationUpdate completed (1.21wmf9) at Fri Feb 22 02:53:05 UTC 2013 [02:53:08] Logged the message, Master [02:56:55] RECOVERY - SSH on labstore4 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:02:28] PROBLEM - SSH on labstore4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:04:07] RECOVERY - SSH on labstore4 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:04:16] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [03:04:34] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [03:16:43] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 14 seconds [03:17:37] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 1 seconds [03:29:12] New patchset: Andrew Bogott; "Tune up manage-volumes-daemon a bit." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50335 [04:02:19] PROBLEM - Puppet freshness on stafford is CRITICAL: Puppet has not run in the last 10 hours [04:34:25] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (203837), plwiktionary (82462), Total (287540) [04:35:19] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [04:35:37] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (200271), plwiktionary (85227), Total (286895) [05:17:17] New patchset: Tim Starling; "Move all favicons to bits" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49802 [05:17:23] New review: Tim Starling; "Patch Set 3: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49802 [05:17:24] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49802 [05:18:42] !log tstarling synchronized docroot/bits [05:18:45] Logged the message, Master [05:20:02] !log tstarling synchronized wmf-config/InitialiseSettings.php [05:20:04] Logged the message, Master [05:31:58] New patchset: Tim Starling; "Fix some favicons I broke" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50340 [05:32:09] nobody really cares about favicons, right? [05:32:32] New review: Tim Starling; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50340 [05:32:33] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50340 [05:33:10] !log tstarling synchronized docroot/bits/favicon/wikipedia.ico [05:33:12] Logged the message, Master [05:33:37] !log tstarling synchronized wmf-config/InitialiseSettings.php [05:33:37] Logged the message, Master [05:50:25] New patchset: Tim Starling; "Remove virtual host entries for deleted wikis" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50343 [05:50:48] New review: Tim Starling; "Patch Set 1: Verified+2 Code-Review+2" [operations/apache-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50343 [05:50:49] Change merged: Tim Starling; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50343 [05:56:39] New patchset: Tim Starling; "Remove document roots for deleted wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50344 [06:11:57] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 189 seconds [06:12:42] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:12:51] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 209 seconds [06:14:21] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.024 second response time on port 8123 [06:20:57] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:21:40] New patchset: Tim Starling; "Use favicon.php for /favicon.ico of all wikis" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50346 [06:21:52] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [06:22:45] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [06:22:55] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:33:35] New review: Tim Starling; "Patch Set 1: Verified+2 Code-Review+2" [operations/apache-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50346 [06:33:35] Change merged: Tim Starling; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50346 [06:48:45] New patchset: Tim Starling; "Adding CentralNotice user right to meta and testwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50196 [06:48:50] New review: Tim Starling; "Patch Set 3: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50196 [06:48:51] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50196 [06:49:18] RECOVERY - Lucene on search1016 is OK: TCP OK - 3.024 second response time on port 8123 [06:49:43] !log tstarling synchronized wmf-config/InitialiseSettings.php [06:49:44] Logged the message, Master [06:58:54] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [06:59:12] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [07:15:58] New review: Faidon; "Patch Set 1: Code-Review-1" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/50149 [07:17:52] paravoid: Tim-away: is that mostly for a particular filetype? or it effects everything that goes through imagemagick? [07:18:13] (re the last gerrit msg above) [07:20:04] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [07:31:18] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [07:31:27] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [07:31:54] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [07:33:07] !log restarted lucene search on search1016 [07:33:08] RECOVERY - MySQL disk space on neon is OK: DISK OK [07:33:09] Logged the message, Master [07:37:21] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50335 [07:41:35] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50335 [08:08:42] PROBLEM - Puppet freshness on db40 is CRITICAL: Puppet has not run in the last 10 hours [08:10:30] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [08:10:57] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [08:27:45] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [08:38:16] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [08:39:28] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [08:42:13] MaxSem: check_solr for vanadium [08:42:13] Traceback (most recent call last): [08:42:13] File "/usr/local/nagios/libexec/check_solr", line 133, in [08:42:16] (e, ) = err.args [08:42:19] ValueError: need more than 0 values to unpack [08:42:27] ugh [08:42:35] my Python sucks [08:43:12] will investigate [08:43:21] you can do multiple except Foo blocks [08:43:33] so you can put the except URLError above the generic one [08:43:38] unrelated to this [08:43:42] just saying, better style [08:44:09] yeah, but I need specific handling for only one case of URLError [08:44:55] because whatever error message urrlib2 uses for connection timeouts makes no sense whatsoever [08:46:46] paravoid, can we check if this monitoring works? [08:47:03] which one? [08:47:36] MaxSem: ? [08:48:47] we can temporarily disable jetty on one of the Tampa servers, it's not used by the API [08:49:15] we had the usual pages tonight, saw nothing from your checks [08:50:13] so I wonder if it will work, or keep OK like the previous monitoring did:) [08:52:49] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [08:54:28] RECOVERY - MySQL disk space on neon is OK: DISK OK [09:06:19] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [10:19:45] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [10:42:33] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , dewiki (11481), Total (29336) [10:44:03] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , dewiki (11519), Total (23451) [10:53:35] New patchset: Platonides; "Handling of packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50015 [10:53:35] New patchset: Platonides; "Configuration for webtools-apache VMs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50011 [10:53:51] New review: Platonides; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50011 [10:55:07] New review: Platonides; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50015 [11:02:01] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [11:03:13] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [11:23:55] !log DNS update - add indiawikipedia.com [11:23:59] Logged the message, Master [11:57:46] PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours [12:20:04] !log Halted cp3010 for controller replacement [12:20:06] Logged the message, Master [12:21:18] New patchset: Dzahn; "(bug 44893) Set up redirect from tartupeedia.ee to a page on etwiki" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [12:23:16] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [12:27:59] New review: ArielGlenn; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/47103 [12:28:11] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47103 [12:33:46] PROBLEM - Puppet freshness on sq71 is CRITICAL: Puppet has not run in the last 10 hours [12:35:34] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 118.40 ms [12:39:19] PROBLEM - Varnish HTTP upload-frontend on cp3010 is CRITICAL: Connection refused [12:42:46] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [12:42:55] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [12:43:04] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:27] New patchset: Dzahn; "(bug 44893) Set up redirect from tartupeedia.ee to a page on etwiki" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [12:49:40] PROBLEM - Host cp3004 is DOWN: PING CRITICAL - Packet loss = 100% [12:52:31] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [12:52:58] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [12:53:52] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 118.36 ms [12:55:04] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.32 ms [12:55:22] RECOVERY - Varnish HTTP upload-frontend on cp3010 is OK: HTTP OK HTTP/1.1 200 OK - 675 bytes in 0.237 seconds [12:55:40] RECOVERY - Host cp3004 is UP: PING WARNING - Packet loss = 66%, RTA = 118.54 ms [12:57:55] PROBLEM - Varnish traffic logger on cp3010 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:59:16] PROBLEM - Varnish HTTP upload-backend on cp3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:59:43] PROBLEM - Varnish HTTP upload-backend on cp3004 is CRITICAL: Connection refused [12:59:52] PROBLEM - Varnish traffic logger on cp3003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:00:01] PROBLEM - Varnish HTTP upload-frontend on cp3004 is CRITICAL: Connection refused [13:00:19] PROBLEM - Varnish traffic logger on cp3004 is CRITICAL: Connection refused by host [13:00:19] PROBLEM - Varnish HTCP daemon on cp3003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:00:19] PROBLEM - Varnish HTTP upload-frontend on cp3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:46] PROBLEM - SSH on cp3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:55] PROBLEM - Varnish HTCP daemon on cp3004 is CRITICAL: Connection refused by host [13:01:13] PROBLEM - SSH on cp3004 is CRITICAL: Connection refused [13:03:19] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:11] !log Halted cp3009 for controller replacement [13:07:12] Logged the message, Master [13:09:20] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.28 ms [13:10:13] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [13:11:06] !log Halted ms-be3004 for controller replacement [13:11:07] Logged the message, Master [13:11:34] RECOVERY - Host maerlant is UP: PING OK - Packet loss = 0%, RTA = 118.26 ms [13:13:50] PROBLEM - Host ms-be3004 is DOWN: PING CRITICAL - Packet loss = 100% [13:14:08] PROBLEM - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100% [13:14:21] !log Halted ms-be3001-3003 for controller replacement [13:14:23] Logged the message, Master [13:14:52] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 118.56 ms [13:15:37] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:13] PROBLEM - Varnish HTTP upload-frontend on cp3009 is CRITICAL: Connection refused [13:20:34] RECOVERY - Varnish traffic logger on cp3010 is OK: PROCS OK: 3 processes with command name varnishncsa [13:21:01] PROBLEM - Varnish traffic logger on cp3009 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [13:21:02] New patchset: Dzahn; "(bug 44893) Set up redirect from tartupeedia.ee to a page on etwiki" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [13:21:05] hello [13:21:37] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [13:23:07] PROBLEM - NTP on cp3004 is CRITICAL: NTP CRITICAL: No response from NTP server [13:30:37] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:43] RECOVERY - Varnish traffic logger on cp3009 is OK: PROCS OK: 3 processes with command name varnishncsa [13:37:13] RECOVERY - Varnish HTTP upload-frontend on cp3009 is OK: HTTP OK HTTP/1.1 200 OK - 675 bytes in 0.235 seconds [13:38:43] RECOVERY - Host ms-be3002 is UP: PING OK - Packet loss = 0%, RTA = 118.33 ms [13:40:51] New review: Hashar; "Patch Set 1:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50306 [13:42:29] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.30 ms [13:43:14] PROBLEM - NTP on cp3003 is CRITICAL: NTP CRITICAL: No response from NTP server [13:44:53] PROBLEM - SSH on ms-be3002 is CRITICAL: Connection refused [13:49:59] RECOVERY - Host ms-be3001 is UP: PING OK - Packet loss = 0%, RTA = 118.51 ms [13:54:11] RECOVERY - Host ms-be3003 is UP: PING OK - Packet loss = 0%, RTA = 118.44 ms [13:55:24] PROBLEM - SSH on ms-be3001 is CRITICAL: Connection refused [13:57:47] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [13:57:56] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [13:58:14] PROBLEM - SSH on ms-be3003 is CRITICAL: Connection refused [13:59:17] RECOVERY - MySQL disk space on neon is OK: DISK OK [14:00:02] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.022 second response time on port 8123 [14:03:29] PROBLEM - Puppet freshness on stafford is CRITICAL: Puppet has not run in the last 10 hours [14:04:59] PROBLEM - NTP on ms-be3002 is CRITICAL: NTP CRITICAL: No response from NTP server [14:11:22] New patchset: Reedy; "(bug 45165) Create rollbacker group for wikidatawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49847 [14:11:29] New review: Reedy; "Patch Set 2: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49847 [14:11:32] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49847 [14:13:51] New patchset: Reedy; "(bug 45124) Allow wikidatawiki sysops to add/remove confirmed status" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49682 [14:14:40] New patchset: Reedy; "(bug 45124) Allow wikidatawiki sysops to add/remove confirmed status" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49682 [14:16:38] New patchset: Reedy; "(bug 45124) Allow wikidatawiki sysops to add/remove confirmed status" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49682 [14:16:50] PROBLEM - NTP on ms-be3001 is CRITICAL: NTP CRITICAL: No response from NTP server [14:17:27] New review: Reedy; "Patch Set 4: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49682 [14:17:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49682 [14:18:07] !log reedy synchronized wmf-config/InitialiseSettings.php [14:18:09] Logged the message, Master [14:22:05] PROBLEM - NTP on ms-be3003 is CRITICAL: NTP CRITICAL: No response from NTP server [14:22:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:24:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.777 seconds [14:36:29] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [14:43:38] New patchset: Platonides; "Handling of packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50015 [14:48:29] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [14:52:23] <^demon> !log restarted gerrit on manganese, jetty flapping again [14:52:25] Logged the message, Master [14:52:26] bah Gerrit died [14:52:27] oh [14:52:28] :-] [14:52:41] <^demon> It should be coming back up in a second. [14:53:33] back :-) [14:54:47] heh, Gerrit also uses jetty:) [14:56:41] <^demon> MaxSem: Also? [14:57:32] like Solr [14:57:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:19] <^demon> Oh, heh. [14:58:24] <^demon> Well, lots of java things use jetty :) [14:59:21] 2013-02-22 14:55:54,316 ERROR zuul.GerritEventConnector: Received unrecongized event type 'reviewer-added' from Gerrit. Can not get account information. [14:59:22] bhahhh [14:59:41] ^demon, does Gerrit simply log to a local file? [15:00:09] <^demon> yup. [15:00:31] guess not much of a difference for a single server [15:05:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.746 seconds [15:09:47] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 202 seconds [15:10:05] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 209 seconds [15:15:56] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:20:17] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.019 second response time on port 8123 [15:27:47] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 11 seconds [15:28:05] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [15:34:46] New patchset: Tpt; "(bug 40759) Let Proofread Page setup namespaces for fi.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48877 [15:34:55] New review: ArielGlenn; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50041 [15:35:04] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50041 [15:44:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:08] !log gallium removed live hack in /srv/org/wikimedia/doc/puppetsource/manifests/role/labsmediawiki.pp [15:47:10] Logged the message, Master [15:47:11] apergos: ^^^ [15:47:19] yay [15:47:29] I wonder what it was for [15:47:46] I'd rerun puppet but the first run is stil going [15:47:48] *still [15:47:53] that removed a newline between the comment block and the class [15:48:01] I guess andrew tested it locally and forgot to reset [15:48:04] ah [15:48:06] fine [15:48:07] it is in the repo alread [15:48:31] gah it really runs slow over there, it's killing m [15:48:31] e [15:48:40] slow ? [15:48:42] what puppet? [15:48:45] yep [15:49:03] cause it includes all the stupid manifests :-] And gallium is very active [15:49:10] + it is disk is crippled [15:49:12] I/O takes ages [15:49:37] New patchset: Hashar; ".pep8 configuration file" [operations/dumps] (master) - https://gerrit.wikimedia.org/r/50369 [15:49:37] New patchset: Hashar; "pep8: E302 expected 2 blank lines, found 1" [operations/dumps] (master) - https://gerrit.wikimedia.org/r/50370 [15:49:37] New patchset: Hashar; "pep8 whitespaces fixing" [operations/dumps] (master) - https://gerrit.wikimedia.org/r/50371 [15:49:47] apergos: here are the pep8 checks for operations/dumps :-] [15:49:59] <^demon> "notice: Run of Puppet configuration client already in progress; skipping" <- I'm pretty sure puppet is making things up. [15:50:04] they still fail though https://integration.mediawiki.org/ci/job/operations-dumps-pep8/3/consoleFull :-( [15:50:16] ah let me get those in [15:50:38] and there are four puppet agent running on gallium :-] [15:50:57] four?? [15:50:57] oh oh [15:51:06] if it is in that state, need to delete the lock file [15:51:13] lemme look at that first [15:51:14] and then rerun puppet again [15:51:17] usually that is cause of [apt-get] [15:51:20] yea [15:51:23] well [15:51:34] in labs I simply restart puppet [15:51:35] gotta wait for current run to complete, it is progressing [15:51:38] <^demon> mutante: Where's said lock file? [15:51:47] after that I can do cleanup [15:52:59] out of topic: anyone ever listened to Pearl Jam album "No Code"? :-D [15:53:13] of course I can't review it (merge it) [15:53:17] cause jenkins failed it [15:53:20] I found out the album this morning, must have been hidden for a bit more than 10 years [15:53:33] <^demon> mutante: Ah, seems to have just been slow, not actually stuck. [15:53:46] New review: ArielGlenn; "Patch Set 1: Code-Review+2" [operations/dumps] (master) C: 2; - https://gerrit.wikimedia.org/r/50369 [15:53:48] apergos: the dumps fix ? [15:53:56] apergos: ah yeah I need to make it V+2 [15:53:58] New review: ArielGlenn; "Patch Set 1: Verified+2" [operations/dumps] (master); V: 2 - https://gerrit.wikimedia.org/r/50369 [15:53:58] Change merged: ArielGlenn; [operations/dumps] (master) - https://gerrit.wikimedia.org/r/50369 [15:54:08] ^demon: /var/lib/puppet/state/puppetdlock i think [15:54:11] yeah [15:54:27] but puppet agent --disable , and then --enable again could also work [15:54:45] --enable [15:54:47] Enable working on the local system. This removes any lock file, causing 'puppet agent' to start managing the local system again [15:55:09] ok, gotcha, just slow [15:55:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.045 seconds [15:55:39] apergos: I did the change to let Jenkins V+2 on operations/dumps :-] [15:55:42] ok [15:55:48] it happened some time in the past .. like http://projects.puppetlabs.com/issues/2888 [15:56:10] gee, blank content [15:56:11] <^demon> apergos: Did you merge 50041 on sockpuppet? [15:56:12] bad gerrit [15:56:15] no [15:56:20] wait which one was that [15:56:26] <^demon> The hook for gerrit [15:56:46] yes [15:57:06] <^demon> Hmm, puppet finished but the file didn't update :\ [15:57:19] <^demon> Oh, there it is. [15:57:40] <^demon> Nevermind :) [15:57:55] whew [15:58:03] ok one puppet agent on gallium now [15:58:07] let's trry a new puppet run [15:58:18] running [15:58:48] Change merged: ArielGlenn; [operations/dumps] (master) - https://gerrit.wikimedia.org/r/50370 [15:59:11] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [15:59:15] uh oh [15:59:29] New review: Dzahn; "dzahn@fenari:~$ apache-fast-test tartupeedia.ee.url mw1044" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [15:59:43] hashar: and that Apache change now merged without manual Verify, sweet [16:00:13] mutante: beware it just lint the config [16:00:20] aka make sure that is something that Apache can load [16:00:30] um [16:00:32] the integration tests, I need to find out a way to do it porperly [16:00:51] so it merged bu where is it on sockpuppet? [16:00:58] hashar: of course, i test those on a single server first each time, and paste the output of apache-fast-test [16:00:59] possibly loading an apache instance that listen on 127.0.0.2:80 and run Jeff script against it [16:01:08] mutante: .... [16:01:19] oh different repo [16:01:22] nm [16:01:26] * apergos <- dumb [16:03:28] mw110: ssh: connect to host mw110 port 22: Connection refused [16:03:38] Change merged: ArielGlenn; [operations/dumps] (master) - https://gerrit.wikimedia.org/r/50371 [16:03:40] looking at that one [16:04:39] oh, sits in installer at partitioning screen [16:05:16] New patchset: MaxSem; "Fix exception handling, lint" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50376 [16:05:18] that would get it [16:05:50] RobH: mw110, sits at installer.. fyi.. looks like you were on it last per SAL [16:06:16] paravoid, ^^^ [16:07:23] !log mw110 - sits at partioning screen in installer [16:07:23] Logged the message, Master [16:08:00]