[00:08:58] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50268 [00:09:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50268 [00:12:27] New review: Ryan Lane; "Patch Set 2: Code-Review-1" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/50011 [00:14:24] New review: Ryan Lane; "Patch Set 2:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50015 [00:14:29] TimStarling: any chance of an inline documentation sprint for lua? [00:15:06] luadoc ported to PHP? [00:15:23] no I mean comments for the php code [00:15:28] ;) [00:15:37] I thought it had comments already [00:15:49] it certainly has some [00:16:17] * AaronSchulz wonders how long it will be before he can really review more stuff [00:18:51] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [00:19:51] TimStarling: I keep looking over the code, though I see stuff that seems to lack context or clarity about what it's trying to do [00:21:52] * AaronSchulz looks at doCachedExpansion [00:22:41] New patchset: Pyoungmeister; "more removal of old mediawiki class junk" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50321 [00:23:32] that should probably be protected [00:25:09] binasher: https://ishmael.wikimedia.org/?host=db1017&hours=24&sort=time ? [00:25:38] that seems frowny [00:25:44] what the... [00:25:49] and mathy [00:25:57] * binasher breaks ishmael again  [00:26:25] are those numbers not accurate? [00:26:34] they should be accurate [00:26:45] I know math was changed lately [00:26:56] I also saw lock wait timeouts in the logs [00:26:59] i just meant that as in "lets make pretend we didn't see it!" [00:27:21] for some reason, i thought MathRenderer::writeDBEntry was only supposed to be called very infrequently [00:28:48] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50321 [00:28:57] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50321 [00:37:05] New patchset: Pyoungmeister; "moving udpprofile out of mediawiki.pp into misc/udpprofile.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50322 [00:37:47] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50322 [00:37:57] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50322 [00:42:34] New patchset: Aaron Schulz; "Made rewrite.py aware of the transcoded zone." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50323 [00:45:04] New review: Aaron Schulz; "Patch Set 1:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50323 [00:46:19] binasher: is the math load temporary perhaps? Maybe it is migrating things on the fly? [00:46:31] I actually know fuck all about what changed though [00:47:05] it's definitely still happening [00:47:16] I mean for the next few days [00:47:25] ori might know [00:51:12] New patchset: Pyoungmeister; "moving mediawiki-math package into pdf misc class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50324 [00:52:32] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50324 [00:52:40] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50324 [00:55:44] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49916 [00:57:06] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 0 seconds [01:06:16] New patchset: Aaron Schulz; "Made the transcoded container sharded like the others." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50327 [01:07:00] binasher: I'm checking [01:07:33] ori-l: thanks! [01:08:01] New review: Aaron Schulz; "Patch Set 1: Code-Review+2" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/50327 [01:08:18] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50327 [01:08:55] !log aaron synchronized wmf-config/filebackend.php 'deployed f699c5a9ad2f780dc18863592dcae4a51ba6d841' [01:08:56] Logged the message, Master [01:11:23] binasher: I think I see the problem. Should have a fix in a few. [01:15:51] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [01:16:00] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [01:16:54] Gerrit seems intermittantly unhappy [01:17:01] ssh: connect to host gerrit.wikimedia.org port 29418: Connection timed out [01:17:01] fatal: The remote end hung up unexpectedly [01:17:01] New patchset: Aaron Schulz; "Updated the sharded containers list." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50329 [01:17:45] New patchset: Aaron Schulz; "Updated the sharded containers list (added transcoded zone)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50329 [01:18:08] paravoid: enjoy :) [01:25:36] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , eswiki (31767), Total (33404) [01:27:51] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , eswiki (21559), Total (24516) [01:31:08] New patchset: Andrew Bogott; "I'm thinking that upstart_job regards true and "true" as not equal." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50330 [01:31:48] New review: Andrew Bogott; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50330 [01:32:57] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50330 [01:35:20] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [01:36:31] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [01:40:11] binasher: https://gerrit.wikimedia.org/r/#/c/50332/ [01:45:13] paravoid: Swift looks pretty unhappy from the MW side [01:45:22] Load of timeouts [01:45:25] ori-l: thanks! [01:45:49] RECOVERY - MySQL disk space on neon is OK: DISK OK [01:46:16] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [01:56:28] PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours [01:56:56] binasher: how soon does it need to go out? (wondering if i should cherry-pick this on top of the deployed tip and prepare a submodule bump) [01:57:11] !log migrating wikivoyage and wikidata external blob stores to innodb [01:57:14] Logged the message, Master [01:58:20] ori-l: the sooner the better, but not urgent enough that it shouldn't wait til tomorrow at least to deploy, when more people are around [01:58:41] k, i'll add it to the calendar for tomorrow [01:59:01] many thanks for tracking that down with lightning speed [01:59:32] if you discount the fact that aaron flagged it yesterday, sure :) [01:59:41] i promised to look but forgot. oh well. [02:04:25] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , ptwiki (84879), Total (86046) [02:05:01] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 190 seconds [02:05:46] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 205 seconds [02:20:28] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [02:29:09] !log LocalisationUpdate completed (1.21wmf10) at Fri Feb 22 02:29:09 UTC 2013 [02:29:11] Logged the message, Master [02:32:28] PROBLEM - Puppet freshness on sq71 is CRITICAL: Puppet has not run in the last 10 hours [02:40:52] PROBLEM - SSH on labstore4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:41:28] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [02:44:19] RECOVERY - SSH on labstore4 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:48:31] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , arwiki (14217), ptwiki (36874), Total (51752) [02:48:40] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , arwiki (13744), ptwiki (36102), Total (50527) [02:49:53] PROBLEM - SSH on labstore4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:06] !log LocalisationUpdate completed (1.21wmf9) at Fri Feb 22 02:53:05 UTC 2013 [02:53:08] Logged the message, Master [02:56:55] RECOVERY - SSH on labstore4 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:02:28] PROBLEM - SSH on labstore4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:04:07] RECOVERY - SSH on labstore4 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:04:16] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [03:04:34] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [03:16:43] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 14 seconds [03:17:37] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 1 seconds [03:29:12] New patchset: Andrew Bogott; "Tune up manage-volumes-daemon a bit." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50335 [04:02:19] PROBLEM - Puppet freshness on stafford is CRITICAL: Puppet has not run in the last 10 hours [04:34:25] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (203837), plwiktionary (82462), Total (287540) [04:35:19] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [04:35:37] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (200271), plwiktionary (85227), Total (286895) [05:17:17] New patchset: Tim Starling; "Move all favicons to bits" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49802 [05:17:23] New review: Tim Starling; "Patch Set 3: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49802 [05:17:24] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49802 [05:18:42] !log tstarling synchronized docroot/bits [05:18:45] Logged the message, Master [05:20:02] !log tstarling synchronized wmf-config/InitialiseSettings.php [05:20:04] Logged the message, Master [05:31:58] New patchset: Tim Starling; "Fix some favicons I broke" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50340 [05:32:09] nobody really cares about favicons, right? [05:32:32] New review: Tim Starling; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50340 [05:32:33] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50340 [05:33:10] !log tstarling synchronized docroot/bits/favicon/wikipedia.ico [05:33:12] Logged the message, Master [05:33:37] !log tstarling synchronized wmf-config/InitialiseSettings.php [05:33:37] Logged the message, Master [05:50:25] New patchset: Tim Starling; "Remove virtual host entries for deleted wikis" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50343 [05:50:48] New review: Tim Starling; "Patch Set 1: Verified+2 Code-Review+2" [operations/apache-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50343 [05:50:49] Change merged: Tim Starling; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50343 [05:56:39] New patchset: Tim Starling; "Remove document roots for deleted wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50344 [06:11:57] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 189 seconds [06:12:42] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:12:51] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 209 seconds [06:14:21] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.024 second response time on port 8123 [06:20:57] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:21:40] New patchset: Tim Starling; "Use favicon.php for /favicon.ico of all wikis" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50346 [06:21:52] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [06:22:45] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [06:22:55] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:33:35] New review: Tim Starling; "Patch Set 1: Verified+2 Code-Review+2" [operations/apache-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50346 [06:33:35] Change merged: Tim Starling; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50346 [06:48:45] New patchset: Tim Starling; "Adding CentralNotice user right to meta and testwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50196 [06:48:50] New review: Tim Starling; "Patch Set 3: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50196 [06:48:51] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50196 [06:49:18] RECOVERY - Lucene on search1016 is OK: TCP OK - 3.024 second response time on port 8123 [06:49:43] !log tstarling synchronized wmf-config/InitialiseSettings.php [06:49:44] Logged the message, Master [06:58:54] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [06:59:12] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [07:15:58] New review: Faidon; "Patch Set 1: Code-Review-1" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/50149 [07:17:52] paravoid: Tim-away: is that mostly for a particular filetype? or it effects everything that goes through imagemagick? [07:18:13] (re the last gerrit msg above) [07:20:04] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [07:31:18] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [07:31:27] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [07:31:54] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [07:33:07] !log restarted lucene search on search1016 [07:33:08] RECOVERY - MySQL disk space on neon is OK: DISK OK [07:33:09] Logged the message, Master [07:37:21] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50335 [07:41:35] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50335 [08:08:42] PROBLEM - Puppet freshness on db40 is CRITICAL: Puppet has not run in the last 10 hours [08:10:30] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [08:10:57] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [08:27:45] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [08:38:16] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [08:39:28] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [08:42:13] MaxSem: check_solr for vanadium [08:42:13] Traceback (most recent call last): [08:42:13] File "/usr/local/nagios/libexec/check_solr", line 133, in [08:42:16] (e, ) = err.args [08:42:19] ValueError: need more than 0 values to unpack [08:42:27] ugh [08:42:35] my Python sucks [08:43:12] will investigate [08:43:21] you can do multiple except Foo blocks [08:43:33] so you can put the except URLError above the generic one [08:43:38] unrelated to this [08:43:42] just saying, better style [08:44:09] yeah, but I need specific handling for only one case of URLError [08:44:55] because whatever error message urrlib2 uses for connection timeouts makes no sense whatsoever [08:46:46] paravoid, can we check if this monitoring works? [08:47:03] which one? [08:47:36] MaxSem: ? [08:48:47] we can temporarily disable jetty on one of the Tampa servers, it's not used by the API [08:49:15] we had the usual pages tonight, saw nothing from your checks [08:50:13] so I wonder if it will work, or keep OK like the previous monitoring did:) [08:52:49] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [08:54:28] RECOVERY - MySQL disk space on neon is OK: DISK OK [09:06:19] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [10:19:45] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [10:42:33] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , dewiki (11481), Total (29336) [10:44:03] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , dewiki (11519), Total (23451) [10:53:35] New patchset: Platonides; "Handling of packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50015 [10:53:35] New patchset: Platonides; "Configuration for webtools-apache VMs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50011 [10:53:51] New review: Platonides; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50011 [10:55:07] New review: Platonides; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50015 [11:02:01] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [11:03:13] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [11:23:55] !log DNS update - add indiawikipedia.com [11:23:59] Logged the message, Master [11:57:46] PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours [12:20:04] !log Halted cp3010 for controller replacement [12:20:06] Logged the message, Master [12:21:18] New patchset: Dzahn; "(bug 44893) Set up redirect from tartupeedia.ee to a page on etwiki" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [12:23:16] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [12:27:59] New review: ArielGlenn; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/47103 [12:28:11] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47103 [12:33:46] PROBLEM - Puppet freshness on sq71 is CRITICAL: Puppet has not run in the last 10 hours [12:35:34] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 118.40 ms [12:39:19] PROBLEM - Varnish HTTP upload-frontend on cp3010 is CRITICAL: Connection refused [12:42:46] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [12:42:55] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [12:43:04] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:27] New patchset: Dzahn; "(bug 44893) Set up redirect from tartupeedia.ee to a page on etwiki" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [12:49:40] PROBLEM - Host cp3004 is DOWN: PING CRITICAL - Packet loss = 100% [12:52:31] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [12:52:58] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [12:53:52] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 118.36 ms [12:55:04] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.32 ms [12:55:22] RECOVERY - Varnish HTTP upload-frontend on cp3010 is OK: HTTP OK HTTP/1.1 200 OK - 675 bytes in 0.237 seconds [12:55:40] RECOVERY - Host cp3004 is UP: PING WARNING - Packet loss = 66%, RTA = 118.54 ms [12:57:55] PROBLEM - Varnish traffic logger on cp3010 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:59:16] PROBLEM - Varnish HTTP upload-backend on cp3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:59:43] PROBLEM - Varnish HTTP upload-backend on cp3004 is CRITICAL: Connection refused [12:59:52] PROBLEM - Varnish traffic logger on cp3003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:00:01] PROBLEM - Varnish HTTP upload-frontend on cp3004 is CRITICAL: Connection refused [13:00:19] PROBLEM - Varnish traffic logger on cp3004 is CRITICAL: Connection refused by host [13:00:19] PROBLEM - Varnish HTCP daemon on cp3003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:00:19] PROBLEM - Varnish HTTP upload-frontend on cp3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:46] PROBLEM - SSH on cp3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:55] PROBLEM - Varnish HTCP daemon on cp3004 is CRITICAL: Connection refused by host [13:01:13] PROBLEM - SSH on cp3004 is CRITICAL: Connection refused [13:03:19] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:11] !log Halted cp3009 for controller replacement [13:07:12] Logged the message, Master [13:09:20] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.28 ms [13:10:13] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [13:11:06] !log Halted ms-be3004 for controller replacement [13:11:07] Logged the message, Master [13:11:34] RECOVERY - Host maerlant is UP: PING OK - Packet loss = 0%, RTA = 118.26 ms [13:13:50] PROBLEM - Host ms-be3004 is DOWN: PING CRITICAL - Packet loss = 100% [13:14:08] PROBLEM - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100% [13:14:21] !log Halted ms-be3001-3003 for controller replacement [13:14:23] Logged the message, Master [13:14:52] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 118.56 ms [13:15:37] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:13] PROBLEM - Varnish HTTP upload-frontend on cp3009 is CRITICAL: Connection refused [13:20:34] RECOVERY - Varnish traffic logger on cp3010 is OK: PROCS OK: 3 processes with command name varnishncsa [13:21:01] PROBLEM - Varnish traffic logger on cp3009 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [13:21:02] New patchset: Dzahn; "(bug 44893) Set up redirect from tartupeedia.ee to a page on etwiki" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [13:21:05] hello [13:21:37] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [13:23:07] PROBLEM - NTP on cp3004 is CRITICAL: NTP CRITICAL: No response from NTP server [13:30:37] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:43] RECOVERY - Varnish traffic logger on cp3009 is OK: PROCS OK: 3 processes with command name varnishncsa [13:37:13] RECOVERY - Varnish HTTP upload-frontend on cp3009 is OK: HTTP OK HTTP/1.1 200 OK - 675 bytes in 0.235 seconds [13:38:43] RECOVERY - Host ms-be3002 is UP: PING OK - Packet loss = 0%, RTA = 118.33 ms [13:40:51] New review: Hashar; "Patch Set 1:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50306 [13:42:29] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.30 ms [13:43:14] PROBLEM - NTP on cp3003 is CRITICAL: NTP CRITICAL: No response from NTP server [13:44:53] PROBLEM - SSH on ms-be3002 is CRITICAL: Connection refused [13:49:59] RECOVERY - Host ms-be3001 is UP: PING OK - Packet loss = 0%, RTA = 118.51 ms [13:54:11] RECOVERY - Host ms-be3003 is UP: PING OK - Packet loss = 0%, RTA = 118.44 ms [13:55:24] PROBLEM - SSH on ms-be3001 is CRITICAL: Connection refused [13:57:47] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [13:57:56] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [13:58:14] PROBLEM - SSH on ms-be3003 is CRITICAL: Connection refused [13:59:17] RECOVERY - MySQL disk space on neon is OK: DISK OK [14:00:02] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.022 second response time on port 8123 [14:03:29] PROBLEM - Puppet freshness on stafford is CRITICAL: Puppet has not run in the last 10 hours [14:04:59] PROBLEM - NTP on ms-be3002 is CRITICAL: NTP CRITICAL: No response from NTP server [14:11:22] New patchset: Reedy; "(bug 45165) Create rollbacker group for wikidatawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49847 [14:11:29] New review: Reedy; "Patch Set 2: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49847 [14:11:32] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49847 [14:13:51] New patchset: Reedy; "(bug 45124) Allow wikidatawiki sysops to add/remove confirmed status" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49682 [14:14:40] New patchset: Reedy; "(bug 45124) Allow wikidatawiki sysops to add/remove confirmed status" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49682 [14:16:38] New patchset: Reedy; "(bug 45124) Allow wikidatawiki sysops to add/remove confirmed status" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49682 [14:16:50] PROBLEM - NTP on ms-be3001 is CRITICAL: NTP CRITICAL: No response from NTP server [14:17:27] New review: Reedy; "Patch Set 4: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49682 [14:17:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49682 [14:18:07] !log reedy synchronized wmf-config/InitialiseSettings.php [14:18:09] Logged the message, Master [14:22:05] PROBLEM - NTP on ms-be3003 is CRITICAL: NTP CRITICAL: No response from NTP server [14:22:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:24:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.777 seconds [14:36:29] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [14:43:38] New patchset: Platonides; "Handling of packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50015 [14:48:29] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [14:52:23] <^demon> !log restarted gerrit on manganese, jetty flapping again [14:52:25] Logged the message, Master [14:52:26] bah Gerrit died [14:52:27] oh [14:52:28] :-] [14:52:41] <^demon> It should be coming back up in a second. [14:53:33] back :-) [14:54:47] heh, Gerrit also uses jetty:) [14:56:41] <^demon> MaxSem: Also? [14:57:32] like Solr [14:57:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:19] <^demon> Oh, heh. [14:58:24] <^demon> Well, lots of java things use jetty :) [14:59:21] 2013-02-22 14:55:54,316 ERROR zuul.GerritEventConnector: Received unrecongized event type 'reviewer-added' from Gerrit. Can not get account information. [14:59:22] bhahhh [14:59:41] ^demon, does Gerrit simply log to a local file? [15:00:09] <^demon> yup. [15:00:31] guess not much of a difference for a single server [15:05:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.746 seconds [15:09:47] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 202 seconds [15:10:05] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 209 seconds [15:15:56] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:20:17] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.019 second response time on port 8123 [15:27:47] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 11 seconds [15:28:05] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [15:34:46] New patchset: Tpt; "(bug 40759) Let Proofread Page setup namespaces for fi.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48877 [15:34:55] New review: ArielGlenn; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50041 [15:35:04] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50041 [15:44:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:08] !log gallium removed live hack in /srv/org/wikimedia/doc/puppetsource/manifests/role/labsmediawiki.pp [15:47:10] Logged the message, Master [15:47:11] apergos: ^^^ [15:47:19] yay [15:47:29] I wonder what it was for [15:47:46] I'd rerun puppet but the first run is stil going [15:47:48] *still [15:47:53] that removed a newline between the comment block and the class [15:48:01] I guess andrew tested it locally and forgot to reset [15:48:04] ah [15:48:06] fine [15:48:07] it is in the repo alread [15:48:31] gah it really runs slow over there, it's killing m [15:48:31] e [15:48:40] slow ? [15:48:42] what puppet? [15:48:45] yep [15:49:03] cause it includes all the stupid manifests :-]  And gallium is very active [15:49:10] + it is disk is crippled [15:49:12] I/O takes ages [15:49:37] New patchset: Hashar; ".pep8 configuration file" [operations/dumps] (master) - https://gerrit.wikimedia.org/r/50369 [15:49:37] New patchset: Hashar; "pep8: E302 expected 2 blank lines, found 1" [operations/dumps] (master) - https://gerrit.wikimedia.org/r/50370 [15:49:37] New patchset: Hashar; "pep8 whitespaces fixing" [operations/dumps] (master) - https://gerrit.wikimedia.org/r/50371 [15:49:47] apergos: here are the pep8 checks for operations/dumps :-] [15:49:59] <^demon> "notice: Run of Puppet configuration client already in progress; skipping" <- I'm pretty sure puppet is making things up. [15:50:04] they still fail though https://integration.mediawiki.org/ci/job/operations-dumps-pep8/3/consoleFull :-( [15:50:16] ah let me get those in [15:50:38] and there are four puppet agent running on gallium :-] [15:50:57] four?? [15:50:57] oh oh [15:51:06] if it is in that state, need to delete the lock file [15:51:13] lemme look at that first [15:51:14] and then rerun puppet again [15:51:17] usually that is cause of [apt-get] [15:51:20] yea [15:51:23] well [15:51:34] in labs I simply restart puppet [15:51:35] gotta wait for current run to complete, it is progressing [15:51:38] <^demon> mutante: Where's said lock file? [15:51:47] after that I can do cleanup [15:52:59] out of topic: anyone ever listened to Pearl Jam album "No Code"? :-D [15:53:13] of course I can't review it (merge it) [15:53:17] cause jenkins failed it [15:53:20] I found out the album this morning, must have been hidden for a bit more than 10 years [15:53:33] <^demon> mutante: Ah, seems to have just been slow, not actually stuck. [15:53:46] New review: ArielGlenn; "Patch Set 1: Code-Review+2" [operations/dumps] (master) C: 2; - https://gerrit.wikimedia.org/r/50369 [15:53:48] apergos: the dumps fix ? [15:53:56] apergos: ah yeah I need to make it V+2 [15:53:58] New review: ArielGlenn; "Patch Set 1: Verified+2" [operations/dumps] (master); V: 2 - https://gerrit.wikimedia.org/r/50369 [15:53:58] Change merged: ArielGlenn; [operations/dumps] (master) - https://gerrit.wikimedia.org/r/50369 [15:54:08] ^demon: /var/lib/puppet/state/puppetdlock i think [15:54:11] yeah [15:54:27] but puppet agent --disable , and then --enable again could also work [15:54:45] --enable [15:54:47] Enable working on the local system. This removes any lock file, causing 'puppet agent' to start managing the local system again [15:55:09] ok, gotcha, just slow [15:55:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.045 seconds [15:55:39] apergos: I did the change to let Jenkins V+2 on operations/dumps :-] [15:55:42] ok [15:55:48] it happened some time in the past .. like http://projects.puppetlabs.com/issues/2888 [15:56:10] gee, blank content [15:56:11] <^demon> apergos: Did you merge 50041 on sockpuppet? [15:56:12] bad gerrit [15:56:15] no [15:56:20] wait which one was that [15:56:26] <^demon> The hook for gerrit [15:56:46] yes [15:57:06] <^demon> Hmm, puppet finished but the file didn't update :\ [15:57:19] <^demon> Oh, there it is. [15:57:40] <^demon> Nevermind :) [15:57:55] whew [15:58:03] ok one puppet agent on gallium now [15:58:07] let's trry a new puppet run [15:58:18] running [15:58:48] Change merged: ArielGlenn; [operations/dumps] (master) - https://gerrit.wikimedia.org/r/50370 [15:59:11] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [15:59:15] uh oh [15:59:29] New review: Dzahn; "dzahn@fenari:~$ apache-fast-test tartupeedia.ee.url mw1044" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [15:59:43] hashar: and that Apache change now merged without manual Verify, sweet [16:00:13] mutante: beware it just lint the config [16:00:20] aka make sure that is something that Apache can load [16:00:30] um [16:00:32] the integration tests, I need to find out a way to do it porperly [16:00:51] so it merged bu where is it on sockpuppet? [16:00:58] hashar: of course, i test those on a single server first each time, and paste the output of apache-fast-test [16:00:59] possibly loading an apache instance that listen on 127.0.0.2:80 and run Jeff script against it [16:01:08] mutante: .... [16:01:19] oh different repo [16:01:22] nm [16:01:26] * apergos <- dumb [16:03:28] mw110: ssh: connect to host mw110 port 22: Connection refused [16:03:38] Change merged: ArielGlenn; [operations/dumps] (master) - https://gerrit.wikimedia.org/r/50371 [16:03:40] looking at that one [16:04:39] oh, sits in installer at partitioning screen [16:05:16] New patchset: MaxSem; "Fix exception handling, lint" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50376 [16:05:18] that would get it [16:05:50] RobH: mw110, sits at installer.. fyi.. looks like you were on it last per SAL [16:06:16] paravoid, ^^^ [16:07:23] !log mw110 - sits at partioning screen in installer [16:07:23] Logged the message, Master [16:08:00] ah th echeck_solr issue [16:08:17] sigh, yes [16:08:25] hm [16:08:33] debugging by proxy ids difficult:( [16:08:37] I am wondering whether it would make sense to mount the disk on gallium with noatime [16:09:15] why do you have atime on? [16:09:31] no idea :-] [16:09:34] well s/you/folks/ [16:09:40] /dev/md0 on / type ext3 (rw,errors=remount-ro) [16:10:13] and the repo is local on those disks yeah? [16:10:28] yeah everything is there [16:10:34] it makes sense to me but maybe double check with someone else [16:10:41] the git clones, jobs, mediawiki files etc [16:10:46] yep [16:10:51] will mail ops list [16:10:51] :-) [16:11:02] pupet ran happily btw [16:11:15] dzahn is doing a graceful restart of all apaches [16:11:51] !log dzahn gracefulled all apaches [16:11:52] Logged the message, Master [16:12:59] !log gracefulling eqiad Apaches via dsh to push tartupeedia.ee redirect (bug 44893) [16:13:00] Logged the message, Master [16:14:18] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:39] ..on fenari ..eh [16:15:57] it's running [16:15:57] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4915 bytes in 0.005 seconds [16:16:25] I restarted it [16:16:27] :-P [16:16:42] /etc/init.d/apache2: 55: [: nice: unexpected operator [16:16:46] that's a nice feature [16:17:02] ah, that happens on tons of them since a while [16:17:07] but it doesnt break them [16:17:29] ah i noticed that on beta too [16:17:30] was there any relation to doing that graceful and fenari ? i dont think it is in any apache groups [16:17:32] missing " " [16:18:40] mw1067: /etc/init.d/apache2: 55: [: nice: unexpected operator [16:18:40] mw1067: * Reloading web server config apache2 [16:18:40] mw1068: /etc/init.d/apache2: 55: [: nice: unexpected operator [16:18:40] mw1067: ...done. [16:18:45] where is that conf file though [16:19:20] needs fix, but it doesnt keep them from restarting and it's not brand new [16:19:30] well the fix is 2 seconds if we have the file [16:19:31] if [ ! -x $APACHE_HTTPD ] ; then [16:19:35] missing double quotes I guess [16:19:59] but I don't know where that is [16:20:56] envvars.appserver:NICE=$((-`nice`)) [16:21:01] that? [16:21:16] ./puppet/files/apache/envvars.appserver [16:21:32] no [16:21:48] literally as we said earlier, needs [16:21:52] if [ ! -x $APACHE_HTTPD ] ; then [16:21:52] to [16:21:58] if [ ! -x "$APACHE_HTTPD" ] ; then [16:22:05] but where is that startup script [16:22:05] dpkg -S /etc/init.d/apache2 [16:22:05] apache2.2-common: /etc/init.d/apache2 [16:22:11] baaaahhhhhh [16:22:15] which belong to ubuntu :( Installed: 2.2.22-1ubuntu1.2 [16:22:15] arr [16:22:40] well it was nice looking at it :-P [16:23:00] "nice" :) [16:23:07] heh heh [16:24:56] !log tartupeedia.ee now redirects to et.wp portal page (bug 44893) [16:24:57] Logged the message, Master [16:28:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:43] opened a boring little bug abuot the script [16:36:46] :-/ [16:37:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.094 seconds [16:40:45] apergos: mutante can you please land in role::db::beta dummy role class https://gerrit.wikimedia.org/r/#/c/49703/3 ? :-D [16:40:57] would let me have mysql::packages on labs :-] [16:41:02] and the nice motd [16:44:06] you just need the packages, no config ot anything whatsoever? [16:44:14] *or [16:44:36] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , cawiki (18480), enwiki (680984), fawiki (117519), Total (818762) [16:45:00] thanks nagios [16:45:04] *sigh* [16:45:33] hashar: ? [16:45:40] hashar: what about mariadb again? i would have to lookup an email..hm [16:46:00] apergos: mutante: na just the packages, to make sure I am in sync with production [16:46:19] I am not sure what is the mariadb state :-] [16:46:23] we're not moved over yet so I guess it's ok for now but [16:46:30] soonish I guess [16:46:36] maybe I should use mariadb hyeah [16:47:00] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , cawiki (12691), enwiki (675019), fawiki (114063), Total (805694) [16:47:03] will save me from having to migrate a second time [16:47:06] poor job queue :( [16:47:41] ahh [16:47:46] coredb_mysql::packages and set $mariadb [16:50:23] yeah [16:50:28] will poke asher about it next week [16:50:52] Change abandoned: Hashar; "per review with apergos and mutante, lets just use mariadb :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49703 [16:51:38] k [16:53:33] hashar: i already had working mariadb classes in puppet long time ago. .but they were lost >:p [16:53:48] i wonder if i should still resubmit or not [16:54:03] yes [17:16:24] RECOVERY - Puppet freshness on es1004 is OK: puppet ran at Fri Feb 22 17:15:49 UTC 2013 [17:16:25] RECOVERY - Puppet freshness on db40 is OK: puppet ran at Fri Feb 22 17:15:54 UTC 2013 [17:16:51] ^demon: class passwords::gerrit { [17:17:01] ^demon: done [17:17:26] <^demon> Thanks! [17:18:18] yw [17:30:34] New patchset: Demon; "Configure hooks-bugzilla plugin for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50379 [17:33:35] New patchset: Demon; "Configure hooks-bugzilla plugin for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50379 [17:41:53] New patchset: Demon; "Configure hooks-bugzilla plugin for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50379 [17:52:26] food time [17:53:00] <^demon> nom nom. [17:53:16] !log taking down sq41 to replace controller card.... https://rt.wikimedia.org/Ticket/Display.html?id=4550 [17:53:18] Logged the message, Master [17:59:51] damn, nagios is over a page long again [17:59:56] * RobH shakes fist at air [18:05:41] RECOVERY - Host sq41 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:06:35] PROBLEM - Puppet freshness on srv245 is CRITICAL: Puppet has not run in the last 10 hours [18:06:53] RobH: replaced controller card on sq41...should be back up now [18:07:45] sbernardin1: rt 4550? [18:08:21] thx dude [18:10:02] PROBLEM - Frontend Squid HTTP on sq41 is CRITICAL: Connection refused [18:11:49] ah good [18:14:25] sbernardin: what's up with rt 4497? [18:15:46] Jeremyb_: we don't have any similar memory to replace it with [18:16:02] RECOVERY - Host mw1045 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [18:16:09] sbernardin: we need to update the ticket [18:16:16] sbernardin: we need to purchase? there's been no downtime yet? [18:16:19] i assumed you were after our discussion, i'll add an update now. [18:16:44] Jeremyb_: need to take server offline to get info for purchase [18:17:07] ok. i see dab's online [18:17:11] i'll be back in a bit [18:21:08] PROBLEM - Apache HTTP on mw1045 is CRITICAL: Connection refused [18:24:16] robh: rt4545...finished adding back into groups [18:24:27] need to start service [18:24:43] ok, back in node groups, but not pybal right? [18:25:00] correct [18:25:10] add back to pybal? [18:25:15] puppet runs successfully? [18:25:26] did not run it....you said to ping you [18:25:31] so ping [18:25:32] ahh, ok, so you want to run puppet [18:25:39] dont touch pybal stuff until after its good to go [18:25:52] and then when we add back to pybal, i suggest setting it to false so pyal will test it [18:25:55] but not actually pool it [18:26:03] k [18:26:24] hrmm, all of my hosts that were reported throwing errors for sync-common are not [18:26:30] perhaps puppet runs in past 12 hours have fixed them [18:26:32] \o/ [18:27:11] RobH: updated ticket for sq41...let me know when everything is good with it so I can resolve... [18:27:40] sq41 will need to be reinstalled (robh) [18:28:00] was just a controller swap i thought? [18:28:23] yeah..last time we swapped one...the OS had to be reinstalled [18:28:29] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [18:28:30] check it [18:30:14] ok, will do [18:30:27] sbernardin: i stole ticket, cuz i wanna link it to reinstall ticket [18:32:04] RobH: OK...col [18:33:07] hrmm, lets see which servers error for sync common all. [18:36:16] RECOVERY - Apache HTTP on mw1045 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.110 second response time [18:37:43] !log robh Started syncing Wikimedia installation... : [18:37:44] Logged the message, Master [18:38:06] robh: set to mw1045 to false in pybal [18:38:23] also...when the server goes offline should it be commented out? [18:38:30] nah [18:38:37] pybal will depool it automatically [18:38:46] so unless its going to be gone forever, its fine to leave normally [18:38:53] unless it's bits (iirc) [18:38:57] i just have had some odd behaviors on apaches recently [18:39:10] or memcached... [18:39:13] so im paranoid and wanna see it pool with false [18:39:18] actually, memcached is odd duck. [18:39:22] yeah..that's cool [18:39:25] i updated the wikitech page on that recently [18:39:39] since its not slotted, if its going to be offline less than 5 minutes while its fixed [18:39:40] * AaronSchulz wishes the consistent hash was based on arbitrary names [18:39:51] then you could do swaps (same name, different ip) [18:39:51] you may as well leave it in, cuz shuffling it out and back in that quickly may lead to issues [18:40:02] if its going to be offline for hours or so [18:40:15] then take it out, but leave it out for awhile, we have some stuff that will stay in memcached for much longer than a few days [18:40:19] PROBLEM - NTP on mw1045 is CRITICAL: NTP CRITICAL: Offset unknown [18:40:21] but a day or so is better than nothing. [18:40:38] the day or so thing is an arbitrary rob number. [18:40:43] not to be confused with actual factual data. [18:41:02] heh...never confused [18:41:04] AaronSchulz: sounds like you wanna restructure our memcached, thanks for volunteering ;] [18:41:31] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 181 seconds [18:41:47] 118 Warning: require(/usr/local/apache/common-local/php-1.21wmf10/../wmf-config/ExtensionMessages-1.21wmf10.php) [function.require]: [18:41:48] failed to open stream: No such file or directory in /usr/local/apache/common-local/wmf-config/CommonSettings.php on line 2757 [18:41:48] 117 Fatal error: require() [function.require]: Failed opening required '/usr/local/apache/common-local/php-1.21wmf10/../wmf-config/E [18:41:48] xtensionMessages-1.21wmf10.php' (include_path='/usr/local/apache/common-local/php-1.21wmf10/extensions/TimedMediaHandler/handlers/OggHandler/PEAR/File_Ogg:/usr/local/ap [18:41:48] ache/common-local/php-1.21wmf10:/usr/local/lib/php:/usr/share/php') in /usr/local/apache/common-local/wmf-config/CommonSettings.php on line 2757 [18:41:51] oh ffs [18:41:58] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 188 seconds [18:42:56] RobH: pecl memcached doesn't really support that idea [18:43:09] so you have to bring up machines with the same IPs as old ones ;) [18:43:14] eww. [18:43:38] Reedy: So I ran sync-common-all, and none of my new stuff complained [18:43:39] well you could use hostnames too [18:43:45] it also runs sync-common locally without issues [18:43:58] Reedy: on the hosts you said didnt, i think multiple puppet runs overnight may have fixed file issues? [18:44:11] and bring up replacements with the same hostname, though dns would need to be updated obviously [18:44:20] RobH: is that better? :) [18:44:32] no way [18:44:40] though I kind of like not using dns for memcached at all [18:44:45] and doing the mappings in config [18:44:45] it sounds like its messy in an alltogether different manner! [18:44:54] right [18:44:59] heh [18:45:03] damned if you do... [18:45:04] Looks like those errors stopped 8 minutes or so ago [18:45:16] Reedy: .....on all of them? thats odd. [18:45:24] i'll take it, but its odd. [18:45:39] They were all from one server [18:45:39] 10.64.0.75 [18:45:52] PROBLEM - NTP on sq41 is CRITICAL: NTP CRITICAL: No response from NTP server [18:46:25] But the files look right on the new pmtpa apaches [18:46:40] cool, i'll push them back into pybal shortly. [18:46:46] New patchset: Andrew Bogott; "Fix the rval logic in start_volume" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50382 [18:46:47] thx for spotting the issue last night =] [18:48:23] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50382 [18:49:34] robh: the new wtp1001 is replacing the old wtp1001 from b4...correct? [18:49:43] RobH: maybe after removing/adding a server, one do a rolling flush of the servers over time (in case a second server has to get remove/added) [18:49:49] rt4559 [18:49:59] !log robh Finished syncing Wikimedia installation... : [18:50:00] Logged the message, Master [18:50:06] cmjohnson1: yes, but since its there, the racktables name cannot update [18:50:14] cmjohnson1: so just leave the new one as wmfwhatever in racktables for now [18:50:20] though the arbitrary name thing sure would be easy if the extension just supported it... [18:50:25] k [18:50:27] one we move wtp1001 work to new 1002+ then we rename old one [18:50:45] AaronSchulz: how long would the rolling flush take? [18:51:00] ie: can we make it a script that fires off or we manually fire off when we depool or repool a server? [18:51:02] RobH: whatever speed doesn't break the site :) [18:51:13] yea... thats the tricky part aint it =P [18:51:26] though it sounds like the only way to really do our current setup properly. [18:56:42] !log mw86-99 & mw111 back into pybal pool [18:56:42] Logged the message, RobH [18:57:33] !log mw112-125 back into pybal api pool [18:57:34] Logged the message, RobH [18:57:54] RobH: What are you guys doing to wtp1001 now? [18:58:07] RoanKattouw: So we have the old wtp1001, and a new one [18:58:13] Ah I see [18:58:15] once we bring online wtp1002-1004 [18:58:17] we can kill old 1001 [18:58:21] So we have two machines called wtp1001, fun [18:58:21] and use name on new 1001 [18:58:22] Yeah [18:58:29] sound good? [18:58:33] Yup [18:58:38] (notice i allocated you 4 not 5) [18:58:43] we can allocate more easily, i have them [18:58:47] Right [18:58:54] I only have 3 in eqiad as it stands anyway [18:59:11] so my plan is to bring those online (its been on my lsit all week) [18:59:13] i have them half done [18:59:19] Oh good [18:59:26] then hand them to you for deployment, and you can then tell me when we can kill the old 1001 [18:59:32] Yeah, sounds good [18:59:42] What will the intermediate name of wtp1001 be? [18:59:59] RoanKattouw: none, not installed [19:00:06] OK [19:00:10] it goes by the asset tag until then, so we'll just ignore it [19:00:14] Oh I see [19:00:20] We bring up 1002-1004, kill 1001, bring up new 1001 [19:00:24] yep [19:00:33] Right, now I understand [19:00:43] well, between kill old and install new is clear all cache and dns records, kill puppetmaster db entries, etc... [19:00:51] but yea [19:01:13] man, i used to recall everythign i worked on, now i have to read my notes [19:01:22] i have 1002-1004 installed and ready for initial puppet runs i think, checking now [19:02:11] New patchset: Pyoungmeister; "make squid reload if it sees a new conf, instead of restart" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47030 [19:02:36] oh yea, they did some odd hostname shit, they need reinstall i think. [19:02:41] damn it [19:03:32] mark: paravoid would love your input on ^^ [19:04:16] won't work [19:04:21] :((( [19:04:22] I'm surprised how it passed lint [19:04:29] notify => Exec['foo'] [19:04:33] ah [19:04:34] notify => foo is invalid syntax [19:04:36] thanks :) [19:05:48] would probably just notify true [19:05:53] sounds good to me! [19:06:04] New patchset: Pyoungmeister; "make squid reload if it sees a new conf, instead of restart" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47030 [19:07:07] VIP="10\\.2\\.\(1\-2\)\\.\(1\|21\)" [19:07:13] notpeter: you forgot one [19:07:19] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [19:07:24] *sigh* [19:07:26] yep [19:07:38] paravoid: is that 1\-2 or just 1-2 [19:08:02] [1-2] [19:08:18] Hey paravoid, so, for new puppet modules: [19:08:20] 2 spaces? [19:08:22] instead of tabs? [19:08:24] duh, of course square brackets, thx [19:08:27] New patchset: Pyoungmeister; "make squid reload if it sees a new conf, instead of restart" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47030 [19:08:39] try "echo foo | egrep '...'" to test :) [19:09:03] it does grep -q [19:09:04] (i've got tabs currently, I/we can change to tabs later if it needs discussion) [19:09:07] ok [19:09:30] !log sq41 reinstalling [19:09:31] Logged the message, RobH [19:10:54] RobH: I can think of some clever ways to deal with this in BagOStuff [19:11:10] though callers would need to pass an extra $flags option [19:11:58] im gonna pretend to follow that ;] [19:12:00] * RobH nods [19:12:49] only a portion of things that cache care about sequential consistency that much [19:12:50] New patchset: Ottomata; "Initial commit of Kafka Module." [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [19:13:10] if that was flagged with the cached item, it could be checked on read against some epoch [19:13:32] one could bump the epoch when memberships changes (or could even be based on a hash of the list automatically) [19:13:50] ahh [19:14:18] that sounds feasible, the last part in particular, since it means automatic regen, rather than fire off a script and dont forget to do it [19:14:21] the trick would be not wasting too much space storing this info the cache [19:15:25] you could also do cool things like call delete( key, mt_rand( 0, x ) to expire the key (so they don't all go at once) [19:15:43] * delete( key, mt_rand( 0, x ) ) [19:15:49] just a crazy thought [19:16:50] New patchset: Dzahn; "apache-sanity-check needs to use new regex to also work on eqiad Apaches see details in RT-4449" [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/49231 [19:18:07] RECOVERY - NTP on mw1045 is OK: NTP OK: Offset 0.002457976341 secs [19:18:22] !log kicked ntpd manually on mw1045 [19:18:24] Logged the message, RobH [19:23:09] New patchset: awjrichards; "Set template to append to mobile photo upload desc" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50387 [19:28:01] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [19:28:10] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [19:34:36] !log start mysqldump of otrs database on db49 [19:34:38] Logged the message, Master [19:36:59] notpeter: Regarding moving mediawiki::singlenode into a module… would roles move into the module as well, or are roles staying in the global manifests dir? [19:37:20] PROBLEM - Puppet freshness on sq41 is CRITICAL: Puppet has not run in the last 10 hours [19:50:28] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 22 seconds [19:51:23] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [19:56:17] RoanKattouw: Ok, wtp1002-1004 are now online and have done successful puppet runs [19:56:20] they are all yours =] [19:56:22] Yay [19:56:27] !log wtp1002-1004 os installed [19:56:28] Logged the message, RobH [19:56:33] * RoanKattouw grabs them [19:57:01] once you have them serving things, take down the existing 1001 as priority, as soon as its gone i'll get the new 1001 online for you [19:57:25] Will do [19:57:29] thx dude [19:58:16] !log sq41 boot order was fubar, fixed, reinstalling (as reboot into installer wiped mbr) [19:58:17] Logged the message, RobH [19:59:29] New patchset: Catrope; "Add wtp1002-4 to Parsoid deployment list" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50392 [20:00:05] RECOVERY - MySQL disk space on neon is OK: DISK OK [20:00:31] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [20:00:32] apergos: I am going to allocate harmon and copper for you. I'll spin them up with the basic raid mirror misc server partitioning? [20:00:33] RobH: Can you merge and deploy that please ---^^ [20:00:57] will do [20:01:04] great thanks [20:01:20] apergos: when done i'll jsut toss a ticket in core-ops assigned to you for them [20:01:26] these need internal ip yes? [20:01:43] yeah internal is fine [20:01:57] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50392 [20:01:58] cool, thx [20:02:26] thank you! [20:02:29] RoanKattouw: merged on sockpuppet [20:02:59] Yay thanks [20:03:20] I have the same servers for you in Tampa (evnetually) [20:03:30] i just have not gotten to them yet. [20:03:35] OK [20:09:56] !log authdns-update [20:09:57] Logged the message, RobH [20:12:40] RECOVERY - SSH on sq41 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:13:25] !log olivneh synchronized php-1.21wmf10/extensions/Math 'Updating Math extension to fix db caching behavior (Change-Id: I9b690e55001859c97fd40330272791d49ec6de75)' [20:13:26] Logged the message, Master [20:14:20] !log olivneh synchronized php-1.21wmf9/extensions/Math 'Updating Math extension to fix db caching behavior (Change-Id: I9b690e55001859c97fd40330272791d49ec6de75)' [20:14:21] Logged the message, Master [20:15:04] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:16:36] !log olivneh synchronized php-1.21wmf9/extensions/E3Experiments/experiments/openTask.js 'Resume split test (1/3)' [20:16:38] Logged the message, Master [20:16:52] !log olivneh synchronized php-1.21wmf9/extensions/GettingStarted/resources/ext.gettingstarted.js 'Resume split test (2/3)' [20:16:54] Logged the message, Master [20:17:09] !log olivneh synchronized php-1.21wmf9/extensions/GuidedTour/modules/tours/gettingstarted.js 'Resume split test (3/3)' [20:17:10] Logged the message, Master [20:18:04] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 187 seconds [20:19:16] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 195 seconds [20:21:13] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [20:22:52] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 11 seconds [20:23:28] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [20:24:20] * AzaToth requests some more review on https://gerrit.wikimedia.org/r/#/c/50044/ [20:36:52] Ryan_Lane: What's the magic incantation to initialize salt on a machine again? [20:41:56] RECOVERY - Puppet freshness on sq41 is OK: puppet ran at Fri Feb 22 20:41:48 UTC 2013 [20:42:01] New review: Ori.livneh; "(4 comments)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50044 [20:43:52] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [20:49:59] andrewbogott: roles should stay in roles [20:50:09] notepeter: Cool, that's what I thought. [20:50:11] eventually we will want a role module [20:50:21] but due to namespacing, we can't do that peicemeal [20:55:25] RECOVERY - NTP on sq41 is OK: NTP OK: Offset 0.03117215633 secs [20:55:52] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.34 ms [20:57:13] DAMN YOU ASW-B-EQIAD DO WHAT I TELL YOU [20:57:14] sigh. [20:58:19] Stupid question: Why do the RT and Bugzilla tracking links for Ongoing: "All week Swift cluster hardware swap out" link to the Wikivoyage migration items? [20:58:26] Sorry, on https://wikitech.wikimedia.org/view/Deployments :-) [21:00:39] Ryan_Lane: WTF?! [ERROR ] Failed to authenticate with the master, verify this minion's public key has been accepted on the salt master (on wtp1002) [21:01:14] RobH: I'm delayed in setting up wtp1002-4 because salt is giving me crap, see pings to Ryan above [21:01:33] RoanKattouw: No worries [21:03:29] bwahahahaha made it work [21:03:34] weeee [21:03:42] \o/ asw-b-eqiad is my bitch. [21:05:40] !log installing harmon and copper into dump generation misc use [21:05:42] Logged the message, RobH [21:06:04] PROBLEM - Host copper is DOWN: PING CRITICAL - Packet loss = 100% [21:08:04] New patchset: Ottomata; "Initial commit of Kafka Module." [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [21:09:50] New review: Ottomata; "Patchset 2: puppet linted." [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [21:12:25] New patchset: RobH; "harmon and copper as internal hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50431 [21:16:37] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50431 [21:18:24] New patchset: RobH; "copper migration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50433 [21:19:32] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50433 [21:28:38] RECOVERY - Puppet freshness on sq71 is OK: puppet ran at Fri Feb 22 21:28:33 UTC 2013 [21:28:56] PROBLEM - Parsoid on wtp1003 is CRITICAL: Connection refused [21:29:23] PROBLEM - Parsoid on wtp1002 is CRITICAL: Connection refused [21:30:17] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [21:30:55] RoanKattouw: ^ [21:31:01] starting to throw errors, just fyi [21:32:42] ah poop, paravoid, so $sha1 in git submodule foreach refers to what the HEAD index says, not what the FETCH_HEAD says [21:32:45] so [21:33:09] git submodule foreach 'diff HEAD..${sha1}' is basically diff HEAD..HEAD [21:33:59] could parse the diff of superproject diff HEAD..FETCH_HEAD for relevant shas [21:34:00] hm [21:34:38] RECOVERY - Puppet freshness on cp1035 is OK: puppet ran at Fri Feb 22 21:34:28 UTC 2013 [21:35:50] !log maxsem synchronized php-1.21wmf10/extensions/GeoData/solrupdate.php 'https://gerrit.wikimedia.org/r/#/c/50388/' [21:35:52] Logged the message, Master [21:36:20] are srv226-234 decommissioned? [21:38:12] !log maxsem synchronized php-1.21wmf9/extensions/GeoData/solrupdate.php 'https://gerrit.wikimedia.org/r/#/c/50388/' [21:38:13] Logged the message, Master [21:39:47] !log Nuking GeoData Solr index and reindexing [21:39:49] Logged the message, Master [21:39:50] !log harmon and copper online with OS installed for datadumps misc host support [21:39:52] Logged the message, RobH [21:41:13] RobH: I know. I need Ryan to fix salt [21:41:16] brb lunch [21:41:59] RoanKattouw: ahh, you said so, my bad [21:45:38] Ryan_Lane: Do you want an interesting illustration about what I said regarding perl? [21:46:11] Ryan_Lane: Here is a legitimate, and useful regex used in CSBot that someone asked me what it does. See if you can guess. [21:46:13] s/\{\|[^*]*\|+([^\{*][^*]*\|+)*\{|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^\{"'\\]*)/defined $2 ? $2 : ""/gse; [21:50:26] Heh. [21:50:36] He'd asked me about it as well. [21:51:50] RoanKattouw_away: something is broken with salt? [21:52:10] RoanKattouw_away: oh. I see [21:53:42] RoanKattouw_away: I need to talk to the salt folks about that error [21:53:54] oh wait, no [21:54:04] those are new systems, they just weren't accepted on the master yet [21:58:46] ori-l: regarding your review, it's a copy of patchset-created [21:58:55] jsut so you know [21:59:08] and regarding pep-8, let it burn in hell [21:59:45] but pep8 rejects the afterlife and the notion of a punitive deity [21:59:49] hehe [22:00:35] you asked for a look-through on a public channel (can't remember which); i had a few stylistic remarks but i didn't feel strongly enough about them to attach a score. disregard if you like. [22:00:44] no problem [22:02:26] I meant more in line if the new hook is wanted or not [22:03:20] atm drafts will not post it IRC until they are merged [22:03:59] i.e. when publishing a draft, patchset-created will not be fired [22:07:37] !log reindexing complete [22:07:38] Logged the message, Master [22:17:32] PROBLEM - Puppet freshness on cp3003 is CRITICAL: Puppet has not run in the last 10 hours [22:23:25] New review: Ryan Lane; "(3 comments)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50044 [22:23:32] PROBLEM - Puppet freshness on cp3004 is CRITICAL: Puppet has not run in the last 10 hours [22:24:20] AzaToth: I'd like us to switch to pep-8 for everything ;) [22:24:28] none of my old code is pep8, though [22:24:35] and this is based on my code [22:29:37] Ryan_Lane: OK it [22:29:41] 's working now [22:29:46] Ryan_Lane: In the future, what can I do to make that work? [22:37:59] RoanKattouw: nothing. those were new systems [22:38:02] I just accepted their keys [22:39:33] Ahm.... [22:39:35] error: The requested URL returned error: 403 while accessing http://tin.eqiad.wmnet/parsoid/config/.git/info/refs [22:39:55] -_- [22:40:06] let me see if the ranges need to be adjusted [22:40:26] this is one of the wtp systems? [22:40:54] wtp1004 [22:41:35] 10.64.0.0/22 [22:41:39] !log sq41 reinstalled, throws endless puppet errors during redeployment [22:41:40] Logged the message, RobH [22:41:41] !log mw110 throwing odd puppet errors, will come back to it later [22:41:42] Logged the message, RobH [22:41:48] wtp1004.eqiad.wmnet has address 10.64.32.75 [22:42:03] Why /22? [22:42:16] that's what it currently is [22:42:20] OK... [22:42:21] which is obviously wrong ;) [22:42:28] But wtp1 isn't even in that range [22:42:48] 10.0.0.0/16 10.64.0.0/22 10.64.16.0/24 208.80.152.0/22 [22:42:54] that's actually the current range [22:43:18] I should probably have 10.64.0.0./16 [22:43:20] I see [22:43:23] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [22:43:24] PROBLEM - Puppet freshness on ms-be3003 is CRITICAL: Puppet has not run in the last 10 hours [22:43:26] Yeah that seems saner [22:43:42] Even if that range isn't all eqiad in the future, it seems like a good range to whitelist regardless [22:43:52] db27 is goddamn hosed [22:44:03] it wont even serial output, wtf. [22:44:44] RobH: Glad that Ops gave Peter db29 over db27, then. :-( [22:45:06] what you mean? [22:45:10] both are general db use [22:45:17] (isnt that what peter used it for?) [22:45:25] he has both of them. [22:45:39] RoanKattouw: fixing [22:45:44] OK [22:46:00] New patchset: Ryan Lane; "Change new deploy http accept range" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50441 [22:46:01] ^^ [22:46:36] YAy [22:47:03] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50441 [22:47:46] RoanKattouw: ok. try now [22:49:57] OK that worked [22:50:04] But (I think this might be a known issue) it doesn't run the hooks [22:50:16] the ones to add the links? [22:50:19] Yes [22:50:22] -_- [22:50:38] I can initialize the hosts and then do a dummy deploy, that'll put them in [22:50:45] ok [22:50:49] But it should just do it upon init, ideally [22:50:52] indeed [22:50:56] I'll take a look at that soon [22:52:26] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50329 [22:52:43] Also, ideally the salt initialization would be puppetized somehow [22:52:46] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50323 [22:55:10] !log changing dns entry for wikitech-static [22:55:11] Logged the message, Master [22:57:20] RECOVERY - Parsoid on wtp1002 is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.064 seconds [22:57:38] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.058 seconds [22:57:52] RobH: Look at that, they're coming up [22:57:55] Now lemme pool them [22:58:23] RECOVERY - Parsoid on wtp1003 is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.058 seconds [22:58:32] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 0.013 seconds response time. www.wikipedia.org returns 208.80.154.225 [22:58:38] RobH: I thought Peter just had db29 (for getting the numbers for SUL finalisation); it'll have a bunch of extra dbs on it so he can join across all of them for ultra-queries-of-death. [23:00:20] New review: CSteipp; "I did test this since I was curious, and ssl will work without the 3 lines specifying SSL being on, ..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/49200 [23:04:25] James_F: i have no idea what peter does with the databases [23:04:32] i just have if a server is db class and is assigned to him. [23:04:35] ;] [23:04:50] he asked for both db27 and db29 to get fixed, only one of the two was repaired right away [23:05:36] RobH: wtp1002-4 are up and running, wtp1001 is depooled, yank when desired [23:06:20] PROBLEM - Puppet freshness on ms-be3001 is CRITICAL: Puppet has not run in the last 10 hours [23:06:45] RoanKattouw: Awesome, thanks dude [23:07:04] i'll get it pulled and renamed, so your new wtp1001 should be set later today/monday [23:07:13] if not monday, i dunno, we have meetings all next week [23:07:21] Right, that's fine [23:09:31] RobH: Sure. [23:12:04] New patchset: Faidon; "Support /transcoded/ for upload Varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50443 [23:12:55] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50443 [23:26:03] !log deploying squid [23:26:04] Logged the message, Master [23:29:01] !lo geploying squid config for upload's new transcoded container [23:29:08] !log deploying squid config for upload's new transcoded container [23:29:09] Logged the message, Master [23:29:16] third time's the charm [23:30:44] New patchset: Faidon; "Fix exception handling, lint" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50376 [23:32:07] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50376 [23:33:05] New patchset: Andrew Bogott; "Refactor mediawiki::singlenode and wikidata::singlenode into modules." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50451 [23:33:12] New review: Faidon; "Consider this a +2 if you test it before/after you deploy it :)" [operations/debs/wikimedia-task-appserver] (master) C: 1; - https://gerrit.wikimedia.org/r/49231 [23:34:13] New patchset: Ottomata; "Adding puppet-merge for sockpuppet puppet merges." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50452 [23:35:07] New review: Andrew Bogott; "I've verified that the role::mediawiki-install::labs class is still working properly. I have /not/ ..." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/50451 [23:38:56] New review: Andrew Bogott; "I've verified that the role::mediawiki-install::labs class is still working properly. I have /not/ ..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/50451 [23:40:17] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [23:42:01] :O [23:42:05] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:49:24] New patchset: Ori.livneh; "Add python-pygments as dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50454 [23:54:39] New patchset: Pyoungmeister; "more cleanup of old mediawiki class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50455 [23:55:19] !log argon is going offline, killing our limesurvey install [23:55:21] Logged the message, RobH [23:56:02] New patchset: RobH; "killing argon, as we dont need limesurvey anymore" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50456 [23:57:24] New review: Demon; "Limesurvey is dying? Happy days are here again!" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/50456 [23:58:21] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50455 [23:58:23] ^demon: heh [23:58:29] was like 'why did chad +1?; [23:58:32] i am amused. [23:58:45] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50456 [23:58:58] RobH: merging for you [23:59:04] already did [23:59:06] oh, on sockpuppet? [23:59:09] thx dude [23:59:10] yeah [23:59:14] <^demon> I remember when limesurvey was installed and it spammed the apache logs. [23:59:16] <^demon> Piece of crap. [23:59:27] !log argon is shut down now [23:59:28] Logged the message, RobH [23:59:47] notpeter: So FYI: I did not touch the limesurvey database.