[00:01:42] wikidata status update for anyone interested: the migration to the S5 shard is done on 6 out 8 effected db servers. i expect to be fully done in < 45min [00:03:54] binasher: thanks for the updates :) [00:04:01] ping robla : see above wikidata status [00:04:27] Ryan_Lane: wanna review https://gerrit.wikimedia.org/r/#/c/32924/ ? [00:04:40] you're more authoritative on that area :) [00:04:56] lemme see [00:05:23] I'm pretty sure that file should just get deleted [00:05:26] not renamed [00:05:41] Reedy: ^^ [00:05:47] heh [00:06:09] I'll do that then [00:07:00] New patchset: Reedy; "Delete *.wikimedia.org.crt" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32924 [00:14:09] I'm getting really annoyed by having copies of configs between nagios and icinga [00:14:24] New patchset: Asher; "moving wikidatawiki to s5, disabling wgReadOnly" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50127 [00:16:00] New patchset: Faidon; "Fix check_solr for Nagios, not just Icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50129 [00:16:23] New review: Asher; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50127 [00:16:24] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50127 [00:17:38] !log asher synchronized s3.dblist 'removing wikidatawiki' [00:17:40] Logged the message, Master [00:17:48] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50119 [00:17:57] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50119 [00:18:10] !log asher synchronized s5.dblist 'adding wikidatawiki' [00:18:11] Logged the message, Master [00:18:12] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50129 [00:18:18] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50129 [00:18:27] binasher: sync-dblist ;) [00:19:36] New review: Ryan Lane; "Patch Set 7: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/32924 [00:19:45] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32924 [00:20:10] woosters: thanks for the ping [00:20:58] !log asher synchronized wmf-config 'enabling wikidatawiki on shard s5' [00:20:59] Logged the message, Master [00:21:26] Wheee [00:21:28] binasher: is that it? [00:21:39] New patchset: Reedy; "Add enwiki to wikidata dblist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50130 [00:22:13] * robla prepares "w00t" and other celebratory comments [00:22:33] New review: Reedy; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50130 [00:22:34] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50130 [00:22:39] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 207 seconds [00:22:49] robla: should be! wikidata db writes are flowing to the s5 master and so far, everything looks good [00:22:54] Reedy: so, those jobqueue alerts... [00:22:55] ignore them? [00:22:57] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 214 seconds [00:23:09] Err [00:23:15] excellent, thanks binasher! w00t! \o/ [00:23:25] https://www.wikidata.org/wiki/Special:RecentChanges [00:23:27] :D [00:23:34] going to watch a bit longer but yup, still looks good [00:23:34] paravoid: Which alerts? [00:23:50] the nagios ones, we were looking at them the other day [00:23:54] you verified they're true [00:24:04] "w00t" :) [00:24:13] JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , lmowiki (11414), svwiki (63550), Total (90760) [00:24:17] Oh [00:24:21] JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (661642), lmowiki (11414), svwiki (63552), Total (750686) [00:24:26] apparently different, no idea why [00:24:30] the fist writes to immiediately start flowing are related to the job runner / update client (i.e. UPDATE /* Wikibase\DispatchChanges::trySelectClient */ `wb_changes_dispatch` SET chd_site = 'enwiki',chd_db = 'enwiki',chd_seen = '6148311',chd_touched = '20130221002303',chd_lock = 'enwiki.WikiBase.dispatchChanges',chd_disabled = '0' WHERE chd_site = 'enwiki') [00:24:52] hmmm [00:25:00] binasher: That seems to be just over 3GB smaller now.. [00:25:03] maybe we should have stopped the cronjobs [00:25:08] lolol [00:25:21] let's see if they pickup again [00:25:44] 20.986434936523 vs 24.061431884766 earlier [00:25:55] woo [00:25:57] nice [00:25:59] or there goes some data ;) hahaha [00:27:29] having explicitly defined small primary keys is a good thing, or innodb makes ones up that are unusable and generally take up a lot more more space per row than an auto-inc int would [00:27:33] did the changes table get pruned? [00:27:44] there is a cron job for that [00:27:46] dump/load always uses space more efficiently too [00:28:00] aude: nothing was explicitly pruned [00:28:15] binasher: that's fine but should automatically happen like once a day [00:28:18] hmm, job runners should probably be restarted though [00:28:19] makes stuff smaller [00:28:23] yeah [00:28:25] paravoid: But yes, those numbers are pretty useless now we keep old jobs around and prune them at some later point [00:28:37] uhm [00:28:47] so? remove the checks? [00:28:50] old/failed [00:28:54] it'd be nice to have /some/ check [00:29:01] Indeed, and before it worked fine (mostly) [00:29:28] paravoid: We need to do something with job_attempts [00:29:58] just curious, did we do rebuild the term search key? or add that column [00:30:05] (assume they can be done later) [00:30:18] the column with the OSC thing [00:30:27] AaronSchulz: ^ Should we count where job_attempts = 0 for the job queue counts? (I know, it's not indexed) [00:30:28] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 184 seconds [00:30:34] and rebuild does not require read only, i think [00:30:41] Reedy: what counts? [00:30:47] depends what you want to count [00:30:53] JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (661642), lmowiki (11414), svwiki (63552), Total (750686) [00:31:07] | 109017665 | 20130213224350 | refreshLinks | 3 | [00:31:07] | 109018333 | 20130213224918 | refreshLinks | 3 | [00:31:18] Which somewhat articifically inflate the count that we care about.. [00:31:39] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 193 seconds [00:31:39] having tonnes of ones with 0 count would suggest new jobs possibly not being processed [00:32:43] the check calls extensions/WikimediaMaintenance/getJobQueueLengths.php [00:32:47] fwiw [00:33:04] Yeah [00:33:19] heh, I bet that won't work with redis :) [00:33:21] and fires off if at least one is > 10k [00:33:31] one wiki [00:34:46] !log moved wikidatawiki tables on s3 to wikidatawiki_old to keep around for a time just in case; then "drop database wikidatawiki" [00:34:47] Logged the message, Master [00:35:00] :) [00:35:10] just in case..... [00:35:36] That means they'll stay for everer [00:35:56] would everything be up to date on the toolserver once replication catches up? [00:36:01] or do they have to make changes as well? [00:36:18] They'll need to make changes [00:36:37] AaronSchulz: what's the lightest effective check we can do to make sure the job queue isn't spiralling out of control? [00:36:38] duh: looks like changes are needed there [00:36:52] ok [00:36:52] they need to change where they're replicating from [00:37:13] Reedy: is it doing COUNT() now? [00:37:22] Yeah [00:37:29] * aude no idea how toolserver replication works [00:37:51] i asked dab in -toolserver [00:37:56] Reedy: so it can do the same thing JobQueueDB::doGetSize does [00:37:56] ah, okay [00:38:47] Which is? [00:39:03] exclude rows with job_token set? [00:39:06] We aren't filtering by type.. [00:39:20] unindexed [00:39:29] isn't this scanning everything anyway? [00:40:06] 1 row in set (1.73 sec) on enwiki [00:40:07] meh, WFM [00:40:15] it's not like this is myisam and COUNT(*) was fast [00:45:29] just for curiousity can someone do show create table for wb_terms in wikidata? [00:45:45] * aude just wonders what, if anything remains to do [00:45:56] since i can't see on toolserver [00:47:30] aude: http://p.defau.lt/new.html [00:47:34] http://p.defau.lt/?wuvc0ouLdZ3NmR_e6YowkQ [00:47:49] awesome [00:48:10] can you tell if the term search key got populated or not? [00:48:18] not a big deal either way at this point [00:48:22] I'll have a look in a minute [00:48:30] thanks [00:48:39] it looks good though [00:48:41] !log reedy synchronized php-1.21wmf10/extensions/WikimediaMaintenance/ [00:48:42] Logged the message, Master [00:48:46] RobH: mw85-mw125 yours? [00:49:26] what they do? [00:49:28] !log reedy synchronized php-1.21wmf9/extensions/WikimediaMaintenance/ [00:49:29] they should be working. [00:49:29] Logged the message, Master [00:49:34] mw125: rsync: mkdir "/apache/common-local/php-1.21wmf10/extensions/WikimediaMaintenance" failed: No such file or directory (2) [00:49:34] mw125: rsync error: error in file IO (code 11) at main.c(605) [Receiver=3.0.9] [00:49:44] Fine for php-1.21wmf9 though... [00:49:44] is that on all of them ? [00:49:51] paravoid: The check should be of some use now [00:49:52] or just the one? [00:50:01] i did a bunch of syncs, but not a scap, bleh. [00:50:08] sync-common locally [00:50:20] down to mw82 [00:50:21] mw110: ssh: connect to host mw110 port 22: Connection refused [00:50:21] Reedy: wow, thanks :) [00:50:23] ^ that's dead [00:50:24] Reedy: so the entire range is throwing errors? [00:50:32] i know mw110 is dead [00:50:33] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [00:50:35] there is a note on it [00:50:45] looks not populated, as far as i can tell from editing properties on wikidata [00:50:46] Yup, without checking every number, yes [00:50:51] Reedy: but now you have me paranoid, so they are all bad [00:50:56] ok, let me pull them out of pybal. [00:51:01] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [00:51:10] Like I say, seem ok for php-1.21wmf9, but not for php-1.21wmf10 [00:51:21] So that means? [00:51:29] I would imagine it means they shouldnt be serving the site. [00:51:47] Yeah [00:51:54] They can serve wikipedias fine, the rest not [00:51:55] aude: '' [00:52:24] aude: We can run the script for that tomorrow (well, today, post sleep) if you want [00:52:57] Reedy: ok [00:53:11] And anything else that needs tidying up [00:53:19] !log issues with new mw servers 86+, took back out of pybal until i can troubleshoot [00:53:20] Logged the message, RobH [00:53:23] Reedy: Thanks for the spot =] [00:53:35] I'll look into it and get them back to working [00:53:37] well i'm travelling tomorrow and friday, but if daniel k is around (in case of any problems, unlikely) [00:53:39] then that's fine [00:54:01] RobH: Everything else looks fine, just looks like they're out of date [00:54:21] i just copied them to live today [00:54:24] :/ [00:54:28] and they ran the normal syncs, but bleh [00:54:33] so you ran sync-common and no go [00:54:41] i wonder whats up with them [00:54:49] reedy@mw113:/usr/local/apache/common$ sync-common [00:54:50] Copying to mw113 from 10.0.5.8... [00:54:50] * Reedy waits [01:01:12] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 186 seconds [01:21:32] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [01:23:03] RECOVERY - MySQL disk space on neon is OK: DISK OK [01:23:29] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 30 seconds [01:24:59] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [01:41:56] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Puppet has not run in the last 10 hours [01:58:35] RECOVERY - mysqld processes on db1030 is OK: PROCS OK: 1 process with command name mysqld [01:59:29] RECOVERY - mysqld processes on db1031 is OK: PROCS OK: 1 process with command name mysqld [02:02:38] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 190 seconds [02:02:56] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 194 seconds [02:03:59] PROBLEM - mysqld processes on db1030 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [02:04:53] PROBLEM - mysqld processes on db1031 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [02:04:54] PROBLEM - Puppet freshness on db40 is CRITICAL: Puppet has not run in the last 10 hours [02:12:03] RECOVERY - mysqld processes on db1031 is OK: PROCS OK: 1 process with command name mysqld [02:12:57] RECOVERY - mysqld processes on db1030 is OK: PROCS OK: 1 process with command name mysqld [02:23:54] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [02:25:33] New patchset: Andrew Bogott; "Turn manage-volumes into a daemon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49916 [02:26:07] New review: Andrew Bogott; "Patch Set 4:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49916 [02:28:05] New patchset: Asher; "new extension1 shard (as an externalload config) - initially for AFTv5" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50139 [02:28:21] !log LocalisationUpdate completed (1.21wmf10) at Thu Feb 21 02:28:20 UTC 2013 [02:28:25] Logged the message, Master [02:28:44] New review: Asher; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50139 [02:28:46] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50139 [02:30:13] !log asher synchronized wmf-config/db-eqiad.php 'new extension1 shard (as an externalload config) - initially for AFTv5' [02:30:15] Logged the message, Master [02:30:58] !log asher synchronized wmf-config/db-pmtpa.php 'new extension1 shard (as an externalload config) - initially for AFTv5' [02:31:00] Logged the message, Master [02:35:18] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [02:35:18] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [02:54:17] !log LocalisationUpdate completed (1.21wmf9) at Thu Feb 21 02:54:17 UTC 2013 [02:54:19] Logged the message, Master [03:02:54] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [03:06:03] RECOVERY - MySQL disk space on neon is OK: DISK OK [03:06:39] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [03:16:54] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 181 seconds [03:18:24] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 199 seconds [03:50:48] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [03:51:15] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [03:56:30] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [03:57:06] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [04:05:39] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Puppet has not run in the last 10 hours [04:15:42] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [04:17:39] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [04:19:45] PROBLEM - Puppet freshness on labstore3 is CRITICAL: Puppet has not run in the last 10 hours [04:47:04] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [04:48:34] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [05:20:37] RECOVERY - MySQL disk space on neon is OK: DISK OK [05:21:04] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [05:25:32] New patchset: Tim Starling; "Move all favicons to bits" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49802 [05:29:17] New review: Tim Starling; "Patch Set 2:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49802 [05:49:17] New patchset: Tim Starling; "Use a cgroup for command execution" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49000 [05:49:26] New review: Tim Starling; "Patch Set 2: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49000 [05:49:27] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49000 [05:50:42] !log tstarling synchronized wmf-config/CommonSettings.php 'shell cgroup' [05:50:45] Logged the message, Master [05:53:01] PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours [05:56:55] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [05:57:22] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [06:08:28] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [06:10:07] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [06:11:33] New patchset: Tim Starling; "Increase $wgMaxImageArea to 75 Mpx" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50149 [06:11:46] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:13:26] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.020 second response time on port 8123 [06:18:59] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:20:28] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [06:28:09] RECOVERY - MySQL disk space on neon is OK: DISK OK [06:28:21] TimStarling: hadn't noticed the cgroup changes, they look great [06:28:36] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [06:28:38] from a glance [06:28:54] PROBLEM - Puppet freshness on sq71 is CRITICAL: Puppet has not run in the last 10 hours [06:29:26] yeah, it should give us a few benefits beyond just avoiding imagemagick deadlocks [06:29:45] limiting on vmsize was getting tedious [06:30:07] since adding a library or deploying a new binary that uses lots of libraries would throw out the estimate [06:30:36] that was one of the problems with lilypond deployment, come to think of it -- massive vsize usage [06:31:25] yeah, even for avconv we were inflating the limits because of vsize [06:31:41] https://bugzilla.wikimedia.org/show_bug.cgi?id=43188 I guess this can be closed? [06:32:56] yes [06:33:41] done [06:34:35] oh was about to too :) [06:35:05] we should merge https://gerrit.wikimedia.org/r/#/c/38307/ too... [06:35:15] oh you're not a reviewer there [06:35:26] it's about apparmor, you might be interested :) [06:37:54] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [07:12:24] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:17:30] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.033 second response time on port 8123 [07:26:57] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:28:16] New review: Faidon; "Patch Set 4:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42791 [07:28:36] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.020 second response time on port 8123 [07:40:39] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [07:40:39] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:40:48] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [07:45:54] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [07:58:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:01:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.846 seconds [08:11:15] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [08:11:42] RECOVERY - MySQL disk space on neon is OK: DISK OK [08:29:06] PROBLEM - MySQL Slave Delay on db46 is CRITICAL: CRIT replication delay 191 seconds [08:29:15] PROBLEM - MySQL Replication Heartbeat on db46 is CRITICAL: CRIT replication delay 195 seconds [08:29:51] PROBLEM - MySQL Replication Heartbeat on db1022 is CRITICAL: CRIT replication delay 206 seconds [08:30:18] PROBLEM - MySQL Slave Delay on db1022 is CRITICAL: CRIT replication delay 219 seconds [08:47:25] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , frwiki (1892892), Total (1896226) [08:47:51] RECOVERY - MySQL Replication Heartbeat on db1022 is OK: OK replication delay 0 seconds [08:48:18] RECOVERY - MySQL Slave Delay on db1022 is OK: OK replication delay 0 seconds [08:48:54] RECOVERY - MySQL Slave Delay on db46 is OK: OK replication delay 28 seconds [08:49:03] RECOVERY - MySQL Replication Heartbeat on db46 is OK: OK replication delay 21 seconds [08:49:57] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , frwiki (1928065), Total (1930900) [09:15:30] mark: are you around by any chance to get some changes reviewed ? :-D [09:15:33] or maybe this afternoon [09:32:23] !log Jenkins: removing git branch specifier from all mediawiki-core jobs. [09:32:24] Logged the message, Master [09:42:15] New patchset: Hashar; "adapt role::cache::upload for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50064 [10:06:16] !log Jenkins: reseting git branch specifier :-/ [10:06:18] Logged the message, Master [10:51:31] !log huge CPU spike on manganese (gerrit host) [10:51:33] Logged the message, Master [10:53:31] apergos: mark: mutante: manganese has some huge CPU spike, would you be able to get us some log informations please ? :-} [10:53:34] qchris: ^^^^ [10:53:40] qchris: don't you have access on the cluster? [10:53:47] Thanks hashar [10:53:51] hashar: [10:54:03] hashar: Don't think so. [10:54:22] which log? [10:54:26] I"m on the host [10:54:55] top says java and python, that must be you. where do I look? [10:54:59] they are in /var/lib/gerrit2/review_site/logs [10:55:05] apergos: Last time ^demon noticed lots and lots of "Dispatched Failed!" lines ... I'll see if he mentioned which logs that were... [10:55:11] (then a pile of little gits stacked up oo) [10:55:13] *too [10:55:14] java is the Gerrit process, python is probably some hook. [10:55:51] [2013-02-21 10:55:29,395] WARN org.eclipse.jetty.io.nio : Dispatched Failed! SCEP@3c5da61{l(/127.0.0.1:27509)<->r(/127.0.0.1:8080),d=false,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=1r}-{AsyncHttpConnection@32841836,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0} to org.eclipse.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager@4189cd3c [10:55:51] so either: /var/lib/gerrit2/review_site/logs/error_log or /var/lib/gerrit2/review_site/logs/sshd_log [10:55:53] lots [10:55:56] really lots and lots [10:55:56] good! [10:56:16] so it seems to be the same issue it had a few days ago [10:56:25] 2.8gb worth I guess [10:56:25] Restarting gerrit should make the problem disapear [10:56:29] apergos: what is the python command line ? [10:56:45] I mean there is some python process having some huge CPU, do you get the full command line? [10:56:51] I am wondering if that could be some hook getting wild [10:56:52] thats the entire log entry [10:56:54] as I posted it [10:56:59] oh. just a sec [10:57:12] ircecho [10:57:19] irrelevant [10:57:20] ah ok :-] [10:57:26] there's a salt-minion too [10:57:39] no idea what it is for [10:57:47] anything else before I restart (I assume i/etc/init.d/gerrit restart suffices)? [10:57:56] so I guess simply reboot gerrit. Hopefully restart on the init script will be fine [10:58:03] doing [10:58:08] :-] [10:58:19] qchris: I really hate that error message. That is not really helpful. [10:58:32] :-) [10:58:42] Looks like we'll have to switch away from jetty then. [10:58:48] should have tossed the log file rats [10:58:48] ^demon will not like this [11:00:18] Looks like gerrit is not there yet, is it still starting up? [11:00:26] probably [11:00:36] + there is an apache in front of it acting as a reverse proxy [11:00:45] Apache is up [11:00:51] yeah but it has some timeout [11:00:52] err [11:00:59] going to do so, I will stop it again, sec [11:01:14] when gerrit is unreachable, apache send the "Service Temporarily Unavailable" error [11:01:16] and start a timer [11:01:24] it will keep serving that page until the timer is expired [11:01:26] startup had failed no idea why, I ketp the recent additions to the error log [11:01:32] one way is to restart gerrit then restart apache to clear the timer [11:01:40] doh [11:01:51] waiting [11:02:26] it spurts the errors in /var/lib/gerrit2/review_site/logs/error_log with a huge long stacktrace (seems like java people love long traces) [11:03:19] so unhappy because it claims it didn't start [11:03:31] but I see a GerritCodeReview running with a new timestamp [11:03:41] ah [11:03:45] working again [11:03:46] http://gerrit.wikimedia.org/r/ [11:03:47] :-] [11:03:49] It's up again [11:03:50] yay [11:03:51] so it seems you saved it [11:03:53] \O/ [11:03:53] \o/ [11:03:54] and log fle is tiny now [11:04:02] !log Ariel saved Gerrit by restarting it! [11:04:03] Logged the message, Master [11:04:07] hahaha [11:04:10] :-))) [11:04:12] thank you Ariel! [11:04:14] yw [11:04:39] would be nice to know what's broken [11:04:50] Jetty is broken as it seems. [11:05:00] We hit this problem the other day [11:05:18] that's unfortunate [11:05:34] Fortunately, tomcat does not seem to have this problem. [11:05:43] ah [11:05:47] so that's next up is it? [11:05:59] We'll have to duscuss that with ^demon [11:06:03] good luck [11:06:08] Thanks :-) [11:33:58] ok mark is going to hate me [11:34:22] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [11:34:40] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [11:37:58] New patchset: Hashar; "adapt role::cache::upload for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50064 [11:43:05] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Puppet has not run in the last 10 hours [12:00:01] PROBLEM - MySQL Replication Heartbeat on db44 is CRITICAL: CRIT replication delay 194 seconds [12:00:19] PROBLEM - MySQL Slave Delay on db44 is CRITICAL: CRIT replication delay 196 seconds [12:05:16] RECOVERY - MySQL disk space on neon is OK: DISK OK [12:06:10] PROBLEM - Puppet freshness on db40 is CRITICAL: Puppet has not run in the last 10 hours [12:06:47] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [12:22:41] RECOVERY - MySQL Slave Delay on db44 is OK: OK replication delay 0 seconds [12:23:52] RECOVERY - MySQL Replication Heartbeat on db44 is OK: OK replication delay 0 seconds [12:24:46] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [13:03:46] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [13:11:43]

Error 403 Access denied