[00:01:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37980 [00:04:27] !log pgehres synchronized php-1.21wmf6/extensions/ContributionTracking/ 'Updating ContributionTracking to master' [00:04:34] Logged the message, Master [00:05:21] !log pgehres synchronized php-1.21wmf6/extensions/FundraiserLandingPage/ 'Updating FundraiserLandingPage to master' [00:05:28] Logged the message, Master [00:06:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:08:52] New patchset: Andrew Bogott; "Puppetize adminbot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37981 [00:21:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [00:36:57] New patchset: Pgehres; "Putting donation pipeline in maintenance mode for db1008 maintenance." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37984 [00:37:59] fyi - in 5 minutes we are putting the fundraising donation pipeline into maintenance mode for db1008 hdd swap at 01:00 UTC [00:38:22] okdok [00:41:34] Change merged: Katie Horn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37984 [00:44:29] !log pgehres synchronized wmf-config/CommonSettings.php 'Putting donation pipeline in maintenance mode' [00:44:37] Logged the message, Master [00:47:02] New patchset: Pgehres; "Adding maintenance mode to S:FLP as well." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37988 [00:47:25] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37988 [00:47:42] notpeter: did you forget your https://bugzilla.wikimedia.org/show_bug.cgi?id=42152 comment? :) [00:47:54] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [00:48:09] !log pgehres synchronized wmf-config/CommonSettings.php 'Putting donation pipeline in maintenance mode, part 2' [00:48:17] Logged the message, Master [00:49:15] AaronSchulz: I wouldn't say I forgot it... [00:49:41] are you waiting till the next run [00:49:42] ? [00:53:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:01:28] cmjohnson1: ok we're clear to do the disk swap when you're ready [01:01:35] Payments is Go! :) [01:01:39] okay [01:01:58] Donate is go! [01:02:03] wmfwiki is Go! [01:02:21] CAPCOM, we are go! [01:02:25] !log replacing disk on slot 8 db1008 [01:02:32] Logged the message, Master [01:03:22] Asburn, SanFran, you are go for disk swap ;-) [01:04:10] having just watched "Prep & Landing" it seems like we should be talking in xmas idioms [01:05:02] Roger that... Santa? [01:05:05] * K4-7131 is awkward [01:05:23] disk has been swapped [01:05:25] Firmware state: Rebuild [01:05:27] as long as it's no "Oh Frostbite" or "Figgy Pudding" [01:05:32] Yay! [01:05:40] rudolph has a red nose [01:05:47] i reaeat, rudolph has a red nose [01:06:00] you ate his nose? [01:06:34] More than once, apparently. [01:06:49] cmjohnson1: I'm afraid to say for fear of jinxing the Dell gods, but i guess we're good to go? [01:07:27] yes...the disk is in rebuild so it will take awhile to be fully online but that is all [01:07:34] yeah [01:07:44] i guess we're okay then? [01:07:50] slight but probably not noticeable performance for an hour or so [01:07:53] k [01:08:01] I'm lighting up payments again, then. [01:08:13] lmk and I will unfuck the cluster [01:08:26] pgehres: I'm pretty sure that's not possible [01:08:40] my eyes. [01:08:44] * K4-7131 gets popcorn [01:08:49] ...go for it. :) [01:09:28] cmjohnson1: thanks! [01:09:35] anytime! [01:09:38] indeed [01:09:41] New patchset: Pgehres; "Turning donations back on" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37991 [01:09:44] thanks for staying late [01:09:50] k..i am packing up and going home [01:09:53] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37991 [01:09:58] and now that i have made all the tickets go crazy, i'm outta here! [01:10:02] have a good night. [01:10:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [01:10:17] LeslieCarr: i'm afraid to look at my queue [01:10:53] !log pgehres synchronized wmf-config/CommonSettings.php 'Putting donation pipeline back to normal' [01:12:13] Logged the message, Master [01:16:43] mutante: I've reopened https://rt.wikimedia.org/Ticket/Display.html?id=3942 [01:17:14] I don't know what it says in puppet, or which file controls this, but I've checked from time to time on the actual server and I'm not in the group [01:37:43] New patchset: Ryan Lane; "Increase worker threads for the salt master" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37994 [01:38:35] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37994 [01:39:39] New patchset: Pyoungmeister; "coredb cnf to our standard" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37995 [01:41:49] ^demon: ping [01:41:59] <^demon> I got your text, I'm looking. [01:42:01] ^demon: This looks a little messed up https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=commit;h=039415ae2aa5c73513697cb0341c3c5e7d9840cf [01:42:17] ^demon: Okay cool thanks [01:42:38] New patchset: Pyoungmeister; "coredb cnf to our standard" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37995 [01:42:41] <^demon> What the fuck did I do? [01:42:48] :D [01:42:53] things you don't want to hear [01:42:56] <^demon> Stupid make-wmf-branch. [01:42:57] we were wondering that :-D [01:43:03] <^demon> What the fuck did *it* do? [01:43:21] <^demon> I'm going to rewrite history. [01:43:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:43:45] ^demon: hrm....might that compound matters? [01:45:02] <^demon> wmf/1.21wmf6 is correct. I have no clue why that also got committed to master. [01:46:45] <^demon> Ugh, rewriting history will confound things at this point. [01:46:48] <^demon> You're right. [01:48:03] <^demon> Well, anything that still needs cleaning can be undone. [01:48:31] ^demon: well it's .gitmodules that is the big issue right? [01:48:49] <^demon> What's the big deal about it? Just delete it. [01:49:27] ^demon: well the big deal is that it's in master and creating all the extensions [01:50:49] <^demon> Easily fixed: https://gerrit.wikimedia.org/r/#/c/38000/ [01:51:54] ^demon: that looks good to me [01:52:16] ^demon: Can I merge that? [01:52:23] <^demon> Jenkins will when unit tests are done. [01:52:46] ^demon: what command did you use? I was getting rebase errors with Fundraiser/ContribsTracking [01:53:08] <^demon> `git revert 039415ae2aa5c73513697cb0341c3c5e7d9840cf` [01:53:13] <^demon> `git commit -a` [01:53:18] did you pull first? [01:53:22] ^demon: Jenkins is going to merge that commit? [01:53:22] I mean submodule upate [01:53:34] s/upate/update [01:54:34] <^demon> AaronSchulz: Not sure if I had or not. I've been switching between branches and submodules all day. [01:54:48] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [01:55:02] <^demon> Entirely possible. In any case, if deleting .gitmodules isn't enough, check .git/config to make sure they're not still registered as submodules. [01:55:35] ^demon: I tried reverting first...then just deleting the .gitmodules file and reverting the stuff in the index [01:55:50] yeah .git/config was empty [01:56:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.005 seconds [01:58:11] <^demon> https://gerrit.wikimedia.org/r/38001 and https://gerrit.wikimedia.org/r/38002 need review. [01:59:04] ^demon: What about Wikibase? [01:59:43] <^demon> Oh whoops, I missed that due to conflict. [01:59:45] <^demon> Will submit. [01:59:53] ^demon: Okay cool thanks [02:00:32] <^demon> https://gerrit.wikimedia.org/r/38003 [02:00:34] ^demon: ahh, the .gitreview fucked up my revert [02:00:35] !log LocalisationUpdate failed: git pull of core failed [02:00:44] Logged the message, Master [02:01:59] ^demon: Depends On I43248135 Revert "Applied patches to new WMF 1.21wmf6 branch" [02:02:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37995 [02:02:13] <^demon> Yes I know. All 3 of those changes should be reviewed. [02:02:20] <^demon> 38001, 38002, 38003. [02:02:54] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [02:03:01] ^demon: Okay thanks [02:03:24] ^demon: I greatly appreciate you hoping back on IRC and taking care of this issue tonight [02:04:03] <^demon> Yeah well, I broke master :p [02:04:14] s/hoping/hopping [02:04:20] * AaronSchulz snickers [02:04:32] AaronSchulz: ha ha both work really [02:05:05] ^demon: s/hoping/hopping [02:05:26] * preilly — figures AaronSchulz should get a freebie  [02:05:44] * AaronSchulz was surprised when he said "potata" earlier instead of "potato" [02:06:28] <^demon> Ok, master's fixed now. [02:06:56] * AaronSchulz watches chad delete 58000 lines of code [02:09:06] * preilly ha ha ha [02:15:04] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 184 seconds [02:15:58] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 196 seconds [02:16:19] <^demon> I'm going back to my evening. If you need me, e-mail. [02:17:37] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [02:18:13] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [02:32:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [02:40:16] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Tue Dec 11 02:40:04 UTC 2012 [03:02:26] !log Rebooting mexia, seems to have been hit by some kernel bug [03:02:33] Logged the message, Mr. Obvious [03:05:46] RECOVERY - Puppet freshness on ms-be3002 is OK: puppet ran at Tue Dec 11 03:05:30 UTC 2012 [03:08:10] PROBLEM - Host mexia is DOWN: PING CRITICAL - Packet loss = 100% [03:08:47] !log aaron synchronized php-1.21wmf6/includes/job/jobs/RefreshLinksJob.php [03:08:55] Logged the message, Master [03:12:48] !log aaron synchronized php-1.21wmf6/includes/job/JobQueueDB.php [03:12:55] Logged the message, Master [03:29:00] !log mexia still hasn't come back up. That combined with the weird behavior I saw before rebooting it makes me suspect a hardware issue [03:29:08] Logged the message, Mr. Obvious [03:29:17] !log aaron synchronized php-1.21wmf5/includes/job/jobs/RefreshLinksJob.php [03:29:25] Logged the message, Master [03:29:35] !log aaron synchronized php-1.21wmf5/includes/job/JobQueueDB.php [03:29:44] Logged the message, Master [03:29:51] Ryan_Lane: git-deploy works great now :) [03:29:58] Only problem is mexia isn't coming back up after reboot [03:38:26] Also, RT isn't letting me log in, sent an e-mail about that and about the mexia breakage to ops-l [03:53:00] !log powercycling mexia [03:53:08] Logged the message, Master [03:53:11] RoanKattouw: reply via mail .i reset RT for you [03:53:18] find info in fenari home "rt" [03:53:47] watching mexia boot up [03:53:54] Initializing firmware interfaces... [03:54:00] mutante: That password doesn't work either [03:54:03] Lifecycle Controller: Collecting System Inventory... [03:54:32] RoanKattouw: wth?! i have no issues with that at all and i reset mine multiple times [03:54:49] RoanKattouw: mexia login: [03:54:49] RECOVERY - Host mexia is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [03:54:51] I reset mine multiple times too [03:55:02] I tried every value the password has ever had, plus the one you gave me. They all fail [03:55:18] D'OH [03:55:23] user catrope? [03:55:25] username != password [03:55:28] heh [03:56:42] Ahm, sorry [03:56:46] username != e-mail address [03:56:48] RoanKattouw: np:) [03:57:04] And mexia is back up [03:57:05] Thanks mam [03:57:07] *man [03:57:08] when i talk to you its always about hosts being down that need powercycling an i never heard of before:)) [03:57:16] mystery hosts,heh [03:57:33] what is mexia doing [03:57:43] mexia = temporary Parsoid box [03:57:48] ah,ok:) [04:02:37] PROBLEM - Parsoid on mexia is CRITICAL: Connection refused [04:04:58] !log Updating Parsoid to master [04:05:05] Logged the message, Mr. Obvious [04:10:43] RECOVERY - Parsoid on mexia is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.002 seconds [04:13:41] :) sounds good. away again then [04:15:48] RoanKattouw: hooray about git-deploy! [04:15:58] I still need to add the change to ignore submodules [04:16:02] I'll make that tomorrow morning [04:16:17] OK, coo [04:16:27] I've just hacked it to remove the submodules for now [04:16:31] Also, .deploy is really fucking annoying [04:16:42] why? just put it into the gitignore [04:16:46] I've hacked the Parsoid clone to ignore it, but you'd have to do that to every repo [04:16:52] I suppose I can land that in master in the long term [04:16:58] yeah [04:17:02] it's needed [04:17:13] the client reads it to know which tag to use [04:17:18] s/client/minion/ [04:18:38] I'd prefer it was in json, but when we rewrite git-deploy we can do that [04:26:10] PROBLEM - MySQL Replication Heartbeat on db64 is CRITICAL: CRIT replication delay 186 seconds [04:26:28] PROBLEM - MySQL Replication Heartbeat on db39 is CRITICAL: CRIT replication delay 187 seconds [04:26:37] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 181 seconds [04:26:37] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: CRIT replication delay 182 seconds [04:26:37] PROBLEM - MySQL Slave Delay on db64 is CRITICAL: CRIT replication delay 201 seconds [04:27:04] PROBLEM - MySQL Replication Heartbeat on db11 is CRITICAL: CRIT replication delay 185 seconds [04:27:22] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 222 seconds [04:27:22] PROBLEM - MySQL Slave Delay on db39 is CRITICAL: CRIT replication delay 212 seconds [04:27:22] PROBLEM - MySQL Slave Delay on db11 is CRITICAL: CRIT replication delay 188 seconds [04:27:31] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 227 seconds [04:29:37] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: CRIT replication delay 182 seconds [04:29:46] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 187 seconds [04:29:46] :/ [04:29:55] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: CRIT replication delay 194 seconds [04:30:40] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 223 seconds [04:30:58] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: CRIT replication delay 234 seconds [04:31:07] PROBLEM - MySQL Slave Delay on db1019 is CRITICAL: CRIT replication delay 240 seconds [04:31:16] that's s3.. the s3 masting just has a *ton* of UPDATE /* JobQueueDB::claimRandom */ queries running right now [04:31:21] ah [04:31:26] was just about to log onto that host to check [04:32:06] there was a spike earlier on the job runners as well [04:33:08] <3 dbtree [04:33:20] db66 also has a long running WikiExporter query thread for dewikisource going.. it looks like it's from the search indexer, so maybe dewikisource doesn't have incremental indexing :/ [04:34:35] and db39 has a ton of very long running [04:34:36] SELECT /* FlaggedRevsStats::getEditReviewTimes */ /*! STRAIGHT_JOIN */ MIN(rev_timestamp) AS rt,MIN(n.fr_timestamp) AS nft,MAX(p.fr_rev_id) FROM `revision` INNER JOIN `flaggedrevs` `p` ON ((p.fr_page_id = rev_page) AND (p.fr_rev_id < rev_id) AND (p.fr_timestamp < rev_timestamp)) INNER JOIN `flaggedrevs` `n` ON ((n.fr_page_id = rev_page) AND (n.fr_rev_id >= rev_id) AND (n.fr_timestamp >= rev_timestamp)) WHERE (rev_user [04:34:37] AND (rev_timestamp BETWEEN '20121109124953' AND '20121116124953') AND ((rev_id % 1) = 0) GROUP BY rev_timestamp,rev_id [04:34:37] queries [04:34:44] all coming from hume [04:35:05] all copying to tmp tables.. its kinda fucked, i'm going to kill all of that [04:35:06] ah. that's the new wikidata cron [04:35:52] that is up? thought it was still listed as ToDo to switch something to cron .. and get puppet help for it [04:35:58] sorry, reading the wrong thin [04:35:58] looks like all for enwikinews [04:35:59] *thing [04:36:15] and looks like whatever launched them on hume is long dead.. there's nothing on the other side [04:36:30] in fact, I'm logging off of fenari, not really in the right state to be on systems [04:37:47] :) [04:38:31] !log killed 22 long running FlaggedRevsStats::getEditReviewTimes on db39, run from hume against enwikinews [04:38:38] Logged the message, Master [04:41:09] !log same with db66 [04:41:16] Logged the message, Master [04:42:04] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [04:42:41] :) [04:44:17] notpeter says he fixed some crons on hume.. well. please consider unfixing them :) [04:44:28] RECOVERY - MySQL Slave Delay on db64 is OK: OK replication delay 0 seconds [04:46:00] working as intended. by whoever wrote it.... [04:46:39] ah, was looking at cron.daily but will stop [04:46:52] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 889 seconds [05:01:35] Ryan_Lane: is there a place in labs to play with git-deploy? (just to learn how it works) [05:01:45] well, there was... [05:01:47] I kind of broke it [05:02:08] it's easier to implement in production than in labs [05:02:08] implosion? ;) [05:02:40] I deleted a couple of the instances [05:03:01] until we make salt per-project, it's not easy to do [05:04:33] PROBLEM - MySQL Slave Delay on db1010 is CRITICAL: CRIT replication delay 1725 seconds [05:06:01] i was thinking there could at least be a master and a few slaves all within a single project [05:06:12] RECOVERY - MySQL Slave Delay on db1010 is OK: OK replication delay 0 seconds [05:06:12] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [05:06:33] wasn't considering cross/multi project implications [05:06:55] yeah, that's how I did it originally [05:07:08] now the salt master is on virt0, though [05:07:30] and the instances are configured to use it [05:11:06] !log catrope synchronized php-1.21wmf5/extensions/VisualEditor 'Updating VisualEditor to master' [05:11:09] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 2056 seconds [05:11:14] Logged the message, Master [05:11:27] !log catrope synchronized php-1.21wmf6/extensions/VisualEditor 'Updating VisualEditor to master' [05:11:34] Logged the message, Master [05:12:57] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [05:16:11] Ryan_Lane: lots of memcached connection errors [05:16:23] * Aaron|home wonders if binasher is around [05:16:31] uugggghh. I really shouldn't do things on the cluster.... [05:17:17] Aaron|home: huh? [05:17:29] memcached-serious [05:17:54] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 2407 seconds [05:19:41] git-deploy is sllloooowwwww [05:19:42] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [05:20:16] RoanKattouw: which repo? [05:20:21] parsoid/Parsoid [05:20:29] for fetch and checkout or both? [05:20:38] Oh now it came through [05:20:49] # INFO : Step 'sync' finished. Started at 2012-12-11 05:18:13; took 137 seconds to complete [05:21:03] I ... think it was checkout? [05:21:04] binasher: spammy :) [05:21:31] RoanKattouw: did all of the minions return success? [05:21:34] Yes [05:21:46] hm. I wonder if I'm still not running enough threads [05:21:54] 50 still may be low [05:22:31] !log killed all wikiadmin threads on db39 [05:22:34] fetch is returning almost immediately for me [05:22:38] Logged the message, Master [05:22:49] checkout did too [05:22:57] Aaron|home: it's most all from 4 app servers: [05:22:58] 7741115 mw61 [05:22:59] 7716291 srv248 [05:22:59] 7711051 mw60 [05:23:00] 7707027 srv249 [05:23:10] thats number of log lines [05:23:13] next closest is 9393 srv193 [05:23:25] memcache isn't running on srv261 [05:23:34] which is a random one they are complaining about [05:23:36] i wonder if they're having network probs.. hmm, i think one of those was 10mbps [05:23:42] heh [05:23:53] Ryan_Lane: they run mediawiki [05:24:18] are they misconfigured, then? [05:24:20] and they likely have networking issues [05:24:39] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 2759 seconds [05:24:40] one of the servers is complaining that it can't connect to 11000 on srv261 [05:24:57] oh? [05:24:59] yes [05:25:13] oh lol [05:25:15] maybe they need to have scap run on them? [05:25:22] Aaron|home: wtf? [05:25:49] that would mean they're running srsly old configs [05:26:06] 51 => '10.0.8.11:11000', [05:26:08] it's in mc.php [05:26:35] sounds like something that salt's supposed to make never happen again ;P [05:26:38] binasher: those are bit boxes [05:26:43] jeremyb: eh? no [05:26:46] oh [05:26:50] Ryan_Lane: those lines aren't used [05:26:52] you mean unsync'd boxes? [05:26:55] those 4 boxes you listed [05:26:57] ah [05:27:10] lemme see if mw is really out of date on them [05:27:28] jeremyb: yeah, hopefully git-deploy will make it better, but it won't solve it [05:27:40] Ryan_Lane: i mean (in part) that a box doens't have to be up and operational when you do a sync because it could stay in the salt queue for when it comes back up? [05:28:01] Ryan_Lane: not that i really even know if that's possible with salt [05:28:03] salt uses 0mq. if the box doesn't get the call, it doesn't get it [05:28:30] hrmmm [05:28:50] but, with git-deploy, the system should be able to check often to see if it's out of date [05:28:58] yeah, i guess [05:29:04] which is better than no [05:29:05] *now [05:29:10] noop redeploys are cheap [05:29:22] with unchanged git commit ids [05:29:29] binasher: ... [05:29:39] binasher: it's in the active list in mc.php [05:29:58] Ryan_Lane: they don't even have memcache installed anymore [05:30:01] though none of those should be used [05:30:06] oh [05:30:08] that file does litereally 0 things now [05:30:17] we use a different method of managing mc now? [05:30:24] didn't know that [05:30:30] did that happen while I was on vacation? :) [05:30:35] Ryan_Lane: http://24.media.tumblr.com/tumblr_mamw0rbkSE1qbsw6yo1_400.gif [05:30:44] that's how we manage [05:30:45] where's it configured now? php config? [05:30:51] hahaha [05:31:05] uh.... it's mc1-16 [05:31:10] not sure how conf'd, tbh [05:31:13] then wtf [05:31:15] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [05:31:22] those are weird errors :D [05:31:24] binasher could tell you [05:32:01] y'all are scaring me [05:32:21] whiskey's my excuse. what's yours? [05:32:26] well, it's obvious I haven't read about this change ;) [05:32:28] jesus, why am I on work irc [05:32:29] notpeter: mine too [05:32:36] lololol [05:32:49] earlier I logged off of fenari because I thought it was likely a bad idea [05:32:52] notpeter: mc.php is still used, but only the pecl section. guess its time to delete the shit outta the rest [05:32:59] ahhhhhh, ok [05:33:03] ah, gotcha [05:33:12] anyone send an email about this? [05:33:14] good to know! :) [05:33:23] I probably select-all/mark as read'd it [05:33:25] jobs runners are stuck on enotify [05:33:32] did anyone scap those boxes [05:33:36] nothing happening over there [05:33:40] no. I can, though [05:33:40] are there hot spares in the new system? what configures who's a spare or not? [05:33:48] Ryan_Lane: ok, but take a shot first [05:33:52] hahaha [05:34:00] not a great idea ;) [05:34:53] (if not mc.php) [05:34:59] * jeremyb is too tired for shots [05:35:40] of course scap takes forever and a fucking year [05:36:12] which is kind of sad when it's a single box doing it [05:36:12] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 3362 seconds [05:37:03] * Aaron|home wonders wtf code the runners are stopped on [05:37:50] ok, I have no idea why I'm online [05:37:54] peace out, yall [05:38:31] heh [05:41:29] binasher: scap finished on srv248, still reporting memcached errors [05:44:42] Aaron|home: resourceloader using the php mc client? [05:45:20] I don't see how [05:46:51] does bits use the eventlogging extension at all? [05:48:48] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [05:48:58] poor db39 - Trx id counter C746E6056 Purge done for trx's n:o < BE2171C26 undo n:o < 23 [05:49:00] * Aaron|home knows little of that ext [05:50:15] it's enabled on enwiki [05:50:19] and meta [05:50:22] and test and test2 [05:50:36] $wgEventLoggingBaseUri = '//bits.wikimedia.org/event.gif'; [05:51:47] ./extensions/EventLogging/EventLogging.module.php: $this->cache = $cache ?: wfGetCache( CACHE_MEMCACHED ); [05:51:57] !log aaron synchronized php-1.21wmf5/includes/objectcache/MemcachedPhpBagOStuff.php [05:52:03] i think it's the only thing hard coded to specifically use the old client [05:52:05] Logged the message, Master [05:52:30] that can't be working very well :D [05:52:34] :D [05:52:55] !log aaron synchronized php-1.21wmf5/includes/objectcache/MemcachedPhpBagOStuff.php [05:53:02] Logged the message, Master [05:53:14] yeah that's it [05:53:45] why is that hardcoded? [05:53:45] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 4270 seconds [05:53:58] binasher: yes we need to delete all that old shit [05:54:07] that way no one can use it [05:54:10] Aaron|home: i think ori-l wrote that ext? [05:54:33] hrm? wait, catching up [05:55:31] what's the right thing to use? i had $wgMemc, but was foiled by the fact that you get a null cacher if you don't configure memcache. [05:56:05] i just deployed that change on friday and it indeed didn't work very well :| [05:56:27] is a cache needed for correctness? [05:56:37] yes [05:57:44] then use CACHE_ANYTHING [05:58:06] that will still use the proper mc on our setup [05:58:24] for 3rd parties it will fall back to the db cache in the worse case [06:00:06] Aaron|home: hrm, JobQueueDB::claimRandom is contributing to replication lag on s3 in eqiad… we need to kill JobQueueDB [06:00:28] Aaron|home / binasher: ok, committing a fix now. Thanks for the heads up. [06:00:37] binasher: I was looking at that lately [06:00:45] ori-l: thanks for the quick fix! [06:00:47] I'm starting to think we really *need* redis [06:00:55] * ori-l salivates. [06:00:57] Aaron|home: we are totally on the same page [06:01:01] like not just "wouldn't it be cool" but "lets do this shit now" [06:01:22] yes [06:01:24] yesyesyesyes [06:01:41] notpeter: ^^ DO IT.. [06:02:54] PROBLEM - MySQL Slave Delay on db1010 is CRITICAL: CRIT replication delay 4423 seconds [06:03:30] btw binasher, varnishlog appears to truncate query strings at 245ish chars. i haven't had a chance to investigate the problem or report it adequately, but maybe you know off the top of your head what is imposing the constraint? [06:05:04] ori-l: nothing off the top of my head [06:05:11] will have to check out the code [06:05:16] that's unfortunate though [06:06:13] binasher: we should talk about some details for using redis [06:07:00] maybe some meeting [06:09:06] Aaron|home: that sounds good. and meetingful. [06:11:15] jenkins is fubar at the moment, right? [06:11:20] * Aaron|home detects jenkins fail [06:12:03] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [06:12:03] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [06:12:03] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [06:14:19] i self-merged after jenkins failure [06:14:31] i think the unwritten rule is that i now do an unscheduled sync to prod [06:14:41] and promptly disconnect from the internet