[00:08:02] RECOVERY - HTTP on neon is OK: HTTP OK HTTP/1.1 200 OK - 455 bytes in 0.056 seconds [00:08:15] hey maplebed: shall we give it another shot? i fixed the seg fault, and i ran all configurations on my local machine and they work [00:08:30] drdee: no, not today. [00:08:42] it's past 5pm and I don't think I have the fortitude to try again at the moment. [00:08:59] do you think we could try tomorrow morning? [00:09:10] if you really want to do it now I can rally... [00:09:10] okay, sounds good to me! [00:09:16] no no [00:09:21] tomorrow is good [00:09:25] ok. thanks. [00:10:03] anytime midmorning pacific will be fine. [00:10:14] perfect, see you then [00:20:47] RECOVERY - NTP on neon is OK: NTP OK: Offset -0.04752540588 secs [01:12:02] New review: Tim Starling; "This is not a normal CARP-like consistent hashing algorithm. Instead of hashing the URL together wit..." [operations/debs/varnish] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/4162 [04:33:48] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [04:43:51] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [04:43:51] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [04:58:51] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [04:58:51] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [05:31:54] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:53:12] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:07:00] PROBLEM - Lucene on search11 is CRITICAL: Connection timed out [06:12:37] RECOVERY - Lucene on search11 is OK: TCP OK - 0.006 second response time on port 8123 [06:15:10] PROBLEM - Puppet freshness on search1022 is CRITICAL: Puppet has not run in the last 10 hours [06:32:07] PROBLEM - Puppet freshness on search1021 is CRITICAL: Puppet has not run in the last 10 hours [07:43:59] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [08:39:32] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [08:41:29] PROBLEM - Puppet freshness on db9 is CRITICAL: Puppet has not run in the last 10 hours [09:29:12] New patchset: ArielGlenn; "threading + Popen = fail; switch to multiprocessing instead" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4255 [09:31:46] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4255 [09:31:48] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4255 [09:42:30] New review: Hashar; "Various minor notes." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/3885 [10:06:53] New patchset: ArielGlenn; "task_done() removed for good" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4256 [10:07:40] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4256 [10:07:43] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4256 [10:52:47] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:54:53] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [11:08:47] New patchset: ArielGlenn; "allow for the inclusion of mandatory files for wmf mirrors" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4258 [11:09:50] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4258 [11:09:52] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4258 [11:16:59] !log enabled Renameuser extension on wikitech, renamed tchay per RT request, disabled extension again (it was installed but disabled) [11:17:01] Logged the message, Master [11:47:28] New patchset: ArielGlenn; "allow for rsync of arbitrary files at root of rsync tree" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4260 [11:48:22] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4260 [11:48:24] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4260 [11:58:26] PROBLEM - Host db1007 is DOWN: PING CRITICAL - Packet loss = 100% [12:17:15] New patchset: Hashar; "document fenari symbolic links" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4261 [12:17:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4261 [12:20:03] before I look at db1007, anyone messing with it? [12:21:34] <^demon> G'morning apergos [12:21:39] yo [12:23:59] <^demon> How goes? [12:24:12] slow [12:24:14] slugging along [12:24:23] (workwise) [12:24:40] amazingly depressing (general life-wise) [12:24:42] you? [12:26:08] <^demon> Life's well. Busy, but making progress. [12:26:30] are you buried alive in git? [12:26:52] <^demon> Why do you think I ran off to help Rob in the DC last Wednesday? ;-) [12:27:00] hahaha [12:27:15] for your phyiscal health, of course! and to see the pretty blinky lights... [12:27:28] <^demon> They were pretty. [12:27:35] aren't they though? [12:28:15] ok folks (RobH since I see yr here) db1007 no response on console, no ping, no nada so going to powercycle [12:30:39] i think just like db1047 on March 30 (SAL) [12:31:22] !log rebooted bd1007, it was dead in the water (also no helpful messages on console, bah) [12:31:24] Logged the message, Master [12:31:33] watching it boot up [12:32:32] and does not show obvious errors at all.. then it is like the other [12:32:59] so wat was wrong with db1047? [12:33:52] i dont know. it froze, just like what you said about db1007, powercycled it, came back up, no obvious messages about hardware, syslog just ended in the middle of something unsuspicous [12:33:58] and m ark was like "Rob H will like to hear that" ;p [12:34:35] RECOVERY - Host db1007 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [12:36:14] bah [12:36:19] i added mysql to system startup though [12:36:37] maybe but it'snot running [12:36:43] i need to document how to upgrade firmware so when this happens folks can do it ;] [12:36:55] apergos: on db1047 that is [12:36:59] * apergos passes the buck to robh :-P [12:37:08] wha? [12:37:11] you rebooted it right? [12:37:18] dont pass it to me im eating breakfast! [12:37:19] since you haven't written it up yet [12:37:22] PROBLEM - NTP on db1007 is CRITICAL: NTP CRITICAL: Offset unknown [12:37:28] nah, just reboot it and put it back into service [12:37:31] I rebooted it [12:37:36] its not worth waiting on firmware right now ;] [12:38:25] PROBLEM - mysqld processes on db1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [12:38:43] just start mysql, thats why i added that autostart on the other one [12:39:28] RECOVERY - NTP on db1007 is OK: NTP OK: Offset -0.003657221794 secs [12:40:18] update-rc.d mysql defaults [12:40:29] unless there is a reason for it not to [12:40:41] I don't kow where it is on these boxes [12:41:03] /etc/init.d/mysql start did it on the other [12:41:05] I was expecting it to be in /usr/locall/mysql-something [12:41:45] I hope that's not a messed up version [12:42:28] RECOVERY - mysqld processes on db1007 is OK: PROCS OK: 1 process with command name mysqld [12:42:40] !log started mysqld on db1007 via /etc/init.d/mysql (this doesn't seem to point to a special fb build, and can't seem to find one on this host, what's up with that?) [12:42:42] Logged the message, Master [12:43:02] well, it looked good in the way that first it reported a little replag on nagios, which then disappeared [12:43:29] I mean it will work, it just might not be as awesoem as we like [12:46:40] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 2227 seconds [12:47:13] there it is..and it shouldnt take that long now [12:48:28] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 1766 seconds [12:51:22] !log db1007 - add mysql startup via 'update-rc.d mysql defaults' [12:51:24] Logged the message, Master [12:54:55] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [12:55:13] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [12:56:10] do we have [12:56:11] wrong channel [13:44:36] New patchset: RobH; "updating admins, revoking outdated access, commenting file for easier reference" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4267 [13:44:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4267 [13:45:10] Anyone wanna review that for me or shall I self review? ;] [13:46:19] * RobH hears crickets chirping [13:47:23] New review: RobH; "this isn't self review, one of my many personalities did the work, and another one did the review." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4267 [13:47:26] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4267 [14:14:24] New patchset: Jgreen; "added timers to search api_sweep_test, increased http timeout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4269 [14:14:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4269 [14:15:46] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4269 [14:15:49] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4269 [14:35:38] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [14:45:41] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [14:45:41] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [15:00:41] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [15:00:41] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [15:01:46] New review: Mark Bergsma; "Yes, that's right. I didn't call it CARP because it's actually a bit different. :)" [operations/debs/varnish] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/4162 [15:17:24] just fyi, search testing is being slightly delayed because we're testing before it goes into prod ;) [15:17:48] notpeter: You mean the "testing in production" phase is delayed because we decided to insert a "testing in non-production" phase before it? [15:18:00] novel, I know [15:21:34] cmjohnson1: did the cable rings arrive? [15:21:40] im just doing my procurement review [15:22:12] yes...they arrived [15:22:18] cool, resolving it then [15:22:25] came in late yesterday [15:22:51] messing with the access point w/ Lcarr [15:28:06] !log pointing eswiki search at eqiad [15:28:08] Logged the message, notpeter [15:30:51] Jeff_Green: ok, should be live now [15:31:02] fire in the hole! [15:31:15] I'm totally getting search results on es [15:31:21] now. [15:31:37] how do we verify that they are in fact coming from eqiad? =P [15:32:01] stop lsearchd on the pmtpa host? :-P [15:33:03] or look at ganglia ;) [15:33:54] http://ganglia.wikimedia.org/latest/?r=20min&cs=&ce=&m=&c=Search&h=search14.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [15:33:58] search rate is at 0 [15:35:34] iunno, dude [15:35:48] looks like eswiki search is going to eqiad [15:37:31] nice [15:42:12] Jeff_Green: watching these logs/graphs is boring. let's play hungry hungry hippos [15:42:40] you know we actually own that game. except I like to call it Choke Hazard Game [15:42:53] hahahahahaha [15:43:07] well, then you're going to have to be on "playing the choking hazard game" detail [15:43:39] * Jeff_Green dies [15:44:14] noooooooooooooo [16:02:06] Change abandoned: Hashar; "This is not needed. The Gerrit plugin handle all the job :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2495 [16:05:04] New patchset: Hashar; "move jenkins/gerrit fetcher to integration/jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4283 [16:05:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4283 [16:05:47] Jeff_Green: notpeter: could you possibly review https://gerrit.wikimedia.org/r/4283 and merges it on puppet repo ? [16:06:01] yep, one sec [16:06:02] I am going to manage that file in another git repository (integration/jenkins) [16:08:32] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4283 [16:08:35] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4283 [16:11:37] hashar: ok done [16:11:45] Jeff_Green: great. many thanks :-] [16:11:52] np [16:15:41] New review: Catrope; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/3885 [16:16:33] PROBLEM - Puppet freshness on search1022 is CRITICAL: Puppet has not run in the last 10 hours [16:19:15] hi Jeff_Green , maplebed. did you create new gerrit projects before? [16:19:30] i haven't [16:20:19] gerrit projecs? not branches but new projects? no. you might ask reedy or roan? [16:20:24] someone made all the mediawiki projects. [16:20:31] I've made labs projects though... [16:21:06] I have created Gerrit projects before yes [16:21:24] i want some stuff in labs, and it's in operations/puppet , branch test. i am being asked though if this shouldnt be in its own project, no in operations/puppet [16:21:26] You have to do it from the CLI. I don't think Ryan documented this yet [16:22:31] I think Ryan may have made the projects, yea [16:22:38] he assisted on that migration [16:23:20] <^demon> maplebed: I'm the one who made several dozen repos in gerrit :) [16:23:51] good to know. I'll probably forget again by this afternoon. [16:23:54] ;) [16:24:22] ^demon: see, there you go again, volunteering information [16:24:25] you just dont learn ;] [16:24:56] <^demon> Oh see, when we upgraded to 2.3, you can make repos via the web. [16:25:00] <^demon> So it won't be all me ;-) [16:25:41] <^demon> s/upgraded/upgrade/ [16:25:44] <^demon> Hasn't happened yet [16:26:04] mutante: what is it that you're putting in git that shouldn't be in puppet? I'd dubious that creating a new repository within gerrit is the right way forward... [16:26:06] i just like that now all of the devs live in the painful gerrit world that ops lives in [16:26:11] misery loves company. [16:26:16] i just want to import a couple existing .php files from a statstics web app, fix them then, then put them on a labs instance via puppet. not that many files at all. either i just keep it in operations/puppet or ... [16:27:00] A comparable example is our ganglia installation - the php comes from a package and the config comes from puppet [16:27:04] but i wanted the process of fixing them in gerrit and public [16:27:37] i did not add them as puppet file resources right away [16:28:08] if it doesn't want to live in puppet, I'd suggest the operations/software repo (though at the moment that's all our own stuff, nothing backported, I think) [16:29:16] but there's plenty of stuff we just use puppet to deploy, so if it's small I wouldn't sweat it. Of course I'm sure others have different opinions... [16:29:23] i was going to add the files as puppet resources once the worst problems are fixed [16:29:42] instead of building a .deb for example [16:30:00] because the number of files really isnt that large [16:30:43] in that case I would add them to puppet in their original form to deploy them to the host, turn off puppet on the host, fix them, then put the fixed files back in puppet. [16:30:53] that gives you the complete set of changes to the files as one changeset [16:31:17] that's what I did for search qa stuff. felt dirty but it worked [16:32:14] so keep it in operations/puppet test [16:32:35] apergos, mutante: when Leslie shows up we can go over the slaving stuff - there are three of you and three hosts to reslave! How convenient. [16:32:48] cool [16:33:20] mutante: oh I didn't bother with that since it felt like I was duplicating revision control efforts [16:33:40] PROBLEM - Puppet freshness on search1021 is CRITICAL: Puppet has not run in the last 10 hours [16:34:20] i just threw a new dir in production/files/blah, added a couple classes to the most relevant manifest (search.pp) and felt like I'd done something wrong :-P [16:34:42] heh [16:34:44] ;) [16:35:42] I like the idea of a production/operations project area, separate from the puppet git repo [16:36:10] but i think we should figure out a way to make that something puppet can find and install from [16:36:37] or we create a little framework to package off of it without much effort [16:38:30] that'd be something fun for the hackathon I suppose? [16:40:05] <^demon> Put them in other repos then pull stuff in as submodules to puppet. [16:40:24] <^demon> :) [16:40:26] sure [16:40:49] <^demon> It's what we're planning on doing for extensions on the wmf branch for deployment. [16:40:56] i, for one, am not sufficiently educated the possiblities of on git+puppet+(gerrit?) integration [16:42:15] at my previous job we predefined an ops git repo all ops folks worked within [16:42:31] i.e. operations/projects/{your project here} [16:43:14] and wrote scripts such that they'd find and package new stuff under ./projects with little effort [16:43:48] maybe it would make sense to do something similar so we don't have to burn in the whole gerrit layer config for new trivial scripts batches etc? [16:44:36] leaving off the packaging part I mean, just have it such that it's easy to point puppet manifests at your new ./project/whatever [16:47:12] Jeff_Green: we have operations/software [16:47:34] though puppet can't pull from it, unless we do something fancier than currently exists. [16:47:40] so we're just missing the gerrit+puppet foo [16:48:16] maplebed: exactly, I opted not to use operations/software because it was a second place to check stuff in, detached [16:48:48] if I could have checked in there and pointed puppet at it directly, I would have chosen it instead [16:49:40] how difficult would it be to pull it in as a submodule complete with gerrit integration? [17:03:56] yea, so if i just merge it to operations/puppet test branch now, will i be killed?;) like "now it's hard to move it again" [17:05:20] apergos LeslieCarr mutante: you're all three here now; whenever you're ready let's talk mysql slaving and replica setup! [17:06:08] ready [17:06:46] can you give me 5 more minutes ? [17:15:07] maplebed: mutante ready now :) [17:16:17] yep [17:17:49] and apergos? [17:17:57] just replied to more git/gerrit access requests coming in via mail now, German "wikidata-dev" team is migrating [17:18:02] ah, maybe too much to try and coordinate all four of us... [17:18:39] well maybe it's better this way; apergos can read backscroll and then there'll be some time shift to the actions. [17:19:10] so, first thing - docs. http://wikitech.wikimedia.org/view/External_storage [17:19:25] take a quick skim through that doc; ping here when done. [17:21:55] ok, skimmed through [17:22:06] i approve of your choice of funny examples :) [17:22:25] best word ever. [17:22:57] ok, at the "Making a new slave .." header [17:24:26] New patchset: Bhartshorne; "current backup scheme doesn't work; removing from es1004" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4285 [17:24:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4285 [17:24:43] so which Host A and Host B are we going to use [17:24:51] At the moment, two of the ES slaves in eqiad are broken and the third is slaving from the wrong place. One of them's doing fine. [17:25:07] go ahead and look at all four eqiad hosts and see which is which. [17:25:14] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4285 [17:25:15] eeep [17:25:17] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4285 [17:25:20] es1001-1004 [17:25:21] ? [17:25:25] yup. [17:25:44] you're interested in the slaving health so you want to 'show slave status\G' [17:26:56] New patchset: Jgreen; "disabling ganglia packet loss reporting cron job on oxygen for now, it's not properly configured yet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4287 [17:27:00] es1002 - cant connect to local mysql server .. [17:27:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4287 [17:27:20] es1001 - waiting for master to send event [17:27:56] so 1004 is the only working one ? [17:28:08] The three lines to look for when checking a host for its slaving health (in the output of 'show slave status'): [17:28:09] Slave_IO_Running: Yes [17:28:09] Slave_SQL_Running: Yes [17:28:16] and [17:28:17] Seconds_Behind_Master: 0 [17:28:20] it has "Slave_IO_State: " nothing [17:29:06] es1004 has No for both slave_IO and slave_SQL running and NULL for seconds_behind_master [17:29:19] that translates to 'slaving is not curretnly running.' [17:29:42] i just updated http://wikitech.wikimedia.org/view/External_storage#Nagios.2C_Monitoring.2C_and_Health [17:29:43] es1002 can't connect to mysql and ps confirms that mysql's not even running. [17:29:59] ossm. [17:30:24] so Slave_IO_State: Waiting for master to send event is normal ? [17:30:32] yup. [17:30:42] and no IO_State at all? [17:30:43] that means it's just chillin waiting for new data to come in from the master. [17:31:09] ok [17:31:48] so to summarize - es1001 and 3 are curretnly slaving successfully, 2 and 4 are broken. [17:31:49] so which one is slaving from the wrong place ? [17:31:49] !log dns update for wikipedia.org/com.il being resolved [17:31:51] Logged the message, RobH [17:32:19] !log update done, all nameservers still online [17:32:20] the 'Master_Host' line tells us who the machine is slaving from; [17:32:21] Logged the message, RobH [17:32:27] in this case they're both slaving from es3. [17:32:56] the pretty picture in the wiki page says all hosts within one colo are supposed to slave from within that colo and only one host is supposed to slave cross-colo [17:33:08] so either 1 or 3 is wrong, and it doesn't really matter which. [17:33:21] cool [17:33:50] * mutante nods [17:34:03] the picture says es1001 is the one that's supposed to be slaving cross-colo, so that's the one I'd choose, but the important part is just that there's only one cross-colo link. [17:34:20] I will say it's important not because mysql doesn't replicate well cross-colo, but just because it's differetn from how the docs say it should be. [17:34:46] okay [17:35:03] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4287 [17:35:06] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4287 [17:35:10] because they're already dead, let's start with 2 and 4 intsead of worrying about 3. We'll leave that one for apergos. [17:35:12] so do/should we have a special important nagios alarm for if that host goes down ? [17:35:27] (also i have heard the term intermediate master, would that be applicable for 1 ?) [17:35:34] yup. [17:36:04] I think we'll see the mysql heartbeat alarm go off for all nodes in the colo if the intermediate master goes down, but I don't recall. [17:36:09] Anyone want to review my redirects.conf apache change before i push it? [17:36:14] * RobH would appreciate it [17:36:31] RobH: sure [17:36:31] that entire directory is in svn, so easy to review [17:36:42] its /home/wikipedia/conf/httpd/ [17:36:58] ok, let's read through the taking a snapshot and making a new slave sectinos more carefully. [17:37:00] added in wikipedia.org.il and wikipedia.com.il to redirect to he.wikipedia.org (hebrew lang code project) [17:37:13] opps, i see a mistake, durnit! [17:37:23] whee oxygen cronspam slain [17:37:31] ok, mistake fixed [17:37:35] looking [17:38:40] turns out we legally have to do soemthing with resolving the squatter and other domain names we get [17:38:48] simply just taking them over isnt quite good enough ;] [17:38:50] oh yeah [17:38:59] point them at a sour-faced page [17:39:21] heh, well, these are more country specific, so i feel even worse if we dont do something [17:39:35] i guess the typo domains should just rewrite to the proper ones though ;] [17:40:03] i never know what to think about that, part of me thinks people should be alerted when they mistype [17:40:20] well, that would require more work =P [17:40:24] true [17:40:32] but it would be nice if they had some banner along the top saying they typed it in wrong [17:40:33] redirect them to zombo.com [17:40:36] then pass it to the correct page [17:40:50] hm. These instructions tell us how to create a new slave from an existing slave and have them both read from the same master. This time's a little different; we want to create a new slave that reads from the host from which we're creating the slave (so that we get the nice tree replication). [17:40:51] so maplebed, let's pretend es1001 is properly the intermediate master [17:40:52] give them the info without adversing affecting their user experience [17:41:01] ah ok [17:41:13] maplebed: so if i want to fix es1004, i would use es1004 as Host B, and es1001 as Host A [17:41:25] i was going to ask if we should stop es1001 or if we should stop es1003 since it's not the intermediate master and then it would have less of an effect ? [17:41:27] mutante: yes. [17:41:32] !log started enwiki.revision sha1 migration on db53 [17:41:34] Logged the message, Master [17:41:46] robh I'm still trying to figure out the appropriate svn syntax to compare :-( [17:41:53] svn diff [17:42:02] oh duh [17:42:04] thx [17:42:27] ok, looks fine to me [17:42:31] heh, git or nothing! [17:42:36] LeslieCarr: that would be what we should do if es1003 was reading from es1001. Since it's reading from es3, we can't. It's really hard to take a snapshot from one host and hook it up to slave off a different host (the slave logs are not the same across different nodes in a cluster). [17:42:44] RobH: or cvs [17:42:50] okay [17:42:50] Jeff_Green: excelelnt, so if it crashes the site you can join me in blame ;] [17:42:52] hehe [17:42:52] somehow I completely missed svn in my history [17:43:11] RobH: that would be my proper initiation I suppose [17:43:16] so we looked at 'show slave status' which showed us information about the current host and its master. [17:43:27] Jeff_Green: this file is what caused the single largest outage by me in my history here [17:43:32] i had a . where i needed a . [17:43:36] the command that shows us information that children of the current node will need is 'show master status\G'. [17:43:38] hehe [17:43:42] hrm, trying to figure out a good way to put that in the document [17:43:43] it redirected all pages to en.wikipedia.org, including en.wikipedia.org [17:43:51] and corrupted the entire caching layer [17:43:53] that's pretty spiffy [17:44:05] before it even synced out to more than 1/3rd the apcahes, i had caught it, but was too late =[ [17:44:14] that was a hell of an outage. [17:44:18] oh you know, now that we're talking about it I wrote an http-fetching tool for testing redirect changes [17:44:21] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 303 seconds [17:44:24] LeslieCarr: it'll be tough without getting too many branches and 'if this then that if that then ...' sprinkled throughout. [17:44:30] Jeff_Green: oh? shall we use it before i push this? [17:44:32] ok [17:44:36] it was on fenari, I wonder if it still works . . . looking [17:44:59] RobH: we might be able to use it to test against the staging host [17:45:06] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [17:45:12] I've completely forgotten how it works. one sec [17:45:14] ok, so the change to this procedure is that when we take the snapshot we should record the output of both 'show slave status' and 'show master status' for use later on. [17:45:17] i used to push this stuff to test.w.o host and test via telnet. [17:45:28] but this seems to be an easy enough change that im not too worried [17:45:33] yeah agreed [17:45:35] just worried enough to get a second set of eyes on it [17:45:45] i think we are good, we will know in a moment [17:45:48] i had a more complicated one and was worried [17:45:58] mutante: LeslieCarr who wants to go first? We can't both take lvm snapshots simultaneously. [17:46:02] !log pushing out redirects change to apaches for wikipedia.org/com.il redirect to he.wikipedia.org [17:46:04] Logged the message, RobH [17:46:09] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 411 seconds [17:46:17] can i go first ? [17:46:33] maplebed: i was going to ask , so just one of us has to do the stuff on Host A [17:46:33] would you start up a screen session so that we can shoulder surf? [17:46:40] mutante: yes. [17:46:41] !log gracefully restarting apaches [17:46:43] Logged the message, RobH [17:47:00] yes, go ahead Leslie.. and screen [17:47:21] ACKNOWLEDGEMENT - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 428 seconds asher migration [17:47:34] ok, new screen session made on bast1001 as root [17:47:39] bleh, hard to test my change when the dns propagation isnt done for the new nameservers [17:47:43] ctrl+j is my hotkey [17:47:47] but it seems the rest of the site isnt down, so i assume its all good [17:47:51] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 411 seconds asher migration [17:47:56] !log i didnt crash the site, weeee [17:47:57] Logged the message, RobH [17:48:05] and 176x50 right now [17:48:08] RobH: on a positive note, DNS is another great opportunity to kill the site! muwhwhwhhahahah [17:48:18] yes, one i have avoided until now [17:48:40] mark and ryan, not so much ;] [17:48:59] maplebed: how long will it likely take to copy a full es snapshot to a new host these days? [17:49:13] binasher: it's been a while but I think about a day. [17:49:39] ok, thats not too bad [17:49:40] mutante: can you bump your window size up to 176x50? [17:50:57] at some point i'd like to move ES inserts to a new shard on a different set of hardware [17:51:16] new hardware? [17:51:22] maplebed: better? [17:51:38] mutante: you're at 168x43; [17:51:45] we can shrink to that size if it's as large as you can get. [17:52:25] binasher: the current hardware's at 58% storage capacity [17:52:32] 192 x 46 [17:53:05] ok, went down to 176 x 46 [17:53:07] 192 x 50 [17:53:14] hrmp:) hey sry about that [17:53:39] oomph. this hsouldn't be so hard. [17:53:59] ok, back to 176x50 [17:54:37] almost there! [17:55:35] mutante: you seem to be having a really hard time resizing. do you watn us to just match you instead? [17:56:05] i get to 176 x47 , but not x50 [17:56:11] ok [17:56:13] i'll go to x47 [17:56:15]