[00:08:02] RECOVERY - HTTP on neon is OK: HTTP OK HTTP/1.1 200 OK - 455 bytes in 0.056 seconds [00:08:15] hey maplebed: shall we give it another shot? i fixed the seg fault, and i ran all configurations on my local machine and they work [00:08:30] drdee: no, not today. [00:08:42] it's past 5pm and I don't think I have the fortitude to try again at the moment. [00:08:59] do you think we could try tomorrow morning? [00:09:10] if you really want to do it now I can rally... [00:09:10] okay, sounds good to me! [00:09:16] no no [00:09:21] tomorrow is good [00:09:25] ok. thanks. [00:10:03] anytime midmorning pacific will be fine. [00:10:14] perfect, see you then [00:20:47] RECOVERY - NTP on neon is OK: NTP OK: Offset -0.04752540588 secs [01:12:02] New review: Tim Starling; "This is not a normal CARP-like consistent hashing algorithm. Instead of hashing the URL together wit..." [operations/debs/varnish] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/4162 [04:33:48] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [04:43:51] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [04:43:51] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [04:58:51] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [04:58:51] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [05:31:54] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:53:12] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:07:00] PROBLEM - Lucene on search11 is CRITICAL: Connection timed out [06:12:37] RECOVERY - Lucene on search11 is OK: TCP OK - 0.006 second response time on port 8123 [06:15:10] PROBLEM - Puppet freshness on search1022 is CRITICAL: Puppet has not run in the last 10 hours [06:32:07] PROBLEM - Puppet freshness on search1021 is CRITICAL: Puppet has not run in the last 10 hours [07:43:59] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [08:39:32] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [08:41:29] PROBLEM - Puppet freshness on db9 is CRITICAL: Puppet has not run in the last 10 hours [09:29:12] New patchset: ArielGlenn; "threading + Popen = fail; switch to multiprocessing instead" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4255 [09:31:46] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4255 [09:31:48] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4255 [09:42:30] New review: Hashar; "Various minor notes." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/3885 [10:06:53] New patchset: ArielGlenn; "task_done() removed for good" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4256 [10:07:40] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4256 [10:07:43] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4256 [10:52:47] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:54:53] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [11:08:47] New patchset: ArielGlenn; "allow for the inclusion of mandatory files for wmf mirrors" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4258 [11:09:50] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4258 [11:09:52] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4258 [11:16:59] !log enabled Renameuser extension on wikitech, renamed tchay per RT request, disabled extension again (it was installed but disabled) [11:17:01] Logged the message, Master [11:47:28] New patchset: ArielGlenn; "allow for rsync of arbitrary files at root of rsync tree" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4260 [11:48:22] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4260 [11:48:24] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4260 [11:58:26] PROBLEM - Host db1007 is DOWN: PING CRITICAL - Packet loss = 100% [12:17:15] New patchset: Hashar; "document fenari symbolic links" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4261 [12:17:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4261 [12:20:03] before I look at db1007, anyone messing with it? [12:21:34] <^demon> G'morning apergos [12:21:39] yo [12:23:59] <^demon> How goes? [12:24:12] slow [12:24:14] slugging along [12:24:23] (workwise) [12:24:40] amazingly depressing (general life-wise) [12:24:42] you? [12:26:08] <^demon> Life's well. Busy, but making progress. [12:26:30] are you buried alive in git? [12:26:52] <^demon> Why do you think I ran off to help Rob in the DC last Wednesday? ;-) [12:27:00] hahaha [12:27:15] for your phyiscal health, of course! and to see the pretty blinky lights... [12:27:28] <^demon> They were pretty. [12:27:35] aren't they though? [12:28:15] ok folks (RobH since I see yr here) db1007 no response on console, no ping, no nada so going to powercycle [12:30:39] i think just like db1047 on March 30 (SAL) [12:31:22] !log rebooted bd1007, it was dead in the water (also no helpful messages on console, bah) [12:31:24] Logged the message, Master [12:31:33] watching it boot up [12:32:32] and does not show obvious errors at all.. then it is like the other [12:32:59] so wat was wrong with db1047? [12:33:52] i dont know. it froze, just like what you said about db1007, powercycled it, came back up, no obvious messages about hardware, syslog just ended in the middle of something unsuspicous [12:33:58] and m ark was like "Rob H will like to hear that" ;p [12:34:35] RECOVERY - Host db1007 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [12:36:14] bah [12:36:19] i added mysql to system startup though [12:36:37] maybe but it'snot running [12:36:43] i need to document how to upgrade firmware so when this happens folks can do it ;] [12:36:55] apergos: on db1047 that is [12:36:59] * apergos passes the buck to robh :-P [12:37:08] wha? [12:37:11] you rebooted it right? [12:37:18] dont pass it to me im eating breakfast! [12:37:19] since you haven't written it up yet [12:37:22] PROBLEM - NTP on db1007 is CRITICAL: NTP CRITICAL: Offset unknown [12:37:28] nah, just reboot it and put it back into service [12:37:31] I rebooted it [12:37:36] its not worth waiting on firmware right now ;] [12:38:25] PROBLEM - mysqld processes on db1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [12:38:43] just start mysql, thats why i added that autostart on the other one [12:39:28] RECOVERY - NTP on db1007 is OK: NTP OK: Offset -0.003657221794 secs [12:40:18] update-rc.d mysql defaults [12:40:29] unless there is a reason for it not to [12:40:41] I don't kow where it is on these boxes [12:41:03] /etc/init.d/mysql start did it on the other [12:41:05] I was expecting it to be in /usr/locall/mysql-something [12:41:45] I hope that's not a messed up version [12:42:28] RECOVERY - mysqld processes on db1007 is OK: PROCS OK: 1 process with command name mysqld [12:42:40] !log started mysqld on db1007 via /etc/init.d/mysql (this doesn't seem to point to a special fb build, and can't seem to find one on this host, what's up with that?) [12:42:42] Logged the message, Master [12:43:02] well, it looked good in the way that first it reported a little replag on nagios, which then disappeared [12:43:29] I mean it will work, it just might not be as awesoem as we like [12:46:40] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 2227 seconds [12:47:13] there it is..and it shouldnt take that long now [12:48:28] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 1766 seconds [12:51:22] !log db1007 - add mysql startup via 'update-rc.d mysql defaults' [12:51:24] Logged the message, Master [12:54:55] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [12:55:13] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [12:56:10] do we have [12:56:11] wrong channel [13:44:36] New patchset: RobH; "updating admins, revoking outdated access, commenting file for easier reference" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4267 [13:44:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4267 [13:45:10] Anyone wanna review that for me or shall I self review? ;] [13:46:19] * RobH hears crickets chirping [13:47:23] New review: RobH; "this isn't self review, one of my many personalities did the work, and another one did the review." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4267 [13:47:26] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4267 [14:14:24] New patchset: Jgreen; "added timers to search api_sweep_test, increased http timeout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4269 [14:14:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4269 [14:15:46] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4269 [14:15:49] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4269 [14:35:38] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [14:45:41] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [14:45:41] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [15:00:41] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [15:00:41] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [15:01:46] New review: Mark Bergsma; "Yes, that's right. I didn't call it CARP because it's actually a bit different. :)" [operations/debs/varnish] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/4162 [15:17:24] just fyi, search testing is being slightly delayed because we're testing before it goes into prod ;) [15:17:48] notpeter: You mean the "testing in production" phase is delayed because we decided to insert a "testing in non-production" phase before it? [15:18:00] novel, I know [15:21:34] cmjohnson1: did the cable rings arrive? [15:21:40] im just doing my procurement review [15:22:12] yes...they arrived [15:22:18] cool, resolving it then [15:22:25] came in late yesterday [15:22:51] messing with the access point w/ Lcarr [15:28:06] !log pointing eswiki search at eqiad [15:28:08] Logged the message, notpeter [15:30:51] Jeff_Green: ok, should be live now [15:31:02] fire in the hole! [15:31:15] I'm totally getting search results on es [15:31:21] now. [15:31:37] how do we verify that they are in fact coming from eqiad? =P [15:32:01] stop lsearchd on the pmtpa host? :-P [15:33:03] or look at ganglia ;) [15:33:54] http://ganglia.wikimedia.org/latest/?r=20min&cs=&ce=&m=&c=Search&h=search14.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [15:33:58] search rate is at 0 [15:35:34] iunno, dude [15:35:48] looks like eswiki search is going to eqiad [15:37:31] nice [15:42:12] Jeff_Green: watching these logs/graphs is boring. let's play hungry hungry hippos [15:42:40] you know we actually own that game. except I like to call it Choke Hazard Game [15:42:53] hahahahahaha [15:43:07] well, then you're going to have to be on "playing the choking hazard game" detail [15:43:39] * Jeff_Green dies [15:44:14] noooooooooooooo [16:02:06] Change abandoned: Hashar; "This is not needed. The Gerrit plugin handle all the job :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2495 [16:05:04] New patchset: Hashar; "move jenkins/gerrit fetcher to integration/jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4283 [16:05:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4283 [16:05:47] Jeff_Green: notpeter: could you possibly review https://gerrit.wikimedia.org/r/4283 and merges it on puppet repo ? [16:06:01] yep, one sec [16:06:02] I am going to manage that file in another git repository (integration/jenkins) [16:08:32] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4283 [16:08:35] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4283 [16:11:37] hashar: ok done [16:11:45] Jeff_Green: great. many thanks :-] [16:11:52] np [16:15:41] New review: Catrope; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/3885 [16:16:33] PROBLEM - Puppet freshness on search1022 is CRITICAL: Puppet has not run in the last 10 hours [16:19:15] hi Jeff_Green , maplebed. did you create new gerrit projects before? [16:19:30] i haven't [16:20:19] gerrit projecs? not branches but new projects? no. you might ask reedy or roan? [16:20:24] someone made all the mediawiki projects. [16:20:31] I've made labs projects though... [16:21:06] I have created Gerrit projects before yes [16:21:24] i want some stuff in labs, and it's in operations/puppet , branch test. i am being asked though if this shouldnt be in its own project, no in operations/puppet [16:21:26] You have to do it from the CLI. I don't think Ryan documented this yet [16:22:31] I think Ryan may have made the projects, yea [16:22:38] he assisted on that migration [16:23:20] <^demon> maplebed: I'm the one who made several dozen repos in gerrit :) [16:23:51] good to know. I'll probably forget again by this afternoon. [16:23:54] ;) [16:24:22] ^demon: see, there you go again, volunteering information [16:24:25] you just dont learn ;] [16:24:56] <^demon> Oh see, when we upgraded to 2.3, you can make repos via the web. [16:25:00] <^demon> So it won't be all me ;-) [16:25:41] <^demon> s/upgraded/upgrade/ [16:25:44] <^demon> Hasn't happened yet [16:26:04] mutante: what is it that you're putting in git that shouldn't be in puppet? I'd dubious that creating a new repository within gerrit is the right way forward... [16:26:06] i just like that now all of the devs live in the painful gerrit world that ops lives in [16:26:11] misery loves company. [16:26:16] i just want to import a couple existing .php files from a statstics web app, fix them then, then put them on a labs instance via puppet. not that many files at all. either i just keep it in operations/puppet or ... [16:27:00] A comparable example is our ganglia installation - the php comes from a package and the config comes from puppet [16:27:04] but i wanted the process of fixing them in gerrit and public [16:27:37] i did not add them as puppet file resources right away [16:28:08] if it doesn't want to live in puppet, I'd suggest the operations/software repo (though at the moment that's all our own stuff, nothing backported, I think) [16:29:16] but there's plenty of stuff we just use puppet to deploy, so if it's small I wouldn't sweat it. Of course I'm sure others have different opinions... [16:29:23] i was going to add the files as puppet resources once the worst problems are fixed [16:29:42] instead of building a .deb for example [16:30:00] because the number of files really isnt that large [16:30:43] in that case I would add them to puppet in their original form to deploy them to the host, turn off puppet on the host, fix them, then put the fixed files back in puppet. [16:30:53] that gives you the complete set of changes to the files as one changeset [16:31:17] that's what I did for search qa stuff. felt dirty but it worked [16:32:14] so keep it in operations/puppet test [16:32:35] apergos, mutante: when Leslie shows up we can go over the slaving stuff - there are three of you and three hosts to reslave! How convenient. [16:32:48] cool [16:33:20] mutante: oh I didn't bother with that since it felt like I was duplicating revision control efforts [16:33:40] PROBLEM - Puppet freshness on search1021 is CRITICAL: Puppet has not run in the last 10 hours [16:34:20] i just threw a new dir in production/files/blah, added a couple classes to the most relevant manifest (search.pp) and felt like I'd done something wrong :-P [16:34:42] heh [16:34:44] ;) [16:35:42] I like the idea of a production/operations project area, separate from the puppet git repo [16:36:10] but i think we should figure out a way to make that something puppet can find and install from [16:36:37] or we create a little framework to package off of it without much effort [16:38:30] that'd be something fun for the hackathon I suppose? [16:40:05] <^demon> Put them in other repos then pull stuff in as submodules to puppet. [16:40:24] <^demon> :) [16:40:26] sure [16:40:49] <^demon> It's what we're planning on doing for extensions on the wmf branch for deployment. [16:40:56] i, for one, am not sufficiently educated the possiblities of on git+puppet+(gerrit?) integration [16:42:15] at my previous job we predefined an ops git repo all ops folks worked within [16:42:31] i.e. operations/projects/{your project here} [16:43:14] and wrote scripts such that they'd find and package new stuff under ./projects with little effort [16:43:48] maybe it would make sense to do something similar so we don't have to burn in the whole gerrit layer config for new trivial scripts batches etc? [16:44:36] leaving off the packaging part I mean, just have it such that it's easy to point puppet manifests at your new ./project/whatever [16:47:12] Jeff_Green: we have operations/software [16:47:34] though puppet can't pull from it, unless we do something fancier than currently exists. [16:47:40] so we're just missing the gerrit+puppet foo [16:48:16] maplebed: exactly, I opted not to use operations/software because it was a second place to check stuff in, detached [16:48:48] if I could have checked in there and pointed puppet at it directly, I would have chosen it instead [16:49:40] how difficult would it be to pull it in as a submodule complete with gerrit integration? [17:03:56] yea, so if i just merge it to operations/puppet test branch now, will i be killed?;) like "now it's hard to move it again" [17:05:20] apergos LeslieCarr mutante: you're all three here now; whenever you're ready let's talk mysql slaving and replica setup! [17:06:08] ready [17:06:46] can you give me 5 more minutes ? [17:15:07] maplebed: mutante ready now :) [17:16:17] yep [17:17:49] and apergos? [17:17:57] just replied to more git/gerrit access requests coming in via mail now, German "wikidata-dev" team is migrating [17:18:02] ah, maybe too much to try and coordinate all four of us... [17:18:39] well maybe it's better this way; apergos can read backscroll and then there'll be some time shift to the actions. [17:19:10] so, first thing - docs. http://wikitech.wikimedia.org/view/External_storage [17:19:25] take a quick skim through that doc; ping here when done. [17:21:55] ok, skimmed through [17:22:06] i approve of your choice of funny examples :) [17:22:25] best word ever. [17:22:57] ok, at the "Making a new slave .." header [17:24:26] New patchset: Bhartshorne; "current backup scheme doesn't work; removing from es1004" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4285 [17:24:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4285 [17:24:43] so which Host A and Host B are we going to use [17:24:51] At the moment, two of the ES slaves in eqiad are broken and the third is slaving from the wrong place. One of them's doing fine. [17:25:07] go ahead and look at all four eqiad hosts and see which is which. [17:25:14] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4285 [17:25:15] eeep [17:25:17] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4285 [17:25:20] es1001-1004 [17:25:21] ? [17:25:25] yup. [17:25:44] you're interested in the slaving health so you want to 'show slave status\G' [17:26:56] New patchset: Jgreen; "disabling ganglia packet loss reporting cron job on oxygen for now, it's not properly configured yet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4287 [17:27:00] es1002 - cant connect to local mysql server .. [17:27:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4287 [17:27:20] es1001 - waiting for master to send event [17:27:56] so 1004 is the only working one ? [17:28:08] The three lines to look for when checking a host for its slaving health (in the output of 'show slave status'): [17:28:09] Slave_IO_Running: Yes [17:28:09] Slave_SQL_Running: Yes [17:28:16] and [17:28:17] Seconds_Behind_Master: 0 [17:28:20] it has "Slave_IO_State: " nothing [17:29:06] es1004 has No for both slave_IO and slave_SQL running and NULL for seconds_behind_master [17:29:19] that translates to 'slaving is not curretnly running.' [17:29:42] i just updated http://wikitech.wikimedia.org/view/External_storage#Nagios.2C_Monitoring.2C_and_Health [17:29:43] es1002 can't connect to mysql and ps confirms that mysql's not even running. [17:29:59] ossm. [17:30:24] so Slave_IO_State: Waiting for master to send event is normal ? [17:30:32] yup. [17:30:42] and no IO_State at all? [17:30:43] that means it's just chillin waiting for new data to come in from the master. [17:31:09] ok [17:31:48] so to summarize - es1001 and 3 are curretnly slaving successfully, 2 and 4 are broken. [17:31:49] so which one is slaving from the wrong place ? [17:31:49] !log dns update for wikipedia.org/com.il being resolved [17:31:51] Logged the message, RobH [17:32:19] !log update done, all nameservers still online [17:32:20] the 'Master_Host' line tells us who the machine is slaving from; [17:32:21] Logged the message, RobH [17:32:27] in this case they're both slaving from es3. [17:32:56] the pretty picture in the wiki page says all hosts within one colo are supposed to slave from within that colo and only one host is supposed to slave cross-colo [17:33:08] so either 1 or 3 is wrong, and it doesn't really matter which. [17:33:21] cool [17:33:50] * mutante nods [17:34:03] the picture says es1001 is the one that's supposed to be slaving cross-colo, so that's the one I'd choose, but the important part is just that there's only one cross-colo link. [17:34:20] I will say it's important not because mysql doesn't replicate well cross-colo, but just because it's differetn from how the docs say it should be. [17:34:46] okay [17:35:03] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4287 [17:35:06] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4287 [17:35:10] because they're already dead, let's start with 2 and 4 intsead of worrying about 3. We'll leave that one for apergos. [17:35:12] so do/should we have a special important nagios alarm for if that host goes down ? [17:35:27] (also i have heard the term intermediate master, would that be applicable for 1 ?) [17:35:34] yup. [17:36:04] I think we'll see the mysql heartbeat alarm go off for all nodes in the colo if the intermediate master goes down, but I don't recall. [17:36:09] Anyone want to review my redirects.conf apache change before i push it? [17:36:14] * RobH would appreciate it [17:36:31] RobH: sure [17:36:31] that entire directory is in svn, so easy to review [17:36:42] its /home/wikipedia/conf/httpd/ [17:36:58] ok, let's read through the taking a snapshot and making a new slave sectinos more carefully. [17:37:00] added in wikipedia.org.il and wikipedia.com.il to redirect to he.wikipedia.org (hebrew lang code project) [17:37:13] opps, i see a mistake, durnit! [17:37:23] whee oxygen cronspam slain [17:37:31] ok, mistake fixed [17:37:35] looking [17:38:40] turns out we legally have to do soemthing with resolving the squatter and other domain names we get [17:38:48] simply just taking them over isnt quite good enough ;] [17:38:50] oh yeah [17:38:59] point them at a sour-faced page [17:39:21] heh, well, these are more country specific, so i feel even worse if we dont do something [17:39:35] i guess the typo domains should just rewrite to the proper ones though ;] [17:40:03] i never know what to think about that, part of me thinks people should be alerted when they mistype [17:40:20] well, that would require more work =P [17:40:24] true [17:40:32] but it would be nice if they had some banner along the top saying they typed it in wrong [17:40:33] redirect them to zombo.com [17:40:36] then pass it to the correct page [17:40:50] hm. These instructions tell us how to create a new slave from an existing slave and have them both read from the same master. This time's a little different; we want to create a new slave that reads from the host from which we're creating the slave (so that we get the nice tree replication). [17:40:51] so maplebed, let's pretend es1001 is properly the intermediate master [17:40:52] give them the info without adversing affecting their user experience [17:41:01] ah ok [17:41:13] maplebed: so if i want to fix es1004, i would use es1004 as Host B, and es1001 as Host A [17:41:25] i was going to ask if we should stop es1001 or if we should stop es1003 since it's not the intermediate master and then it would have less of an effect ? [17:41:27] mutante: yes. [17:41:32] !log started enwiki.revision sha1 migration on db53 [17:41:34] Logged the message, Master [17:41:46] robh I'm still trying to figure out the appropriate svn syntax to compare :-( [17:41:53] svn diff [17:42:02] oh duh [17:42:04] thx [17:42:27] ok, looks fine to me [17:42:31] heh, git or nothing! [17:42:36] LeslieCarr: that would be what we should do if es1003 was reading from es1001. Since it's reading from es3, we can't. It's really hard to take a snapshot from one host and hook it up to slave off a different host (the slave logs are not the same across different nodes in a cluster). [17:42:44] RobH: or cvs [17:42:50] okay [17:42:50] Jeff_Green: excelelnt, so if it crashes the site you can join me in blame ;] [17:42:52] hehe [17:42:52] somehow I completely missed svn in my history [17:43:11] RobH: that would be my proper initiation I suppose [17:43:16] so we looked at 'show slave status' which showed us information about the current host and its master. [17:43:27] Jeff_Green: this file is what caused the single largest outage by me in my history here [17:43:32] i had a . where i needed a . [17:43:36] the command that shows us information that children of the current node will need is 'show master status\G'. [17:43:38] hehe [17:43:42] hrm, trying to figure out a good way to put that in the document [17:43:43] it redirected all pages to en.wikipedia.org, including en.wikipedia.org [17:43:51] and corrupted the entire caching layer [17:43:53] that's pretty spiffy [17:44:05] before it even synced out to more than 1/3rd the apcahes, i had caught it, but was too late =[ [17:44:14] that was a hell of an outage. [17:44:18] oh you know, now that we're talking about it I wrote an http-fetching tool for testing redirect changes [17:44:21] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 303 seconds [17:44:24] LeslieCarr: it'll be tough without getting too many branches and 'if this then that if that then ...' sprinkled throughout. [17:44:30] Jeff_Green: oh? shall we use it before i push this? [17:44:32] ok [17:44:36] it was on fenari, I wonder if it still works . . . looking [17:44:59] RobH: we might be able to use it to test against the staging host [17:45:06] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [17:45:12] I've completely forgotten how it works. one sec [17:45:14] ok, so the change to this procedure is that when we take the snapshot we should record the output of both 'show slave status' and 'show master status' for use later on. [17:45:17] i used to push this stuff to test.w.o host and test via telnet. [17:45:28] but this seems to be an easy enough change that im not too worried [17:45:33] yeah agreed [17:45:35] just worried enough to get a second set of eyes on it [17:45:45] i think we are good, we will know in a moment [17:45:48] i had a more complicated one and was worried [17:45:58] mutante: LeslieCarr who wants to go first? We can't both take lvm snapshots simultaneously. [17:46:02] !log pushing out redirects change to apaches for wikipedia.org/com.il redirect to he.wikipedia.org [17:46:04] Logged the message, RobH [17:46:09] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 411 seconds [17:46:17] can i go first ? [17:46:33] maplebed: i was going to ask , so just one of us has to do the stuff on Host A [17:46:33] would you start up a screen session so that we can shoulder surf? [17:46:40] mutante: yes. [17:46:41] !log gracefully restarting apaches [17:46:43] Logged the message, RobH [17:47:00] yes, go ahead Leslie.. and screen [17:47:21] ACKNOWLEDGEMENT - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 428 seconds asher migration [17:47:34] ok, new screen session made on bast1001 as root [17:47:39] bleh, hard to test my change when the dns propagation isnt done for the new nameservers [17:47:43] ctrl+j is my hotkey [17:47:47] but it seems the rest of the site isnt down, so i assume its all good [17:47:51] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 411 seconds asher migration [17:47:56] !log i didnt crash the site, weeee [17:47:57] Logged the message, RobH [17:48:05] and 176x50 right now [17:48:08] RobH: on a positive note, DNS is another great opportunity to kill the site! muwhwhwhhahahah [17:48:18] yes, one i have avoided until now [17:48:40] mark and ryan, not so much ;] [17:48:59] maplebed: how long will it likely take to copy a full es snapshot to a new host these days? [17:49:13] binasher: it's been a while but I think about a day. [17:49:39] ok, thats not too bad [17:49:40] mutante: can you bump your window size up to 176x50? [17:50:57] at some point i'd like to move ES inserts to a new shard on a different set of hardware [17:51:16] new hardware? [17:51:22] maplebed: better? [17:51:38] mutante: you're at 168x43; [17:51:45] we can shrink to that size if it's as large as you can get. [17:52:25] binasher: the current hardware's at 58% storage capacity [17:52:32] 192 x 46 [17:53:05] ok, went down to 176 x 46 [17:53:07] 192 x 50 [17:53:14] hrmp:) hey sry about that [17:53:39] oomph. this hsouldn't be so hard. [17:53:59] ok, back to 176x50 [17:54:37] almost there! [17:55:35] mutante: you seem to be having a really hard time resizing. do you watn us to just match you instead? [17:56:05] i get to 176 x47 , but not x50 [17:56:11] ok [17:56:13] i'll go to x47 [17:56:15] :) [17:56:20] ok [17:56:26] 176x47, all on the same page ? [17:56:31] yes [17:56:36] yay [17:57:11] ok. rock on. [17:57:29] ok, so now we start the taking a snapshot steps ? [17:57:56] yup. [17:58:42] 0 rows affected is correct ? [17:58:50] a bit of mysql background; there are two replication threads (io and sql). the io thread is responsible for copying content from the master, the sql thread is responsible for executing the queries. [17:58:55] (correct) [17:59:18] by stopping the io thread, we're telling the host to stop fetching new queries and then letting the sql thread work through any backlog of queries that might exist. [17:59:24] cool [17:59:32] so that was easy because it looks like there was no backlog ? [17:59:42] both log positions were the same to begin with and haven't changed ? [17:59:47] this gives us a nice stable unchanging starting point from which to take the snapshot. [17:59:57] if there was any backlog it was probably sub-second. [18:00:05] cool [18:00:15] what does flush tables do ? [18:00:30] flushes any buffers that have changed data to disk. [18:00:38] cool [18:00:42] also, now i am thinking of https://xkcd.com/327/ [18:00:43] :) [18:00:53] since we're taking a filesystem snapshot, any data that's in mysql's memory but not written to disk wouldn't get captured. [18:00:58] (is that little bobby tables?) [18:01:00] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 322 seconds [18:01:01] yep :) [18:01:18] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 336 seconds [18:01:54] I guess it's probably ok to remove the snapshots from 2011-11-07. [18:01:55] :P [18:02:09] ok [18:02:40] ok and we sync because we want to be super sure that everything is on disk ? [18:02:47] yup. [18:02:56] the flush tables is mysql -> filesystem, the sync is filesystem -> disk. [18:03:29] run the full lvdisplay just to see what it is we're pulling from. [18:03:46] cool [18:03:57] so where does lvcreate put this snapshot ? [18:04:07] in special lvm space. [18:04:14] ok [18:04:17] it's not part of the filesystem; there won't be a path to it. [18:04:22] gotcha [18:04:28] well, except for the /dev/mapper thingy that allows you to mount it. [18:04:30] ok, now we start the slave back up ? [18:04:34] so magic land [18:04:44] one sec. [18:04:53] maplebed: i don't think we need to wait for the current cluster to fill before moving writes or adding new shards. [18:04:53] we need to catch the master position as well as the slave postiton. [18:05:13] btw, there is a debian package mylvmbackup that does this sequence [18:05:15] so capture the output of mysql -e 'show master stauts\G' [18:05:18] ah [18:05:21] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 1 seconds [18:05:30] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [18:05:40] binasher: quite true. [18:06:10] ok, now we can restart the slave. [18:07:05] mutante: I was looking at that yesterday. There's also 'snaprotate.pl' which is something we use for backups on the rest of our databases; it does something similar. [18:07:10] can i just do mysql -e 'start slave;' ? [18:07:17] yup. [18:07:59] ah /mnt/snap exists… from the last snapshot i guess ? [18:08:10] I didn't put any of the verification steps in the doc, but if you run show slave status now it should show you that slaving is working again. [18:08:30] huh. yeah, /mnt/snap is probably from the lats time. See if anything's mounted there. [18:08:40] cool, the slave_io and slave_sql are running and 0 seconds behind master [18:09:03] looks like nothing mounted there [18:09:10] ok, carry on. [18:09:56] ok, gonna start the big rsync :) [18:10:10] is the directory on the target empty? [18:10:16] and is there enough space on the disk? [18:10:47] there is some stuff in /a and there's a lot of space [18:11:12] you're going to want to fill sqldata; what's int eh copy on 1002? [18:11:48] looks like nothing [18:12:11] sweet. [18:12:18] should i exit out and actually do this in a screen session on es1002 itself ? [18:12:20] the command ? [18:12:28] up to you. [18:12:43] ok i'm going to do that [18:12:58] did not see you settin $fs= [18:13:01] rsync -avP es1001:/mnt/snap/ /a/ [18:13:29] did it on es1001 :) [18:13:45] ok, screen running on es1002 with rsync [18:13:59] so now we just wait a while ? [18:14:26] also, so let's say while you're gone es1002 suddenly gets really far behind, what do i do ? [18:14:33] maplebed: [18:14:49] how long does the whole procedure take? [18:14:51] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&tab=ch&vn=&hreg[]=es100 [18:15:10] apergos: it took us about 30m on the front end, probably 30m on hte finishing end, and 1-2 days wait in between. [18:15:20] uugghh [18:15:28] not sure if I'm down for the hour right now [18:15:44] apergos: no problem. save backsrcoll and do it while I'm asleep. ;) [18:15:47] am I liable to break anything if I try it "followingthe docs" tomorrow? [18:16:00] yeah I was going to copy the log for sure [18:16:31] there is one deviation we took from teh docs (recording the output of 'show master status') [18:17:08] ok, but you chatted about that in here right? [18:17:10] and apergos since the snap currently exists for this copy, yeah, you could break stuff by following them tomorrow (unless you name your snap something different from 'snap'. [18:17:15] apergos: yup. [18:17:18] just highlighing it. [18:17:21] what name would you like? [18:17:46] any you want. it only exists while the copy is taking place; you delete it at the end. [18:17:51] ok. [18:17:52] but since you offer... [18:17:55] "apersnap" [18:18:04] it sounds catchy. [18:18:12] *eyeroll* [18:18:13] ok then [18:18:16] lol [18:18:40] saved. I'd prefer to do it when I have more brain cells awake [18:20:09] hrm, that makes me think of aprihop [18:20:14] which is delicious delicious beer [18:20:19] and regarding es1004, should it run at the same time or wait anyways [18:20:21] apergos: you should poke at es1003. [18:20:24] http://www.dogfish.com/brews-spirits/the-brews/seasonal-brews/aprihop.htm [18:20:28] mutante: I think they can run in parallel. [18:20:57] ok. duly noted [18:21:36] maplebed: so then i will just start the rsync part , on host B = 1004 [18:21:43] +1 [18:21:55] mutante: do the disk checks there too; I'm gonna bet it's close to full. [18:23:19] maplebed: yea, / has 20K avail :p [18:23:23] /a [18:23:36] nuke everything there. [18:23:36] it's all way old. [18:25:26] bleh, apache2 ... [18:25:42] whats the directive to set it to load a particular cgi script as its docroot index... [18:26:03] RECOVERY - Disk space on es1004 is OK: DISK OK [18:26:48] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [18:26:52] :) [18:27:11] \o/ [18:27:26] maplebed: MySQL master status error on es1001 [18:27:32] wazzup wit dat ? [18:27:42] !log nuked /a contents on es1004, started rsync from es1001 [18:27:44] Logged the message, Master [18:28:21] yeah that. [18:28:54] mysql slaves are supposed to be read-only, masters read-write. [18:29:01] RobH: ping [18:29:12] ? [18:30:00] pretty, looks like es1001 is pretty much maxed out bw [18:30:56] es1001 is a slave, so should be read-only, but is also the middle-master, so should alert more forcefully when it dies. [18:31:08] I expect I just did it wrong and es1001 really shouldn't be albeled as a master. [18:31:27] PROBLEM - MySQL replication status on es1004 is CRITICAL: (Return code of 255 is out of bounds) [18:34:03] I love seeing network graphs peg at 100MBps [18:34:14] oh, shouldnt mysql processes be stopped on es1004 during the rsync [18:34:25] mutante: yes. [18:34:48] oh should i stop mysql on es1002 ? [18:35:00] it's one of the facebook ones though, not init.d/mysql [18:35:05] LeslieCarr: it wasn't running when we started. [18:35:09] oh yeah [18:35:10] :) [18:35:18] mutante: I'd just kill -9 the thing then restart the copy. [18:35:28] (in case mysq tries to munge any of the files in its death throws. [18:35:29] and note it on the doc ? [18:35:33] yes please. [18:35:57] mutante: would you be available tomorrow to mess with gallium ? :D [18:36:10] preilly: http://wikitech.wikimedia.org/view/Reprepro [18:36:14] mutante: as I understand it it is like the middle of the night for you [18:38:24] maplebed: LeslieCarr , done. killed, restarted, added to docs [18:38:31] rockin. [18:38:54] huzzah [18:38:58] hashar: yes, we can do gallium tomorrow. yes, kind of late, but its like a live tutorial :) [18:40:37] RobH: you may need DirectoryIndex , Options +ExecCGI and/or ScriptAlias [18:40:58] thx, i found someone who made a tutorial for just what i want a fwe minutes ago =] [18:41:03] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [18:41:08] but since you foolishly spoke up, if i hit a wall im pinging you bwahahaha [18:41:17] wah:) [18:41:17] mutante: ^_^ [18:41:20] heh [18:43:00] PROBLEM - Puppet freshness on db9 is CRITICAL: Puppet has not run in the last 10 hours [18:43:03] maplebed: so i will just look at this rsync from time to time and if it would fail just restart until i eventually got all the data or somebody tells me es1001 is too maxed out ?:p [18:43:14] robh: nagios reported 2 warnings less than 10 hours ago on ps1-d2-pmtpa and ps1-d3-pmtpa...the warning and the input A on the phase doesn't make sense [18:43:24] yeah, sounds about right. [18:43:29] alright [18:43:31] I'm pretty sure es1001 will do just fine. [18:44:04] mutante: enjoy the tutorial and ping me tomorrow whenever you are around :) [18:44:38] cmjohnson1: hrmm, gimme a moment, middle of another task, i can take a look in a few minutes [18:44:42] will ping you when i have a moment [18:44:55] k [18:46:21] ok, maplebed, i guess continue in ~ days at this point then? thanks for the session [18:46:26] ~ 2 days [18:46:50] yeah, we should check up on it periodically and make sure rsync didn't die, [18:47:01] but we can do the slave starting process friday morning. [18:47:10] (pacific time) [18:47:37] LeslieCarr: Do you have a moment for a couple of vlan assignments? https://rt.wikimedia.org/Ticket/Display.html?id=2768 [18:47:57] yep, i'll check rsyncs [18:48:44] hrm, 1 kitten per vlan assignment [18:48:51] LeslieCarr: its for preilly [18:48:56] so take the kitten out of him. [18:49:03] hehe [18:49:50] LeslieCarr: please don't [18:50:03] oh sweet, i already did the ip assignments for these when i allocated them [18:50:07] past me is really nice to future me. [18:50:19] LeslieCarr: I mean please do the change just don't take my kitten out [18:50:32] RobH: sounds like a paradox to me [18:50:35] preilly: everyone is born with three kittens, if she does you still have one left. [18:50:52] :) [18:51:10] !log moving ru, nl, pl, pt, zh, and sv search to eqiad [18:51:11] past rob is usually pretty nice to future rob, except for the entire diabetes, being a fat kid kinda thing [18:51:11] Logged the message, notpeter [18:51:17] * preilly — just had a very weird chill go up my spine  [18:51:41] past rob is kind of a jerk where health is involved. [18:51:57] professional, but still a jerk. [18:52:41] RobH: those two are already completed [18:52:50] yay past LeslieCarr ! [18:52:54] hehe [18:53:06] preilly: past preilly owes past LeslieCarr a couple kittens [18:53:16] thats enough of that confusion, to the present! [18:53:16] \ [18:53:29] preilly: so i am allocating the boot info and starting their installs now [18:53:33] Jeff_Green: ok, traffic on search7, the big pool3 host is dropping [18:53:38] RobH: thanks! [18:53:51] notpeter: yay! [18:54:17] notpeter: nice logline there! [18:54:20] RobH: I'm looking at http://rt.wikimedia.org/Search/Simple.html?q=osm and see that two of the tickets are waiting for approal [18:54:22] can anyone who speaks the following languages make some searches? ru, nl, pl, pt, zh, and sv [18:54:34] RobH: what can i do to help those along ? [18:54:41] woosters: --^ [18:54:47] you mean the purchases? [18:54:53] ct has all the info to discuss with erik today. [18:55:12] there is nothing anyone else can do at the moment. [18:55:30] once they are approved, the quotes are good for us to place the orders for both ashburn and tampa [18:55:40] ya tfinc [18:55:51] ok [18:55:53] there are no pending quotes yet for esams, as there is still the unasnwered question pertaining to the legality of hosting more than caching in esams [18:55:57] there are 6 rt tickets on osm h/w purchase [18:55:57] i'll check with ct later today then [18:56:59] our dhcp lease files are a goddamn mess [18:57:00] prtugese is... returning some kinda results [18:57:09] i really don't feel like fixing them now, but even editting them hurts my soul. [18:57:49] same with it [18:58:32] hrmm, silver is returning bogus info for mac in drac....wtf [18:58:47] will have to pull the hard way [18:58:59] tfinc: would you be willing to make a couple of searches in polish wikipedia to let me know if it's returning reasonable results? [18:59:25] notpeter: searched for Scheveningen on .nl . works [18:59:34] notpeter: hold the osm order hostage, you have my permission ;] [18:59:37] notpeter: sure, how can i test ? [18:59:53] tfinc: just use the search box on pl.wikipedia.org [19:00:00] and if it returns something reasonable, win! [19:00:05] mutante: sweet! [19:01:01] huh, that wasnt bogus mac [19:02:48] notpeter: "Smörgåsbord" on sv. looks ok as well, but thats about all i know , heh;) [19:02:58] k, laters [19:02:58] heh [19:03:02] thank you! [19:03:49] zhwiki is returning... results [19:03:55] I have no idea if they're correct or not [19:04:40] !log dns update for zhen mgmt [19:04:42] Logged the message, RobH [19:05:00] tewwy: hi, since i see you are tychay, your wikitech account has been renamed from tchay, as requested. [19:05:06] * mutante out [19:07:35] woosters: I was going to leave pool3 in eqiad for a while, then also add pool too, and then do the remaining pools tomorrow. does that sound good? [19:08:20] which ones did you switch over again? [19:08:25] * apergos does the backread [19:08:48] ok, none with a language I know [19:09:25] tfinc: fyi https://rt.wikimedia.org/Ticket/Display.html?id=2676 [19:09:29] heh [19:09:30] that is the master ticket for all the orders [19:09:41] then each order should have its last status on the title of the ticket [19:09:47] and noted in said tickets [19:11:33] thanks [19:11:43] welcome [19:12:25] what the ticket doesnt note is there is no movement on esams orders [19:12:35] someone needs to clear with legal the issues of hosting data off us soil [19:12:51] if its ok, then mark would handle the procurement of those hosts, as he does for all esams hardware [19:13:02] if not, then caching only, and same person to handle it [19:14:52] preilly: So silver has 1TB disks and zhen 250GB [19:15:21] but i am setting them up same, in that they have a single / raid1 partition, etc... [19:15:40] if you think you need more than 250GB then we can look at purchasing larger disks. [19:16:17] RobH: okay, I think it'll be fine for now [19:18:24] awjr: we may want to clear the cache as well for this change right? [19:20:32] binasher: would you mind flushing the varnish cache? [19:23:12] Ryan_Lane: Are you using both labstore1 and labstore2 at this time? [19:23:40] I ask because the power in that rack is wildly unbalanced and we will need to cause some downtime to correct the issue. [19:23:45] the joys of nonredundant power. [19:24:10] hey maplebed: shall we give the filters another shot? [19:40:10] LeslieCarr: are you our resident nagios expert these days? [19:40:28] nagios is reporting invalid information for ps1-d2-sdtpa, and its just not polling for new info. [19:40:33] not sure why =/ [19:40:43] i know its annoying me to see it do it though [19:41:28] preilly: i forgot to ping you, those installs are done [19:45:12] New patchset: Hashar; "jenkins: minimal git CLI config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4295 [19:45:21] that change is already in production [19:45:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4295 [19:47:52] ARGH [19:48:02] RobH: ummm [19:48:02] my firmware upgrade reset the temp threshholds for a bunch of power strips [19:48:06] RobH: yes [19:48:06] blaaaaaaaaah [19:48:15] Ryan_Lane: well, we have to schedule some downtime then [19:48:17] RobH: you should only take one down at a time [19:48:27] we shouldnt have to take both down anyhow [19:48:32] hopefully [19:48:39] so can one just gracefully be shutdown? [19:48:40] and I'll need time inbetween to ensure the validity of the filesystem after you do it [19:48:51] ok [19:48:57] New patchset: Jgreen; "adding mhernandez account to grosley/aluminium, sudoers cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4296 [19:48:59] but yes, it can just be shutdown [19:49:04] cmjohnson1: On the ticket for rebalancing c3-sdtpa, please add ryan as a person on ticket [19:49:10] oh [19:49:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4296 [19:49:13] as he will need to be around after its brought online [19:49:16] I need to make that rrdns entry first [19:49:16] ok [19:49:27] and will also be the person to shut it down it sounds like ;] [19:49:39] cmjohnson1: you may have to do that rack in a few steps, ryan for the labs storage downtime [19:49:44] RobH: okay, coolio [19:49:44] ben for the swift, etc... [19:49:48] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4296 [19:49:51] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4296 [19:49:54] though swift should be fault tolerant [19:50:00] Ryan_Lane: make yer shit like swift man [19:50:06] maplebed: ^ [19:50:13] well, swift has 3 copies [19:50:18] gluster has two [19:50:20] I can make it three [19:50:25] then why didnt you ask for three ;] [19:50:30] but it'll eat up *way* more space [19:51:15] how the hell does servertech firmware leave all my settings in place, as it should, but reset my temp threshholds [19:51:16] damn them [19:56:19] hrmm, it didnt, observium is reporting invalid threshhold data. [19:56:22] bleh. [20:07:20] Jeff_Green: ok, I was going to throw in pool2. sound reasonable to you? [20:07:33] want to bang on it for a couple of minutes? [20:07:40] yeah we should IMO [20:07:41] robh: I can move the power on c3 w/out down time...the servers are redundant...they will just alarm...I will not move anything putting it out there first [20:07:55] ja [20:08:03] cmjohnson1: hrmm, lemme look at it real quick [20:08:33] cmjohnson1: when i look at the power mgmt page, it doesnt denote that its dual feed [20:08:40] but you can confirm now its a dual feed power strip? [20:09:20] no the power strip is a single feed...the servers have dual power supplies...i can move one side at a time so I don't have any down time on the server [20:09:42] arent the servers using y power cables? [20:10:11] so you would pull one, put it on its own power cable, move the y, then move the power cable off and the y back on the second psu? [20:10:19] yes....you got it [20:10:26] that would work, but you ahve to do it pretty slowly [20:10:35] i can handle that [20:10:37] cuz the psu will need a few moments to go fully online again [20:10:55] better than taking anything down [20:11:01] just time consuming [20:11:08] agreed on both counts [20:11:21] hrmm, well, you can do that, but we still need to ensure the folks are about who can fix it if it doesnt work [20:11:28] so when you want to move labs, bring ryan into it [20:11:47] the ms-be server can do it without alerting ben specifically, just ensure someone in ops is about [20:12:04] the es servers are able to be handled by maplebed LeslieCarr or mutante [20:12:19] so any of them works for that, atleast i assume so since ben taught them how to fix replication today ;] [20:12:31] so yea, thats a better solution [20:12:38] Ryan_Lane: are you around just in case [20:12:48] I am [20:12:56] please only bring one labstore box down at a time [20:13:09] and wait for me to check the filesystem before starting on the next [20:13:13] he is going to try to move labstore2 and its array to another power feed [20:13:14] without downtime [20:13:19] ah [20:13:19] ok [20:13:24] okay..not planning on bringing anything down [20:13:26] its redundant power using a single y cable, so going to do some cable juggling [20:13:32] sounds good [20:13:33] just ensuring yer here incase it fails [20:13:35] ;] [20:13:36] o.O [20:13:38] the labs are actually single power cables [20:13:44] it's dual power using a single power cable? [20:13:51] cmjohnson1: thats not great =/ [20:13:52] for *storage* nodes? [20:14:03] if its a non reduntant power strip, should use y cables, do you have nay? [20:14:04] any? [20:14:14] if not, continue with those and drop a ticket in procurement for me to orde ryou more [20:14:18] we didn't have space in tampa anymore, remember? [20:14:20] we can move em now [20:14:23] * Ryan_Lane groans [20:14:38] it's labs anyway, it's experimental [20:14:40] it can break ;-p [20:14:41] lol [20:14:45] -_- [20:14:49] mark: whats the company that makes the third hand, minkels? [20:14:53] I don't want to spend hours fixing a filesystem [20:14:59] RobH: sells, anyway [20:14:59] Ryan_Lane: it shouldnt break. [20:15:07] mark: sells? [20:15:10] mark: any chance i can bribe you to get me some more before berlin? [20:15:15] RobH: if you bring down more than one node, it'll break [20:15:16] RobH: perhaps one [20:15:20] I only have 3 [20:15:32] i have the one, so a pair would make it a lot easier to rack stuff [20:15:39] right [20:15:43] i'll get you one more [20:15:46] I only need 2 now [20:15:51] i wish we could find a seller, now that i posted on facebook all my friends are wanting to know where to get them [20:15:51] hehe [20:16:00] minkels sells them [20:16:20] mark: where do you want to move the labs? [20:16:21] but they generally do big orders only, so need to find a reseller [20:16:23] hrmm, but they are to vars right? [20:16:26] oooo essex is being released tomorrow [20:16:27] yea [20:16:34] they donated the ones we have [20:16:42] because I contacted them once during a fundraiser [20:16:46] but they won't repeat that I think ;) [20:16:52] Ryan_Lane: I want v6 for labs! :P [20:16:57] baaaah [20:16:58] http://www.minkels.com/index.php?id=373&pageid2=33202&entity_id2=object.catalog.block [20:17:03] it exists in diablo [20:17:20] mark: I'll talk to the openstack people to see how I can add the ipv6 network to an already existing network [20:17:27] ok [20:17:28] that's the reason we don't have it right now [20:17:37] I don't want to muck around in the database :) [20:18:45] RobH: I used it to mount the MX80 in esams recently [20:18:55] but then I had it squeezed in between the mx80 and an ex4200 below it [20:18:58] with no extra space [20:19:01] http://www.compriconshop.nl/artikeloverzicht.asp?g=15_BEHUIZINGEN+ACCES&f=&p=5 sells them [20:19:03] was impossible to get it out :( [20:19:06] but that page doesnt work right [20:19:08] RobH: oh [20:19:09] eww =[ [20:19:11] I'll order some there then [20:19:18] yea would be nice to get chris some too =] [20:19:30] no problem [20:19:33] yay! [20:19:41] order me a set of 3 more if you can [20:19:47] so i have 4 total for the full rack stuff when possible [20:19:53] same for tampa if you would ^_^ [20:19:58] hm [20:20:02] can rrdns use cnames? [20:20:06] but anyhow, just a single one more is awesome, cuz they are fantastic tools [20:20:13] Ryan_Lane: no [20:20:15] :( [20:20:17] someone in the US needs to sell these suckers [20:20:51] so I need to have A records for labstore, and all the labstoreX systems? [20:21:06] cmjohnson1: i dont think mark was saying they needed to move [20:21:11] Ryan_Lane: yes [20:21:13] ok [20:21:21] and no reverse for labstore [20:21:25] mark: so you had to unrack the mx80 to remove them? [20:21:26] ok [20:21:30] RobH: yes [20:21:33] well, loosen it [20:21:33] that sucks [20:21:37] and balance it on my hand ;-) [20:21:46] I should have put the handle in the front [20:21:50] then it probably would have worked [20:21:53] but I didn't think of that in advance ;) [20:21:59] mark: http://pastebin.com/EViiQ944 [20:22:12] the handle on mine was not where i wanted, i removed the screw on it and flipped it about [20:22:25] I wonder if I should call it projectstorage, instead [20:22:26] i have no idea why other companies dont make this thing [20:22:29] its awesome. [20:22:37] then call the compute node rrdns instancestorage [20:22:44] Ryan_Lane: that's correct, although normally you just repeat the A record without repeating 'labstore' [20:22:51] ah [20:23:04] * cmjohnson1 moving downstairs to sdtpa [20:23:08] but this is fine [20:23:23] cmjohnson1: k [20:23:37] mark: http://pastebin.com/vDyCZ69Q [20:23:53] Ryan_Lane: yep [20:23:56] cool [20:24:05] I'll do the same for the compute nodes and call it instancestorage [20:24:12] that'll make it more clear what each one ise [20:24:13] *is [20:24:35] binasher: I think tomorrow I'll put a bit of load on the chash director in eqiad [20:24:55] i no longer recall everything i tried to make the ciscos work [20:25:03] i guess i get to repeat it all now and document it =P [20:25:27] Ryan_Lane: this is somehow your fault, i know it [20:25:28] ;p [20:25:36] hahaha [20:25:59] hey, I documented some stuff ;) [20:28:30] mark: that new apache configuration is pretty good [20:28:35] I'm changing a few things around [20:29:04] I'm adding support for redirection, and adding in a simple alias hash, since that's a very common and straightforward thing to do [20:29:12] I'm changing aliases to serveraliases [20:29:47] I started on a mediawiki manifest and realized it's going to make me want to strangle myself [20:31:37] oh I wasn't done yet [20:31:51] was just testing it as a concept [20:31:55] * Ryan_Lane nods [20:39:24] robh: u around [20:39:29] yep [20:39:32] New patchset: Lcarr; "deactivating several config files in icinga for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4302 [20:39:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4302 [20:39:56] ok...i'm going to start w/ ms-fe4 [20:39:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4302 [20:40:01] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4302 [20:40:31] also..hey the lab arrays are dual power supply but the y cables will not work on them...power supplies are on each side [20:41:00] scratch the last ...let's move ryan's labs first [20:47:14] cmjohnson1: ahh, arrays, right [20:47:27] well, only moving one of the two lab servers, and keeping the array with it on whereever circuit it goes [20:47:27] right [20:47:51] Ryan_Lane: ^ [20:48:08] * Ryan_Lane nods [20:48:10] you should be ok to go ahead, just ensure the psu comes online with green lights, then give it another 30 seconds or so [20:48:12] before you move the other [20:48:13] wait [20:48:17] which one are you doing [20:48:26] lemme make a quick change first [20:48:31] labstore 2 [20:48:44] cmjohnson1: also admin log when you pull and replace each plug ;] [20:48:52] gah!!!! now for some horrible reason puppet is writing every host into puppet_hosts.cfg commented out [20:48:53] despite what some say, you cannot admin log too much. [20:49:10] A : wtf ?!?!?! and B : how can i fix this wtf ?!?! [20:52:14] LeslieCarr: isn't there a post-puppet hook that runs on spence and comments stuff out? [20:52:20] cmjohnson1: actually, we're good to go with that server [20:52:21] I vaguely recall something crazy like that. [20:52:31] no, not post-puppet. it's int the init script. [20:52:37] oh that [20:52:51] yes, we have that in the init script [20:52:51] okay...this will take a few minutes i will ping you after i am finished [20:53:15] good idea, lemme check what exactly that does [20:54:12] !log removing power from top power supply on labstore2 [20:54:14] Logged the message, Master [20:55:53] LeslieCarr: there's a python script for that yes [20:56:14] it's supposed to run after puppet has finished updating all nagios configs, and comments out any resources for nonexistent hosts [20:56:17] there are some small differences, fixing those - maybe that is causing the commenting :) good idea maplebed [20:56:20] (as otherwise nagios won't run) [20:57:29] !log removing power from bottom power supply labstore 2 [20:57:31] Logged the message, Master [20:57:56] New patchset: Lcarr; "Removing differences in how purge-nagios-resources runs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4304 [20:58:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4304 [20:58:36] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4304 [20:58:39] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4304 [21:00:19] !log replacing power cable on labstore1 array psu1 (left side) [21:00:20] Logged the message, Master [21:02:07] New patchset: Ryan Lane; "Use the rrdns entries for the glusterfs clusters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4306 [21:02:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4306 [21:02:37] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4306 [21:02:40] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4306 [21:04:09] !log replacing power cable on labstore2 array psu2 (right side) [21:04:11] Logged the message, Master [21:05:16] ryan_lane: finished with labstore 2...okay to fix labstore1 [21:07:36] yep [21:10:09] cmjohnson1: hold up [21:10:19] k [21:10:23] you moved labstore 1 and its array? [21:10:40] you want to try to keep the array on the same circuits if you can, just for ease of organization [21:10:46] as the server [21:11:02] i moved labstore 2 [21:11:17] and it's array to the same circuit [21:11:24] ok, thats what i get for not paying close attention ;] [21:11:32] its getting a LOT closer [21:11:32] i am going to move labstore1 to a different circuit (and it's array) [21:11:38] hrmm [21:11:47] here is the thing [21:12:02] Input [21:12:02] Feed ID Input Feed [21:12:02] Name Input [21:12:03] Status Input [21:12:03] Voltage Input [21:12:03] Load Input [21:12:04] Power [21:12:04] (V) (A) (W) [21:12:04] AA TowerA_X On 208.0 6.75 * 811 [21:12:04] AB TowerA_Y On 208.0 3.88 * 466 [21:12:05] AC TowerA_Z On 208.0 6.63 * 796 [21:12:09] well, that pasted poorly. [21:12:16] the y circuit is where it's going [21:12:18] but, you can see the feeds are close, but Y isnt [21:12:24] but none are JUST one letter [21:12:32] they are all a combination of two of the phases [21:12:38] sorry, not phases [21:12:38] right it will be zy [21:12:39] circuits [21:12:47] !log moving de, fr, and ja search to eqiad [21:12:49] Logged the message, notpeter [21:12:53] i wouldnt move labstores anymnore [21:13:02] as they are huge, they will cause large incremental changes [21:13:20] you moved the one, which popped up Y pretty well [21:13:29] but i would move a non array attached system at this point [21:13:31] well i only have the ms-fe4 and the es boxes [21:13:42] and ms-be something [21:13:50] or you moved that already? [21:13:55] that's already on that circuit [21:14:24] i rather not mess with the es boxes [21:14:44] those actually may be the safest, lemme see if they are replication masters or slaves [21:14:57] if they are slaves, then its no outage on anything to move them. [21:14:59] es1 and 2 [21:16:07] es3 is the master. [21:16:21] es1 looks like its not on a thing =/ [21:16:32] maplebed: is that right? [21:16:33] RobH: what's a thing? [21:16:44] i mean in db.php its not listed for anything that isnt commented out [21:17:03] you mean you want me to make sense with my statements?!? [21:17:06] read my mind! [21:17:08] ;] [21:17:30] huh. [21:17:51] cmjohnson1: i mean, you can move the labstore if you wanna, but with those three phases being so close together, moving it from circuit to circuit i think its going to result in an imbalance, but you are paying closer attention than me [21:18:05] but it seems that moving the labstore and server jumps whatever phase it hits by 3 amps [21:18:22] I expect that's a misconfiguration, but you're right that for now it makes cmj's job easier. [21:18:43] cmjohnson1: not disagreeing with you, just presenting the question =] [21:18:55] and you can move es1 via the power cable swap method all you want [21:19:00] its not even pooled =] [21:19:08] ie: its the perfect host to move [21:19:12] it would, however, get angry if it was not shut down cleanly. [21:19:25] well, it shouldnt shut down the way he is moving them [21:19:25] i am going to move es1 [21:19:32] but since its power related, there is always the chance of mistake [21:19:42] it would be the easiest and safest [21:19:49] hence we hope for best, prepare for worst =] [21:20:16] standby to standby [21:20:20] if you guys didnt know the worse would be some crazy butterfly affect where chris trips, and the entire building burns down. [21:20:51] which is less of a fear for chris, and more for his predecessor. [21:20:54] ;] [21:22:20] !log replacing power cable to psu1 (top) es1 [21:22:22] Logged the message, Master [21:23:30] binasher: you renamed a server professor and didnt tell the inventory mgmt folks or on site folks! [21:23:40] doh [21:23:43] as such LeslieCarr will be kicking you in the shin when you least expect it. [21:24:07] live in fear! [21:24:23] * binasher nods [21:24:26] cuz yea, we have to relabel it, update racktables, and relabel the port on the switch [21:24:36] and kill old dns entries, and update mgmt and normal dns [21:24:39] !log replacing power cable to psu1 (bottom) es1 [21:24:40] Logged the message, Master [21:24:47] RobH: I bet you're a fan of slapsgiving, aren't you? [21:25:01] only in theory when remote. [21:25:23] hearing it happen in other families is hilarious [21:25:47] notice i said LeslieCarr would be kicking him, cuz slapping is somehow the most demeaning form of physical assault. [21:26:01] i mean, how dismissive is it of binasher if we slapped him? [21:26:01] unless its with a glove [21:26:14] well, that would be like throwing down to duel though [21:26:28] i dont want you dead, you just renamed a server wrong ;] [21:26:37] the fear of the shin kicking is the true punishment. [21:26:39] i'll slap you back for every db server i'm handed with the raid misconfigured, mind you [21:27:04] sounds like you wont be getting me to do your os installs anymore ;] [21:27:30] robh: we're better balanced and not receiving any warnings [21:27:32] which, btw, preilly is totally gonna corner you to help him with his smssomethingorother servers [21:28:18] cmjohnson1: cool, the alarm triggers when they are 20% or more out of sync [21:28:30] so just ensure the largest difference is like 18% or less [21:28:36] so when things hit peak they wont flux too wildly [21:28:42] though i guess we are in peak now... [21:28:47] so \o/ [21:29:10] okay...that was my error for installing them like that...assumed that rack was going to fill up faster [21:29:13] binasher: just dont mention to him he isnt root, he is sensitive about it ;] [21:29:25] cmjohnson1: me too =] [21:29:31] good work [21:29:51] sms servers. uh huh. [21:30:25] dont go giving anyone sudo on crazy crap, no matter what kinda bedroom eyes he give syou ;] [21:30:44] plus if you spend too much time wiht him ryan will think you are trying to steal him. [21:31:07] with who? [21:31:20] why am i not still on an island i'd never heard of two weeks ago [21:31:32] Ryan_Lane: i was telling asher that patrick is yers. [21:31:35] hahahaha [21:31:35] binasher: was it awesome ? [21:31:40] and to keep his hands off yer man [21:32:08] Ryan_Lane: looks like you get to setup some sort of sms servers. those will be great for everyone accessing wikipedia from their phone in 1995 [21:32:11] heh, asher is going through culture shock. welcome back to the trenches man [21:32:38] I'm doing what now? :) [21:34:22] New patchset: Ryan Lane; "Adding another wiki in for our gsoc student" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4308 [21:34:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4308 [21:34:39] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4308 [21:34:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4308 [21:35:32] LeslieCarr: it was a mix of awesome and bizarre. awesome: seeing wild monkeys, having a room with a giant stone hottub, hammocks on the beach. lost in translation: having two people shadow me at all times during the day, even to the point of following me into the bathroom, and not letting me do things like grab a bottle of water by myself [21:36:08] binasher: why were people shadowing you? [21:36:11] really into the bathroom ? [21:37:16] at first i found it creepy and weird but then i realized i was being treated like a boss [21:37:53] LIKE A BAWS [21:39:43] then when we were in the night market they were trying to keep me with them instead of taking pictures, in case i got kidnapped [21:39:50] (presumably by the monkeys) [21:40:05] (which you be awesome) [21:40:14] *would [21:43:10] ransom of one billion bananas ? [21:43:17] New patchset: Ryan Lane; "Change the controller name for labs instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4310 [21:43:30] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/4310 [21:43:44] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4310 [21:44:43] New patchset: Ryan Lane; "Change the controller name for labs instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4310 [21:44:56] gerrit-wm: don't be a dick [21:44:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4310 [21:45:00] \o/ [21:45:18] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4310 [21:45:21] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4310 [21:45:35] Ryan_Lane: it doesnt know how to be anything else [21:45:36] they've branched out from bananas. i heard stories about them breaking into hotel rooms if windows were left open and raiding the minibar - they've learned to open beer cans [21:45:58] puppet sucks [21:46:11] it's no ok to use ${var} in some places [21:46:14] which is fucking stupid [21:48:19] binasher: so they dont like eye contact, and like to drink beer [21:48:42] so many correlations to various folks iknow [21:49:40] isn't intelligent design amazing? [21:49:57] * RobH twitches [21:54:42] mark: still around ? if so, want to chat about bgp stuff for a minute ? [22:13:15] !log flipping all search back to pmtpa (until tomorrow...) [22:13:16] Logged the message, notpeter [22:13:33] New patchset: Bhartshorne; "Revert "Revert "Update udp-filter config to 0.2 Deploy Wikipedia Zero and Teahouse filters Comment out Teahouse filter Change-Id: I4b8f35a7bc71eb740cba01286be46ad4f06a0ff6""" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4313 [22:13:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4313 [22:14:13] New patchset: Bhartshorne; "Revert "Revert "correcting copy/paste typo""" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4314 [22:14:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4314 [22:14:53] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4313 [22:14:56] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4313 [22:15:13] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4314 [22:15:15] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4314 [22:16:31] !log deployed (3rd time's the charm!) udp-filter changes to emery for diederik [22:16:33] Logged the message, Master [22:22:07] PROBLEM - udp2log log age on emery is CRITICAL: CRITICAL: log files /var/log/squid/orange-ivory-coast.log, have not been written to in 6 hours [22:36:12] New patchset: Bhartshorne; "adding teahouse filter for diederik" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4315 [22:36:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4315 [22:37:55] http://bits.blogs.nytimes.com/2012/04/04/google-begins-testing-its-augmented-reality-glasses/ [22:38:01] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4315 [22:38:03] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4315 [23:50:38] RECOVERY - BGP status on cr1-eqiad is OK: OK: host 208.80.154.196, sessions up: 9, down: 0, shutdown: 0