[10:15:20] morning [10:21:43] no it isn't :-P [10:53:47] New patchset: Nikerabbit; "Symlink should be fixed after jetty installation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21182 [10:54:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21182 [10:54:54] apergos: so... [10:55:06] let's depool ms-fe4? [10:55:11] or? [10:56:03] I thought we wanted to do two of them? [10:56:44] well yes, but let's start gradually? [10:56:55] remove one, see the effect, remove a second one [10:56:59] ok [11:05:32] okay, removed ms-fe4 [11:06:23] ok, looking at the graphs [11:07:08] hm, unrelated to my change, ms-fe1 reports a 10s flatline response time [11:07:28] just for DELETEs [11:07:38] yeah I was looking at that, pretty weird [11:07:39] reporting error I say [11:09:39] sounds likely [11:10:15] (I'm looking at the changes in https://gerrit.wikimedia.org/r/#/c/18264/ to make sure I know what they do and why... also for sanity check) [11:11:18] great [11:17:33] I've verified the proxy logging change, the db preallocation changes, and the wsgi change. now looking at the change to the list of packages (that's the last thing too) [11:19:51] ok, that checks out too [11:20:22] shall I review +2 and merge? [11:21:22] no [11:21:25] that would break everything :) [11:21:38] heh [11:21:55] the changes are not compatible with 1.4 [11:22:12] you merge them => puppet applies them on production => everything fails [11:22:26] ah we don't have the new packages in the repo yet [11:22:45] plus also we want to apply them only to the two chosen proxies [11:22:53] well this will be irritating [11:23:40] * apergos waits [11:24:03] the two paths we can go is: 1) disable puppet in production and do the upgrade via puppet, 2) disable puppet in the to-be-upgraded box and do the upgrade by hand [11:24:06] I vote (2) [11:24:08] for now [11:26:01] I would vote to disable puppet in production and do the upgrade on one host by puppet, check the results for sanity [11:26:19] now? [11:26:35] we'd have to disable it on *every* production host and keep it that way until we're done with the upgrade [11:26:44] by every I mean every swift host [11:26:45] yep [11:26:48] obviously :) [11:27:45] ah, forgot [11:27:52] !log depooling ms-fe4 to stage 1.5 upgrade [11:28:03] Logged the message, Master [11:28:03] no bot? [11:28:05] ah. [11:28:10] hasty! [11:28:22] yep, we would be without puppet on those boxes for a couple days [11:28:49] I don't think that's a big deal... is it? [11:29:10] it is if it's more than a couple of days [11:29:33] and I'm worried we'll get stuck mid-way for some reason [11:30:02] on a related nonte, we have any easy way to back out if it turns out that 1.5 is killing us for whatever reason? [11:30:34] if we don't merge via puppet I think it's trivial [11:30:48] how would we do it? [11:30:52] (if we don't merge) [11:31:00] apt-get install swift=1.4; puppetd -vt? [11:31:14] I don't think it keeps state anywhere [11:31:33] and update the config file(s) and rewrite.py [11:31:45] puppet does that [11:32:18] so when and how do you want to test the 1.5 puppet changes? [11:33:28] I think we should just do it manually for now [11:33:33] for 1-2 proxy servers [11:34:31] what do you think? [11:34:36] I got that. I'm asking, if we go that route, how we plan for testing the 1.5 puppet changes, to make sure they work; that should be part of the overall plan [11:34:44] oh! [11:36:09] so, the puppet changes are tiny puppet-wise [11:36:19] it's mostly content that changes, which we'll already test manually [11:36:37] however, there's still a window there [11:38:29] ok well I have a proposal then [11:38:41] we do the first two proxy servers manually [11:38:54] we do all but one of the backend servers manually [11:39:09] we do the last backend server with puppet [11:39:15] and the last two proxy servers with puppet [11:39:23] okay [11:39:25] obviously this is not today, I just want to have the plan [11:39:32] it doesn't even have to be one, it can be more [11:39:35] sure [11:39:40] we can run puppet --noop [11:39:46] yep [11:39:46] see if it's sane, the push to all [11:39:59] okay, I like [11:40:03] great [11:40:10] so first tow proxy servers, manually [11:40:12] *two [11:40:17] lemme look at the graphs again... [11:40:20] great [11:40:26] no traffic in ms-fe4 at all, I checked lvs4 too [11:40:53] seems pretty bored, wanna pull the second one? [11:41:23] done [11:41:28] ms-fe3 [11:42:08] * apergos watches and waits [11:42:14] and eats a peach [11:43:26] yum [11:46:11] seems good [11:46:18] !log depooling ms-fe3 to stage 1.5 upgrade [11:46:27] Logged the message, Master [11:46:42] these look pretty good to me [11:47:33] still has lingering traffic according to netstat [11:47:37] ok [11:47:43] let's move on with ms-fe4 though [11:49:36] okay, disabled puppet on both [11:49:39] so I guess these packages are tucked away in a lab swift instance [11:49:48] "swift-init all stop" in ms-fe4 [11:49:51] All these packages are currently in /root/ on swift-be1 he says in an email [11:50:11] I read somewhere they're in bast1001:~ben and I took them from there [11:50:23] oh [11:50:40] http://wikitech.wikimedia.org/view/User:Bhartshorne/swift_upgrade_notes_2012-08 [11:50:44] there, second line [11:50:50] ok [11:51:29] (btw, swift's at 1.6 now...) [11:51:41] yeah I noticed [11:53:00] if we add these to the repo(s) on brewster, the old versions become unavailable? [11:53:42] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: Connection refused [11:54:25] Be aware that reprepro will remove older versions of packages without asking. They are no longer available in the pool [11:54:33] yeah. so I think we don't want to add these to the repo [11:54:39] nope [11:54:45] bah humbug [11:54:53] hm, is it my idea or the requests/s graph shows an upwards trend? [11:54:57] looking [11:55:24] nah I think it's minor [11:55:34] just fluctuating [11:55:50] little bit yes [11:55:56] let's wait a little though [11:56:44] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=^ms-fe[1-9].pmtpa&mreg[]=swift_[A-Z]%2B_hits%24&z=large>ype=stack&title=Swift+queries+per+second&aggregate=1&r=day [11:57:17] give it about 15 min [11:57:29] I installed 1.5 on ms-fe4 already :) [11:57:41] mrghmph [11:58:01] well it's not in the pool so it doesn't matter :-P [11:58:09] keep ms-fe3 as is for now though [11:59:06] :) [12:02:14] okay, packages & config prepared on ms-fe4 [12:03:54] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [12:03:54] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [12:03:54] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [12:03:54] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [12:03:54] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [12:03:55] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [12:03:55] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:03:56] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [12:03:56] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [12:03:57] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [12:03:57] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [12:03:58] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [12:03:58] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:05:52] okay, starting swift on ms-fe4 [12:06:00] what about rewrite.py? [12:06:06] oh, right. [12:06:24] very true [12:06:26] thanks [12:06:44] yw [12:07:03] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [12:11:39] yay for broken diff [12:11:50] rewrite.py has changed in the meantime [12:11:55] orilly? [12:12:56] yeah, probably the originals/thumb change [12:12:57] I'll fix it. [12:17:53] seems like the sort of thing that patch should have worked arond [12:17:55] around [12:17:56] whatever [12:19:12] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.011 seconds [12:19:27] heh, works [12:19:27] cool [12:25:42] ok the various lists commands seem to work (no big surprise, just figure I might as well walk through these) [12:30:24] things look pretty good [12:34:14] New patchset: Faidon; "swift changes for the upgrade to 1.5.0" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18264 [12:34:36] that's the rebase [12:34:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18264 [12:35:00] right [12:35:34] so, what did you test exactly? [12:36:32] lists, stats [12:36:40] did not try actual retrieval [12:36:47] nor nonexistent containers/objects [12:37:53] I am doing nonexistent containers/objects (lists, stats) right now [12:37:55] they all check out [12:42:12] with what? swift -A? or curl? [12:42:22] swift -A from the host [12:46:56] download worked. with somewhat surprising results (it created the hash dirs) [12:47:20] hm? [12:47:40] I requested 7/7b/somethingorother [12:47:44] using swift [12:47:58] i created 7/7b/somethingorother in the current dir [12:48:07] anyways it worked fine [12:48:31] I haven't tried any thumbs, only originals [12:48:57] retrieval of nonexistent object fails properly [12:54:49] New patchset: Matthias Mullie; "Add new AFT permission levels" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21141 [13:00:40] apergos: break for lunch? [13:00:51] sounds great [13:00:51] and other business that's piling up [13:01:01] need to cook real food [13:10:49] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [13:15:05] https://gerrit.wikimedia.org/r/#/c/21141/2/wmf-config/CommonSettings.php [13:15:06] $wgGroupPermissions['afttest-hide'] = $wgGroupPermissions['oversight']; [13:15:17] I'm.. not sure this is a good idea [13:52:24] hey guys, other than mark, who can review my puppet changes to java.pp? [13:52:34] mutante or paravoid maybe? [13:52:51] https://gerrit.wikimedia.org/r/#/c/20741/ [14:01:53] ottomata: I can confirm that the issue you are having isn't you. I am totally getting the same error. [14:02:05] I see 1023 hit brewster for its PXE boot DHCP assignment [14:02:14] but I never see the secondary hit from within the ubuntu installer [14:02:21] aye cool, yeah that's what I saw too [14:02:48] of course, cannot push the installer logs to the web to look at them either [14:02:53] as it fails dhcp [14:03:06] I am going to check a few things in the bios to confirm they are right, checking now [14:03:09] ottomata: I'm sorry, this is swift week [14:03:19] me and apergos are picking up swift stuff from Ben [14:03:39] * apergos peeks in [14:03:44] aye cool [14:03:44] and it's important to squeeze as much as possible before he leaves, so everything else kinda is in the backburner [14:03:52] yeah everyone has been soooooper busy it seems [14:03:53] s'ok [14:04:09] mark: varnish 3.0.3 got released btw [14:04:14] ottomata: So there are a fwe things that can cause it to detect the wrong interfaces. I am going to confirm that DRAC is set to dedicated, no virtual media is connected, and that all the bios settings basically match what the other R310s we run have [14:04:32] ok [14:06:18] hrmm, drac communication failure... [14:06:25] thats a new error. [14:07:01] ottomata: in about two minutes im ditching you for 30, my lunch is going to get cold (well, breakfast + lunch, reheated pizza!) [14:07:08] ok cool [14:07:08] but now im intrigued damn it [14:07:11] yeah! [14:08:17] ok, it should work, im going to stay attached to it and poke at it while im eating if you dont mind [14:08:23] if you want to try 1024 go for it [14:10:22] apergos: ready when you are [14:11:18] ok i'll try 1024 [14:12:57] ok [14:13:06] half-ready (I'll be going back and forth from the kitchen) [14:15:00] RobH, yargh, same deal on 1024 [14:15:18] i have no idea if that makes me happy or sad. [14:15:30] moar data though =] [14:16:06] so all the settings in bios & drac are right [14:17:43] other topic: paravoid, I know you and apergos are busy, but who can review puppet stuff? [14:17:49] there's got to be more than mark and you, right? [14:19:00] so just out of curiosity [14:19:16] is the oracle jdk open source? [14:19:55] i dunno if it has any bearing on using it, all your changes are very specific to analytics and appear to be legit to me, but im not that well versed in java [14:20:28] oh, nm, i see, you are pulling openjdk, nm [14:21:09] (i dunno enough to approve this other than 'assume good faith' type approval that you arent doing anything crazy in java setup) [14:22:11] aye [14:22:14] yeah we aren't [14:22:24] all i'm doing is abstracting out package naming details [14:22:30] because they names of the packages are not consistent [14:22:36] yea, im reading it now and seems thats it [14:22:40] between lucid and precise and openjdk vs. sun/oracle [14:22:46] i should be able to review and approve this in a few minutes [14:22:57] * apergos looks at the graphs again [14:23:04] and the only sun/oracle stuff I'm using are ones that are already available, either through ubuntu or through our own apt [14:23:32] ottomata: I'm not going to make you cherry pick and redo for a single space before a tab on line 238 ;] [14:23:53] (im not evil) [14:24:38] New review: RobH; "looks legit, and is localized to just analytic machines for now" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/20741 [14:24:39] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20741 [14:24:51] still ok [14:25:32] what do you think about putting it back in the pool? (ms-fe4) [14:25:40] ottomata: Your change is merged and live on puppetmaster [14:25:47] and i let my food get cold [14:25:51] * RobH goes to reheat it [14:26:02] apergos: you eating today? ;] [14:26:14] yes [14:26:26] iif you did the backread you'd see I am cooking right now even [14:26:31] (but I ate earlier too) [14:26:51] thank youuuuuuuu [14:26:52] half the food in my fridge is bad now [14:27:10] * RobH is reheating pizza now [14:27:24] and will prolly make a hash of taters and onion... cuz thats all i have thats still good? [14:29:15] taters and onions ain't bad [14:29:17] * paravoid points at Rob [14:29:25] but make sure you get more food for the fridge later [14:29:26] in Europe they're not many of us [14:29:49] with Daniel's relocation even less so [14:29:52] paravoid: apergos and i have a longstanding arrangement to remind one another to actually remember to eat. [14:29:57] do you need to be on the list? [14:30:00] haha [14:30:06] nah, thanks [14:30:14] worst case I won't eat and lose a pound or two [14:30:15] he's on a different sleep schedule anyhow [14:30:25] on pacific time? [14:30:30] actually, my sleep schedule was great this week [14:31:01] and even with the late days, I keep waking up early-ish, with the exception of today [14:31:08] oh the irony [14:33:00] ok, cooking time, back in 15 [14:33:09] so what did you think about putting ms-fe4 back in the pool? do you want more testing? [14:33:14] yes [14:33:16] well, first of all [14:33:19] let's do ms-fe3 first [14:33:30] yes which, testing or pool? [14:33:42] then, I'd like to enlist Aaron sometime later to actually try pointing a MW to ms-fe43/ [14:33:45] 4/3 [14:33:48] and see it if works [14:33:54] maybe a test wiki, I don't know [14:34:14] what do you think? [14:34:25] leet's see when we get there [14:34:49] fair enough [14:34:52] so, move on to ms-fe3? [14:34:55] sure [14:39:05] aaaahh internet, back [14:39:55] PROBLEM - Swift HTTP on ms-fe3 is CRITICAL: Connection refused [14:51:49] the two remaining proxies sure look unaffected [14:54:30] ottomata_m: this is confusing as hell. [14:57:52] ottomata: i hate this server now. its making me feel stupid. [14:58:25] ummmm but are you still intrigued? [14:58:38] still working on it [15:01:25] ottomata: The netboot prolly is wrong for this [15:01:31] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.30:11000 (Connection timed out) [15:01:34] analytics101[1-9]|analytics102[0-9]) echo partman/lvm.cfg ;; \ [15:01:46] so 1023-1029 are not going to want to use the lvm.cfg [15:02:01] you will want to specify something else for them, or leave it blank and manually partition [15:02:09] hm, why not? [15:02:29] hmmm [15:02:31] they have dual 500gb disks is all. [15:02:44] if thats ok with lvm then thats fine [15:02:58] oh hmmmm, yeah i guess we do want mirrored raid for / on those [15:02:59] right [15:03:04] but seems to be an odd range of servers (included both c2100 and r310) [15:03:10] whaaa [15:03:11] really? [15:03:12] which are which? [15:03:19] well, the R310s are 1023+ [15:03:23] ah rigiht [15:03:24] yeah [15:03:34] so you want that lvm line to be just up to 1022 [15:03:37] 1011-1022 are c2100s [15:03:38] yeah [15:03:42] (i could fix, but I assume you want to ;) [15:03:52] this is unrelated to the other issue we have [15:04:04] (but if we set it to manual partition for now it will make troubleshooting easier) [15:04:14] as we can manually set an IP when it fails, then web mount the debug logs. [15:04:28] plus i am just reviewing everything for these servers now to try to track wtf is up [15:04:30] so i noticed this [15:04:31] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [15:04:48] yeah i can fix it [15:04:50] (for now i would just update to remove 1023+ from any entries) [15:04:58] to manually debug? [15:04:59] then it just asks for paritioning, which is fine for debug [15:05:00] that's fine [15:05:10] but yea, later will want to add a new line for them to include in some mirror partman script [15:05:11] we want to automate once your figure it out though [15:05:17] ok cool [15:05:19] we are on same page =] [15:05:21] yeah i can do that [15:06:06] i thiiiiink i can use the analyitcs-cisco.cfg and just swap the sd* for sda and sdb [15:06:19] but i'll wait until you are done figuring it out before I try [15:06:29] cool [15:07:46] Ok, all the dhcp files check out as ok, the bios confirms as ok, the netboot files have the subnet needed (and have worked to isntall other machines in that subnet) [15:07:52] the drac settings are fine [15:08:01] dns is working for all the entries [15:08:08] are there still those drac errors you mentioned? [15:08:11] the netboot partman entry has no bearing on this issue [15:08:21] nah, it was some odd one time timeout [15:08:26] i didnt see it happen again [15:08:37] if it only happens once, its a non-issue ;] [15:08:48] and i had just disabled system services, so it may have been stuck reloading [15:08:53] aye [15:09:06] (system services is bad, since if you load it on accident remotely during POST you cannot unload it easily) [15:09:13] but it has nothing to do with the issue we have now [15:09:42] ottomata: So the next step is once the netboot.cfg is updated and live, we can rerun the install [15:09:49] it will fail on the dhcp in the installer (but not in post) [15:10:07] then we can manually set an ip, and web mount the debug log to see wtf is happening during the automated network discovery [15:10:32] I can see no reason it should be getting the DHCP lease during the PXE but not during the installer. [15:10:48] (i have seen that when the installer in netboot.cfg doesnt have the subnet defined) [15:10:51] but we have it defined. [15:11:07] 10.64.36.255) echo subnets/analytics1-c-eqiad.cfg ;; \ [15:12:00] and that actual file is fine, and has worked for the other analytics subnet servers (the c2100s) [15:13:37] right [15:13:43] apergos, paravoid: read backscroll. looks good! [15:13:50] I'm about to head into the office. [15:14:03] ok, "see" you in a while [15:14:09] wait so, ok, RobH, i'm confused, what are we doing to update netboot.cfg? you want me to change the partman stuff now? [15:14:21] RobH: Do you know if the last bit in #wikimedia-tech is at all valid? [15:14:27] ottomata: yea go ahead and correct the lvm line to exclude 1023+ [15:14:40] ottomata: that way when it gets to the partitioning menu we can manually interject and mount the debug logs [15:14:50] ah ok [15:15:21] paravoid: there was one part to the gerrit diff that I realized I didn't do that's required for the puppet upgrade alone to work - change packages from ensure => present to ensure => latest. (I realize you're not using puppet yet, but I didn't want to forget again.) [15:16:09] we better make sure hat goes on the etherpad I guess [15:16:24] or just straight into gerrit. [15:16:27] :P [15:16:29] sure [15:17:12] RD: I dunno, but I am going to pull the data and make an RT ticket for it [15:17:31] Alright [15:19:24] New patchset: Ottomata; "netboot.cfg - removing anlytics1023-1029 from netboot for now." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21191 [15:19:45] RobH ^ [15:20:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21191 [15:27:55] New review: RobH; "changed per my request" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/21191 [15:27:56] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21191 [15:29:51] ottomata: Ok, netboot change is live, pulling to install servers and restarting 1023 [15:30:07] cool [15:30:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21182 [15:30:40] ottomata: hopefully being able to see the logs now will result in me getting a goddamn clue whats going on [15:30:45] =P [15:33:55] RECOVERY - Swift HTTP on ms-fe3 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.509 seconds [15:33:57] apergos: ms-fe3 is done btw [15:33:58] ^^^ [15:34:57] I see that it is [15:35:20] so, question [15:35:26] I've recorded all the steps we made [15:35:39] any ideas on where to permanently store them? [15:35:44] wikitech perhaps? [15:35:53] yes, wikitech [15:36:00] stick em under Swift in a subpage [15:36:43] ottomata: so its rebooting, still confirmed borked [15:36:46] i see dhcp hit for pxe [15:36:56] then nothing in the installer, which makes me thing its borking the ordering of the NICs [15:37:12] but we will see shortly. [15:37:27] (we didnt order these with extra cards that i recall) [15:37:47] * apergos is finally eating their food [15:37:47] New patchset: Pyoungmeister; "add missing coma to solr init.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21192 [15:38:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21192 [15:38:32] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21192 [15:39:43] ottomata: ok, if i specify the ip it hangs on detecting the actual link [15:39:58] so i think the ubuntu installer is somehow swapping the nic1 and nic2 from what bios has [15:40:11] since we are only using one nic, i am disabling the second in bios to see if the installer then progresses [15:40:35]