[10:15:20] morning [10:21:43] no it isn't :-P [10:53:47] New patchset: Nikerabbit; "Symlink should be fixed after jetty installation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21182 [10:54:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21182 [10:54:54] apergos: so... [10:55:06] let's depool ms-fe4? [10:55:11] or? [10:56:03] I thought we wanted to do two of them? [10:56:44] well yes, but let's start gradually? [10:56:55] remove one, see the effect, remove a second one [10:56:59] ok [11:05:32] okay, removed ms-fe4 [11:06:23] ok, looking at the graphs [11:07:08] hm, unrelated to my change, ms-fe1 reports a 10s flatline response time [11:07:28] just for DELETEs [11:07:38] yeah I was looking at that, pretty weird [11:07:39] reporting error I say [11:09:39] sounds likely [11:10:15] (I'm looking at the changes in https://gerrit.wikimedia.org/r/#/c/18264/ to make sure I know what they do and why... also for sanity check) [11:11:18] great [11:17:33] I've verified the proxy logging change, the db preallocation changes, and the wsgi change. now looking at the change to the list of packages (that's the last thing too) [11:19:51] ok, that checks out too [11:20:22] shall I review +2 and merge? [11:21:22] no [11:21:25] that would break everything :) [11:21:38] heh [11:21:55] the changes are not compatible with 1.4 [11:22:12] you merge them => puppet applies them on production => everything fails [11:22:26] ah we don't have the new packages in the repo yet [11:22:45] plus also we want to apply them only to the two chosen proxies [11:22:53] well this will be irritating [11:23:40] * apergos waits [11:24:03] the two paths we can go is: 1) disable puppet in production and do the upgrade via puppet, 2) disable puppet in the to-be-upgraded box and do the upgrade by hand [11:24:06] I vote (2) [11:24:08] for now [11:26:01] I would vote to disable puppet in production and do the upgrade on one host by puppet, check the results for sanity [11:26:19] now? [11:26:35] we'd have to disable it on *every* production host and keep it that way until we're done with the upgrade [11:26:44] by every I mean every swift host [11:26:45] yep [11:26:48] obviously :) [11:27:45] ah, forgot [11:27:52] !log depooling ms-fe4 to stage 1.5 upgrade [11:28:03] Logged the message, Master [11:28:03] no bot? [11:28:05] ah. [11:28:10] hasty! [11:28:22] yep, we would be without puppet on those boxes for a couple days [11:28:49] I don't think that's a big deal... is it? [11:29:10] it is if it's more than a couple of days [11:29:33] and I'm worried we'll get stuck mid-way for some reason [11:30:02] on a related nonte, we have any easy way to back out if it turns out that 1.5 is killing us for whatever reason? [11:30:34] if we don't merge via puppet I think it's trivial [11:30:48] how would we do it? [11:30:52] (if we don't merge) [11:31:00] apt-get install swift=1.4; puppetd -vt? [11:31:14] I don't think it keeps state anywhere [11:31:33] and update the config file(s) and rewrite.py [11:31:45] puppet does that [11:32:18] so when and how do you want to test the 1.5 puppet changes? [11:33:28] I think we should just do it manually for now [11:33:33] for 1-2 proxy servers [11:34:31] what do you think? [11:34:36] I got that. I'm asking, if we go that route, how we plan for testing the 1.5 puppet changes, to make sure they work; that should be part of the overall plan [11:34:44] oh! [11:36:09] so, the puppet changes are tiny puppet-wise [11:36:19] it's mostly content that changes, which we'll already test manually [11:36:37] however, there's still a window there [11:38:29] ok well I have a proposal then [11:38:41] we do the first two proxy servers manually [11:38:54] we do all but one of the backend servers manually [11:39:09] we do the last backend server with puppet [11:39:15] and the last two proxy servers with puppet [11:39:23] okay [11:39:25] obviously this is not today, I just want to have the plan [11:39:32] it doesn't even have to be one, it can be more [11:39:35] sure [11:39:40] we can run puppet --noop [11:39:46] yep [11:39:46] see if it's sane, the push to all [11:39:59] okay, I like [11:40:03] great [11:40:10] so first tow proxy servers, manually [11:40:12] *two [11:40:17] lemme look at the graphs again... [11:40:20] great [11:40:26] no traffic in ms-fe4 at all, I checked lvs4 too [11:40:53] seems pretty bored, wanna pull the second one? [11:41:23] done [11:41:28] ms-fe3 [11:42:08] * apergos watches and waits [11:42:14] and eats a peach [11:43:26] yum [11:46:11] seems good [11:46:18] !log depooling ms-fe3 to stage 1.5 upgrade [11:46:27] Logged the message, Master [11:46:42] these look pretty good to me [11:47:33] still has lingering traffic according to netstat [11:47:37] ok [11:47:43] let's move on with ms-fe4 though [11:49:36] okay, disabled puppet on both [11:49:39] so I guess these packages are tucked away in a lab swift instance [11:49:48] "swift-init all stop" in ms-fe4 [11:49:51] All these packages are currently in /root/ on swift-be1 he says in an email [11:50:11] I read somewhere they're in bast1001:~ben and I took them from there [11:50:23] oh [11:50:40] http://wikitech.wikimedia.org/view/User:Bhartshorne/swift_upgrade_notes_2012-08 [11:50:44] there, second line [11:50:50] ok [11:51:29] (btw, swift's at 1.6 now...) [11:51:41] yeah I noticed [11:53:00] if we add these to the repo(s) on brewster, the old versions become unavailable? [11:53:42] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: Connection refused [11:54:25] Be aware that reprepro will remove older versions of packages without asking. They are no longer available in the pool [11:54:33] yeah. so I think we don't want to add these to the repo [11:54:39] nope [11:54:45] bah humbug [11:54:53] hm, is it my idea or the requests/s graph shows an upwards trend? [11:54:57] looking [11:55:24] nah I think it's minor [11:55:34] just fluctuating [11:55:50] little bit yes [11:55:56] let's wait a little though [11:56:44] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=^ms-fe[1-9].pmtpa&mreg[]=swift_[A-Z]%2B_hits%24&z=large>ype=stack&title=Swift+queries+per+second&aggregate=1&r=day [11:57:17] give it about 15 min [11:57:29] I installed 1.5 on ms-fe4 already :) [11:57:41] mrghmph [11:58:01] well it's not in the pool so it doesn't matter :-P [11:58:09] keep ms-fe3 as is for now though [11:59:06] :) [12:02:14] okay, packages & config prepared on ms-fe4 [12:03:54] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [12:03:54] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [12:03:54] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [12:03:54] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [12:03:54] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [12:03:55] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [12:03:55] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:03:56] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [12:03:56] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [12:03:57] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [12:03:57] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [12:03:58] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [12:03:58] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:05:52] okay, starting swift on ms-fe4 [12:06:00] what about rewrite.py? [12:06:06] oh, right. [12:06:24] very true [12:06:26] thanks [12:06:44] yw [12:07:03] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [12:11:39] yay for broken diff [12:11:50] rewrite.py has changed in the meantime [12:11:55] orilly? [12:12:56] yeah, probably the originals/thumb change [12:12:57] I'll fix it. [12:17:53] seems like the sort of thing that patch should have worked arond [12:17:55] around [12:17:56] whatever [12:19:12] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.011 seconds [12:19:27] heh, works [12:19:27] cool [12:25:42] ok the various lists commands seem to work (no big surprise, just figure I might as well walk through these) [12:30:24] things look pretty good [12:34:14] New patchset: Faidon; "swift changes for the upgrade to 1.5.0" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18264 [12:34:36] that's the rebase [12:34:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18264 [12:35:00] right [12:35:34] so, what did you test exactly? [12:36:32] lists, stats [12:36:40] did not try actual retrieval [12:36:47] nor nonexistent containers/objects [12:37:53] I am doing nonexistent containers/objects (lists, stats) right now [12:37:55] they all check out [12:42:12] with what? swift -A? or curl? [12:42:22] swift -A from the host [12:46:56] download worked. with somewhat surprising results (it created the hash dirs) [12:47:20] hm? [12:47:40] I requested 7/7b/somethingorother [12:47:44] using swift [12:47:58] i created 7/7b/somethingorother in the current dir [12:48:07] anyways it worked fine [12:48:31] I haven't tried any thumbs, only originals [12:48:57] retrieval of nonexistent object fails properly [12:54:49] New patchset: Matthias Mullie; "Add new AFT permission levels" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21141 [13:00:40] apergos: break for lunch? [13:00:51] sounds great [13:00:51] and other business that's piling up [13:01:01] need to cook real food [13:10:49] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [13:15:05] https://gerrit.wikimedia.org/r/#/c/21141/2/wmf-config/CommonSettings.php [13:15:06] $wgGroupPermissions['afttest-hide'] = $wgGroupPermissions['oversight']; [13:15:17] I'm.. not sure this is a good idea [13:52:24] hey guys, other than mark, who can review my puppet changes to java.pp? [13:52:34] mutante or paravoid maybe? [13:52:51] https://gerrit.wikimedia.org/r/#/c/20741/ [14:01:53] ottomata: I can confirm that the issue you are having isn't you. I am totally getting the same error. [14:02:05] I see 1023 hit brewster for its PXE boot DHCP assignment [14:02:14] but I never see the secondary hit from within the ubuntu installer [14:02:21] aye cool, yeah that's what I saw too [14:02:48] of course, cannot push the installer logs to the web to look at them either [14:02:53] as it fails dhcp [14:03:06] I am going to check a few things in the bios to confirm they are right, checking now [14:03:09] ottomata: I'm sorry, this is swift week [14:03:19] me and apergos are picking up swift stuff from Ben [14:03:39] * apergos peeks in [14:03:44] aye cool [14:03:44] and it's important to squeeze as much as possible before he leaves, so everything else kinda is in the backburner [14:03:52] yeah everyone has been soooooper busy it seems [14:03:53] s'ok [14:04:09] mark: varnish 3.0.3 got released btw [14:04:14] ottomata: So there are a fwe things that can cause it to detect the wrong interfaces. I am going to confirm that DRAC is set to dedicated, no virtual media is connected, and that all the bios settings basically match what the other R310s we run have [14:04:32] ok [14:06:18] hrmm, drac communication failure... [14:06:25] thats a new error. [14:07:01] ottomata: in about two minutes im ditching you for 30, my lunch is going to get cold (well, breakfast + lunch, reheated pizza!) [14:07:08] ok cool [14:07:08] but now im intrigued damn it [14:07:11] yeah! [14:08:17] ok, it should work, im going to stay attached to it and poke at it while im eating if you dont mind [14:08:23] if you want to try 1024 go for it [14:10:22] apergos: ready when you are [14:11:18] ok i'll try 1024 [14:12:57] ok [14:13:06] half-ready (I'll be going back and forth from the kitchen) [14:15:00] RobH, yargh, same deal on 1024 [14:15:18] i have no idea if that makes me happy or sad. [14:15:30] moar data though =] [14:16:06] so all the settings in bios & drac are right [14:17:43] other topic: paravoid, I know you and apergos are busy, but who can review puppet stuff? [14:17:49] there's got to be more than mark and you, right? [14:19:00] so just out of curiosity [14:19:16] is the oracle jdk open source? [14:19:55] i dunno if it has any bearing on using it, all your changes are very specific to analytics and appear to be legit to me, but im not that well versed in java [14:20:28] oh, nm, i see, you are pulling openjdk, nm [14:21:09] (i dunno enough to approve this other than 'assume good faith' type approval that you arent doing anything crazy in java setup) [14:22:11] aye [14:22:14] yeah we aren't [14:22:24] all i'm doing is abstracting out package naming details [14:22:30] because they names of the packages are not consistent [14:22:36] yea, im reading it now and seems thats it [14:22:40] between lucid and precise and openjdk vs. sun/oracle [14:22:46] i should be able to review and approve this in a few minutes [14:22:57] * apergos looks at the graphs again [14:23:04] and the only sun/oracle stuff I'm using are ones that are already available, either through ubuntu or through our own apt [14:23:32] ottomata: I'm not going to make you cherry pick and redo for a single space before a tab on line 238 ;] [14:23:53] (im not evil) [14:24:38] New review: RobH; "looks legit, and is localized to just analytic machines for now" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/20741 [14:24:39] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20741 [14:24:51] still ok [14:25:32] what do you think about putting it back in the pool? (ms-fe4) [14:25:40] ottomata: Your change is merged and live on puppetmaster [14:25:47] and i let my food get cold [14:25:51] * RobH goes to reheat it [14:26:02] apergos: you eating today? ;] [14:26:14] yes [14:26:26] iif you did the backread you'd see I am cooking right now even [14:26:31] (but I ate earlier too) [14:26:51] thank youuuuuuuu [14:26:52] half the food in my fridge is bad now [14:27:10] * RobH is reheating pizza now [14:27:24] and will prolly make a hash of taters and onion... cuz thats all i have thats still good? [14:29:15] taters and onions ain't bad [14:29:17] * paravoid points at Rob [14:29:25] but make sure you get more food for the fridge later [14:29:26] in Europe they're not many of us [14:29:49] with Daniel's relocation even less so [14:29:52] paravoid: apergos and i have a longstanding arrangement to remind one another to actually remember to eat. [14:29:57] do you need to be on the list? [14:30:00] haha [14:30:06] nah, thanks [14:30:14] worst case I won't eat and lose a pound or two [14:30:15] he's on a different sleep schedule anyhow [14:30:25] on pacific time? [14:30:30] actually, my sleep schedule was great this week [14:31:01] and even with the late days, I keep waking up early-ish, with the exception of today [14:31:08] oh the irony [14:33:00] ok, cooking time, back in 15 [14:33:09] so what did you think about putting ms-fe4 back in the pool? do you want more testing? [14:33:14] yes [14:33:16] well, first of all [14:33:19] let's do ms-fe3 first [14:33:30] yes which, testing or pool? [14:33:42] then, I'd like to enlist Aaron sometime later to actually try pointing a MW to ms-fe43/ [14:33:45] 4/3 [14:33:48] and see it if works [14:33:54] maybe a test wiki, I don't know [14:34:14] what do you think? [14:34:25] leet's see when we get there [14:34:49] fair enough [14:34:52] so, move on to ms-fe3? [14:34:55] sure [14:39:05] aaaahh internet, back [14:39:55] PROBLEM - Swift HTTP on ms-fe3 is CRITICAL: Connection refused [14:51:49] the two remaining proxies sure look unaffected [14:54:30] ottomata_m: this is confusing as hell. [14:57:52] ottomata: i hate this server now. its making me feel stupid. [14:58:25] ummmm but are you still intrigued? [14:58:38] still working on it [15:01:25] ottomata: The netboot prolly is wrong for this [15:01:31] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.30:11000 (Connection timed out) [15:01:34] analytics101[1-9]|analytics102[0-9]) echo partman/lvm.cfg ;; \ [15:01:46] so 1023-1029 are not going to want to use the lvm.cfg [15:02:01] you will want to specify something else for them, or leave it blank and manually partition [15:02:09] hm, why not? [15:02:29] hmmm [15:02:31] they have dual 500gb disks is all. [15:02:44] if thats ok with lvm then thats fine [15:02:58] oh hmmmm, yeah i guess we do want mirrored raid for / on those [15:02:59] right [15:03:04] but seems to be an odd range of servers (included both c2100 and r310) [15:03:10] whaaa [15:03:11] really? [15:03:12] which are which? [15:03:19] well, the R310s are 1023+ [15:03:23] ah rigiht [15:03:24] yeah [15:03:34] so you want that lvm line to be just up to 1022 [15:03:37] 1011-1022 are c2100s [15:03:38] yeah [15:03:42] (i could fix, but I assume you want to ;) [15:03:52] this is unrelated to the other issue we have [15:04:04] (but if we set it to manual partition for now it will make troubleshooting easier) [15:04:14] as we can manually set an IP when it fails, then web mount the debug logs. [15:04:28] plus i am just reviewing everything for these servers now to try to track wtf is up [15:04:30] so i noticed this [15:04:31] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [15:04:48] yeah i can fix it [15:04:50] (for now i would just update to remove 1023+ from any entries) [15:04:58] to manually debug? [15:04:59] then it just asks for paritioning, which is fine for debug [15:05:00] that's fine [15:05:10] but yea, later will want to add a new line for them to include in some mirror partman script [15:05:11] we want to automate once your figure it out though [15:05:17] ok cool [15:05:19] we are on same page =] [15:05:21] yeah i can do that [15:06:06] i thiiiiink i can use the analyitcs-cisco.cfg and just swap the sd* for sda and sdb [15:06:19] but i'll wait until you are done figuring it out before I try [15:06:29] cool [15:07:46] Ok, all the dhcp files check out as ok, the bios confirms as ok, the netboot files have the subnet needed (and have worked to isntall other machines in that subnet) [15:07:52] the drac settings are fine [15:08:01] dns is working for all the entries [15:08:08] are there still those drac errors you mentioned? [15:08:11] the netboot partman entry has no bearing on this issue [15:08:21] nah, it was some odd one time timeout [15:08:26] i didnt see it happen again [15:08:37] if it only happens once, its a non-issue ;] [15:08:48] and i had just disabled system services, so it may have been stuck reloading [15:08:53] aye [15:09:06] (system services is bad, since if you load it on accident remotely during POST you cannot unload it easily) [15:09:13] but it has nothing to do with the issue we have now [15:09:42] ottomata: So the next step is once the netboot.cfg is updated and live, we can rerun the install [15:09:49] it will fail on the dhcp in the installer (but not in post) [15:10:07] then we can manually set an ip, and web mount the debug log to see wtf is happening during the automated network discovery [15:10:32] I can see no reason it should be getting the DHCP lease during the PXE but not during the installer. [15:10:48] (i have seen that when the installer in netboot.cfg doesnt have the subnet defined) [15:10:51] but we have it defined. [15:11:07] 10.64.36.255) echo subnets/analytics1-c-eqiad.cfg ;; \ [15:12:00] and that actual file is fine, and has worked for the other analytics subnet servers (the c2100s) [15:13:37] right [15:13:43] apergos, paravoid: read backscroll. looks good! [15:13:50] I'm about to head into the office. [15:14:03] ok, "see" you in a while [15:14:09] wait so, ok, RobH, i'm confused, what are we doing to update netboot.cfg? you want me to change the partman stuff now? [15:14:21] RobH: Do you know if the last bit in #wikimedia-tech is at all valid? [15:14:27] ottomata: yea go ahead and correct the lvm line to exclude 1023+ [15:14:40] ottomata: that way when it gets to the partitioning menu we can manually interject and mount the debug logs [15:14:50] ah ok [15:15:21] paravoid: there was one part to the gerrit diff that I realized I didn't do that's required for the puppet upgrade alone to work - change packages from ensure => present to ensure => latest. (I realize you're not using puppet yet, but I didn't want to forget again.) [15:16:09] we better make sure hat goes on the etherpad I guess [15:16:24] or just straight into gerrit. [15:16:27] :P [15:16:29] sure [15:17:12] RD: I dunno, but I am going to pull the data and make an RT ticket for it [15:17:31] Alright [15:19:24] New patchset: Ottomata; "netboot.cfg - removing anlytics1023-1029 from netboot for now." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21191 [15:19:45] RobH ^ [15:20:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21191 [15:27:55] New review: RobH; "changed per my request" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/21191 [15:27:56] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21191 [15:29:51] ottomata: Ok, netboot change is live, pulling to install servers and restarting 1023 [15:30:07] cool [15:30:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21182 [15:30:40] ottomata: hopefully being able to see the logs now will result in me getting a goddamn clue whats going on [15:30:45] =P [15:33:55] RECOVERY - Swift HTTP on ms-fe3 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.509 seconds [15:33:57] apergos: ms-fe3 is done btw [15:33:58] ^^^ [15:34:57] I see that it is [15:35:20] so, question [15:35:26] I've recorded all the steps we made [15:35:39] any ideas on where to permanently store them? [15:35:44] wikitech perhaps? [15:35:53] yes, wikitech [15:36:00] stick em under Swift in a subpage [15:36:43] ottomata: so its rebooting, still confirmed borked [15:36:46] i see dhcp hit for pxe [15:36:56] then nothing in the installer, which makes me thing its borking the ordering of the NICs [15:37:12] but we will see shortly. [15:37:27] (we didnt order these with extra cards that i recall) [15:37:47] * apergos is finally eating their food [15:37:47] New patchset: Pyoungmeister; "add missing coma to solr init.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21192 [15:38:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21192 [15:38:32] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21192 [15:39:43] ottomata: ok, if i specify the ip it hangs on detecting the actual link [15:39:58] so i think the ubuntu installer is somehow swapping the nic1 and nic2 from what bios has [15:40:11] since we are only using one nic, i am disabling the second in bios to see if the installer then progresses [15:40:35] (since it wont go past the IP allocation, failing for link on the nic, it leads me to think this, too bad i cannot mount the damned logs to read them ;) [15:40:56] ahhmm ok [15:40:58] neat, you cannot disable the secondary nic. [15:41:00] what the hell [15:41:05] hah [15:41:07] can you remove it? [15:41:15] annoying i guess [15:41:17] nope, its mainboard nic [15:41:20] which is whats odd [15:41:25] i wonder if these have more nics installed [15:41:28] lemme pull the order [15:41:39] i wonder if these also have the damned extra nics [15:42:31] you coudl see the others in bios on the c2100s [15:43:32] Broadcom 5709 Dual Port 1GbE NIC w/TOE PCIe-4 (430-3251) [15:43:39] thats whats on the quote for the r310 order [15:45:24] oook [15:51:35] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [15:52:11] sigh, fucking srv278 [15:52:38] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [15:54:50] just decommission that box [15:54:54] it's not worth the trouble [15:55:47] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [15:55:59] I distinctly remember having a problem with that in the past but I can't find the ticket [15:56:12] so I may be wrong [15:56:49] mark: did you see that varnish 3.0.3 was released? [15:57:41] mark: also VUG has a registration page and limited seats it seems. [15:58:56] i did [16:06:44] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [16:06:44] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [16:06:44] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [16:06:44] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [16:09:31] hi paravoid, apergos [16:09:39] hello [16:09:47] good morning maplebed [16:10:02] * ^demon waves to everyone [16:10:28] * everyone waves to ^demon [16:10:56] so I don't have backscroll for the hour I was commuting; anything interesting with your tests? [16:11:41] I've tested it with unauthenticated requests, apergos tested it with the swift CLI [16:11:54] and I'm trying to test now with authenticated requests now [16:11:57] all looks good though [16:12:24] anthenticating with curl or something different? [16:12:44] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:13:01] basically yes [16:13:54] hey paravoid, a quick debian packaging question: where / how does dpkg-buildpackage determine who changed the source code? [16:14:14] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [16:14:22] drdee: debian/changelog's first entry [16:14:54] (well, first as you read the file, last chronologically) [16:15:19] ty [16:17:00] works fine [16:17:11] maplebed: so, should we bug Aaron about testing it with MW? [16:17:28] I wouldn't do that normally, but I remember something you said about getting reports it's broken [16:17:31] +1 [16:17:44] and a comment on the wiki that says something about auth being broken on >= 1.4.4 [16:17:47] it shouldn't be too hard to move test.wikimedia.org to use ms-fe3 [16:17:53] right [16:18:06] other than that I think we're good [16:18:19] btw, any reason that they're still running lucid? legacy? [16:19:39] only that they were built before we had precise. [16:19:43] the eqiad cluster is on precise. [16:19:48] so, legacy [16:19:49] ok [16:20:27] I'm also waiting for aaron's ms5 change to get signed off so it can be pushed out [16:20:34] is there anything else we can do in the meantime? [16:20:42] did it get a review from Tim? [16:21:30] gotta find it again [16:24:36] no it didn't [16:24:48] ah too bad. [16:24:56] I looked at it, it seems fine if a bit of code duplication that could now be cleaned up [16:25:36] since tim didn't get to it we'll have to see if someone else has time and the appropriate expertise [16:25:40] *grumble* [16:26:01] hey RoanKattouw... [16:26:06] ;) [16:26:20] :-D [16:26:39] Hey [16:26:45] What's up? [16:27:13] any chance you'd be interested in reviewing https://gerrit.wikimedia.org/r/#/c/21153/ ? [16:27:52] Sure [16:28:20] * apergos just added the comment they could have made yesterday but didn't notice they were on the requested review list :-/ [16:29:35] hey AaronSchulz [16:29:40] maplebed: Approved [16:30:10] apergos: Comment noted, there's a fair bit that can be factored out there, but it looks like that was kind of a problem already [16:30:23] yeah, it's not a blocker, just a "look at later" [16:30:50] so now... how do we get this deployed? :-D [16:31:01] AaronSchulz: two things going on this morning [16:31:16] * roan just reviewed and approved the multiwrite change [16:31:41] * we've got two of the production swift proxies on 1.5 and out of rotation and are interested in trying to do something like point test.mw.org at them. [16:31:58] thanks RoanKattouw ! [16:32:06] I'll deploy it too if you like [16:32:07] yes, thanks much [16:32:23] ok let's think about this folks: what do we risk with this going out now? [16:32:41] * RoanKattouw prepares cherry-pick but won't merge yet [16:33:43] Meh AaronSchulz beat me to it by a few seconds :) [16:33:51] beat you to what? :) [16:34:04] Submitting a cherry-pick of that commit [16:34:19] AaronSchulz: the new 1.5 proxies are out of the pool, they're ms-fe3 & ms-fe4 [16:34:27] See https://gerrit.wikimedia.org/r/#/c/21193 , we uploaded identical commits seconds from each other, the only difference is the name of the committer [16:34:38] And apparently mine won because it was submitted a few seconds after Aaron's :S [16:34:42] :-D [16:34:42] AaronSchulz: also, 3) is the auth caching still on? I remember reading that it caused troubles yesterday and you disabled it [16:38:13] so this is deployed to... [16:38:24] 1.20wmf-something? [16:38:27] Not yet [16:38:37] It's been submitted for review to 1.20wmf9 and 1.20wmf10 [16:38:44] which would be all projects [16:38:45] ok [16:38:55] how does that process work? [16:39:09] From there either Aaron or myself can approve&merge those, then manually git pull && sync-file on fenari [16:39:23] ok [16:39:35] gerrit-wm Change merged: Aaron Schulz; [mediawiki/core] (wmf/1.20wmf9) - https://gerrit.wikimedia.org/r/21194 [16:39:37] gerrit-wm Change merged: Aaron Schulz; [mediawiki/core] (wmf/1.20wmf10) - https://gerrit.wikimedia.org/r/21193 [16:39:51] then I suppose the actual config change needs to get made and synced [16:40:16] ? [16:44:27] AaronSchulz: so having synced the php, now you enable it via config? [16:45:37] yes [16:46:49] so my plan is, after the sync and the config change, I'm going to start my deleter on ms5 (not that it matters but still) and we'll see what dies :-P [16:47:16] fair enough [16:47:45] +1 [16:48:15] New patchset: Aaron Schulz; "Only write thumbnails to the master backend." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21195 [16:49:12] now there is no other container for thumbs, no temp or whatever, right? [16:49:20] just local-thumb? [16:53:39] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21195 [16:54:37] apergos: yes; all thumbs are in the -thumb container. archived happens within the container (eg thumb/archived/a/a2/...) [16:54:46] * apergos answers their own question by listing the containers [16:55:16] maplebed: btw, I read that swift added versioning. what's your opinion about that? [16:55:52] paravoid: definitely a useful feature in general, but not necessary here since mediawiki does the versioning for us. [16:57:12] so in theory I can start deleting... speak now or share the blame when the site falls over [16:58:06] actually I'llwait a bit, still see a lot of open connections [16:58:08] apergos: deletionist! [16:58:30] only stuff that can be regerenated from existing material! [16:58:36] was it pushed already? [16:58:37] :) [16:58:50] I see it merged but not yet synced. [16:58:50] http://wikitech.wikimedia.org/view/Server_admin_log [16:59:03] guess I just missed it. [16:59:08] * AaronSchulz hates comcast promotion calls [16:59:18] the log messages don't go to the channel any more [16:59:19] it's a bug [16:59:27] (from sync file or whatever it is) [16:59:37] robh: mutant and i were discussing this last night can you review https://gerrit.wikimedia.org/r/#/c/21145/ [17:00:29] (mutante) ^ autocorrect [17:00:38] cmjohnson1: reviewing the related rt's now [17:00:59] I don't see traffic to ms5 falling off yet either. [17:01:07] I'm on there watching [17:01:17] AaronSchulz: should we see effects immediately? or should it take a bit? [17:01:27] number of connections seems to fluctuaate but not fall off, it's true [17:01:54] I don't see any change [17:02:00] must be a bunch of readers [17:02:32] tcp 0 0 ms5.pmtpa.wmnet:nfs srv258.pmtpa.wmnet:swat ESTABLISHED [17:02:35] typical entry [17:03:52] New review: RobH; "srv266 is an R610 with over a year of warranty left, thus it should not be decommissioned." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/21145 [17:03:59] cmjohnson1: ^ [17:05:14] cmjohnson1: so rejected, you can cherry pick and remove srv266 from the decom list [17:05:27] but the other two are ok, can re-review when thats fixed [17:05:37] oops thought I checked that already and it expired in april�.will take a look�.ok..cool..thx [17:05:50] racktables had no warranty info [17:05:56] so i just pulled up dell's warranty status check [17:06:09] i dont bother to bookmark it, its google's first reply on dell warranty check [17:06:16] yep�i have the site bookmarked [17:06:29] Now, if you want a supreme amount of busy work that has to someday get done [17:06:38] sorry�that was my fault..should've never made it to the ticket [17:06:41] I can give you the SQL query to run to dump out all of racktables into a CSV [17:06:50] which you can then parse to find out which servers are missing info, and track down said info. [17:06:52] i may need that for an audti [17:06:57] oh, you will [17:07:09] but you may want it well before so you can fix things before they see it [17:07:18] I see that directory ops stopped, but not stores [17:07:19] * RobH had two old ES servers labeled incorrectly from their initial install [17:07:35] Jeff_Green is entirely responsible for the magic that is the sql query. [17:07:49] when audit time comes, you can send him donations of .... i dunno [17:07:55] is he the sql wizard? [17:07:55] srv224 still doing nfs [17:07:59] Jeff_Green: you want scotch, cookies, what? [17:08:07] so... [17:08:07] from/to ms5. bah humbug [17:08:10] what are we waiting? [17:08:16] cmjohnson1: the sad part is i used to work writing sql queries for a living [17:08:17] (it's a scaler) [17:08:30] i have forgotten every single bit of it. [17:08:45] well now I would say we are waiting to find out why the config change didn't have the impact we expected (at least that's what I'm on) [17:08:56] robh: shame [17:08:59] ie: I could explain nest queries and how i wanted the damned report to work, but i couldn't recall the syntax to save my life [17:09:01] lol [17:09:19] AaronSchulz: any thoughts? [17:10:22] robh: send me the sql query when u get a chance�that may be something for me to do during the hurricane as long as pmtpa doesn't go down. [17:11:20] cmjohnson1: it is already living on db9, i can walk you through using it now [17:11:43] apergos: http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms5.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1345741873&g=network_report&z=large&c=Miscellaneous%20pmtpa [17:11:44] (I se the changes in filebackend.php on srv224 so that made it around at least) [17:12:09] cmjohnson1: so login as root@db9. in root's home is a sql query called racktables_inventory [17:12:26] you run it with mysql < racktables_inventory > outputfilename [17:12:40] then you can scp that result file over to fenari (your home directory) and do whatever [17:12:49] it took that long? [17:12:55] RobH: whiskeycookies! [17:12:57] what changed to trigger it to drop all of a sudden like that? [17:13:05] now tcpdump shows no packets [17:13:13] well arp... :-P [17:13:15] cmjohnson1: I dump the result file into gdocs spreadsheet, sort by rack row location for audit [17:13:27] and then pull out all non tampa related items (easy since its sorted by location) [17:13:32] apergos: there was extra whitespace in a setting name [17:13:44] oh geeee [17:13:44] danmable whitespace. [17:13:50] Jeff_Green: duly noted, cmjohnson1 you owe jeff some whiskeycookies. [17:13:57] that's why only the directory calls went away at first [17:14:01] ok, waiting for the rest of the open connections to go away [17:15:05] ah it's finally starting to drop off yay [17:16:06] so AaronSchulz what do you think of the idea of sending test.wm.org traffic to the upgraded ms-fe produciton hosts? [17:16:11] yes! [17:17:49] maplebed: do they have there own rrdns or should I just pick one or mt_rand() between them or something? [17:17:56] pick one. [17:18:04] ms-fe3 or ms-fe4 [17:19:48] New patchset: Dzahn; "decom srv206,srv217 - RT-1422, RT-241" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21145 [17:20:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21145 [17:23:13] they are still not actually dropping off. from the scalers there is nothing but there are regular app servers that are doing things [17:23:17] not good [17:23:25] and that number is not declingin, it's just bouncing around [17:25:43] mutante: is your change the fix robh was talking about? [17:26:11] maplebed: testwiki seems ok [17:26:21] cmjohnson1: it is [17:26:44] apergos: which number isn't declining? [17:26:52] AaronSchulz: what tests did you run? [17:26:53] something in the bowels of mw is still using ms5 nfs [17:26:56] New review: RobH; "looks good" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/21145 [17:27:10] maybe it's just checking that the mount exists, who knows, but it's doing something [17:27:13] merging on sockpuppet now [17:27:29] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [17:27:36] well, havent yet [17:27:45] mutante: were you already doing that? (I dont wanna do it if you are) [17:27:50] maplebed: moving, deleting...someone should try uploading a kitty [17:27:59] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21145 [17:28:02] apergos: how do you know that? [17:28:03] * AaronSchulz already did restoring too [17:28:04] I see no traffic [17:28:13] and I"ve been writing it wrong; it's test.wikipedia.org, not wikimedia.org, right? [17:28:17] I was watching tcpdump [17:28:33] and there were nfs packets [17:28:42] cmjohnson1: mutante's change (for srv206/217) is live [17:28:46] to a specified app server (I chose one of the ones with an open connection at random) [17:28:48] so you will wanna resolve those tickets [17:28:49] maplebed: yes [17:28:54] aka srv193 [17:29:00] (241, 1422) [17:29:02] and wipe disks [17:29:02] cool thx robh [17:29:06] yep [17:29:08] apergos: and its not ganglia? ;) [17:29:15] RobH: the decom change? no, i wasn't, but amended it.yup [17:29:32] apergos: oh, and yes auth caching was renabled [17:29:36] * AaronSchulz forgot to answer [17:30:11] mutante: i merged, thx for amending [17:31:25] RobH: yw. cmjohnson1: looks like it was. [17:31:43] ganglia? on app servers? [17:31:53] polling ms5 nfs? :-P [17:32:05] ah, i am so used to having a white text on black background, now that it's black on white once, i can hardly read IRC properly..its weird [17:33:17] apergos: I don't think ganglia should be doing anything to ms5 on the app servers. [17:33:36] I am sure it isn't, I was wisecracking back at aaron [17:33:46] AaronSchulz: I followed your request. MOAR KITTENS. http://test.wikipedia.org/wiki/File:Feral_Cat.jpg [17:33:53] apergos: from the apaches? odd [17:33:55] 17:32:55.781805 IP (tos 0x0, ttl 64, id 24551, offset 0, flags [DF], proto TCP (6), length 52) [17:33:56] srv238.pmtpa.wmnet.890 > ms5.pmtpa.wmnet.nfs: Flags [.], cksum 0xab6e (correct), ack 1, win 12, options [nop,nop,TS val 6742666 ecr 130598947], length 0 [17:34:02] for example [17:34:07] scalars or regular ones? [17:34:26] well that is 238, so not a scaler [17:35:17] maybe it's something outside of mw verifying the mount? [17:35:27] I'm just guessing wildly at this point [17:36:00] apergos: well, it's still mounted, so... [17:36:14] a bit of a traffic is not really surprising [17:36:18] anyways the point is there are still around 35-40 open connections at any time [17:36:20] as long as it's not much [17:36:24] nfs connections, from the app servers [17:36:29] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [17:36:29] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [17:36:39] I would just like to be sure my delete run won't hose them. [17:37:12] apergos: it seemed that ms5's issues yesterday only increased load on the scalers, not the rest of the apaches, right? [17:37:19] so even if they'er still connected, I'd bet you're ok. [17:37:23] well .. heh, we didn't get that far [17:37:37] anyways, straw poll: ok to go ahead with delete? [17:38:24] * maplebed draws the short straw [17:38:46] is that a yes or a no though? :-D [17:39:17] well, since I drew the short straw, it means you're protected, right? I say go ahead. [17:39:40] paravoid, AaronSchulz ? [17:40:10] yes, go ahead [17:41:09] * apergos 's impatience wins [17:41:14] heh [17:41:22] sorry too slow AaronSchulz :-D [17:41:34] paravoid: ping [17:41:56] "feral"...ohh [17:42:03] preilly: pong [17:42:06] for the first little while it won't do much, it's processing things that were deleted yesterday [17:42:16] paravoid: can I pm [17:42:19] sure [17:42:21] AaronSchulz: :) [17:42:23] yeah, I'm not seeing any attempts to write to ms5 from mw [17:42:46] AaronSchulz: so, green light for ms-fe3/4? [17:43:37] it looks ok [17:43:54] great. [17:44:10] AaronSchulz: about auth caching, did you disabled it yesterday after all? [17:44:25] [10:29] AaronSchulz apergos: oh, and yes auth caching was renabled [17:44:55] oh, sorry, didn't see that [17:44:58] oh so what was the bug in the end? [17:44:59] tis live :) [17:45:00] great [17:45:03] er the cause I mean [17:45:15] apergos: something stupid in CloudFiles [17:45:32] since we have our own fork, stuff is easy to fix though [17:45:37] yay for that [17:45:40] are we contributing that back? [17:45:45] (just curious) [17:45:57] apergos: it wasn't letting cached credentials load with a cdn auth url [17:46:01] *without [17:46:10] of course, we don't use that stuff [17:46:30] paravoid: I don't bother anymore, it's poorly maintained [17:46:41] just look at the last 2 commits for example [17:46:50] sigh [17:47:13] when I started merging it into our fork as a commit, I realized it was totally broken and wip [17:47:39] New patchset: Pyoungmeister; "mobile.pp: upping version number on vumi and vumi-wikipedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21198 [17:48:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21198 [17:48:36] hmm, I see a few more commits there lately [17:48:46] none of which fix the big broken one [17:49:04] which one is the big broken one? [17:49:56] https://github.com/rackspace/php-cloudfiles/commit/930eb8df511160da6eae203e3baf986a24085cea [17:54:16] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21198 [17:54:29] PROBLEM - Puppet freshness on snapshot4 is CRITICAL: Puppet has not run in the last 10 hours [17:54:52] apergos: did you rm? [17:55:03] I am doing so [17:55:18] using the same boring script as [17:55:22] was that only yesterday? :-D [17:55:31] so whats going on with the upgrade? [17:55:47] nothing yet [17:56:13] maplebed: how would you feel about having mixed 1.4/1.5 proxies in LVS? [17:56:18] I don't see any problems, but better to check [17:56:35] I think it's ok. [17:56:38] i.e. reenable ms3/4 in pybal [17:56:47] did we confirm that traffic from test.wp.org is actually going to ms-fe3 or 4? [17:56:48] then disable ms1/2, upgrade those two [17:57:12] I didn't, but if Aaron says so. [17:58:48] okay, so plan: reenable ms3/4, see how it goes until tomorrow our morning, then upgrade the other two then [17:58:54] fair enough? [17:59:00] -1 upgrade on a friday. [17:59:15] well, it's upgrade on thursday technically, but I see your point. [17:59:38] the other options is doing all of them today or wait until Monday. [18:00:00] we want em to have a few hours on em I think [18:00:10] AaronSchulz: which front end did you use? fe3 or fe4? [18:00:11] right, I'd prefer that route too [18:00:11] minimum [18:00:14] (for test.wp.org) [18:00:15] fe4 [18:00:28] but I'm not good for upgrades at midnight my time [18:00:45] otoh I really, even if we said yes we're going to deploy on a fridya it's cool... I really want my friday [18:00:50] I've lost every evening for a week [18:01:01] apergos: note how I said our *morning* [18:01:05] not evening :) [18:01:09] yes, I noted that :-D [18:01:39] so how about back ends? can we do any of those while we wait around? [18:01:49] good q [18:01:54] maplebed: you had some thoughts about this [18:02:21] I don't see the log entry for my Feral Cat upload on ms-fe4. [18:02:58] notpeter: ping [18:03:39] maplebed: oh crap [18:03:50] ? [18:03:58] I think the auth caching may have ruined the test [18:04:16] is it not just auth caching but also host caching? [18:04:25] bummer :-( [18:04:57] wow diablingn writes could not have come any later, 81gb free [18:05:02] *disabling [18:05:27] heh [18:05:42] apergos: ossm. [18:06:24] huh? [18:06:58] AaronSchulz: could you explain a bit about that? [18:07:06] the auth caching on testwiki etc. [18:07:37] I just disable it for testwiki [18:07:39] * AaronSchulz is testing again [18:07:45] apergos: the 81gb thing. [18:08:06] AaronSchulz: well I got that part :) why? [18:08:10] Ryan_Lane: morning [18:08:17] morning [18:08:22] how's the swift stuff going? [18:08:23] Ryan_Lane: apparently we need your expertise [18:08:41] on hurricane-hit datacenters [18:08:44] I have none, but ok :) [18:08:45] :) [18:08:46] hahaha [18:09:04] well, let me give you the short answer. we're fucked if it's a good direct hit [18:09:19] yeah I'm asling what "ossm" means [18:09:20] * Damianz gives ryan the bolt cutters and lets him loose at the power lines [18:09:21] and a decent sized hurricane [18:09:38] we'll have a couple weeks of up and down service [18:09:41] minimum [18:09:50] and that's if the datacenter *really* gives a fuck [18:09:52] how long do their diesel generators work if electricity goes down? i am expecting like 10 minutes? [18:10:03] as long as they keep refueling them [18:10:09] usually 6 hours or so [18:10:16] you mean if they didn't use the fuel for the lawnmowers? [18:10:20] heh [18:10:24] hehe [18:10:30] push lawnmowers [18:10:59] eh, nevermind, i should have said i am expecting like 10 minutes for backup batteries that gives them the time to have the diesel generators up and running [18:11:06] what's the category of the storm? [18:11:13] and what's it expected to be when it hits? [18:11:23] New review: Jerith; "Looks good." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21073 [18:11:40] I think it's too soon to know and too soon to know [18:11:43] maplebed: what is the ip of fe-4 [18:11:51] AaronSchulz: I see auth stuff hitting ms-fe4 but no object stuff yet. [18:11:59] http://www.wunderground.com/tropical/tracking/at201209_5day.html [18:12:03] AaronSchulz: 10.0.6.215 [18:12:04] US forecasters said Isaac will likely turn into a Category 1 hurricane by Friday [18:12:07] ,eh [18:12:08] err [18:12:09] meh [18:12:13] cat I as best [18:12:17] nothing to worry about [18:12:34] and realistically when it lands it'll drop to a TS [18:12:38] they don't know it will hit florida even (I think) [18:12:42] Ryan_Lane: it's Tampa, not usmil [18:12:47] but tis the season... [18:12:55] paravoid: dude, a cat I is like a strong breeze [18:13:15] I never even evacuated unless a storm was a cat III or better [18:13:21] I have absolutely no idea about hurricanes [18:13:27] that's like when i got into a monsoon in Australia and CT told me that is like a breeze if you grew up with it:) [18:13:33] I can tell you about earthquakes if you want [18:13:44] we have local expertise in those [18:13:45] a cat I is 75-95 MPH. a cat III is 111-130 [18:13:51] the wind isn't really the problem, though [18:13:53] it's the water [18:13:57] AaronSchulz: should I expect stuff to hit ms-fe4 yet or are you still changing something? [18:13:59] and especially the tidal surge [18:13:59] apergos: we hadn't had any major earthquakes since you've been here, have we? [18:14:00] it depends [18:14:03] it's about time! [18:14:06] maplebed: should be getting hits [18:14:10] it's not. [18:14:20] well when we have major quakes... [18:14:31] I'm also looking at logs, I concur that it's not [18:14:31] oh wait. [18:14:44] and you also have active volcanoes [18:14:46] I disctrust the logs [18:14:47] *maybe* the power will go out. if it does, I'd expect the datacenter to failover to generators without interruption. if their failover doesn't work, that's one more giant reason we shouldn't be using them [18:14:52] because of the proxy-logging change. [18:15:01] http://en.wikipedia.org/wiki/1989_Loma_Prieta_earthquake ( for paravoid) [18:15:29] oh, except that it is getting log entries for the pybaltestfile. [18:15:32] Jamesofur: ping! [18:15:44] nevermind [18:15:48] maplebed: pybal monitoring wase logged [18:15:49] right [18:16:19] Ryan_Lane: it's also the objects that can be tossed against buildings (I recall a nice giant tree being tossed on the roof of the library I was in during one storm. made a huuuge V shape from the dent) [18:16:33] of course for a dc yu expect it to be impervious [18:16:50] AaronSchulz: do you have ideas on why it wouldn't be hitting ms-fe4? [18:17:04] Ryan_Lane: You know the failover failed before, right? 4th of July outage in 2011 (? or was it 2010?) [18:17:10] no [18:17:39] the swift stuff gets logged to /var/log/messages? really? ughhhh [18:18:14] a test to ms-fe4 using curl successfully logged, so logging's not the problem. [18:18:35] RoanKattouw: yes [18:18:56] apergos: I wouldn't worry too much about the wind [18:19:03] apergos: a CAT I has fairly weak winds [18:19:15] yes, for cat 1 I just wouldn't worry at all [18:19:16] and by the time it hits tampa (if it does, even directly), they'll be much weaker [18:19:25] A cat I is like having free burretos avaible for all the ops team [18:19:31] :-D [18:19:50] New patchset: Platonides; "Fix typo which renamed group from 'patroller' to 'atroller' on eswikt" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21203 [18:20:00] ^ can someone merge? [18:20:29] I wouldn't have a picnic during a cat I, but I'd surely have a hurricane party :) [18:20:31] so paravoid, apergos: you were asking earlier about putting them in; I'd like to get test using them first. So it sounds like we schedule a window on monday. [18:20:32] I've been in a Bft 11 (just below Cat I) before, it was a bit impressive but not scary or damaging. At least not in the coastal areas where we were used to storms, some places inland where Bft 9-10 is a BFD had damage [18:20:39] oohhhh, maybe we should have a hurricane party in the office [18:20:42] heh [18:20:43] hehe [18:20:50] Hurricane party for a hurricane on the other side of the country [18:20:51] only if it's the kind of hurricane you can drink. [18:21:23] New review: Dzahn; "sure, obvious typo" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/21203 [18:21:24] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21203 [18:21:28] maplebed: ok.. what about backends, can we do any of those today? [18:21:51] thanks mutante [18:21:57] a sync-file now? [18:23:06] Platonides: I think that atroller might be more accurate than patroller ;) [18:23:06] apergos: up to you, but I'd lean towards getting the proxy tested before doing any of the backends. just a preference though, there's nothing stopping us from doing them first, with the one qualification that if something does go wrong (with the proxy testing) there's one more variable in the mix. [18:23:44] notpeter, please explain that to the people unable to patrol the wiktionary :) [18:24:00] but they can still troll it! [18:24:05] so that's pretty good :) [18:24:36] New patchset: Demon; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [18:25:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [18:25:29] RECOVERY - Puppet freshness on silver is OK: puppet ran at Thu Aug 23 18:24:59 UTC 2012 [18:26:51] Change merged: Pyoungmeister; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/21060 [18:28:19] maplebed: to put back ms-fe3/4? [18:28:46] paravoid: we haven't gotten mediawiki to test against ms-fe4 yet. [18:28:48] that sounds overly cautious but I'm okay with that [18:29:02] heh, I changed it to fe3 and now I get hits for the IP for fe-1 [18:29:18] ok. so today, is there anything else we can move forward with? [18:29:19] Platonides: what is the step in between merge in gerrit and having it in /h/w/common/wmf-config? [18:29:39] i see a git status there with a modified filebackend.php and untracked files... hrmm [18:29:49] so we have ms-fe3/4 onlinethen ms-fe1/2then ms-be one by one [18:29:55] and on top of that we have origs [18:29:57] fun [18:30:07] yeah it sure is [18:30:10] (n't) [18:30:10] Platonides: and sync-file docs have changed and say it syncs /home/wikipedia/common/wmf-deployment/ , not wmf-config [18:30:50] apergos: waiting to see if anything's broken is more time than actually doing it though [18:30:53] it's not much work [18:31:06] yes, it's just [18:31:08] proposed timeline? [18:31:18] that we don't get ben til mid september [18:31:27] no we don't [18:31:28] so I'm feeling that crunch [18:31:31] get him at all [18:31:31] err... the 28th is my last day. [18:31:36] full stop. [18:31:44] yeah we know. rub it in :-P [18:31:53] that's why I wanted to start with 1.5 yesterday [18:32:02] apergos: no, the "mid september" part has changed [18:32:02] yep [18:32:22] huh? [18:32:27] sync_scripts wikitech page now redirects to "Wikimedia binaries", but they aren't really binaries [18:32:32] we know we don't get him til mid september, we only get him til the 28th [18:32:36] no part time days no nothing [18:32:45] okay, correct [18:32:59] so, timeline? [18:33:09] is gonna suck but that's how it is [18:34:13] bah I am brain dead [18:34:31] I'm about to leave too, to save what's left of my social life [18:34:44] but I'd like to conclude on a timeline first [18:34:48] !log deleting about 200k thumb dirs and their contents from ms5, unused on any project, covering june and july uploads [18:34:56] I should have logged that ages ago [18:34:58] Logged the message, Master [18:34:58] anyways... [18:35:20] well I hope that aaron and ben can get testing happening on the proxies [18:35:28] so that monday we can move forward on that [18:35:47] when do we try again with originals, now that ms5 is out of the loop? [18:36:26] I guess it would be "put em in the pool" monday morning our time, then later that day try doing the other two? [18:36:28] our queue has [18:36:42] and then tues would be backends [18:36:44] *sigh* [18:36:54] 1) ms-fe3/4 testing & subsequent pooling, 2) ms-fe1/2 upgrade, 3) ms-be* upgrades, 4) origs [18:37:10] ahh, I see [18:37:24] for (1), if Aaron gives the green light soon, I guess you can do it maplebed? [18:37:25] maplebed: when I authenticate I get the rr dns url [18:37:40] it's a matter of setting two False to True and then making sure nothing melts [18:37:50] AaronSchulz: rrdns or lvs? [18:37:51] no wonder I wasn't get 3/4 [18:37:57] rrdns is supposed to be long gone. [18:38:00] AaronSchulz: local hack to override it? we need it for about 10 minutes... [18:38:12] just don't do it at the end of your day [18:38:13] http://ms-fe.pmtpa.wmnet:80/v1/AUTH_... [18:38:15] maplebed: could you do that today perhaps? to accelerate our timeline? [18:38:20] AaronSchulz: that's lvs. good. [18:38:27] isn't that supposed to be ms-fe.svc? [18:38:33] paravoid: yeah, should be able to. [18:38:39] /etc/hosts file on test.wp.org would do it. [18:39:05] okay [18:39:09] so then it would be ms-fe1/2 upgrade monday, [18:39:14] so, (1) today, unless something else goes horribly bad [18:39:20] maybe backends monday evening our time? ugh [18:39:24] (2) monday european monday [18:39:37] monday we also have the ops meeting btw [18:39:37] but backends might bleed into the next day [18:39:53] anyone recently pushed out wmf-config stuff? [18:39:53] ok they will definitely bleed into next day, actually let's not schedule anything for mon evening [18:40:21] back ends tuesday our time [18:40:30] see how slow and painful that is [18:41:20] I think I need to reboot ms-be3. the object server processes are all stuck in D state. ::sigh:: [18:41:47] do either of you want to poke before I reboot? [18:42:00] okay, we've figured out what to do today and tomorrow it's readonlyfriday, so talk again tomorrow to sync up? [18:42:22] no poking from me [18:42:37] I suppose given the time constraints, we might suggest violating read-only friday. [18:42:48] grrrr [18:43:06] this weekend I will not be around (I hope) to help if things break [18:43:17] and friday evening I want to be off work [18:43:21] I see no problem doing something *early* tomorrow, i.e. late U.S. Thursday [18:43:45] I'm ok with it too, and I will be around this weekend. [18:44:01] okay, let's see how this day goes [18:44:05] yep [18:44:06] and talk tomorrow morning with apergos. [18:44:13] !log rebooting ms-be3 [18:44:16] it's going to depend on getting the prxies pooled and etc [18:44:22] Logged the message, Master [18:45:38] New patchset: Pyoungmeister; "change character encoding for vumi" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21206 [18:46:05] maplebed: okay, in the meantime could you send us a bit more info about some of the itmes in the swift_tasks_2012-08-13 list? [18:46:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21206 [18:46:37] like the redo zones part, you've told me a few things [18:46:51] or chat with you tomorrow [18:46:54] got to leave now [18:47:01] thanks :) [18:47:06] * apergos clocks out as well [18:47:10] oh rats [18:47:17] gaaahh one more task for me... [18:47:32] PROBLEM - swift-account-server on ms-be3 is CRITICAL: Connection refused by host [18:47:41] PROBLEM - swift-container-server on ms-be3 is CRITICAL: Connection refused by host [18:47:57] New patchset: Pyoungmeister; "change character encoding for vumi" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21206 [18:48:04] paravoid: which ones? [18:48:25] the ones we're unlikely to touch until Tuesday [18:48:26] PROBLEM - swift-container-replicator on ms-be3 is CRITICAL: Connection refused by host [18:48:26] PROBLEM - swift-account-replicator on ms-be3 is CRITICAL: Connection refused by host [18:48:33] like zones in tampa [18:48:35] PROBLEM - swift-object-replicator on ms-be3 is CRITICAL: Connection refused by host [18:48:39] k. [18:48:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21206 [18:48:55] statsd maybe, dunno [18:49:02] PROBLEM - swift-object-server on ms-be3 is CRITICAL: Connection refused by host [18:49:02] PROBLEM - swift-container-updater on ms-be3 is CRITICAL: Connection refused by host [18:49:05] heyyyyy maplebed [18:49:07] do you know what this is? [18:49:07] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=dataloss&mreg[]=%5Edataloss%24&hreg[]=emery&aggregate=1&hl=emery.wikimedia.org|Miscellaneous%20pmtpa [18:49:11] PROBLEM - swift-account-auditor on ms-be3 is CRITICAL: Connection refused by host [18:49:11] PROBLEM - swift-object-updater on ms-be3 is CRITICAL: Connection refused by host [18:49:13] i can't find 'dataloss' anywhere [18:49:20] PROBLEM - SSH on ms-be3 is CRITICAL: Connection refused [18:49:22] maplebed: just trying to squeeze every last bit of information out of you, sorry :-) [18:49:23] maplebed: yeah, so trying to hack fe-4 into the storage url seems to work well [18:49:27] maplebed: and thanks so much for everything [18:49:28] *doesn't seem [18:49:29] PROBLEM - swift-object-auditor on ms-be3 is CRITICAL: Connection refused by host [18:49:38] gtg now, bye [18:49:47] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: Connection refused by host [18:49:53] cya [18:49:56] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: Connection refused by host [18:50:01] ottomata: looking [18:50:27] New review: Jerith; "\o/" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/21206 [18:50:34] ottomata: the dataloss metric is useless. pay attention to the packet_loss_* metrics instead. [18:52:31] where does dataloss come from? [18:52:33] robla is wondering [18:52:36] can we get rid of it? [18:52:40] robla sees it [18:52:42] it's embedded in udp2log [18:52:42] and then gets scared [18:52:43] and then asks me [18:52:47] and I say 'iunnnooooo' [18:52:50] and it doesn't do the ignore ssl traffic stuff [18:52:52] :) [18:52:53] udp2log sends directly to ganglia? [18:52:53] so gets 99% all the time. [18:52:56] yes. [18:52:58] ah...that problem [18:53:06] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21206 [18:53:10] can I push a change to stop it from doing that? [18:53:12] ok. it's been a while since I looked at that [18:53:15] that will one day get built and deployed? [18:53:26] it's a good nag for us to fix the underlying problem [18:53:29] ok that will take 3 hours, so I am definitely clocked out. tah [18:56:25] maplebed: yeah, so I'm not sure how to test this then [18:57:06] AaronSchulz: I thought you just said hacking the stoarge url worked. [18:57:10] well, at least I know authing works with 1.5 [18:57:23] no I was saying it didn't ;) [18:57:39] oh. you corrected yourself later; I missed that part. [18:57:42] maplebed, I don't see anything about ganglia or dataloss in the udplog repo [18:57:50] there is packet-loss.cpp [18:57:56] but that is the custom filter that all the udp2log instances run [18:58:03] AaronSchulz: putting ms-fe.pmtpa.wmnet in /etc/hosts with ms-fe4's ip adress will do it for that host. [18:58:04] and we use that for packet loss alerts, etc. [18:58:06] I was getting error on everything I did on testwiki with that hack [18:58:38] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [18:59:05] PROBLEM - Host ms-be3 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:08] the requests did go to the right proxy though ;) [18:59:44] damn. I missed watching ms-be3 boot. [19:00:25] why was it rebooting? [19:00:40] because I rebooted it to look for a memory test error. [19:01:11] RECOVERY - swift-account-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [19:01:20] RECOVERY - swift-container-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [19:01:20] RECOVERY - swift-object-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [19:01:20] RECOVERY - swift-account-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [19:01:20] RECOVERY - swift-object-updater on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [19:01:20] RECOVERY - Host ms-be3 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [19:01:29] RECOVERY - SSH on ms-be3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:01:29] RECOVERY - swift-object-auditor on ms-be3 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [19:01:56] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:02:05] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [19:02:15] RECOVERY - swift-container-replicator on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [19:02:15] RECOVERY - swift-object-replicator on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [19:02:23] RECOVERY - swift-account-replicator on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [19:02:41] RECOVERY - swift-container-updater on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [19:04:56] maplebed: did you see the error? [19:05:04] still on the phone about 6 [19:05:24] on ms-be3? no, I was trying to do too much at once and missed watching during the time it would have shown up. [19:05:40] the host does seem happier at the moment though, and I'm looking at the ipmi log. [19:06:01] looks clean. [19:12:12] maplebed: I guess you can try the /hosts thing on srv193 [19:12:27] not sure if it would work any better [19:12:41] wait, that isn't what you said you just tried and it failed? [19:13:03] ::sigh:: [19:13:06] I was rewriting the storage url to use ms-fe4 [19:13:14] totally misinterpreted your message due to it's ordering in the IRC log. [19:13:14] anyway, hosts is rooted [19:13:15] maplebed: so the hard drives that are going bad on all the ms-be's are going into predictive failure�once the error rate hits a certain threshold the drive is turned off. no fix other than to replace. [19:13:43] any comment about why they're all failing? [19:13:56] and by 'they all' I mean the ~10% failure rate we're seeing. [19:13:57] they could have bad sectors [19:14:25] AaronSchulz: I'll make that change. srv193 you say? [19:14:27] which is most likely the case since they're out of the box like this�dell is buying cheap hard drives from western digital (my opinion) [19:14:36] yeah, that's testwiki [19:15:05] cmjohnson1: there isn't any way for you to ID the failed drives, is there? [19:15:12] it's something you need to work with me on? [19:16:37] what do you mean? if it isn't mounting is one way. or do you mean identifying which drive is the bad one by slot#? [19:17:52] just whether you need help to ID which ones have failed [19:18:41] AaronSchulz: I think it worked. [19:19:53] ``err.. I think \o/ [19:20:09] I assume this is you: /Test_for_testing_fancy_test_uploads_09.jpg [19:20:46] Error deleting file: Could not move file "mwstore://local-swift/local-public/8/88/Test_for_testing_fancy_test_uploads_01.jpg" to "mwstore://local-swift/local-deleted/2/7/f/27f9mwpmp4t54jc8ecq19irwelo8v6y.jpg". [19:21:01] I see the results hitting ms-fe4, which means it's lunch time for me. [19:21:04] back in a bit. [19:21:08] yeah, no moving/deleting works (like when I had the hack before) [19:33:17] maplebed: will get new DIMM and hard drives tomorrow. as far as figuring out which is the right drive. our system of trying to mount the bad drive is the best [19:33:48] unless we want to load MegaCli [19:55:47] sorry, mutante, I was away [19:56:05] the step between merging and having it there... [19:56:07] git pull? [19:57:52] Got the following error when trying to delete a file on test.wikipedia.org: Error deleting file: Could not move file "mwstore://local-swift/local-public/f/fb/Jar.jpg" to "mwstore://local-swift/local-deleted/6/l/c/6lcuv5u7egt9yze8e0xjx7tw8kdg8ut.jpg". [19:57:57] Tried again and got... [19:58:08] Errors were encountered while deleting the file: [19:58:09] The file "mwstore://local-multiwrite/local-public/f/fb/Jar.jpg" is in an inconsistent state within the internal storage backends [19:58:09] The file "mwstore://local-multiwrite/local-deleted/6/l/c/6lcuv5u7egt9yze8e0xjx7tw8kdg8ut.jpg" is in an inconsistent state within the internal storage backends [19:58:40] maplebed, AaronSchulz: ^ [19:59:15] we saw some of them before [20:02:05] Oh well, guess I'll have to scap without testing on test. [20:35:36] New patchset: Dzahn; "add /var/cache/planet and /usr/share/planet-venus/theme/common to be owned by planet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21264 [20:36:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21264 [20:36:28] nm [20:36:32] oop, wrong chat [20:53:40] hmm [20:53:55] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21264 [20:58:09] adding the Topic-branch name to gerrit-wm output would be nice [21:01:03] <^demon> I don't think gerrit gives us that. [21:01:11] <^demon> We don't have the whole ref available, iirc. [21:05:12] AaronSchulz: sorry, I had some paperwork to attend to. [21:05:51] np :) [21:06:35] maplebed: I think I see the problem though [21:06:43] while you were doing your paperwork ;) [21:06:53] best outcome of paperwork EVAR! [21:07:09] CF has double encoding and wrong / handling in the copy functions [21:07:30] the encoding calls there don't match anything else, and don't follow the api docs wrt to / [21:07:51] ossm. [21:08:04] * AaronSchulz runs the tests again to confirm a few times [21:08:42] it also explains the "garbage encoded file" entry that made the listing tests fail [21:09:57] AaronSchulz: I want to grab a tcpdump. would you tell me immediately before beginning the test and as soon as it finishes? [21:10:01] I don't have rights to rename files. [21:10:12] I'm running tests against copper [21:10:17] not touching testwiki atm [21:10:23] ah. [21:10:45] takes 6min :) [21:10:58] would you mind granting me rights on testwiki or doing a file move / deletion for me? [21:11:22] username? [21:11:27] bhartshorne. [21:12:01] done [21:12:25] confirmed. thanks. [21:13:34] "you do not have permission. reason: could not copy file." that's the expected failure, right? [21:16:04] same problem I had, yes [21:16:41] AaronSchulz: this collection of packets is fascinting. [21:16:50] *fascinating. [21:16:58] * AaronSchulz runs the copper tests a 3rd time [21:19:02] AaronSchulz: it looks like MW purges thumbs for the previous name even if the move to a new name fails. is that expected? [21:19:18] maybe, the code isn't usually optimized around failure cases [21:19:52] it also looks like after checking the original exists, it does a HEAD against the new location (for the original) 4 times. [21:20:23] are you looking at only the proxy-server entries? [21:20:28] yes. [21:20:43] I don't recall seeing that many heads [21:20:52] though I do recall more than one for the same file [21:21:07] most of the heads were internal ones (from X-Newest) [21:21:29] the other interesting thing is that before asking for the list of thumbs, it does a HEAD against the container (presumably to verify it exists?) [21:21:35] or any 404 head, really [21:21:49] yeah, but this is just against the raw container [21:21:55] maplebed: containers are cached in memcached, so that should be rarish [21:22:02] oh, the container [21:22:23] that should be even less frequent then [21:22:33] at least from mw [21:22:46] I'll repeat the test and see if it's the same. [21:25:48] Jamesofur: http://meta.wikimedia.org/wiki/Planet_Wikimedia#Requests_for_Update_or_Removal [21:25:57] AaronSchulz: this is what I was looking at: http://pastebin.com/anpNn3tZ [21:26:01] Jamesofur: i am going to update all of the "has moved" section [21:26:11] (that was when I said to move the file from feral cat 2 to feral cat 3 [21:27:15] maplebed: you will get the backend-sync error if you try on the same file again :) [21:27:21] mutante: awesome thanks! Looking through the rest now, I know that some of these shouldn't be dead [21:27:25] I tried a different file. [21:27:28] thanks for the warning though. [21:28:04] ah! I see the problematic line this time! [21:28:13] I think last capture was the second try on the same file [21:28:35] and I didn't look closely at the error - you're right; it was inconsistent backends. [21:28:47] COPY /v1/AUTH_xxxxxx/wikipedia-test-local-public/f%252Ff7%252FFeral_Cat.jpg [21:28:57] that's it right there. [21:29:02] so ... I confirm what you already found out. [21:29:03] \o/ [21:29:43] the destination header suffers from the same encoding issue. [21:29:44] do we have a gerrit revision interwiki like we did with svn code review? [21:30:24] Destination: wikipedia-test-local-public%2Fa%2Fa2%2FFeral_Cat_Stares.jpg [21:30:25] oh wait! [21:30:27] it's not quite the same [21:31:00] the URL is double-encoded (%252F) but the destination is only single-encoded (%2F) (%25 being the code for %) [21:33:57] I deployed the fix now [21:34:01] seems to be working on testwiki [21:34:18] maplebed: ok, I think the upgrade situation is starting to look good now [21:35:02] New patchset: Dzahn; "fix all "has moved" warnings on en.planet update, per:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21275 [21:35:03] fileop, error, and swift logs are fine [21:35:32] AaronSchulz: so do you know if the previous version of swift just accepted the double encoding and figured it out? [21:35:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21275 [21:36:12] New review: Dzahn; "http://meta.wikimedia.org/w/index.php?title=Planet_Wikimedia&oldid=4062146#has_moved_.283xx.29" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/21275 [21:36:13] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21275 [21:36:41] maplebed: apparently it tried to fix them somehow [21:36:47] * AaronSchulz looks at the release notes [21:37:07] just to see, can I run a tcpdump while you do a file move against copper? [21:37:38] * AaronSchulz looks at https://bugs.launchpad.net/swift/+bug/857673 [21:40:13] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [21:42:41] maplebed: ready for dumping? [21:42:53] ready [21:42:56] going [21:43:47] you should see a bunch of stores, then copies, then copies, then deletes, then deletes, then delets [21:44:00] hmm, I'll do it again with just 1 src file [21:44:09] ok [21:44:17] it's all being piped to a file; i don't see any if it realtiem. [21:44:18] done [21:44:19] you're done? [21:44:22] gerate [21:44:34] hmm that might be messy [21:44:41] can you run it on a new file? [21:44:52] huh? [21:45:03] the file you piped it to [21:45:18] I was doing batches with 10 or so files earlier [21:45:27] none of the copies in the test have encoded URLs. [21:45:39] this is running with the CF fix btw [21:45:55] heh. [21:46:00] there's a fault in your test case. [21:46:14] none of the test files have shards and none of them have slashes in the name. [21:46:19] they're all just container/name. [21:46:21] (this is fileOpPerfTest) [21:46:30] the encoding bug is only in the second slash and beyond. [21:46:30] it's not the full unit tests [21:46:38] those tests won't catch the bug. [21:46:44] about to run scap [21:46:57] maplebed: want be to run the actual unit tests? [21:47:04] and should I downgrade cloudfiles? [21:47:16] (on my machine) [21:47:21] I didn't realize this was the new cloudfiles. [21:47:26] so I don't need the entire unit test, [21:47:36] I just need a single file but it must have a / in the path somewhere. [21:47:43] I can just run the copy tests (though it will still have some cruft in there) [21:47:50] hmm [21:47:59] the buggy cloudfiles produces URLs like container/foo%252Fbar [21:48:03] notice how the first slash is preserved. [21:48:28] the tests you just ran were files like container/foo (no slashes beyond the one separating the file from the container) so it won't be visible. [21:48:43] (sorry to be repetitive.) [21:49:05] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: Puppet has not run in the last 10 hours [21:49:13] that all makes sense, right? [21:49:23] yeah, I'll just run the copy unit test [21:50:08] I'm ready when you are; gimme the go ahead and I'll start the tcpdump. [21:50:31] now :) [21:50:37] man, do I ever love tcpdump as a tool. [21:50:39] it's going now. [21:51:15] done [21:51:40] and again, new CF :) [21:52:01] ok. [21:52:28] Jamesofur: all the "has moved" warnings are gone. i am going to paste more stuff in the non-en sections [21:52:42] AaronSchulz: there was no traffic to copper during that time. [21:52:48] Jamesofur: i am linking to gerrit changes .. [21:52:53] I got one auth request and that was it. [21:53:27] were you running the dump right? [21:53:40] the tests worked fine for me [21:54:04] yes. but auth gets back msfe-test.wikimedia.org, so will be balanced between Cu, Mg, and Zn. same problem we saw trying to test against ms-fe4 this morning. [21:54:06] ::sigh:: [21:54:38] mutante: perfect thanks, I'm working through the other en warnings now and will probably make a commit soon with the 401/403/404s . Yeah, I was wondering if we had an interwiki for it (code review uses [[rev:##### ]] for example). Doesn't look like it yet [21:55:39] AaronSchulz: sorry, but can you run it once more? [21:55:45] I've got dumps ready to go on all three hosts this time. [21:55:53] yeah, I was dumping the storage_url, I see what you mean about the balancing [21:55:54] Jamesofur: oh true, we do, [[gerrit:1234]] [21:56:01] dumps running. [21:56:02] ahhhh, perfect [21:56:04] run at will. [21:56:11] ok [21:56:45] done [21:57:03] looks like it hit zinc that time. [21:58:08] AaronSchulz: looks like the /s are encoded correctly (aka left alone) [21:58:46] I'd hope so [22:00:06] do you have the previous version available to run? I'd like to see the same test show the error again. or I suppose you could deploy that to testwiki and we could see the fix applied there. [22:00:40] the fix is already on testwiki [22:00:46] oh great. [22:00:48] * maplebed tests [22:01:01] manual testing worked for me, it's on all wikis too [22:01:02] New patchset: Ryan Lane; "Set puppet servername and certname based on realm and site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21280 [22:01:28] anyway, I can always run old versions for my copper tests [22:01:41] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/21280 [22:01:53] hey, know of a quick way to tell crond to just execute everything in an existing user crontab "right now" without caring about the time and having to copy/paste/edit to a script ... [22:02:41] New patchset: Ryan Lane; "Set puppet servername and certname based on realm and site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21280 [22:02:44] mutante: there is no way [22:03:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21280 [22:03:25] alright, i guess i'll write a script to grep the commandlines out of there [22:03:31] maybe use for i in `crontab -l | awk '{ print $ }'` do; $i; done [22:03:32] ? [22:03:47] Would suck if using multiple args of different counts [22:03:50] of course that awk wouldn't work if there's args [22:03:51] yeah [22:03:53] AaronSchulz: confirmed! [22:04:28] someone want to review a possibly horribly destructive change? [22:04:28] https://gerrit.wikimedia.org/r/#/c/21280/ [22:04:38] well, or i just change the times to $NOW + 1 minute [22:05:00] thanks, something similar will work [22:05:07] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [22:05:07] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [22:05:07] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [22:05:07] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [22:05:07] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [22:05:08] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [22:05:08] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [22:05:09] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [22:05:09] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [22:05:10] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [22:05:10] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [22:05:11] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [22:05:11] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [22:05:18] * AaronSchulz finds the word "realm" to be magical [22:05:24] awk can print from column number -> end [22:05:49] * Jasper_Deng_sick is getting errors on wiki [22:05:58] "(Cannot contact the database server: Unknown error (10.0.6.73))" [22:06:17] dberrors.log spamming [22:06:19] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 210 seconds [22:06:31] Thu Aug 23 22:06:20 UTC 2012 srv270 enwiki Error connecting to 10.0.6.73: Lost connection to MySQL server at 'reading initial communication packet', system error: 111 [22:06:37] PROBLEM - MySQL Replication Heartbeat on db59 is CRITICAL: CRIT replication delay 230 seconds [22:06:39] spamming => flooding [22:06:46] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 238 seconds [22:06:47] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CRIT replication delay 239 seconds [22:06:49] the error is real [22:06:55] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 247 seconds [22:06:55] PROBLEM - MySQL Replication Heartbeat on db1049 is CRITICAL: CRIT replication delay 248 seconds [22:07:02] enwiki is down [22:07:05] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [22:07:06] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 256 seconds [22:07:06] PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: CRIT replication delay 256 seconds [22:07:13] PROBLEM - MySQL Replication Heartbeat on db1017 is CRITICAL: CRIT replication delay 267 seconds [22:07:22] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 275 seconds [22:07:27] schema changes? [22:07:40] PROBLEM - MySQL Replication Heartbeat on db38 is CRITICAL: CRIT replication delay 290 seconds [22:07:40] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 292 seconds [22:07:49] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: CRIT replication delay 299 seconds [22:08:03] Ryan_Lane: on fluorine under /a/mw-log [22:08:07] RECOVERY - MySQL Replication Heartbeat on db59 is OK: OK replication delay 0 seconds [22:08:16] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [22:08:16] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay 0 seconds [22:08:25] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [22:08:25] RECOVERY - MySQL Replication Heartbeat on db1049 is OK: OK replication delay 0 seconds [22:08:32] not flooding anymore, hmm [22:08:34] RECOVERY - LVS HTTP IPv4 on m.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 0.080 second response time [22:08:35] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [22:08:35] RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay 0 seconds [22:08:43] RECOVERY - MySQL Replication Heartbeat on db1017 is OK: OK replication delay 0 seconds [22:09:00] enwiki is up here atm [22:09:10] RECOVERY - MySQL Replication Heartbeat on db38 is OK: OK replication delay 0 seconds [22:09:10] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay 0 seconds [22:09:19] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 0 seconds [22:09:19] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 0 seconds [22:09:19] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [22:09:31] interesting... [22:10:02] (/now/ it's back up) [22:10:28] New patchset: Jalexander; "Fixing 401, 403 and 404 errors from update Some blogs removed some updated" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21283 [22:11:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21283 [22:12:23] anyone? https://gerrit.wikimedia.org/r/#/c/21280/2 ? [22:12:23] :) [22:16:13] New review: awjrichards; "Looks good; please approve if this is sane! Right now clicking 'WLMMobile_latest.apk' on the nightly..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/19240 [22:16:51] any op available for a quick review/merge of simple change? ^ [22:17:26] New review: awjrichards; "PS this should fix: https://bugzilla.wikimedia.org/show_bug.cgi?id=39275" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/19240 [22:30:13] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21283 [22:31:11] New patchset: Ryan Lane; "Set expiration time for keystone tokens to 7.1 days" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21287 [22:32:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21287 [22:33:54] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21280 [22:37:26] New patchset: Ryan Lane; "Revert "Set puppet servername and certname based on realm and site"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21290 [22:38:09] fucking. hate. puppet. [22:38:14] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21290 [22:38:30] certname = undefined [22:38:34] fuck you puppet. fuck you. [22:39:37] mutante: heh: default => undefined, [22:39:42] If puppet was written in python using Jinja2 it would be awesome, random dsl causes pain [22:39:42] that's wrong [22:40:51] puppet should require quotes for strings [22:40:53] !log putting ms-fe3 and 4 into rotation running swift v1.5.0-3 [22:41:03] Logged the message, Master [22:41:53] done. [22:42:08] logs are still quite [22:43:41] traffic's a-flowin. [22:43:51] New patchset: Ryan Lane; "Set puppet servername and certname based on realm and site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21291 [22:44:40] New patchset: Ryan Lane; "Set expiration time for keystone tokens to 7.1 days" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21287 [22:45:30] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21287 [22:45:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21291 [22:45:31] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21291 [22:46:30] New patchset: Dzahn; "Explicitly specify FollowSymlinks for the Mobile nightly builds, using +SymLinksIfOwnerMatch instead. This is advisable to make Apache check if the symlink target belongs to the same user." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19240 [22:47:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19240 [22:48:03] the logs look ok. [22:49:57] AaronSchulz: I've got a potential bug for you. [22:50:12] in the same section that was incorrectly url-encoding /s, what does it do to !s? [22:50:46] wikipedia-commons-local-thumb.d3/archive/d/d3/20061230231901%2521Sulfur-hexafluoride-3D-vdW.png/120px-Sulfur-hexafluoride-3D-vdW.png <-- note the %2521 before Sulfur [22:51:03] that might be correct though, I'm still checking. [22:52:15] awjr: it does not work as expected.. i tested manually and it still denies us..looking [22:52:22] New patchset: Jalexander; "Fix remaining en planet errors including 500s and no data." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21293 [22:52:28] mutante :( thanks for checking [22:53:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21293 [22:54:29] AaronSchulz: nevermind. looks like that's just a logging thing. [22:54:41] ok [22:56:11] awjr: it is more the way the link itself is created than the Apache config [22:56:28] awjr: do you know how the symlink itself is being updated ? [22:56:46] mutante: i do not actually, i dont really know anything about the integration system [22:57:20] i think hashar and/or ^demon helped set it up [22:57:22] awjr: we don't need to change the apache config, it follows the symlink without it [22:57:43] mutante: ok i'll poke others about it - thanks for looking into it [22:57:47] awjr: it works now, i deleted the symlink and recreated it. difference being i used a relative path instead of the absolute path in the filesystem [22:57:50] <^demon> I didn't set it up, it was hashar. [22:57:59] <^demon> I just tried that fix cuz I couldn't figure out what's wrong. [22:58:07] ^demon right on thanks [22:58:09] awjr: but i guess it will break again once it is rewritten [22:58:17] mutante ok cool - that was just what i was going to ask :p [22:58:35] lemme take a look if i can find that.. [23:01:13] New review: Demon; "One minor thing." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/21293 [23:01:35] ^demon: it was WLMMobile_latest.apk -> /srv/org/mediawiki/integration/WLMMobile/nightly/WLMMobile19156fcc2a.apk now it just is WLMMobile_latest.apk -> WLMMobile_19156fcc2a.apk [23:02:23] <^demon> mmk [23:05:16] New review: Dzahn; "actually it already follows symlinks without a change, so we don't need it. But the symlink is creat..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/19240 [23:05:58] Change abandoned: Demon; "Don't need this then." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19240 [23:08:53] <^demon> Gosh dang it. [23:09:08] <^demon> mutante: Reason for the full path is jenkins is running the ln -s from some random place I dunno. [23:09:09] New patchset: DamianZaremba; "Adding correct ip - change a while back due to corruption" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21297 [23:09:20] maplebed: stat caching seems to work from my testing [23:10:02] dunno what to say, man. [23:10:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21297 [23:10:24] ^demon: where did you find the "ln" command? [23:10:36] <^demon> It's in jenkins :) [23:10:49] <^demon> https://integration.mediawiki.org/ci/job/WLMMobile%20-%20Nightly%20builds/configure [23:11:11] ah, heh, i am not sure if i have a login [23:11:16] <^demon> Labs login. [23:11:21] ok [23:12:12] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [23:12:47] ^demon: can we add "cd /srv/org/mediawiki/integration/WLMMobile/nightly/" and then drop that part from the ln? [23:13:07] <^demon> Tried it. Got the forbidden again. [23:13:54] ^demon: we can try something else and use Alias in Apache config [23:14:00] like this: Alias /manual/ /usr/local/apache/manual/ [23:14:21] "If you need symbolic links consider using theAlias directive, which tells Apache to incorporate an external folder into the web server tree. It serves the same purpose but is more secure. [23:14:25] Read more at http://www.devshed.com/c/a/Apache/Setting-Permissions-in-Apache/1/#O9KA89s0rwE7uZQi.99 [23:15:52] <^demon> You know what'd be even easier...just copy twice. [23:16:04] <^demon> Copy the _latest, which would be overwritten on the next run [23:16:16] true [23:17:10] <^demon> Can you remove the symlink? [23:18:24] done [23:19:38] <^demon> Works :) [23:20:04] ^demon: nice:) well... and we could have also used -t with ln :p [23:20:29] specify the DIRECTORY in which to create the links [23:20:37] without having to cd [23:20:48] wait… did you guys just fix the issue with the latest link? [23:20:53] yea [23:20:59] dang you guys rule [23:21:02] thanks :D [23:22:10] yw [23:22:20] <^demon> And on that note, I'm gonna call it an evening. Later folks. [23:36:57] New patchset: Dzahn; "Fix remaining en planet errors including 500s and no data. remove Chad's blog entirely per his comment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21293 [23:37:42] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21293 [23:39:31] Would anyone object if I redeploy a change to test that had to be backed out earlier? [23:39:57] New review: Jalexander; "Thanks Daniel, looks good (Weird, didn't realize you couldn't +1 after it was already merged)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21293 [23:42:21] Jamesofur: i like how we get redirect on "doubleredirection.wordpress.com" :) [23:43:54] Jamesofur: you know, for a bunch of redirections across all languages, most of them just needing a / at the end, should i even bother to list that on meta? [23:44:56] LOL, that sounds like it's on purpose [23:46:03] mutante: nah if it's that simple I'd just do it as long as we do it right away and don't need to remind someone to do it [23:46:57] the second most commont is that "tag" changed to "category" apparently in some wordpress version [23:47:36] yeah, I've hit that a bunch of times [23:49:29] hi maplebed [23:49:39] hey, you're up early/late. [23:49:47] late :) [23:50:16] so, how's ms-fe3/4? [23:50:23] all good? [23:50:41] they're good. they're in rotation. [23:50:45] I saw you put them back online what? 15' ago? [23:51:08] no, 75 minutes ago? [23:51:28] oh, I was looking at http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=^ms-fe[1-9].pmtpa&mreg[]=swift_[0-9]%2B_hits__%24&mreg[]=swift_other_hits__%24&z=large>ype=line&title=Swift+percentage+queries+by+status+code&aggregate=1&r=hour [23:51:40] yeah.... so I missed a bug in the ganglia logtailer. [23:51:58] the proxy log format changed slightly (added a parameter) and the regex parsing the log wouldn't take it. [23:52:08] oh [23:53:25] so, that's where? puppet? [23:53:51] well... I'm not sure how to update https://gerrit.wikimedia.org/r/#/c/18264/ to include it. [23:54:23] sicne you pushed a patch too, I can't just --amend what I currently have. [23:54:45] you can fetch that and amend [23:54:55] but no reason to have these in a single patch anyway [23:55:03] New patchset: Ryan Lane; "Change puppet branch to always use production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19786 [23:55:11] well, they should all be pushed simultaneously... [23:55:13] in fact I was thinking of pushing the proxy server stuff (proxy-server.erb + rewrite.py) tomorrow [23:55:33] actually, the logtailer change could go now; the change is backwards compatible. [23:55:40] right [23:55:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19786 [23:56:16] New patchset: Dzahn; "fix more redirections in de/en/fr/gmq/it/zh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21303 [23:56:25] ok, I'll just push that one now. [23:56:28] and manually deploy to ms-fe3 [23:57:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21303 [23:57:31] :-) [23:57:46] so, 1.5 proxies look good, that's great news [23:57:58] I guess we'll push it to ms-fe{1,2} in a few hours then [23:58:09] New review: Dzahn; "get rid of all the warnings in planet logs for once" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/21303 [23:58:10] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21303