[00:49:25] Hey Krenair, can I talk to you more about https://gerrit.wikimedia.org/r/#/c/26902/ when you have a minute? [11:00:32] Hi, does anyone know something about \Q in mathjax not working in Firefox but working in Safari ? [11:48:39] hello [12:07:23] Nikerabbit, o_0 http://translatewiki.net:8983/solr/#/collection1 [12:07:31] can I vandalise it plz? [12:10:08] #wikimedia-staff can be pretty handy for things like this [12:10:16] MaxSem: then I know who sabotaged tomorrows presentation [12:10:45] srsly, such services should be loopback-only [12:23:53] MaxSem: what is the issue? the worst someone can do is to delete the contents and insert some spam [12:24:22] They can also make your life miserable by messing with core admin [12:25:37] MaxSem: and do what? [12:27:20] make you live w/o collection1. create a core with config dir pointing so some funny place and attempt to get some files from your machine by using the functionality that returns config files [12:30:04] I'd say that would be vulnerability in solr then... [12:31:04] well, you may also say that there's a vulnerability in MySQL if it allows you to make it world-accessible with empty root password;) [12:32:07] MaxSem: there isn't any sensitive or important data in that instance, so I fail to see the point [12:32:25] okay, np [12:32:47] of course for production service it would not be public [12:32:53] ori-l: saper: I managed to get a GDB backtrace on my own :-] *happy* [12:35:06] * MaxSem attempts to create a core with instance dir pointing to /var/www and config being LocalSettings.php [12:37:04] MaxSem: you see, what do you complain of, he's giving you an infrastructure to play with to find vulnerabilities :p [12:38:59] MaxSem: let me know if it sucedes [12:39:19] rakkaus will;) [12:39:24] mwahaha [13:04:24] hmm [13:04:39] * hashar self award the "run php under GDB and get a stacktrace" badge. [14:28:48] <^demon> hashar: Soooo, gerrit-dev is running 2.6 :) [14:28:53] niiiice [14:29:12] <^demon> Trying to debug a stacktrace though http://p.defau.lt/?lBaiks9ssHZaKRAJGciqVQ :( [14:29:19] ^demon: with bugzilla integration? :-D [14:29:27] Reedy: I gave mcc.php some love with https://gerrit.wikimedia.org/r/44253 [14:29:40] <^demon> Nemo_bis: Not yet, still working on the plugin. But we'll have plugin support with the upgrade :) [14:30:02] ^demon: I need a zuul instance to point at it [14:44:42] New patchset: Hashar; "JSHint: run from bash and with checkstyle magic." [integration/jenkins-job-builder-config] (master) - https://gerrit.wikimedia.org/r/43773 [14:45:10] New review: Hashar; "PS9 use the 'jshint' builder macro instead of 'jshint-in' since we use the $WORKSPACE now :-)" [integration/jenkins-job-builder-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/43773 [14:46:34] Which extension is responsible for this "input methods" thing I see playing peekaboo at the lower-right corner of the editing textbox on mediawiki.org? [14:47:11] <^demon> anomie: CharInsert, iirc? [14:47:33] <^demon> Oh, input methods. [14:47:41] <^demon> That's Narayam. [14:48:27] <^demon> anomie: Questions about that would go to Siebrand or Nikerabbit, probably. [14:50:12] ^demon- Probably UniversalLanguageSelector, actually. Narayam says it's discontinued. [14:50:27] <^demon> Ah, didn't know that. :) [14:50:32] <^demon> Learn something new every day. [14:52:31] yes, it's ULS :) [14:57:32] * anomie files bug 44030 against it [15:07:20] hashar: merge: LOST? [15:07:21] https://gerrit.wikimedia.org/r/#/c/33505/ [15:07:33] due to restart? [15:07:45] yeah [15:07:54] will have to retrigger it once jenkins is rebuild [15:08:02] I usually restart Jenkins early in the morning [15:17:39] now that jenkins was restarted, can someone +2 https://gerrit.wikimedia.org/r/#/c/33505/ please? it was alread +2'ed before by Nikerabbit :) I'd like to get that one finally merged [15:45:35] <^demon> hashar: Bah, found the bug: https://code.google.com/p/gerrit/issues/detail?id=1165 [15:50:30] ^demon: blocking 2.6 I guess ? [15:50:57] <^demon> Na, it's not earth shattering. [15:51:28] good luck on that one :-] [15:51:40] I myself debugged a bug in a Jenkins plugin .. using strace! [15:51:47] need to actually read the java code now :( [16:05:28] out, going to get my daughter back home :] [16:05:36] http://en.wikipedia.org/wiki/Toboggan time! [16:42:13] git clone -q is too quiet [16:46:04] Reedy: ask its opinion on typos in commit messages [16:47:19] oh, there it goes [16:52:36] OHAITHAR 1.21WMF8 [16:56:46] yay! Reedy, did you and ryan figure out everything you need to deploy today? [16:57:22] Nope [16:57:30] I have absolutely no idea what I'm doing [16:57:37] Dang [16:58:04] Alright, I'll make sure he's around to help out.... [16:59:47] I went to bed after 4am my time (I haven't been up long). testwiki was using git deploy sometime after 2am I think [17:04:15] I'm going to half stage it to fenari like usual [17:10:17] https://www.mediawiki.org/wiki/MediaWiki_1.21/wmf8 [17:12:23] csteipp: Presumably I can take the wmf6 slot (not sure whether it's 0 or 1 on gitdeploy), change the branch over, update the submodules etc on tin? [17:13:41] Reedy: That sounds about right, although I don't know exactly how Ryan and Tim set things up yesterday [17:20:00] Hello. [17:25:11] http://wikitech.wikimedia.org/view/Git-deploy [17:25:29] http://wikitech.wikimedia.org/view/Git-deploy#Example_of_changing_versions_of_mediawiki [17:25:34] I guess that's what I'm doing ;) [17:42:29] matthiasmullie: ping [17:44:05] anomie: About? [17:44:13] Reedy- yes [17:44:21] * preilly is hoping that matthiasmullie is online at 6:44 PM [17:44:36] What am I supposed to do to rebuild the localisation cache now? [17:45:06] It should rebuild automatically when you do the 'git deploy sync', as far as I know. [17:45:30] Ryan_Lane- ^ Is that right? [17:45:32] hmm [17:45:42] It didn't first time, but I had no wikis on that version.. [17:46:05] Yeah, it needs at least one wiki to be able to run the maintenance scripts. [17:46:46] I think you used to switch test2 first for that? That should still work. [17:46:59] Yeah, That's what I've done [17:47:14] but before it needed the code to be sync'd so I would do that first [17:47:58] checkoutMediaWiki needs mostly re-writing too [17:51:09] <^demon> Reedy: Other than changing the paths, what else will it need? [17:51:27] Most of the actual checkout phase is redundant [17:51:35] update the slot rather than a clean clone [17:51:52] hi folks...things looking good for using git-deploy today? [17:52:20] <^demon> Reedy: Don't we still have to clone at some point? [17:52:21] Other than somewhat stabbing in the dark [17:52:32] No, because the old slot gets moved forward [17:52:47] http://wikitech.wikimedia.org/view/Git-deploy#Example_of_changing_versions_of_mediawiki [17:53:16] <^demon> Ah ok. [17:54:06] <^demon> That could easily be made a bash script. Then just get rid of checkoutMediaWiki [17:55:04] * chrismcmahon follows along [17:55:57] How do you deploy changes if they're already made to a repo? :/ [17:56:09] and my ssh just died [17:56:29] am I still here? [17:56:42] Reedy- you mean you forgot the "git deploy start"? Ryan taught me yesterday: "git deploy start; git deploy --force sync" should do it [17:56:46] <^demon> Reedy: IRC? Yep. [17:56:52] everything else is die [17:57:05] dieing [17:57:08] anomie: sort of forgot, sort of there was a lot of things to do [17:57:21] * anomie wonders why he keeps typoing "git deploy" as "git deplot" [17:57:46] I think it's something we're going to forget a lot for a while [17:58:11] can't we alias git deplot for you? ;) [17:58:49] Looks like my wireless connection has died [17:58:57] adsl is still up [17:59:11] Glad I decided to force irc through adsl [18:01:26] Damn it, can't push from common repo on tin to gerrit [18:02:22] more remotes [18:02:46] <^demon> What's the remote point to? [18:03:04] gerrit ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config.git (fetch) [18:03:19] <^demon> I'm saying the other remote you raged over? [18:03:46] https://gerrit.wikimedia.org/r/p/operations/mediawiki-config [18:03:51] <^demon> lol push over https. [18:04:01] <^demon> https://gerrit.wikimedia.org/r/#/settings/http-password [18:04:06] yeah... [18:04:15] Still needs the commit hooks [18:06:40] ! [remote rejected] HEAD -> refs/for/newdeploy (change 42730 closed) [18:06:57] is it trying to push everything? [18:07:16] Is that the automatic push? Or are you pushing that yourself? [18:07:28] me [18:07:28] git push gerrit HEAD:refs/for/newdeploy [18:07:33] for the mediawiki-config repo [18:07:48] <^demon> Well, that change is closed. [18:07:49] Cool. I'll add that to Ryan's list of things to do... :) [18:07:58] Indeed, I can see that ^demon ;) [18:08:00] <^demon> Oh, different branch? [18:08:19] <^demon> Use new change-id. Stupid gerrit. [18:08:19] yeah, we're still using the newdeploy branch [18:08:38] <^demon> You could merge the branch to master? [18:09:01] New change-id doesn't work [18:09:03] just tried it [18:09:21] It's probably better not merging it all to master till we've migrated [18:09:57] https://gerrit.wikimedia.org/r/#/q/Ic32529a94962ec332c4d6ab8d3a422f2ac98fd8f,n,z [18:10:30] Die gerrit, die. [18:10:42] :( [18:10:54] everybody loves gerrit [18:11:13] And my laptop battery is about to die [18:11:16] need to move [18:14:48] Reedy: you already started the deploy? :) [18:14:49] heh [18:15:36] Back again [18:16:13] Ryan_Lane: Yeah, force of habit with waiting years for scap means i do most of the prep before the window [18:16:24] * Ryan_Lane nods [18:16:29] probably not a bad idea [18:16:49] Nothing is using the slot, so nothing should be affected [18:16:55] indeed [18:16:57] Anyone any suggestions how to get this damn change to mediawiki-config newdeploy? [18:17:08] it's already pulled? [18:17:34] meaning, did you do it before you did git deploy start? [18:17:44] That's not the problem [18:17:47] It won't submit to gerrit [18:17:54] I made it on tin (updating symlinks and such) [18:18:02] ! [remote rejected] HEAD -> refs/for/newdeploy (change 42730 closed) [18:18:20] 42730 is the first commit on that branch [18:18:25] ah [18:22:54] Your branch is ahead of 'origin/newdeploy' by 66 commits. [18:23:00] I presume that's what's up with it [18:25:05] Stupid git [18:26:25] anomie: Thanks for the code review! [18:26:42] kaldari- No problem, I saw it mentioned on wikitech-l and decided to check it out [18:26:53] The extension's been sitting there for almost a month with no one reviewing it [18:27:25] I hope I can find an SQL ninja to review the special page [18:28:32] mediawiki-config might be a bit of a mess when merged back into master [18:28:44] oh well [18:28:56] ok. verified that git-deploy is reporting successful for all pooled apaches [18:30:16] # FATAL: new file: LocalSettings.php [18:30:24] Even after running git add -f LocalSettings.php [18:30:39] where are you seeing this? [18:30:46] on tin? [18:30:48] running git deploy start in slot1 [18:30:49] yeah [18:30:59] maybe I forgot to commit it? [18:31:06] that doesn't make sense [18:31:16] slot1 is now wmf8 [18:31:17] not wmf6 [18:31:18] did you remove LocalSettings.php from the .gitignore? [18:31:41] nope [18:31:44] reedy@tin:/srv/deployment/mediawiki/slot1$ grep LocalSettings .gitignore [18:31:44] LocalSettings.php [18:32:13] wmf/1.21wmf7 also has untracked files [18:32:19] wtf is going on? [18:32:36] I have a good feeling people keep doing pulls before doing git deploy start [18:33:00] I haven't done anything in wmf7 (I don't think) [18:33:12] I did a git reset --hard [18:33:22] now LocalSettings is missing, of course [18:33:25] weird [18:33:38] did you make any changes to this slot? [18:33:46] in 1.21wmf8 [18:34:24] other than just making a new link? [18:34:40] I think I added LocalSettings after my initial sync of the directory (as it was missing) [18:34:54] git deploy sync? [18:35:18] Yeah, that all went fine on the first run [18:35:21] hm [18:35:28] it's odd it disappeared [18:35:34] probably from switching branch [18:35:42] actually, definitely from switching branch [18:35:53] the other committed file was in the old branch [18:36:01] ok. so: git deploy start [18:36:03] add the file [18:36:08] git add -f LocalSettings.php [18:36:11] git commit -a [18:36:17] git deploy sync [18:36:33] we'll need to add that as a step for new branches [18:36:41] Yeah [18:36:54] The checkout script needs mostly re-writing, so it might aswell be automated [18:37:05] annoying, but not a big deal if we document it [18:37:06] yeah [18:37:25] indeed [18:37:36] With the old method, I've done it, what 20+ times now? [18:37:37] * Reedy grins [18:37:57] oh. right [18:38:05] it needed to be done with that one too, right? [18:38:30] LocalSettings sat around as a live hack [18:38:33] effectively [18:38:35] * Ryan_Lane nods [18:38:40] yeah, it'll need to here, too [18:38:44] thankfully only that one file [18:38:46] but we did clean checkouts everytime, so it'd just get added at the end [18:39:01] Yay, looks like it's doing localisation cache too this time :) [18:39:06] cool [18:39:11] What's the deployment plan? [18:39:12] that's going to take a while [18:39:23] Are we doing everything to wmf7 git deploy first? [18:39:50] It looks like wmf7 may already be there... not sure if it's up to date? [18:39:53] we'll need to move/link [18:40:13] csteipp: I don't think anyone has deployed since Ryan did last night [18:40:16] csteipp: it's there, but we need to move common-local and link to /srv/deployment [18:40:27] Nope [18:40:32] Ah, yep. [18:40:38] I did a common related change [18:40:48] wheee [18:40:49] Deploying change to localisation for 1.21wmf8 (on test2wiki) [18:40:59] we should make sure that the site isn't having issues before we switch [18:41:27] 362 files changed, 58 insertions(+), 57 deletions(-) [18:41:27] rewrite cache/l10n_cache-ug-arab.cdb (60%) [18:41:27] rewrite cache/l10n_cache-ug.cdb (60%) [18:41:28] we'll probably need to graceful-all the apaches after the switch [18:41:29] That's not bad [18:41:54] that's going to be 750MB of data to about 338 systems [18:42:00] it's going to take 10-15 minutes [18:42:15] this part of the system still sucks [18:42:23] I worked on bittorrent last night [18:42:37] I think I'll be able to switch us to it by next week maybe [18:42:37] Logging? ;) [18:42:54] ah, yeah. I need to do logging today [18:43:34] let me see if I can quickly get it going running as myself on fenari until I have it packaged and puppetized [18:44:00] well, it's obvious that l10n is going out [18:44:13] tin is incredibly laggy [18:44:18] :D [18:44:27] I also love the giant spike in the processor right before it :) [18:44:28] Should be the only time today [18:44:43] btw, seems you've done a really good job on git deploy [18:44:44] I set the thread count to 12 [18:44:47] thanks [18:44:53] Bit confusing to start with, but that's expected [18:44:55] it can be improved, a lot [18:45:01] the detailed report sucks [18:45:14] <^demon> We really really need to write our own submodule support. [18:45:15] thankfully it's just a matter of display, all the data needed to make it better is there [18:45:22] ^demon: *yes* [18:45:24] <^demon> `git submodule foreach` makes me cry. [18:45:24] please [18:45:29] it's so fucking slow [18:46:01] <^demon> Also, lack of a way to do something like `git submodule foreach --submodules=A,B,C` annoys me. [18:46:18] <^demon> Sometimes I have a list of (some) submodules I want I work on. [18:46:25] I want to be able to say --fanout 10 [18:46:33] or whatever you'd call it [18:46:38] Ok [18:46:40] I want a large number of submodules updating at once [18:46:44] So I think that's everything staged as is [18:46:57] <^demon> Ryan_Lane: Could probably be fast in python, but I don't know enough python. [18:47:08] ^demon: could run a thread for each one [18:47:24] so, to switch things, I'm going to run a salt command [18:48:41] salt -E '' cmd.run 'mv /usr/local/apache/common-local /usr/local/apache/common-local.scap && ln -s /srv/deployment/mediawiki/common /usr/local/apache/common-local' [18:50:00] and if we need to switch it back: salt -E '' cmd.run 'rm /usr/local/apache/common-local && mv /usr/local/apache/common-local.scap /usr/local/apache/common-local' [18:50:27] that looks right to me [18:50:34] looks like it's finishing up fetch [18:50:42] right about 10 minutes [18:50:53] so, I guess it's still a little faster than scap [18:51:42] shit [18:51:51] a lot of repos are showing fetch errors [18:52:12] err [18:52:13] minions [18:52:52] <^demon> They fetch from the deployment host, right? [18:52:56] yeah [18:53:05] btw, what's the command to dump the stats from redis, if you're not doing a sync? [18:53:14] <^demon> (Just making sure. Fetching from manganese would make me cry) [18:53:23] I'm using: deploy-info --repo=slot1 --fetch --detailed [18:54:17] Reedy: I'd retry the fetch stage [18:54:27] Hm? [18:54:37] you're doing slot1, right? [18:56:03] yeah [18:56:37] I wonder what happened… a ton of the fetches failed [19:03:13] 56 minions pending (338 reporting) [19:03:32] really? for slot1? [19:04:13] for the fetch stage? [19:04:39] Repo: slot1; checking tag: slot1-20130116-185515 [19:04:39] 56 minions pending (338 reporting) [19:04:53] are you on the checkout stage? [19:05:40] It's done [19:05:50] o.O [19:06:12] I'm seriously confused as to how [19:06:24] <^demon> You're just that good? [19:06:28] and you're getting different numbers than me [19:06:35] I'm seeing a mostly failed deployment [19:06:45] # INFO : Step 'finish' finished. Started at 2013-01-16 19:04:57; took 0 seconds to complete [19:06:46] # YAY : 'finish' for 'slot1' completed successfully (now at slot1-20130116-185515) [19:06:49] Repo: l10n-slot1; checking tag: l10n-slot1-20130116-184054 [19:06:54] 255 minions pending (338 reporting) [19:07:00] Repo: slot1; checking tag: slot1-20130116-185515 [19:07:03] 44 minions pending (338 reporting) [19:07:26] detailed view shows an even worse state [19:08:02] do you know offhand what errors 20 and 50 are? [19:08:03] Reedy: did you move on from the fetch stage before they all returned? [19:08:21] 20 means this failed: cmd = '/usr/bin/git reset --hard tags/%s' % (tag) [19:08:31] 50 means this failed: cmd = '/usr/bin/git submodule update --init' [19:08:37] 50 is generally bad [19:08:46] 20 means that the check occured before the fetch finished [19:09:03] I'm going to re-attempt it [19:10:04] Reedy: did all of the minions say they were done with the fetch before you said "y" to continue? [19:12:46] It looks like common also has a bunch of minions that failed the checkout [19:13:30] wtf [19:15:07] hm. weird [19:15:12] is the reporting failing? [19:17:48] error: corrupt loose object '64932930b95e310e70cae17a2bb89569fdb13605' [19:17:48] fatal: loose object 64932930b95e310e70cae17a2bb89569fdb13605 (stored in .git/objects/64/932930b95e310e70cae17a2bb89569fdb13605) is corrupt [19:17:48] error: http://tin.eqiad.wmnet/mediawiki/l10n-slot1/.git did not send all necessary objects [19:17:49] fun [19:19:16] I have a feeling we're not going with this today [19:24:50] git is stupid. if it sees it has a corrupt object, why doesn't it re-pull it from the master? [19:26:51] So... can we re pull/clone the ones that are failed, and then redeploy? [19:27:28] it looks like the network dropped during the fetch and some objects were written into the repos in a corrupted way [19:27:44] <^demon> git gc, then git remote update? [19:27:47] git is apparently stupid and doesn't re-fetch corrupted objects [19:29:01] hm, let me try git remote update [19:29:57] ^demon: seems that doesn't work [19:30:39] <^demon> bah. [19:30:53] seems you need to manually delete corrupted files [19:30:58] which is just absurd [19:31:53] fsck won't even tell you all of the corrupted objects [19:31:58] it fails on the first one it finds [19:31:58] Did you guys try git fsck --full [19:32:04] preilly: yes [19:32:21] maybe I'll just switch it all to bittorrent and say fuck it [19:32:30] ;) [19:32:44] Have you tried hash-object [19:33:26] that requires a path [19:33:36] this just mentions a corrupted object [19:33:40] and immediately fails [19:33:41] and git ls-tree [19:34:07] <^demon> Generally, corrupted objects are hard to recover. Easiest solution is delete the object and pull again. [19:34:47] I guess I could have a for-loop that does fsck, finds the corrupted object, deletes it, then tries again until they are all gone [19:34:55] it's annoying that it doesn't keep going and list them all [19:35:13] or even better, just give a fucking option to re-fetch them [19:35:40] At what point do we consider git-deploy a failed experiment? [19:35:43] so, based on this, I think it's a good idea to push this off [19:36:07] Ryan_Lane: ^demon: mark: preilly: if git-deploy isn't in kick ass production worthy shape by the end of today, can we put it on hold until after the eqiad migration is complete and use scap/etc thru the migration next week? [19:36:09] well, I wouldn't consider it a failed experiment. it just needs to be able to handle situations like this [19:36:22] I'd say at this point we should hold off [19:37:23] I agree... So, Reedy, normal deploy? [19:37:24] are permissions correct on .git/objects? [19:37:34] I was just doing a sync-dir of php-1.21wmf8 [19:37:40] on the deployment host, or the minion? [19:37:48] And getting a spam of The authenticity of host 'mw32 (10.0.11.32)' can't be established. [19:37:50] from enfari [19:39:44] the permissions on the deployment destination are always correct [19:39:50] the same user always writes [19:40:03] * Reedy facepalms [19:40:03] So Ryan_Lane, I'm assuming there's no way to have git-deploy only work on one datacenter at at a time, right? [19:40:16] well you could unpack and repack [19:40:39] csteipp: yes, just need to change the regex [19:40:42] and git cat-file -t to see the type [19:41:08] preilly: he're a broken host mw1133.eqiad.wmnet [19:41:14] if you'd like to try some things out [19:41:19] git fetch fails [19:41:39] :( [19:41:42] seriously, what the hell happened to known_hosts on fenari [19:41:48] Ryan_Lane: what is the path? [19:41:55] /srv/deployment/mediawiki/l10n-slot1 [19:42:59] the easiest way to handle this is to have a "reclone" option, but that's a sledgehammer approach [19:43:33] * csteipp is liking the sledgehammer [19:43:50] sledgehammers are good [19:43:55] Who's the best person to review complicated SQL queries? preilly? [19:43:55] It would be good to have a way to basically reset, if something goes wrong. [19:44:10] kaldari: asher [19:44:21] it could clone into ., do the checkout, then move the old slot and move the . into [19:44:59] aude: thanks [19:45:29] Assuming we have the diskspace for it, that sounds reasonable. [19:45:40] Ryan_Lane: take a look at mw1133:/srv/deployment/mediawiki/l10n-slot1 now [19:46:50] Ryan_Lane: root@mw1133:/srv/deployment/mediawiki/l10n-slot1# git fsck --full [19:46:51] Checking object directories: 100% (256/256), done. [19:47:12] it didn't throw an error about corrupted objects? [19:47:15] preilly: what was the fix? [19:47:57] preilly: if that's the case, try: mw1057 [19:49:11] Reedy: did you figure out what the problem was with known_hosts, or is that still a problem? [19:49:56] Ryan_Lane: root@mw1057:/srv/deployment/mediawiki/l10n-slot1# git fsck --full [19:49:56] Checking object directories: 100% (256/256), done. [19:50:45] so, what did you do to fix it? [19:51:06] Ryan_Lane: it looks like it failed a network transfer [19:51:10] it did [19:51:16] Ryan_Lane: so I just recloned it [19:51:20] heh [19:51:31] I was trying to see how to fix it without re-clone :) [19:51:37] sledgehammer :) [19:51:38] I tried the unpack repack and gc route first [19:51:39] that was the sledgehammer approach I mentioned [19:52:30] sledgehammer approach may be easiest. [19:52:37] sounds like it [19:52:42] either way, we're aborting [19:52:47] Ryan_Lane: this was my first approach http://pastebin.mozilla.org/2064323 [19:53:22] Ryan_Lane: did you ever see anything like this problem in your testing in eqiad? [19:53:40] robla: I didn't even see this with testing to all of pmtpa and eqiad [19:54:18] better for it to happen before we switch to it [19:54:18] yeah [19:54:30] l10n needs to move to bittorrent anyway [19:54:43] so, I'll have some time to work out the rest of the issues [19:55:03] * Ryan_Lane moves test back to scap [19:56:06] !log moved test.wp.o back to scap [20:00:35] crappy [20:00:55] agreed [20:02:27] <^demon> Heh, http://tech.slashdot.org/comments.pl?sid=1061985&cid=26114699 [20:02:42] <^demon> "Use bittorrent to distribute git blobs." [20:02:45] hahaha [20:02:50] that's actually doable. [20:03:02] that's really not a bad idea, in fact [20:04:30] the problem I have with bittorrent for distributing everything is that it'll overwrite files while the deploy is happening [20:04:42] which is the reason I want to use fetch/checkout [20:05:08] I guess it could write into a cache location, then we could rsync the files across, but that eats a ton of disk [20:05:20] <^demon> Well, bittorrent would move the blobs around, replacing the fetch stage. [20:05:24] it also means rolling back is more difficult [20:05:28] ^demon: indeed [20:05:41] <^demon> Rolling back would just be checkout of older tag, since objects would already be in the local repos. [20:05:46] could I just sync the entire .git directory? [20:05:59] <^demon> That'd work too [20:06:13] rsync [20:06:33] we could also do rsync, yes [20:06:48] you can also use rsync to fix up your repo ;) [20:06:53] the benefit of bittorrent is that it would also distribute the bandwidth, which would make it faster [20:07:15] I got murder working last night [20:07:45] I need to push in a patched bittornado for it, but it works [20:08:01] hm. let me try the rsync, actually [20:08:07] stupid git-fetch sucks [20:08:26] does it detect it during the fetch or later? [20:08:51] if the network drops out during the fetch, it may have written corrupt files [20:09:02] when you re-try the fetch, it'll fail [20:09:11] ^demon: https://gerrit.wikimedia.org/r/#/c/43881/ [20:09:18] robla: No way of me really finding out myself. [20:09:18] -rw------- 1 root root 2365064 Jan 16 15:01 /etc/ssh/ssh_known_hosts [20:09:21] you'll know before the checkout stage [20:10:13] bittorrent for .git directory would only transfer the files that need to be updated too :) [20:10:24] rsync would need to check all the damn files every time [20:10:26] let's link .git to a NFS share [20:10:30] * mark ducks ;-) [20:10:30] * Ryan_Lane stabs mark [20:10:46] wouldn't even work [20:10:46] heh [20:10:55] well, I have another couple week to make this work right :) [20:11:32] <^demon> AaronSchulz: Merged. I had looked at it earlier, but got distracted. [20:11:39] heh [20:12:07] yeah [20:12:09] pulling all of .git would also update all of the submoudules too [20:12:20] siebrand: don't forget to get some sleep before the meeting :) [20:12:51] could someone help Reedy with the known_hosts issue? [20:13:02] Reedy: how are things going? [20:13:03] kaldari: It's 21:12, so I've got at least 3 hours before I go to bed. Have to get up at 06:30 for the 07:00 meeting.... [20:13:09] i'm just pasting yes and mashing enter [20:13:13] AaronSchulz: spammingly [20:13:52] Not even started to build the localisation cache this way [20:14:45] siebrand: I won't mind if you're still in pajamas [20:16:15] kaldari: I sleep without clothes on… [20:17:17] siebrand: well, we might have to pixelate you then [20:17:32] :D [20:17:46] Public service announcement imminent on engineering@... [20:17:57] Ooh, exciting [20:18:22] I thought we had a mandatory-pants rule for remote wikimedia workers? [20:19:18] i have seen no shirt before [20:19:44] siebrand: Great advice. [20:20:21] marktraceur: yw:) [20:20:53] guillom: Ditto [20:20:58] I have to work on material in advance now... [20:21:11] Reedy: I had an issue with known_hosts not being readable by non root [20:21:13] ltoo many unnecessary engineering@ spammers. [20:21:17] Reedy: I thought I had that fixed in puppet [20:21:22] hashar: Yeah, it's still root only [20:21:28] on fenari at least [20:21:45] hashar: I've got lots of lines in my .ssh/known_hosts file [20:21:55] Reedy: get some root to manually run puppet on fenari. It should fix it [20:22:11] ssh::hostkeys::collect puppet class makes '/etc/ssh/ssh_known_hosts' 0644 [20:23:58] Reedy: need a root to fix it and you are done :-] save yourself the trouble of pressing Y ;-] [20:24:05] Too late now [20:24:30] :( [20:38:39] Reedy: test2wiki showing error page right now [20:38:45] Yup [20:38:49] Waiting for scap to run [20:38:54] localisation cache [20:39:00] k, thx [20:40:00] IE9 test happened to be running, EVERYTHING failed :) [20:41:11] ohh [20:41:16] ExtensionMessages [20:41:20] let me sync that itself.. [20:42:54] One error to another ;) [20:48:39] heh. test2wiki showing stack trace now [20:49:21] chrismcmahon: I have just noticed that test2 is down [20:50:05] zeljkof: yes, deployment issue http://www.mediawiki.org/wiki/MediaWiki_1.21/Roadmap#Schedule_for_the_deployments [20:51:13] chrismcmahon: thanks, I see it now [20:56:38] Reedy: are we still mid-scap, or is there something else going on? [20:56:56] Still scapping [20:59:13] * robla starts humming "Girl from Ipanema" [21:00:14] i'm not quite sure what it's doing.. [21:00:14] .oO(ahhhh) [21:00:15] no localisation cache? [21:00:25] http://test2.wikipedia.org/wiki/Main_Page [21:01:09] It's still pushing [21:01:25] * aude gets tea [21:03:04] * Reedy looks at ganglia [21:05:04] That's weird [21:05:11] The NFS graph is very spiky [21:23:04] https://gerrit.wikimedia.org/r/#/c/44339/1/repo/Wikibase.i18n.php,unified < Gerrit says no [21:26:08] ourlast scap took about 2 hours [21:26:19] seriously? [21:26:33] that doesn't sound far off this tbh [21:26:39] yeah, but I think there were some weird problems [21:27:16] lots of unknown hosts and such [21:30:39] kaldari: orly? [21:31:02] robla: bsitu did that one though, not me [21:31:43] I basically kept typing 'yes' and hitting enter [21:31:56] ah, ok, that was the ssh key problem [21:32:04] that *should* be fixed [21:32:30] (known_hosts was only root readable) [21:33:23] haha, yeah [21:35:09] <^demon> known_hosts can't catch a break. [21:38:04] I'm on the Jenkins whitelist for MW core [21:39:04] This means that Jenkins runs some tests twice [21:39:24] it complained about https://gerrit.wikimedia.org/r/#/c/36330/ being unmergeable twice [21:48:55] it also does the merge and lint jobs twice [21:57:00] Tim-away: ping [22:16:12] hello [22:19:18] xyzram: ^ [22:19:25] hi Tim! [22:19:59] Hi Tim [22:39:24] hi yurik [22:39:32] woosters: !!! [23:03:47] better read the IRC logs about git-deploy I guess [23:04:23] robla's mail just says there were problems [23:06:35] TimStarling, too much entertainment per line:P [23:19:27] So @MediaWikiMeet as listed in /topic is gone? [23:20:20] merged with @mediawiki ? [23:32:02] TimStarling: the minions had a network disconnect from the deployment host [23:32:05] TimStarling: and git fetch sucks [23:32:34] if it has a disconnect it can write corrupted objects, which will put it into a state where it'll no longer fetch [23:32:49] apparently linus has never heard of download resumption [23:33:08] the only way to repair it is to manually delete the bad objects and re-fetch [23:33:17] so, we're going to bypass git's fetch stage [23:33:35] we're going to bittorrent the .git directory rather than using git's crappy fetch [23:34:06] then do some rename tricks? [23:34:08] so the HTTP connections were reset? [23:34:54] TimStarling: yes [23:34:55] AaronSchulz: no [23:35:22] AaronSchulz: fetch .git over the top of the old one. git objects are immutable [23:35:35] ah, right that would not touch the working dir [23:35:36] but only .git [23:35:38] indeed [23:35:50] and bittorrent will resume files, unlike git fetch [23:36:14] it'll also distribute the bandwidth across nodes, which should make it faster [23:36:28] maybe the network was saturated and the apache's 5-minute timeout was reached? [23:36:37] that's also likely, but…. [23:36:44] I tested this with all of pmtpa and eqiad yesterday [23:36:55] and saturated the line for nearly 30 minutes [23:37:14] (when initializing all of the new nodes) [23:37:23] in theory, network saturation shouldn't cause a server timeout [23:37:27] right [23:37:32] all the server needs is to get ACKs within 5 minutes [23:37:39] which gives time for a lot of retries [23:37:43] that was the first time I saw an error that bad [23:46:53] the apache error log just shows 404s [23:48:50] 340,000 "file not found" errors, 18 access denied errors [23:49:03] the access denied errors appear to have been fixed [23:49:43] ah, missed this one: [23:49:46] [Wed Jan 16 05:03:33 2013] [error] server reached MaxClients setting, consider raising the MaxClients setting [23:50:21] but the deployment was at 18:50, so that's not it [23:50:24] at 18:50 error.log was silent [23:50:44] I find it hard to believe that apache would reset connections without logging it [23:52:52] TimStarling: it's also possible that there was a network issue [23:53:12] either way, the fetch disconnected abruptly somehow for over half of the nodes [23:53:20] and git apparently can't handle that [23:53:39] I'm glad it happened when we went to deploy because I tested deployment about 20 times before it and didn't have the problem [23:53:56] it would have sucked to deploy then a week later run into this issue [23:57:03] the 404s seem to have happened on every sync [23:58:02] I'm guessing each sync causes a lot of connections from each client [23:58:20] which would leave room for connection timeout errors which wouldn't be logged by apache [23:58:34] but once the connection is established, you would expect any network failure to be logged by apache [23:59:33] yeah. the 404s were normal [23:59:54] the initialization I did should have caused way more connections than thix [23:59:56] *this