[00:00:37] I'm pretty sure it has --no-perms [00:00:49] it removes the --perms implied by -a [00:00:51] yeah, i think that's the problem ? [00:01:09] i'm guessing some root dir didn't have perms and it inherited the lack of permissions ? [00:01:13] does that sound plausible ? [00:01:26] drwxr-xr-x 14 mwdeploy mwdeploy 4096 2012-02-13 23:07 . [00:01:38] no, rsync runs as mwdeploy [00:02:11] the source directory on NFS has all sorts of crazy permissions, whatever the wikidev group member decided to set them to [00:02:47] running with --no-perms removes the group writable bits and makes the permissions correct for deployment regardless of the permissions on the source [00:03:16] ok [00:03:37] so how do you think that everything could have gotten switched to 700 ? [00:03:46] umask must have been wrong [00:05:08] how do we set it so that we can fix it plus in the future this doesn't happen again? [00:07:11] reedy: how did you run the script? [00:08:49] which script when? [00:08:57] sync-common-all? [00:09:24] !log reedy ran sync-common-all [00:09:27] * AaronSchulz wonders wtf refreshWikiversionsCDB is in files/misc [00:09:59] t'was probably meant to be a symlink [00:10:04] sync-comon-all foobar [00:10:16] I'm not sure there is any other way to run it..? [00:13:49] where do you see this 700 mode? [00:13:52] what server? [00:14:16] fenari /usr/local/apache/common/php-1.19 [00:14:34] includes, maintenance, mw-config, resources, skins... [00:14:54] Exactly the same on srv190 [00:15:47] but not on srv300 [00:15:59] it looks like rsync died halfway through [00:16:07] Multiple times? [00:16:16] I've run scap a few times over the weekend as I've been changing things [00:16:37] I suppose if ti stopped for some error, of course it'd do the same later [00:16:48] once the directory is created, rsync won't change its permissions [00:16:55] Ah [00:17:05] it's only for creations that it looks at umask [00:17:12] Mixed permissions would explain why test2wiki loads on some requests [00:17:52] crap... [00:18:02] /usr/local/bin/sync-common-all has no fanout option [00:18:14] /usr/local/bin/scap has it [00:18:33] We really need to get all these scripts into line sometime [00:19:02] I already merged sync-common-all and scap [00:19:14] well, on the remote side [00:19:38] I figured it was pointless having both of them so I just got rid of sync-common, making it an alias to scap-1 [00:20:29] yeah [00:21:05] Why is sync-common in /usr/bin and sync-common-all in /usr/local/bin [00:21:08] I shouldn't probably ask that [00:21:29] because sync-common comes from a package and sync-common-all comes from puppet [00:21:52] the package is wikimedia-task-appserver [00:23:31] New patchset: Tim Starling; "Limit fanout like in scap, to avoid overloading the NFS server." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2570 [00:24:07] New patchset: Lcarr; "Generating initcwnd.erb with both default gateway and default interface fact" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2571 [00:24:20] hello [00:24:29] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2570 [00:24:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2570 [00:24:29] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2570 [00:24:29] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2570 [00:24:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2571 [00:24:49] could u help me, where disappeared export to pdf? [00:25:00] Disabled due to performance issues [00:25:15] we pushed it out of a plane [00:25:42] not connected with rights issues? [00:25:50] No [00:26:08] ok then...=) [00:26:22] will it be back once?.. mb?..) [00:26:35] neworldemancer: we're working out a plan for that now [00:26:56] * Reedy wonders if software has any rights after it's been pushed out of a plane [00:27:15] ok, 10x 4 info! [00:27:37] Reedy: it's the ultimate in free software [00:28:52] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2571 [00:28:53] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2571 [00:29:03] anyway if I were you I'd just delete those php-1.19 directories and run sync-common-all again [00:29:10] with -F30 it should hopefully work this time [00:29:21] makes sense [00:29:38] the server gets overloaded if it has too many concurrent clients [00:29:39] Aaron and I won't have the permissions to do that will we? [00:29:53] yeah, you can run any command as mwdeploy [00:30:02] oh [00:30:05] fair enough [00:30:12] sudo -u mwdeploy rm -rf ... [00:30:49] * AaronSchulz waits for Reedy to delete all of fenari [00:30:59] * Reedy glares at AaronSchul [00:31:03] z [00:31:20] or nfs at least :) [00:32:10] * neworldemancer by [00:32:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2556 [00:33:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2556 [00:33:50] so: ddsh -cM -g mediawiki-installation -o -oSetupTimeout=10 'rm -rf /usr/local/apache/common/php-1.19' [00:34:13] with the sudo first [00:34:18] which I seem to have lost somewhere [00:35:26] well that's a no [00:35:54] oh, sudo locally [00:36:00] as in the target machine [00:37:32] !log killed /usr/local/apache/common/php-1.19 from apaches [00:37:34] Logged the message, Master [00:41:22] * AaronSchulz wonders why getUserBlockErrors takes a user param but has a $result cache [00:41:31] in GlobalBlocking [00:42:19] Blame werdna [00:42:53] $ip_pattern = substr( $hex_ip, 0, 4 ) . '%'; [00:42:57] makes less sense for v6 [00:44:12] New patchset: Lcarr; "temp commenting out config file until new facter script propogates" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2572 [00:44:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2572 [00:45:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2572 [00:45:07] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2572 [00:51:41] NFS seems unhappy [00:54:00] yes, i reset nfs1's drac controller to try and get console acces [00:55:10] yeah... ganglia doesn't have happy graphs for nfs1 [00:57:57] I saw many only half or 15% displaying thumbnails at commons today and yesterday. What is that? When will it be fixed? [00:58:13] probably due to swift deploy?! [00:58:32] purging the file page fixed it in all cases [00:58:57] but it is confusing users https://commons.wikimedia.org/wiki/Commons:Help_desk#Brussels_Airlines_destinations.png_laad_niet_volledig and is simply creating unnecessary "purge" work [01:03:39] Saibo: the pediapress people decided to change their default image size from 1200 to 1199 [01:03:49] because they didn't like caching or something [01:03:55] Seriously? :/ [01:04:12] so the image scalers and thumb backend server were overloaded until we disabled the collection extension [01:04:15] it should be OK now [01:04:43] Reedy: did you break nfs1? [01:04:54] I think so [01:04:56] it was segfaulting like mad [01:05:00] what did you do Reedy! [01:05:10] Just trying to use it [01:05:14] You know, to share files [01:05:17] Like it's designed for [01:05:27] that is not proper usage of a nfs server! [01:05:33] store files in labs ;) [01:05:48] Gmail might be a better option [01:05:52] that's got loads of space [01:06:39] Is nfs1 back up? [01:06:55] "Stale NFS file handle" on fenari atm [01:07:30] TimStarling: and the pediapress servers did query millions of thumbs then? Not only on pdf creation? [01:08:00] just on PDF creation, it was enough though [01:08:14] and: why are the thumbs only partly available? note: it apparently differs depending on which cache you hit.. some people also get the full thumbs of the same file [01:08:35] Reedy: we may have to reboot it [01:08:41] and what is with the files already broken? do they all need a manual purge? [01:08:47] or does waiting also solve [01:08:48] ? [01:09:07] that's a separate issue [01:09:13] I haven't looked into it yet [01:09:20] hm. okay [01:09:35] btw: I can't imagine that so many pdfs are created... [01:09:38] it's probably the issue that caused pediapress to change their default thumbnail size [01:10:16] any objections to rebooting fenari? [01:10:19] okay, thanks for the preliminary info [01:10:36] Doesn't look to be doing very much [01:10:40] (fenari) [01:10:48] nope [01:10:52] no objection here [01:11:20] basically we would have to unmount /home and then remount it [01:11:33] but first I would have to kill every process with an open filehandle under /home [01:11:43] it's easier to just reboot, it's a lot of mucking around and it doesn't always work [01:11:53] TimStarling: if we reboot fenari, can you do the security uipgrades first ? [01:11:58] too late [01:12:00] apt-get upgrade ? [01:12:00] oh [01:12:03] :p [01:12:16] I can do it when it comes back up [01:12:18] just do it again afterwards [01:12:32] !log rebooted fenari to fix stale NFS file handle [01:12:34] Logged the message, Master [01:12:42] grumblegrumblegrumble. [01:12:46] hate on rebooting fenari [01:12:51] I had shit running there. [01:12:51] Tough! [01:13:19] you don't actually have to kill all those processes. [01:13:31] a forced remount can work. [01:14:05] hrmph. [01:14:51] !log doing apt-get upgrade on fenari [01:14:52] Logged the message, Master [01:14:57] don't restart your stuff just yet [01:15:24] The following packages have been kept back: [01:15:25] linux-image-server linux-server [01:15:41] LeslieCarr: are those the security updates you wanted? [01:15:56] yeah [01:16:33] i just want to take every rebooting opportunity to do all the updates possible [01:17:32] I'm not sure how to make the kernel packages update [01:17:55] apt-get install works [01:19:03] and it even updated menu.lst without a conflict, imagine that [01:19:13] ok, rebooting again [01:19:27] !log rebooting fenari for kernel upgrades [01:19:29] Logged the message, Master [01:25:23] Reedy: how far did you get with the delete/resync? [01:25:43] AaronSchulz: not that far based on all the errors [01:27:05] did the nfs1 crash happen while you were running sync-common-all? [01:27:08] Yup [01:28:15] I was just going to ask whether it was worth deleting the /usr/local/apache/common/php-1.19 folders again.. [01:29:19] probably best to delete them, to be on the safe side [01:29:34] That's what I was thinking, as no idea where it got to [01:29:40] then run the dsh command from sync-common-all, but reduce the fanout even further [01:29:40] Do I need to worry about killing nfs1 again? [01:29:46] yes [01:29:55] use dsh -F5 or something [01:33:36] I'm getting password prompts for some servers, but not for others [01:35:25] srv193, hume, searchidx2 [01:35:39] they probably need /home remounted [01:35:53] ah, yeah, that'd make sense [01:36:32] they'll likely need to be rebooted [01:37:18] hm. I guess a force remount will work [01:37:24] well, if we do any more rebooting, the normal request of do an upgrade first ;) [01:38:11] root@hume:/# fuser -m /home [01:38:11] Cannot stat /home: Stale NFS file handle [01:38:20] that's not meant to happen is it? [01:38:28] it happens, unfortunately [01:38:42] force umount doesn't work [01:38:46] I guess I can lazy unmount [01:39:03] ok. remounted [01:39:07] I hate doing lazy umounts [01:39:09] lazy unmount just means whatever process it is that still has that filehandle open will be subtly broken until it restarts [01:39:17] yes [01:39:39] I wonder which process uses it [01:39:58] doing a lazy umount lets you remount, then restart affected processes though [01:40:38] though, honestly, I have no clue what was holding it open. [01:40:40] * Ryan_Lane sighs [01:41:11] LeslieCarr: you've been doing apt-get upgrades on the hosts? [01:41:35] I'm going to reboot hume. so, I'll upgrade while I'm at it [01:41:54] on the hosts that are going to be rebooted - i haven't on those specific hosts [01:42:22] I'll do searchidx2 [01:42:23] eh? [01:42:25] Reedy: how are you holding up? [01:42:43] robla: waiting for computers to stop sucking [01:42:49] I think it's going to be a while [01:42:55] yeah, I think so [01:43:03] oh...wait [01:43:09] nope, nm...still suck [01:43:12] ugh. how do you make apt not skip the kernel again? [01:43:20] ah [01:43:22] dist-upgrade [01:43:31] I just used apt-get install [01:44:00] rather than installing all upgrades? that likely would have been a better approach [01:44:14] didn't fred have some method of only installing the security updates? [01:44:18] !log rebooting hume [01:44:20] Logged the message, Master [01:44:27] I ran apt-get upgrade first, then I ran apt-get install on the packages that it said were held back [01:44:31] ahh. ok [01:44:36] though I'd really love to upgrade all our servers to 2.6.33 :) but i figure we can wait until pangolin [01:44:38] dist-upgrade will do the ones held back [01:44:46] 2.6.33 allows you to change initrwnd :) [01:44:53] heh [01:45:02] yeah. likely good to wait until precise [01:45:05] it'll be out soon [01:45:07] !log on searchidx2: doing apt-get upgrade and rebooting [01:45:09] Logged the message, Master [01:45:16] will santa bring it for us ? [01:45:38] heh. well, if it's anything like the lucid upgrade, yes, but it'll take santa a couple years to get there [01:46:12] damn reindeers are so slow! [01:46:15] robla: when these 3 hosts have been rebooted, I'll try again [01:46:52] LeslieCarr: did you not see Arthur Christmas? Santa has a massive space ship thing now [01:47:05] someone already did srv193? [01:47:26] i just did srv193 [01:47:28] not rebooted tho [01:47:37] want me to reboot it? [01:47:41] sure! [01:47:42] reboot party [01:47:47] !log rebooting srv193 [01:47:48] Logged the message, Master [01:50:48] Thanks [01:52:20] !log running ddsh -F30 -cM -g mediawiki-installation /usr/bin/sync-common [01:52:22] Logged the message, Master [01:53:14] !log when rebooting hume I also applied security updates [01:53:16] Logged the message, Master [01:53:24] Reedy: what happened to -F5? [01:53:58] I pasted the command from sync-common-all and altered it on the command line [01:54:03] then just paste it again to log it [01:54:12] !log Make that ddsh -F5 [01:54:14] Logged the message, Master [01:58:09] searchidx2: PHP Warning: PHP Startup: Unable to load dynamic library '/usr/lib/php5/20090626/php_wikidiff2.so' - /usr/lib/php5/20090626/php_wikidiff2.so: cannot open shared object file: No such file or directory in Unknown on line 0 [01:58:16] I guess we don't care about that too much? [01:58:22] being searchidx2 [01:59:33] it's not an urgent problem [02:00:33] Reedy: did it finish? [02:00:40] it's still going [02:00:48] more happily [02:03:18] srv199: rsync: mkstemp "/usr/local/apache/common-local/php-1.18/cache/l10n/.l10nupdate-zh-hk.cache.mGIG9H" failed: Permission denied (13) [02:03:25] quite a few of errors like that on srv199 [02:04:30] and numerous other servers too [02:04:47] I thought Roan fixed all of those, seemingly not [02:10:58] ! log finished running ddsh -F5 -cM -g mediawiki-installation /usr/bin/sync-common [02:11:02] robla: ^ [02:11:07] Seems to work fine on reload [02:11:12] * robla looks [02:11:34] Might be some localisation issues [02:11:51] w00t [02:12:01] all english [02:12:15] or not [02:12:30] anyway, no localisation cache files [02:12:56] TimStarling: any suggestion on the best fix for "Notice: Undefined variable: wgUseNormalUser in /home/wikipedia/common/wmf-config/AdminSettings.php on line 7" ? [02:13:38] Seems to have gone from Maintenance.php since 1.18 [02:14:08] yay, hook errors [02:15:10] before we dive too deeply into those issues: [02:15:20] TimStarling: are you planning to reenable Collections? [02:15:32] Seems every N request you'll get an error 500 [02:16:22] hrm...rsync problem? [02:16:32] brb [02:16:32] Would be suprising [02:19:12] !log reedy synchronized wmf-config/ExtensionMessages-1.19.php 'Remove variablepage' [02:19:14] Logged the message, Master [02:23:05] from #wikimedia-operations : (06:10:11 PM) Philippe: Hi - CT asked me to report here… I have two independent reports of 502 errors when saving edits on wikis. I'm not sure if it's important, or normal :) [02:25:33] !log LocalisationUpdate completed (1.18) at Tue Feb 14 02:25:32 UTC 2012 [02:25:33] !log LocalisationUpdate failed (1.19) at Tue Feb 14 02:25:33 UTC 2012 [02:25:34] Logged the message, Master [02:25:36] Logged the message, Master [02:25:58] !log reedy synchronized php-1.19/extensions/FundraiserLandingPage/ [02:26:00] Logged the message, Master [02:26:49] !log reedy synchronized php-1.19/extensions/VisualEditor [02:26:50] Logged the message, Master [02:30:29] Looks to be the fatals tidied up [02:32:14] Feb 14 02:31:57 10.0.11.17 apache2[8854]: PHP Fatal error: Allowed memory size of 125829120 bytes exhausted (tried to allocate 523800 bytes) in /usr/local/apache/common-local/php-1.19/extensions/LiquidThreads/LiquidThreads.php on line 14 [02:35:27] !log reedy synchronizing Wikimedia installation... : Rebuild messages [02:35:29] Logged the message, Master [02:36:55] TimStarling: are you planning to reenable Collections? [02:37:02] yes, I'll do it now [02:37:23] ah, there you are! :) thanks! [02:37:32] was just having lunch [02:37:34] * robla was just mulling making the commute home [02:38:11] !log tstarling synchronized wmf-config/InitialiseSettings.php 're-enabling the collection extension' [02:38:13] Logged the message, Master [02:38:40] !log reedy synchronizing Wikimedia installation... : Rebuild messages [02:38:42] Logged the message, Master [02:38:51] #wikimedia-operations [02:39:08] Yum [02:39:11] Errors errors errors [02:39:36] hi Risker... are you one of Philippe's sources for 502 errors? :) [02:39:43] robla, yes I am [02:39:47] TimStarling: any suggestion on the best fix for "Notice: Undefined variable: wgUseNormalUser in /home/wikipedia/common/wmf-config/AdminSettings.php on line 7" ? [02:39:50] on the arbwiki [02:39:56] does anything use that? [02:40:05] just adminsettings seemingly [02:40:08] * TimStarling greps [02:40:22] Notice: Undefined variable: wgDBuser in /home/wikipedia/common/php-1.19/extensions/ContributionReporting/ContributionReporting.php on line 28 [02:40:22] PHP Notice: Undefined variable: wgDBpassword in /home/wikipedia/common/php-1.19/extensions/ContributionReporting/ContributionReporting.php on line 29 [02:40:25] robla, it occurred once and has not happened again [02:40:33] I'm presuming that that line is causing those among others [02:41:13] Risker: ok...that's encouraging. we've been doing a lot of rebooting and other things that would kick up dust, so a couple of 502s aren't too alarming if they have stopped happening [02:41:44] !log reedy synchronized php-1.19/extensions/Contest/Contest.php 'Comment out stupid die for the moment' [02:41:46] Logged the message, Master [02:43:02] it used to be used by runJobs.php [02:43:18] Which is now not the case? [02:43:19] ok...this car alarm right outside my window is telling me I should make the commute home [02:43:22] apparently that was broken in 1.16 [02:43:35] with the maintenance rewrite [02:46:06] $wgUseRootUser also used to work [02:46:37] that was broken in 1.17 [02:46:57] anyway I'll just take it out [02:47:10] Thanks [02:48:00] !log tstarling synchronized wmf-config/AdminSettings.php 'remove $wgUseRootUser and $wgUseNormalUser, broken since 1.17 and 1.16 respectively' [02:48:02] Logged the message, Master [02:48:29] !log reedy synchronizing Wikimedia installation... : Rebuild messages [02:48:31] Logged the message, Master [02:48:52] http://p.defau.lt/?LjpPK3308Ni2BhSWVc61wQ [02:48:53] Lovely [02:50:57] strange [02:51:42] sync done. [02:52:57] Who is supposed to own the l10n cache files? [02:53:01] maybe a missing DefaultSettings.php [02:53:06] mwdeploy [02:53:40] some servers it is l10nupdate:l10nupdate [02:53:55] that is wrong [02:54:08] Sounds like the source of the srv224: rsync: mkstemp "/usr/local/apache/common-local/php-1.18/cache/l10n/.l10nupdate-or.cache.3yJziM" failed: Permission denied (13) etc [02:54:12] where is the cron job? I thought it was on hume [02:54:24] I think it might have been puppetised [02:55:56] the scripts are in puppet, I don't see the cron job [02:56:40] ah, it's on fenari [02:57:17] no, commented out [02:57:20] the search continues [02:57:27] http://wikitech.wikimedia.org/view/LocalisationUpdate says it's on fenari [02:57:36] yeah, in l10nupdate's user crontab [02:57:42] puppet likes user crontabs for some reason [03:01:15] I fixed this once already, but it looks like it has been very carefully unfixed [03:02:46] * Aaron|home adds that to quips [03:07:17] I fixed it on July 12, the day after I introduced the mwdeploy user, it says so in the server admin log [03:07:41] but it's broken in the first version Roan checked in to git, in October [03:11:47] alright....now I really mean it about going out the door [03:11:59] I'll be back online later. [03:13:31] New patchset: Tim Starling; "Fixed sync-l10nupdate again." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2573 [03:13:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2573 [03:15:05] Aaron|home: LU doesn't like FR again [03:15:17] Cannot get the contents of http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/FlaggedRevs/presentation/language/ConfiguredPages.i18n.php (curl) [03:15:22] among many others [03:16:02] I wonder.. [03:16:41] Yup [03:17:44] !log reedy synchronized wmf-config/ExtensionMessages-1.19.php 'Fix fr message file locations' [03:17:46] Logged the message, Master [03:19:17] link frontend [03:19:30] I just updated the paths [03:19:33] they're the same as in trunk [03:19:36] meh [03:19:54] We can tidy the mess up soon ;) [03:20:08] LU is sloooow [03:20:43] SVN + NFS...like a moped hauling an RV [03:22:24] I should probably go to bed soon [03:22:56] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2573 [03:22:57] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2573 [03:22:59] Sleep? You don't need that >.> [03:27:45] are you running LU now? [03:28:03] No [03:28:11] I have to run it to test my change [03:28:22] Feel free [03:28:29] I see a fair amount of cdb reader errors in exception.log [03:28:36] but on php-1.18 [03:30:45] I think there's a bug about that [03:31:07] Aye [03:31:16] Just suprising how often some of them appear [03:36:42] aaron cleared profiling data [03:40:23] Where are we supposed to look for more information about error 500s? [03:40:28] We have too many log files [03:40:41] PHP Fatal error: Class 'LocalisationCache' not found in /usr/local/apache/common-local/php-1.19/languages/Language.php on line 274 [03:40:58] Nice [03:41:12] PHP Fatal error: Call to a member function setTimestamp() on a non-object in /usr/local/apache/common-local/php-1.18/extensions/FeaturedFeeds/FeaturedFeeds.body.php on line 257 [03:41:44] 257 is a blank line [03:42:36] Aaron|home: which log is that? [03:43:27] * Aaron|home looks at FF [03:43:30] probably "return $dt->getTimestamp();" [03:43:41] It says setTimestamp() ;) [03:43:43]