[00:00:37] I'm pretty sure it has --no-perms [00:00:49] it removes the --perms implied by -a [00:00:51] yeah, i think that's the problem ? [00:01:09] i'm guessing some root dir didn't have perms and it inherited the lack of permissions ? [00:01:13] does that sound plausible ? [00:01:26] drwxr-xr-x 14 mwdeploy mwdeploy 4096 2012-02-13 23:07 . [00:01:38] no, rsync runs as mwdeploy [00:02:11] the source directory on NFS has all sorts of crazy permissions, whatever the wikidev group member decided to set them to [00:02:47] running with --no-perms removes the group writable bits and makes the permissions correct for deployment regardless of the permissions on the source [00:03:16] ok [00:03:37] so how do you think that everything could have gotten switched to 700 ? [00:03:46] umask must have been wrong [00:05:08] how do we set it so that we can fix it plus in the future this doesn't happen again? [00:07:11] reedy: how did you run the script? [00:08:49] which script when? [00:08:57] sync-common-all? [00:09:24] !log reedy ran sync-common-all [00:09:27] * AaronSchulz wonders wtf refreshWikiversionsCDB is in files/misc [00:09:59] t'was probably meant to be a symlink [00:10:04] sync-comon-all foobar [00:10:16] I'm not sure there is any other way to run it..? [00:13:49] where do you see this 700 mode? [00:13:52] what server? [00:14:16] fenari /usr/local/apache/common/php-1.19 [00:14:34] includes, maintenance, mw-config, resources, skins... [00:14:54] Exactly the same on srv190 [00:15:47] but not on srv300 [00:15:59] it looks like rsync died halfway through [00:16:07] Multiple times? [00:16:16] I've run scap a few times over the weekend as I've been changing things [00:16:37] I suppose if ti stopped for some error, of course it'd do the same later [00:16:48] once the directory is created, rsync won't change its permissions [00:16:55] Ah [00:17:05] it's only for creations that it looks at umask [00:17:12] Mixed permissions would explain why test2wiki loads on some requests [00:17:52] crap... [00:18:02] /usr/local/bin/sync-common-all has no fanout option [00:18:14] /usr/local/bin/scap has it [00:18:33] We really need to get all these scripts into line sometime [00:19:02] I already merged sync-common-all and scap [00:19:14] well, on the remote side [00:19:38] I figured it was pointless having both of them so I just got rid of sync-common, making it an alias to scap-1 [00:20:29] yeah [00:21:05] Why is sync-common in /usr/bin and sync-common-all in /usr/local/bin [00:21:08] I shouldn't probably ask that [00:21:29] because sync-common comes from a package and sync-common-all comes from puppet [00:21:52] the package is wikimedia-task-appserver [00:23:31] New patchset: Tim Starling; "Limit fanout like in scap, to avoid overloading the NFS server." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2570 [00:24:07] New patchset: Lcarr; "Generating initcwnd.erb with both default gateway and default interface fact" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2571 [00:24:20] hello [00:24:29] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2570 [00:24:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2570 [00:24:29] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2570 [00:24:29] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2570 [00:24:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2571 [00:24:49] could u help me, where disappeared export to pdf? [00:25:00] Disabled due to performance issues [00:25:15] we pushed it out of a plane [00:25:42] not connected with rights issues? [00:25:50] No [00:26:08] ok then...=) [00:26:22] will it be back once?.. mb?..) [00:26:35] neworldemancer: we're working out a plan for that now [00:26:56] * Reedy wonders if software has any rights after it's been pushed out of a plane [00:27:15] ok, 10x 4 info! [00:27:37] Reedy: it's the ultimate in free software [00:28:52] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2571 [00:28:53] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2571 [00:29:03] anyway if I were you I'd just delete those php-1.19 directories and run sync-common-all again [00:29:10] with -F30 it should hopefully work this time [00:29:21] makes sense [00:29:38] the server gets overloaded if it has too many concurrent clients [00:29:39] Aaron and I won't have the permissions to do that will we? [00:29:53] yeah, you can run any command as mwdeploy [00:30:02] oh [00:30:05] fair enough [00:30:12] sudo -u mwdeploy rm -rf ... [00:30:49] * AaronSchulz waits for Reedy to delete all of fenari [00:30:59] * Reedy glares at AaronSchul [00:31:03] z [00:31:20] or nfs at least :) [00:32:10] * neworldemancer by [00:32:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2556 [00:33:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2556 [00:33:50] so: ddsh -cM -g mediawiki-installation -o -oSetupTimeout=10 'rm -rf /usr/local/apache/common/php-1.19' [00:34:13] with the sudo first [00:34:18] which I seem to have lost somewhere [00:35:26] well that's a no [00:35:54] oh, sudo locally [00:36:00] as in the target machine [00:37:32] !log killed /usr/local/apache/common/php-1.19 from apaches [00:37:34] Logged the message, Master [00:41:22] * AaronSchulz wonders why getUserBlockErrors takes a user param but has a $result cache [00:41:31] in GlobalBlocking [00:42:19] Blame werdna [00:42:53] $ip_pattern = substr( $hex_ip, 0, 4 ) . '%'; [00:42:57] makes less sense for v6 [00:44:12] New patchset: Lcarr; "temp commenting out config file until new facter script propogates" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2572 [00:44:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2572 [00:45:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2572 [00:45:07] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2572 [00:51:41] NFS seems unhappy [00:54:00] yes, i reset nfs1's drac controller to try and get console acces [00:55:10] yeah... ganglia doesn't have happy graphs for nfs1 [00:57:57] I saw many only half or 15% displaying thumbnails at commons today and yesterday. What is that? When will it be fixed? [00:58:13] probably due to swift deploy?! [00:58:32] purging the file page fixed it in all cases [00:58:57] but it is confusing users https://commons.wikimedia.org/wiki/Commons:Help_desk#Brussels_Airlines_destinations.png_laad_niet_volledig and is simply creating unnecessary "purge" work [01:03:39] Saibo: the pediapress people decided to change their default image size from 1200 to 1199 [01:03:49] because they didn't like caching or something [01:03:55] Seriously? :/ [01:04:12] so the image scalers and thumb backend server were overloaded until we disabled the collection extension [01:04:15] it should be OK now [01:04:43] Reedy: did you break nfs1? [01:04:54] I think so [01:04:56] it was segfaulting like mad [01:05:00] what did you do Reedy! [01:05:10] Just trying to use it [01:05:14] You know, to share files [01:05:17] Like it's designed for [01:05:27] that is not proper usage of a nfs server! [01:05:33] store files in labs ;) [01:05:48] Gmail might be a better option [01:05:52] that's got loads of space [01:06:39] Is nfs1 back up? [01:06:55] "Stale NFS file handle" on fenari atm [01:07:30] TimStarling: and the pediapress servers did query millions of thumbs then? Not only on pdf creation? [01:08:00] just on PDF creation, it was enough though [01:08:14] and: why are the thumbs only partly available? note: it apparently differs depending on which cache you hit.. some people also get the full thumbs of the same file [01:08:35] Reedy: we may have to reboot it [01:08:41] and what is with the files already broken? do they all need a manual purge? [01:08:47] or does waiting also solve [01:08:48] ? [01:09:07] that's a separate issue [01:09:13] I haven't looked into it yet [01:09:20] hm. okay [01:09:35] btw: I can't imagine that so many pdfs are created... [01:09:38] it's probably the issue that caused pediapress to change their default thumbnail size [01:10:16] any objections to rebooting fenari? [01:10:19] okay, thanks for the preliminary info [01:10:36] Doesn't look to be doing very much [01:10:40] (fenari) [01:10:48] nope [01:10:52] no objection here [01:11:20] basically we would have to unmount /home and then remount it [01:11:33] but first I would have to kill every process with an open filehandle under /home [01:11:43] it's easier to just reboot, it's a lot of mucking around and it doesn't always work [01:11:53] TimStarling: if we reboot fenari, can you do the security uipgrades first ? [01:11:58] too late [01:12:00] apt-get upgrade ? [01:12:00] oh [01:12:03] :p [01:12:16] I can do it when it comes back up [01:12:18] just do it again afterwards [01:12:32] !log rebooted fenari to fix stale NFS file handle [01:12:34] Logged the message, Master [01:12:42] grumblegrumblegrumble. [01:12:46] hate on rebooting fenari [01:12:51] I had shit running there. [01:12:51] Tough! [01:13:19] you don't actually have to kill all those processes. [01:13:31] a forced remount can work. [01:14:05] hrmph. [01:14:51] !log doing apt-get upgrade on fenari [01:14:52] Logged the message, Master [01:14:57] don't restart your stuff just yet [01:15:24] The following packages have been kept back: [01:15:25] linux-image-server linux-server [01:15:41] LeslieCarr: are those the security updates you wanted? [01:15:56] yeah [01:16:33] i just want to take every rebooting opportunity to do all the updates possible [01:17:32] I'm not sure how to make the kernel packages update [01:17:55] apt-get install works [01:19:03] and it even updated menu.lst without a conflict, imagine that [01:19:13] ok, rebooting again [01:19:27] !log rebooting fenari for kernel upgrades [01:19:29] Logged the message, Master [01:25:23] Reedy: how far did you get with the delete/resync? [01:25:43] AaronSchulz: not that far based on all the errors [01:27:05] did the nfs1 crash happen while you were running sync-common-all? [01:27:08] Yup [01:28:15] I was just going to ask whether it was worth deleting the /usr/local/apache/common/php-1.19 folders again.. [01:29:19] probably best to delete them, to be on the safe side [01:29:34] That's what I was thinking, as no idea where it got to [01:29:40] then run the dsh command from sync-common-all, but reduce the fanout even further [01:29:40] Do I need to worry about killing nfs1 again? [01:29:46] yes [01:29:55] use dsh -F5 or something [01:33:36] I'm getting password prompts for some servers, but not for others [01:35:25] srv193, hume, searchidx2 [01:35:39] they probably need /home remounted [01:35:53] ah, yeah, that'd make sense [01:36:32] they'll likely need to be rebooted [01:37:18] hm. I guess a force remount will work [01:37:24] well, if we do any more rebooting, the normal request of do an upgrade first ;) [01:38:11] root@hume:/# fuser -m /home [01:38:11] Cannot stat /home: Stale NFS file handle [01:38:20] that's not meant to happen is it? [01:38:28] it happens, unfortunately [01:38:42] force umount doesn't work [01:38:46] I guess I can lazy unmount [01:39:03] ok. remounted [01:39:07] I hate doing lazy umounts [01:39:09] lazy unmount just means whatever process it is that still has that filehandle open will be subtly broken until it restarts [01:39:17] yes [01:39:39] I wonder which process uses it [01:39:58] doing a lazy umount lets you remount, then restart affected processes though [01:40:38] though, honestly, I have no clue what was holding it open. [01:40:40] * Ryan_Lane sighs [01:41:11] LeslieCarr: you've been doing apt-get upgrades on the hosts? [01:41:35] I'm going to reboot hume. so, I'll upgrade while I'm at it [01:41:54] on the hosts that are going to be rebooted - i haven't on those specific hosts [01:42:22] I'll do searchidx2 [01:42:23] eh? [01:42:25] Reedy: how are you holding up? [01:42:43] robla: waiting for computers to stop sucking [01:42:49] I think it's going to be a while [01:42:55] yeah, I think so [01:43:03] oh...wait [01:43:09] nope, nm...still suck [01:43:12] ugh. how do you make apt not skip the kernel again? [01:43:20] ah [01:43:22] dist-upgrade [01:43:31] I just used apt-get install [01:44:00] rather than installing all upgrades? that likely would have been a better approach [01:44:14] didn't fred have some method of only installing the security updates? [01:44:18] !log rebooting hume [01:44:20] Logged the message, Master [01:44:27] I ran apt-get upgrade first, then I ran apt-get install on the packages that it said were held back [01:44:31] ahh. ok [01:44:36] though I'd really love to upgrade all our servers to 2.6.33 :) but i figure we can wait until pangolin [01:44:38] dist-upgrade will do the ones held back [01:44:46] 2.6.33 allows you to change initrwnd :) [01:44:53] heh [01:45:02] yeah. likely good to wait until precise [01:45:05] it'll be out soon [01:45:07] !log on searchidx2: doing apt-get upgrade and rebooting [01:45:09] Logged the message, Master [01:45:16] will santa bring it for us ? [01:45:38] heh. well, if it's anything like the lucid upgrade, yes, but it'll take santa a couple years to get there [01:46:12] damn reindeers are so slow! [01:46:15] robla: when these 3 hosts have been rebooted, I'll try again [01:46:52] LeslieCarr: did you not see Arthur Christmas? Santa has a massive space ship thing now [01:47:05] someone already did srv193? [01:47:26] i just did srv193 [01:47:28] not rebooted tho [01:47:37] want me to reboot it? [01:47:41] sure! [01:47:42] reboot party [01:47:47] !log rebooting srv193 [01:47:48] Logged the message, Master [01:50:48] Thanks [01:52:20] !log running ddsh -F30 -cM -g mediawiki-installation /usr/bin/sync-common [01:52:22] Logged the message, Master [01:53:14] !log when rebooting hume I also applied security updates [01:53:16] Logged the message, Master [01:53:24] Reedy: what happened to -F5? [01:53:58] I pasted the command from sync-common-all and altered it on the command line [01:54:03] then just paste it again to log it [01:54:12] !log Make that ddsh -F5 [01:54:14] Logged the message, Master [01:58:09] searchidx2: PHP Warning: PHP Startup: Unable to load dynamic library '/usr/lib/php5/20090626/php_wikidiff2.so' - /usr/lib/php5/20090626/php_wikidiff2.so: cannot open shared object file: No such file or directory in Unknown on line 0 [01:58:16] I guess we don't care about that too much? [01:58:22] being searchidx2 [01:59:33] it's not an urgent problem [02:00:33] Reedy: did it finish? [02:00:40] it's still going [02:00:48] more happily [02:03:18] srv199: rsync: mkstemp "/usr/local/apache/common-local/php-1.18/cache/l10n/.l10nupdate-zh-hk.cache.mGIG9H" failed: Permission denied (13) [02:03:25] quite a few of errors like that on srv199 [02:04:30] and numerous other servers too [02:04:47] I thought Roan fixed all of those, seemingly not [02:10:58] ! log finished running ddsh -F5 -cM -g mediawiki-installation /usr/bin/sync-common [02:11:02] robla: ^ [02:11:07] Seems to work fine on reload [02:11:12] * robla looks [02:11:34] Might be some localisation issues [02:11:51] w00t [02:12:01] all english [02:12:15] or not [02:12:30] anyway, no localisation cache files [02:12:56] TimStarling: any suggestion on the best fix for "Notice: Undefined variable: wgUseNormalUser in /home/wikipedia/common/wmf-config/AdminSettings.php on line 7" ? [02:13:38] Seems to have gone from Maintenance.php since 1.18 [02:14:08] yay, hook errors [02:15:10] before we dive too deeply into those issues: [02:15:20] TimStarling: are you planning to reenable Collections? [02:15:32] Seems every N request you'll get an error 500 [02:16:22] hrm...rsync problem? [02:16:32] brb [02:16:32] Would be suprising [02:19:12] !log reedy synchronized wmf-config/ExtensionMessages-1.19.php 'Remove variablepage' [02:19:14] Logged the message, Master [02:23:05] from #wikimedia-operations : (06:10:11 PM) Philippe: Hi - CT asked me to report here… I have two independent reports of 502 errors when saving edits on wikis. I'm not sure if it's important, or normal :) [02:25:33] !log LocalisationUpdate completed (1.18) at Tue Feb 14 02:25:32 UTC 2012 [02:25:33] !log LocalisationUpdate failed (1.19) at Tue Feb 14 02:25:33 UTC 2012 [02:25:34] Logged the message, Master [02:25:36] Logged the message, Master [02:25:58] !log reedy synchronized php-1.19/extensions/FundraiserLandingPage/ [02:26:00] Logged the message, Master [02:26:49] !log reedy synchronized php-1.19/extensions/VisualEditor [02:26:50] Logged the message, Master [02:30:29] Looks to be the fatals tidied up [02:32:14] Feb 14 02:31:57 10.0.11.17 apache2[8854]: PHP Fatal error: Allowed memory size of 125829120 bytes exhausted (tried to allocate 523800 bytes) in /usr/local/apache/common-local/php-1.19/extensions/LiquidThreads/LiquidThreads.php on line 14 [02:35:27] !log reedy synchronizing Wikimedia installation... : Rebuild messages [02:35:29] Logged the message, Master [02:36:55] TimStarling: are you planning to reenable Collections? [02:37:02] yes, I'll do it now [02:37:23] ah, there you are! :) thanks! [02:37:32] was just having lunch [02:37:34] * robla was just mulling making the commute home [02:38:11] !log tstarling synchronized wmf-config/InitialiseSettings.php 're-enabling the collection extension' [02:38:13] Logged the message, Master [02:38:40] !log reedy synchronizing Wikimedia installation... : Rebuild messages [02:38:42] Logged the message, Master [02:38:51] #wikimedia-operations [02:39:08] Yum [02:39:11] Errors errors errors [02:39:36] hi Risker... are you one of Philippe's sources for 502 errors? :) [02:39:43] robla, yes I am [02:39:47] TimStarling: any suggestion on the best fix for "Notice: Undefined variable: wgUseNormalUser in /home/wikipedia/common/wmf-config/AdminSettings.php on line 7" ? [02:39:50] on the arbwiki [02:39:56] does anything use that? [02:40:05] just adminsettings seemingly [02:40:08] * TimStarling greps [02:40:22] Notice: Undefined variable: wgDBuser in /home/wikipedia/common/php-1.19/extensions/ContributionReporting/ContributionReporting.php on line 28 [02:40:22] PHP Notice: Undefined variable: wgDBpassword in /home/wikipedia/common/php-1.19/extensions/ContributionReporting/ContributionReporting.php on line 29 [02:40:25] robla, it occurred once and has not happened again [02:40:33] I'm presuming that that line is causing those among others [02:41:13] Risker: ok...that's encouraging. we've been doing a lot of rebooting and other things that would kick up dust, so a couple of 502s aren't too alarming if they have stopped happening [02:41:44] !log reedy synchronized php-1.19/extensions/Contest/Contest.php 'Comment out stupid die for the moment' [02:41:46] Logged the message, Master [02:43:02] it used to be used by runJobs.php [02:43:18] Which is now not the case? [02:43:19] ok...this car alarm right outside my window is telling me I should make the commute home [02:43:22] apparently that was broken in 1.16 [02:43:35] with the maintenance rewrite [02:46:06] $wgUseRootUser also used to work [02:46:37] that was broken in 1.17 [02:46:57] anyway I'll just take it out [02:47:10] Thanks [02:48:00] !log tstarling synchronized wmf-config/AdminSettings.php 'remove $wgUseRootUser and $wgUseNormalUser, broken since 1.17 and 1.16 respectively' [02:48:02] Logged the message, Master [02:48:29] !log reedy synchronizing Wikimedia installation... : Rebuild messages [02:48:31] Logged the message, Master [02:48:52] http://p.defau.lt/?LjpPK3308Ni2BhSWVc61wQ [02:48:53] Lovely [02:50:57] strange [02:51:42] sync done. [02:52:57] Who is supposed to own the l10n cache files? [02:53:01] maybe a missing DefaultSettings.php [02:53:06] mwdeploy [02:53:40] some servers it is l10nupdate:l10nupdate [02:53:55] that is wrong [02:54:08] Sounds like the source of the srv224: rsync: mkstemp "/usr/local/apache/common-local/php-1.18/cache/l10n/.l10nupdate-or.cache.3yJziM" failed: Permission denied (13) etc [02:54:12] where is the cron job? I thought it was on hume [02:54:24] I think it might have been puppetised [02:55:56] the scripts are in puppet, I don't see the cron job [02:56:40] ah, it's on fenari [02:57:17] no, commented out [02:57:20] the search continues [02:57:27] http://wikitech.wikimedia.org/view/LocalisationUpdate says it's on fenari [02:57:36] yeah, in l10nupdate's user crontab [02:57:42] puppet likes user crontabs for some reason [03:01:15] I fixed this once already, but it looks like it has been very carefully unfixed [03:02:46] * Aaron|home adds that to quips [03:07:17] I fixed it on July 12, the day after I introduced the mwdeploy user, it says so in the server admin log [03:07:41] but it's broken in the first version Roan checked in to git, in October [03:11:47] alright....now I really mean it about going out the door [03:11:59] I'll be back online later. [03:13:31] New patchset: Tim Starling; "Fixed sync-l10nupdate again." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2573 [03:13:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2573 [03:15:05] Aaron|home: LU doesn't like FR again [03:15:17] Cannot get the contents of http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/FlaggedRevs/presentation/language/ConfiguredPages.i18n.php (curl) [03:15:22] among many others [03:16:02] I wonder.. [03:16:41] Yup [03:17:44] !log reedy synchronized wmf-config/ExtensionMessages-1.19.php 'Fix fr message file locations' [03:17:46] Logged the message, Master [03:19:17] link frontend [03:19:30] I just updated the paths [03:19:33] they're the same as in trunk [03:19:36] meh [03:19:54] We can tidy the mess up soon ;) [03:20:08] LU is sloooow [03:20:43] SVN + NFS...like a moped hauling an RV [03:22:24] I should probably go to bed soon [03:22:56] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2573 [03:22:57] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2573 [03:22:59] Sleep? You don't need that >.> [03:27:45] are you running LU now? [03:28:03] No [03:28:11] I have to run it to test my change [03:28:22] Feel free [03:28:29] I see a fair amount of cdb reader errors in exception.log [03:28:36] but on php-1.18 [03:30:45] I think there's a bug about that [03:31:07] Aye [03:31:16] Just suprising how often some of them appear [03:36:42] aaron cleared profiling data [03:40:23] Where are we supposed to look for more information about error 500s? [03:40:28] We have too many log files [03:40:41] PHP Fatal error: Class 'LocalisationCache' not found in /usr/local/apache/common-local/php-1.19/languages/Language.php on line 274 [03:40:58] Nice [03:41:12] PHP Fatal error: Call to a member function setTimestamp() on a non-object in /usr/local/apache/common-local/php-1.18/extensions/FeaturedFeeds/FeaturedFeeds.body.php on line 257 [03:41:44] 257 is a blank line [03:42:36] Aaron|home: which log is that? [03:43:27] * Aaron|home looks at FF [03:43:30] probably "return $dt->getTimestamp();" [03:43:41] It says setTimestamp() ;) [03:43:43] gah, set not get [03:43:46] On a ParserOptions object [03:43:48] New patchset: Tim Starling; "Added sudoers rule for l10nupdate -> mwdeploy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2574 [03:43:51] but it's supposed to be set int he constructor [03:44:00] self::$parserOptions->setTimestamp( $date ); [03:44:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2574 [03:44:18] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2574 [03:44:39] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2574 [03:44:39] aye, and the constructor does if ( !self::$parserOptions ) { self::$parserOptions = new ParserOptions(); [03:44:52] Get Max to fix it later [03:44:58] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2574 [03:48:10] gn8 folks [03:51:19] Original exception: exception 'MWException' with message 'Detected bug in an extension! Hook FeaturedFeeds::beforePageDisplay failed to return a value; should return true to continue hook processing or false to abort.' in /usr/local/apache/common-local/php-1.19/includes/Hooks.php:245 [03:51:23] PHP Fatal error: Class 'FlaggedRevsSetup' not found in /usr/local/apache/common-local/php-1.19/extensions/FlaggedRevs/FlaggedRevs.php on line 45 [03:51:35] hmm, a lot of these seem stochastic [03:51:58] Indeed [03:52:26] do you want wmerrors re-enabled [03:52:33] I can re-enable it, it just segfaults a lot [03:53:21] I'm going to bed in a few [03:54:22] PHP Fatal error: Cannot override final method DatabaseBase::newFromType() in /usr/local/apache/common-local/php-1.19/includes/db/DatabaseMysql.php on line 16 [03:54:24] lol [03:54:31] wut [03:55:13] Does 1.19 contain a random error generator? [03:58:22] we had APC issues last time too, IIRC [03:59:25] * Aaron|home was suspecting apc [03:59:30] <^demon> newFromType()? I thought we renamed that to factory() like it should've been to begin with? [03:59:35] <^demon> Or did that not make it into 1.19? [03:59:39] It's still there [03:59:48] * @deprecated since 1.18 [04:00:05] 4 usages in trunk [04:00:09] 4 extensions we don't run [04:01:46] <^demon> TimStarling: I apologize for picking such a stupid function name. [04:02:04] TimStarling: http://en.wikipedia.org/wiki/User_talk:Jimmy_Wales is giving me old content again. [04:02:14] It's missing edits. [04:02:29] actual edits this time? or template edits? [04:02:33] Actual edits. [04:02:45] Someone added a section for valentine's day. [04:02:53] It's visible through https. [04:03:22] it's visibel through HTTP, for me [04:03:27] Hmm, got purged. http://en.wikipedia.org/wiki/User_talk:Jimmy_Wales is displaying the proper content now. [04:03:40] bzzt too late [04:03:42] I swear I'm not crazy. :-( [04:03:47] try again next time [04:04:07] next time grab the response headers really fast before someone purges it [04:05:03] I guess nobody has worked on http redirection to https? You'd think that would annoy more people than just me. Or everyone is using Firefox with HTTPS-Everywhere... [04:06:18] Not everyone can have HTTPS-Everywhere =( chrome has nothing like that [04:06:32] (well it has one, but it has security issues according to the EFF making it useless) [04:07:27] we plan to do some more HTTPS polish once we hire our Software Security Engineer [04:08:15] Reedy: you know, you don't have to stay up until every last problem is fixed, right? [04:08:36] I know [04:08:40] I just get distracted from sleeping [04:08:42] And eating [04:08:48] Happens quite often [04:09:15] I'm not the only one ;) [04:09:48] true enough [04:10:17] sometimes people need to be told when to go to bed [04:10:22] Yup [04:10:25] and we're telling you [04:10:40] Living at home helps with the eating, I get shouted at around 6pm for food [04:11:18] at home? at my parents [04:11:19] bleh [04:11:20] * Reedy goes away for a few hours [04:11:29] goodnight! [04:11:36] night! [04:11:47] (mornin') [04:13:22] Does purging a file description page still purge the file's thumbs? https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/LCA_final.png/500px-LCA_final.png seems stuck. [04:13:46] who knows [04:13:53] ;) [04:13:57] details like that tend to get broken when new systems are deployed [04:14:57] There was some Swift media something deployed, right? [04:15:39] yeah, Aaron|home knows about it [04:16:33] it's cached but there's no file in /mnt/thumbs/wikipedia/commons/thumb/a/a2/LCA_final.png [04:17:09] No file at all? There seems to be a partial image somewhere. [04:17:30] yeah, cached [04:17:46] partial image cached, source file purged [04:18:01] when that happens, subsequent purges won't do anything [04:20:53] what maplebed showed me earlier is that the thumb should be correct on NFS, but is partial in Swift [04:21:02] [0420][tstarling@fenari:/home/wikipedia/conf/squid]$ host 10.2.1.27 [04:21:02] 27.1.2.10.in-addr.arpa domain name pointer ms-fe.svc.pmpta.wmnet. [04:21:02] [0420][tstarling@fenari:/home/wikipedia/conf/squid]$ ssh root@ms-fe.svc.pmpta.wmnet. [04:21:02] ssh: Could not resolve hostname ms-fe.svc.pmpta.wmnet.: Name or service not known [04:21:13] how hard can it be to keep forward and reverse DNS in sync? [04:23:42] oh, hrm, I get what you're saying now [04:26:39] New patchset: Demon; "Revert 682b27, was a stupid change. Just adding something like" [operations/software] (master) - https://gerrit.wikimedia.org/r/2575 [04:28:02] Content-Length: 16384 [04:28:24] suspicious [04:28:54] robla: speaking of powers of two, I read Snow Crash on the way home from SF [04:29:06] I remember you saying it was the inspiration for Second Life [04:29:12] heh...yeah, it was [04:29:55] it's a very dark book, I thought it was interesting that someone could look at such a dark vision of the future and find something they actually wanted in it ;) [04:31:07] yeah, I think Philip just thought it was cool [04:31:15] bbiab [04:31:53] now, why is it that we want to purge this file and not just switch off swift? [04:32:26] because usually if a new piece of software is so broken, we roll back its deployment [04:38:39] are there any other test cases? [04:41:01] back now....here's the theory that maplebed told me [04:42:06] TimStarling: he thinks it was his initial population script rather than the current stuff running in production that's the problem [04:42:11] * Aaron|home doesn't trust Put_object_chunked or any of that client/rewrite PUT code [04:43:34] he thinks the first version was copying the thumbs into place too soon (before the image was generated) [04:43:56] swift appears to be sending an invalid Last-Modified header for new images [04:44:46] what's it sending? [04:44:58] * robla is configuring wireshark on his machine now [04:44:59] Last-Modified: None [04:45:05] * Aaron|home wonders about squid cache validation checks [04:46:24] did someone just break my test case? [04:46:43] * robla didn't [04:46:45] that LCA image started working [04:47:43] anyway, whether or not new broken images are being written, if the old images are broken then we need to fix them [04:49:15] yeah, Ben's going to work on that tomorrow [04:50:48] and you don't think it's a problem for these images to be broken until then? [04:51:06] is there something about swift which makes its deployment difficult to roll back? [04:52:35] TimStarling: how widespread is the image breakage? [04:53:10] in theory, these things have been broken since last week, and we're only just now noticing [04:54:04] most people seem to just purge them without reporting them [04:54:16] I'm looking for PDFs in particular that appear to be broken, and I'm not finding anything [04:54:40] there is a report here: http://commons.wikimedia.org/wiki/Commons:Village_pump#Half_an_image [04:54:45] thankfully not purged yet [04:55:35] ah, yup [04:57:35] and we had a report from Saibo earlier, and another one on IRC about a day ago [04:57:42] TimStarling: you saw this, right: http://wikitech.wikimedia.org/view/User:Bhartshorne/pdf_thumbnail_issue [04:58:22] I had not seen that [04:58:50] in particular, look at the 1200px version [04:59:54] the one I'm looking at is also 139264 bytes [05:00:10] orly [05:01:29] not a power of two :-D [05:02:28] but it is a multiple of 8192 [05:03:15] so....that would actually make some sense if it's a partial encode like Ben was suggesting, if the scaler is writing out in 8192 byte chunks [05:04:29] well, come to think of it, there's no good reason to leave swift running, other than maybe if there's some new reason why NFS would tip over [05:04:46] robla, the upgrade tonight, did it include mailing list servers? [05:04:56] Risker: nope [05:05:26] hmmm. We've just had an email disappear from our main mailing list. Will put in a bugzilla, I guess. [05:05:58] Risker: from the archive or never got delivered? [05:06:16] list admins got a notice that it needed to be moderated, but it never showed up in the moderation queue [05:07:18] I'm a little rusty on Mailman administration, but doesn't it disappear if someone nukes it? [05:07:38] robla: rewrite.py uses 4096 size chunks [05:07:39] also, if I recall correctly, the sender can withdraw it [05:07:58] robla, I don't think the sender can withdraw it [05:08:11] Aaron|home: do you think rewrite.py could send a 404 if the URL is not correct, instead of 400? [05:08:31] 400 is for when there's a problem with some other part of the request [05:08:34] Robla: if he did withdraw it, then he did so in under a minute [05:08:55] what does "URL is not correct" mean? [05:09:54] resp = webob.exc.HTTPBadRequest('Regexp failed: "%s"' % (req.path)) #11 [05:09:55] return resp(env, start_response) [05:09:57] that [05:10:54] I guess [05:12:56] where is the populate script? [05:13:22] Aaron|home: do you know? [05:13:55] no [05:14:57] woosters: are you around by any chance? [05:15:16] hi [05:15:55] hi there! so...Tim and I are dabbling with the possibility of temporarily disabling Swift until we get the partial image thing sorted [05:16:25] we coped without swift for years, I'm sure we can do it for another day [05:17:03] what is the % of images that are distorted in your estimate? [05:17:15] we don't know...that's the problem [05:17:19] I haven't found any way to identify them [05:17:34] but probably a small percentage, like 0.1% [05:18:27] * Aaron|home wishes he could view the puppet repo with something like viewvc [05:18:45] Aaron|home: you're using gitweb? [05:19:00] there's a tiny gitweb link on the revision pages [05:19:03] in gerrit [05:19:07] I was trying that [05:19:29] didn't seem to do what I wanted [05:20:10] it looks like the script was /home/ben/swift/geturls [05:20:32] well, maple bed is writing a script to compare the thumbnails in ms5 and in swift [05:21:57] .1% points to appr 200,000 [05:22:03] * Aaron|home plays with gitweb more [05:22:23] yeah, so I'm saying that we should reconfigure squid to send requests directly to ms5 until he finishes running that csript [05:22:25] ok, this is a sort of a slow shitty version of viewvc [05:22:45] * Aaron|home just wanted to look in files/swift [05:22:53] nothing there [05:23:08] it's in /operations/software.git [05:23:39] most of the thumbnails are not used repeatedly [05:23:42] Aaron|home: for what it's worth, you can also check out a local copy, and run "git instaweb", and have a faster, shitty version of the same thing [05:23:54] so the impact is much much less [05:24:10] yeah, my checkout is on my office box [05:24:28] it actually worked today rather than fail out of files with * in the name like last time [05:24:45] * Aaron|home is too lazy to clone on this box too [05:25:03] *fail out on [05:25:44] https://gerrit.wikimedia.org/r/gitweb?p=operations/software.git;a=blob;f=geturls/geturls;h=cf8ab07ec2d95acddc35f4d4f9ad4ce686158ede;hb=c895cbf7bf53d9bf6cdb57798586e5091058f91a [05:29:10] is rewrite.py the thing that's being used at the moment? [05:29:56] yes [05:30:07] the population script is based around it AFAIK [05:30:25] yeah, it just requests the URLs and expects them to be generated and cached [05:34:16] what happens if the client aborts? [05:35:14] or what if ms5 aborts, for that matter? [05:39:53] # upload doesn"t like our User-agent (Python-urllib/2.6), otherwise we could call it using urllib2.urlopen() [05:39:53] user_agent = Mozilla/5.0 [05:39:58] that's one mystery solved I guess [05:41:40] where is that? [05:43:59] ms-fe1:/etc/swift/proxy-server.conf [05:45:47] we want it to pass through the User-Agent header, like what ms5 does [05:46:04] it makes incident response much easier [05:46:41] if you have apache symbols installed on the image scalers, you can even attach to a long-running process and print out its user agent [07:00:21] Tim-away: I got it again. [07:00:24] http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28proposals%29#Restoring_long-lost_edits_using_the_newly_released_historical_database_dumps [07:00:28] > This page was last modified on 11 February 2012 at 14:48. [07:02:16] Hmm. curl is giving February 14. [07:02:36] Maybe just Chrome cache... [07:19:40] !log symlinkd wikidiff2.so to php_wikidiff2.so on searchidx2 [07:19:42] Logged the message, Master [07:19:51] that will take care of that issue, if hackily [07:29:43] New patchset: Asher; "graphite stats retention" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2576 [07:30:06] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2576 [07:30:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2576 [07:30:07] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2576 [07:36:13] New patchset: Asher; "fix lower-precision longer term storage of stats data" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2577 [07:36:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2577 [07:37:42] New patchset: Asher; "fix lower-precision longer term storage of stats data" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2577 [07:38:05] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2577 [07:38:05] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2577 [07:38:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2577 [07:49:41] hi there [07:49:52] http://commons.wikimedia.org/wiki/Commons:Deletion_requests/2012/02 <- do you see what I see? (stack trace) [07:50:55] apergos: http://www.telecommander.com/pics/links/application%20software/microsoft/Microsoft_Bob_1_0/Microsoft_Bob_1_0.htm [07:51:04] * Aaron|home doesn't even remember that [07:52:16] I didn't really do windoze [07:52:44] ten "friends of bob" [07:52:45] from http://toastytech.com/guis/bob.html, "A few possible reasons that Bob flopped: ...Most people at the time who wanted ease of use would just get a Macintosh." [07:52:48] * Aaron|home lols [07:52:51] is this like "bob's your uncle"?? [07:53:11] "Bob was not useful enough to justify its initial sale price of almost $100." [09:02:29] leafnode: yes, it looks like a stack trace to me [09:29:05] !log tstarling synchronized php-1.18/extensions/cldr/LanguageNames.body.php 'r111453' [09:29:08] Logged the message, Master [09:30:18] leafnode: fixed [09:30:33] TimStarling: great :) Thanks [09:31:13] leafnode: so they say you are alive [09:33:01] saper: old ladies' gossips [11:10:01] anyone know what is wrong at http://stats.wikimedia.org/ ? [11:10:23] permissions are locked down, so it seems that the apache config has been altered [14:05:02] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki to 1.19 [14:05:05] Logged the message, Master [14:16:23] New patchset: Catrope; "Fix MIME type for .woff" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2578 [14:16:43] Reedy: --^^ [14:16:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2578 [14:47:20] hi, I'm looking to use the API to get the number of anons who have contributed in a certain date range. The only way I can think of doing this is to loop over list=usercontribs for all contributions in the date range, and using ucuserprefix [14:47:31] that seems less than opitmal though. Are there better ideas? [14:49:10] You could run an SQL query on the toolserver, if you have an account there [14:49:27] But one way or another you will have to examine all anon contribs in the date range anyway [14:50:15] ugh, well, it has to do I suppose [14:50:16] thanks [15:15:14] New review: Mark Bergsma; "Any reason not to deploy this in base.pp, i.e., on all servers? :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2556 [15:16:32] New patchset: Mark Bergsma; "Working upstart job varnishncsa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2579 [15:17:03] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2579 [15:17:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2579 [15:30:59] New patchset: Mark Bergsma; "Pass the environment as arguments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2580 [15:32:19] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2580 [15:32:20] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2580 [15:33:46] New patchset: Mark Bergsma; "Syntax error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2581 [15:34:28] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2581 [15:34:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2581 [15:36:09] New patchset: Mark Bergsma; "Apparently Puppet doesn't do string concatenation with +" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2582 [15:36:53] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2582 [15:36:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2582 [15:39:20] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 34342 - Create a new books namespace on he.wiki' [15:39:23] Logged the message, Master [15:45:19] Hello, is there a problem with stats.wikimedia.org ? [15:47:33] Ash_Crow: looks like it, getting Forbidden [15:48:08] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 34378 - Rename namespaces on mr.wikisource.org' [15:48:10] Logged the message, Master [15:57:17] it is working for me [15:57:20] Ash_Crow: fixed it [15:57:27] ah, you did something [15:57:28] apergos: since a couple seconds;) [15:57:28] thanks [15:57:55] its on spence, and spence lost its NFS mount.. Stale NFS file handle [15:57:57] mutante, thanks :) [15:58:09] which included this document root [15:58:10] yw [15:59:20] Ash_Crow: thanks for reporting, monitoring didnt detect this one [16:08:58] hello [16:09:07] I get problems when editing pages on Commons [16:09:18] I get a message "Some parts of the edit form did not reach the server; double-check that your edits are intact and try again." [16:09:29] and a blank page [16:09:38] I have to try 4/5 times to edit a page [16:09:38] yannf, not Chrom(e|ium) over HTTPS I hope [16:09:57] yes, Chrome, but not https [16:10:52] is it serious, Doctor? [16:12:38] any other action is no problem: reading, patrolling, deleting, etc. [16:12:39] New review: Catrope; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2264 [16:14:08] * Nemo_bis is not a dev [16:14:47] was there any pb with Chrom(e|ium) over HTTPS? [16:16:02] yannf, yes, many POST requests not working [16:16:11] dunno, I can edit Commons with Chromium [16:21:17] something else, there is no description page for these files: [16:21:19] http://commons.wikimedia.org/wiki/File%3APOSTERMENDOZA.JPG [16:21:26] http://commons.wikimedia.org/wiki/File%3ABruxelles_Java_Masque_Wayang_02_10_2011_06.jpg [16:22:13] yannf, I see it [16:22:28] perhaps check which squid served it in the source? [16:22:38] uh no [16:23:03] the 2nd one is there since at least December [16:23:27] the 1st was probably added in January: http://ru.wikipedia.org/w/index.php?title=%D0%A1%D1%82%D1%80%D0%B0%D1%88%D0%BD%D1%8B%D0%B9_%D1%81%D1%83%D0%B4_(%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2006)&diff=next&oldid=40590500 [16:23:47] still http://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard/Archive_31#System_problems ? [16:25:17] looks it [16:25:24] yes 18 November 2011 [16:46:10] New patchset: Dzahn; "add account for aengels, add to stat1, fix last UID counter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2583 [16:46:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2583 [16:56:50] New patchset: Mark Bergsma; "Make start-stop-daemon work with multiple instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2584 [16:59:36] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2584 [16:59:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2584 [17:05:43] New patchset: Mark Bergsma; "Sigh." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2585 [17:06:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2585 [17:10:24] New patchset: Mark Bergsma; "Add job name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2586 [17:10:47] New review: Dzahn; "approved now in RT 2436" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2583 [17:10:47] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2583 [17:10:48] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2585 [17:11:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2586 [17:11:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2586 [17:26:04] hi, I'm running into an issue on formey -- I just ran 'sudo add-ldap-user jdlrobson http://jonrobson.me.uk/ab134qe4/1i3133113rs' and I haven't gotten a prompt back, nor the usual confirmatory response saying that the home directory has been created. 'ldaplist -l passwd jdlrobson' says that it has. [17:26:47] lemme look [17:26:49] anything in particular I should do? I'm hesitating to just quit that particular command with control-C or closing the window or whatever in case there's something else that has to run in the background and that is stuck... [17:27:17] I ran ps aux and I didn't see anything in particular associated with my username and LDAP-creation-related that looked like it was still running [17:27:20] home directory was not created [17:27:50] don't control-c just yet [17:28:11] The stuck process is nscd, PID 835 [17:28:12] it's waiting on something. I wonder wat [17:28:14] *935 [17:28:18] bah [17:28:34] eww. it's really hung [17:28:44] did i break something already? :) [17:28:57] This is blocking my work because I am hesitant to do any more commit access queue work until I know this isn't some glitch. Ryan_Lane perhaps you could comment? [17:29:03] sumanah: did it unblock? [17:29:16] it's just nscd acting weird [17:29:30] seems your process isn't blocked anymore [17:29:37] home directory was created [17:29:47] sumanah: no need to not do any more [17:29:57] on rare occasion nscd breaks [17:30:00] New patchset: Hashar; "swp files are now ignored" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2587 [17:30:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2587 [17:30:40] the scripts probably don't help with that, since it purges nscd when run [17:32:45] Never mind. It was lag. [17:33:27] no. i killed and purged nsc [17:33:28] *nscd [17:33:49] OK. Ah, thanks. Weirdly, I also experienced a great deal of lag over the last 5 min [17:34:16] this is likely due to nfs1 dying yesterday [17:34:29] Right, because of the Swift & PediaPress thing? [17:35:00] (Ryan, the reason why I seemed to be oblivious to the things you were saying was that I didn't see/hear them, because of IRC lag.) [17:35:08] ahhhh [17:35:08] heh [17:35:22] nah. nfs1 died for some other reason, I believe [17:35:32] Oh nm then, [17:35:41] someone scapped too quickly, or something [17:35:44] and I eventually did get the response Creating a home directory for jdlrobson at /home/jdlrobson [17:35:48] * Ryan_Lane nods [17:37:00] Reedy: how goes? [17:40:50] name service cache daemon. Well, I learned something today. [17:49:15] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2264 [17:49:45] New patchset: Dzahn; "enhance page_all - area code API lookup one-liner :p - option to skip an area" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2264 [17:51:26] * robla starts rooting around for 500 error log on test2 [18:08:57] I'm rather newish to poking around on fenari logs. We're getting frequent 500s on test2. where should I go to see where those are logged? [18:09:40] I'm looking in /home/w/logs/exceptions.log, nothing obvious there. also /home/w/logs/test2wiki.log isn't obvious [18:17:35] Unusually high or persistent lag should be reported to #wikimedia-tech on irc.freenode.net [18:17:38] SOS! [18:18:33] http://ru.wikipedia.org/w/api.php?format=xml&action=query&meta=siteinfo&siprop=dbrepllag give me too big lag 4-5 thousands, what happened ??? [18:18:38] *gives [18:20:13] Nirvanchik: can you give me a traceroute ? [18:20:15] and your ip ? [18:20:19] where are you connecting from ? [18:20:30] my bot (with 15 s lag setting) worked fine always until today [18:20:36] from Moscow, Russia [18:20:50] LeslieCarr: That's everything but connection/ client related [18:21:01] 109.72.74.228 [18:21:15] well i have found we've had a lot of routing issues with russian isp's [18:21:30] so the ru made me suspicious that network could be the issue [18:21:38] LeslieCarr: But that has absolutley nothing to do with DBs out of sync [18:21:43] no it does not [18:21:52] https://ru.wikipedia.org/w/api.php?format=xml&action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb [18:22:04] db46 ... [18:22:08] ah [18:22:21] i was thinking you were saying that loading that page gave you a big lag [18:22:25] not the db46 issue... [18:22:37] :)))) [18:22:50] okay. can this be fixed somehow? [18:23:14] and what happened [18:23:55] not sure what happened, though i do see the replication lag, will look into it [18:23:56] temp. depooling db46 might help :P [18:23:57] hi [18:24:11] @replag [18:24:16] sigh [18:24:26] Nemo_bis: That bots gone for ages :/ [18:24:28] how does it come, that I cant desplay some characters in Etherpad? could it be a server problem? [18:24:37] hoo, but Krinkle restarted it [18:25:18] feb 12 06.19.04 * dbbot-wm è uscito (Ping timeout: 252 seconds) [18:28:01] freenode :/ [18:29:02] Nirvanchik stalker [18:29:37] Nemo_bis:sorry? [18:29:52] Nirvanchik, j/k for your CTCP [18:30:17] ok. I said my location, but now I feel curiosity about where are you all from [18:30:19] Nirvanchik: this may be due to the schema migration for MW 1.19 .. still investigating [18:30:34] lol [18:32:04] fyi, I'm from SF, CA, USA [18:32:35] What's up? [18:34:28] LeslieCarr: db46 has the revision table locked [18:34:47] LeslieCarr: hm... I don't quite well understand but this is interesting. when was this migration conducted? [18:35:02] It's still goin on [18:35:06] It's not a squick change [18:35:24] Reedy: yeah, i'm just not quite sure what to do about it/ if this is due to teh migration and needed, etc --- where's asher when you need him? ;) [18:35:25] LeslieCarr: thanks. I guessed u were from US [18:35:33] let's see [18:35:49] db46 is just an s6 slave [18:36:05] It can just be removed from the db pool [18:36:12] Reedy: Why not just depooling? [18:36:26] That's what I just said :p [18:36:32] :P [18:36:42] 'Apache failed sanity check: VIP not configured on lo' [18:36:45] so much for graceful [18:36:53] "[19:23] hoo: temp. depooling db46 might help :P" [18:37:04] We saw this for another cluster yesterday, then it decided to catch up again [18:37:31] LeslieCarr: recognize that error? [18:38:16] AaronSchulz: i'm guessing that lo0 isn't configured with its proper ip [18:38:18] which machine ? [18:39:07] !log reedy synchronized wmf-config/db.php 'Comment out db46 from s6 due to really high lag from db schema updates' [18:39:09] Logged the message, Master [18:40:17] Though not sure if just reducing it's load might have been better [18:40:58] WOW bot started his job :) [18:41:01] no lag [18:41:31] thanks to everyone [18:41:56] * AaronSchulz looks at node groups [18:43:57] if ! /sbin/ip addr | grep -q "$VIP.*lo"; then... [18:45:12] VIP=10.2.1.1 I guess [18:45:14] it shows db host="db43" now :) [18:48:27] LeslieCarr: yeah, so that code is in apache-sanity-check, and ran on fenari [18:50:14] oh it wants a 10.X address on lo0 [18:50:24] which fenari wouldn't have, it just has a normal ip [18:50:31] why does fenari need apache-sanity-check run ? [18:52:36] it shouldn't, I think I mean to be running apache-graceful-all [18:53:15] * AaronSchulz though the later was the former ;) [18:53:20] LeslieCarr: heheh) wow you're a girl and u have red hair :) not what I expected to see here ))) [18:53:33] wow, girls! [18:53:38] they exist [18:53:57] yep, there's a few of us here ;) [18:54:06] yep. it's nice to have them [18:54:34] rarity in IT [18:55:24] I'll put db46 back in a few minutes, it's nearly caught up with all it's queries [18:55:51] though the alter is still ongoing, now on jawiki [18:57:07] Reedy: What ALTER is this? 1.19 schema changes? [18:57:15] yaa [18:57:34] ar_sha1 [18:58:37] Ah [19:01:18] * AaronSchulz hates the archive table [19:01:29] worst table EVER [19:01:42] hahaha [19:01:44] are you [19:01:52] yeah, you heard. it does suck doesn't it? [19:01:58] so, i reverted the changed related to the office redirect [19:02:05] and synced, and graceful'ed [19:02:10] sorry about that [19:02:23] but really the fact that we have revisions in there in a gazillion old icky forms is also the worst ever [19:02:29] revision text, I mean [19:04:40] it's morning in CA... I got here in good time. And now, I'm going to watch a film before sleep. And you continue to work :) bye-bye! [19:05:41] The update queries seem to be working reasonably well [19:05:46] it's onto rev_sha1 for jawiki now [19:06:09] !log gracefulled apaches to deal with APC corruption [19:06:12] Logged the message, Master [19:06:21] Yay [19:06:46] time to find 1.19 breakage [19:07:03] * Reedy mashes F5 [19:07:26] test2 seems faster [19:08:25] And not dying every few requests ) [19:10:51] AaronSchulz: enwiki next? ;) [19:11:00] go! [19:11:03] wait a few min first [19:11:13] awwwww [19:11:13] lets not rush anything [19:11:27] robla: test is also on 1.19wmf1 [19:11:33] chrismcmahon: we're ready for a smoke test on test2 [19:11:41] *almost [19:11:47] ? [19:12:01] testwiki gave a slightly more consistent loading of pages [19:12:09] * chrismcmahon watches...  [19:12:39] Worth just giving test2 a quick look over before submitting it to anything [19:12:59] And reset the test2 profiling data [19:13:30] aaron cleared profiling data [19:13:58] woosters: is binasher coming in today? [19:14:15] http://noc.wikimedia.org/cgi-bin/report.py?db=test2 [19:14:25] "New in MediaWiki 1.19 "Stuff..."" [19:14:29] lol [19:14:34] It's accurate [19:14:42] can't be wrong [19:14:47] Reedy: we forgot "...and things" [19:15:05] robla: too late, it's in feature freeze [19:15:38] let's make sure we get that in 1.20, okay? [19:15:46] noted [19:16:03] robla - he should be [19:16:17] he was feeling 'sickish' yesterday though [19:16:36] now is also the time to start seeing the profile info [19:16:58] test2 looks to be in a reasonable state from clicking around [19:17:09] not dying every X requests or so [19:17:22] Reedy: AaronSchulz: is it time for chrismcmahon to run through his test plan? [19:17:30] sooner the better [19:17:31] (on test2 that is?) [19:17:44] robla: I've been visiting it today between 500s [19:17:50] exactly [19:18:06] chrismcmahon: have you seen any more 500s in the past 15 min? [19:18:09] apc changes look to have solved that nicely [19:18:45] robla: not that recently, checking now [19:19:02] Reedy: did you make config changes, or merely reboot? [19:19:15] sorry? [19:20:29] Reedy: you said "apc changes", as though something actually changed [19:21:23] Didn't realise it was only a apache restart [19:22:43] my understanding was that it was a "slow restart", whatever that means [19:22:43] AaronSchulz: ? [19:22:59] yeah [19:23:05] yup [19:23:22] no changes to cache size or anything like that then [19:23:24] let the processes stop themselves when they've finished etc, rather than kill -9 ;) [19:23:59] Looks like it's good to go for the moment [19:24:54] New patchset: Catrope; "Fix purge-varnish, wants ban.url now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2590 [19:39:50] any known issue with thumbnails? i have issues with displaying of some [19:44:51] Danny_B|backup: yes, there is. [19:45:04] a small percentage are corrupted [19:45:08] purging seems to fix [19:45:53] guillom: are you still around? [19:46:02] robla, I am [19:46:26] argh, and maplebed isn't [19:46:59] the issue above is probably worth a tweet or something [19:47:42] robla, can you quickly summarize it for me? [19:48:23] here's the status on that. yesterday, we discovered that a small percentage (anecdotally) of image thumbnails are corrupt, in that they'll appear partially loaded [19:48:35] You can always !log and put a !Wikimedia and !Wikipedia somewhere [19:48:50] some ops folks don't like that :) [19:48:52] (to reach those who read groups and tags) [19:49:07] we know the cause, and maplebed is working on two things: [19:49:19] robla: purging did not work, that's why i'm asking [19:49:27] 1. damage assessment: estimating how many images we're talking about here [19:49:40] 2. this afternoon, the actual fix [19:49:46] Danny_B|backup: url? [19:49:46] ok [19:49:56] robla: https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Vitezslav_Janovsky_Vilimek.png/220px-Vitezslav_Janovsky_Vilimek.png [19:50:22] if by this afternoon we don't have a fix, we may revert swift temporarily [19:51:06] ok [19:51:10] Danny_B|backup: you purged here? http://commons.wikimedia.org/wiki/File:Vitezslav_Janovsky_Vilimek.png [19:51:15] yup [19:51:24] woosters: ^ [19:51:30] purging doesn't seem to fix [19:52:34] * robla runs a purge himself [19:53:16] !log reedy synchronized wmf-config/db.php 'Bring db46 back in' [19:53:19] Logged the message, Master [19:54:26] I'm assuming maplebed isn't on IRC because he's in "do not disturb" mode [19:54:39] robla, he's here now [19:55:09] maplebed: I'll mail you the backlog [19:55:25] actually my connection just dropped and I didn't notice. [19:55:33] but I'm almost out of dnd mode. [19:56:12] summary: purging doesn't work for this URL: https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Vitezslav_Janovsky_Vilimek.png/220px-Vitezslav_Janovsky_Vilimek.png [19:56:18] + http://commons.wikimedia.org/wiki/File:Vitezslav_Janovsky_Vilimek.png [19:56:22] so far with a sample of 200 images we're at 4.7% that have at least one bad thumbnail [19:56:53] robla: I think that's a different problem. [19:57:19] 221px works fine [19:57:45] https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Vitezslav_Janovsky_Vilimek.png/221px-Vitezslav_Janovsky_Vilimek.png [19:58:07] 4.7% is pretty bad [19:59:53] robla: the 220px image is purged from both ms5 and swift [20:00:08] so I'm guessing there's some problem with squid caching errors and not purging them right. [20:02:13] that is hihgh [20:12:05] PHP Fatal error: Cannot access protected property WikiPage::$mTouched in /home/wikipedia/common/php-1.19/includes/Article.php on line 1743 [20:14:31] maplebed: http://pastebin.com/EWZx2rL7 [20:15:24] AaronSchulz: looks good to me. [20:15:27] commit! [20:15:39] we can test it on the eqiad cluster [20:23:18] Um, test2.wp.org doesn't have the same settings as what the actual projects will have after the upgrade, right? [20:23:49] "Male user" and "Female user" won't really be namespaces? [20:26:29] i should hope not [20:27:09] !log hashar synchronized php-1.18/maintenance/purgeList.php 'r111480 : enable purge of HTTPS URLs' [20:27:11] Logged the message, Master [20:33:11] !log hashar synchronized php-1.19/maintenance/purgeList.php 'r111480 : enable purge of HTTPS URLs' [20:33:14] Logged the message, Master [20:38:38] !log reedy synchronized wmf-config/InitialiseSettings.php 'fix some mrwikisource aliases' [20:38:40] Logged the message, Master [20:41:33] !log reedy synchronized wmf-config/InitialiseSettings.php 'fix some mrwikisource aliases' [20:41:35] Logged the message, Master [20:45:27] gmaxwell: werdna: (low prio) Five days ago we talked about "https per default" for logged in users - a current note: twitter switched that on yesterday for logged-in users. E.g. google also provides https per default for all logged-in activities (AFAIK). So Wikimedia would be in "good" company ;-) [20:45:49] ugh. twitter switched this on? [20:45:54] now I have to fucking do it [20:45:56] !log reedy synchronized wmf-config/InitialiseSettings.php 'Fix hewiki namespace talk typo' [20:45:58] Logged the message, Master [20:46:03] this is what prompted me to enable https to begin with ;) [20:46:21] just make everyone use https [20:46:23] problem solved [20:46:26] "even *twitter* is doing https and we aren't?" [20:46:44] well, that would actually solve a lot of problems [20:47:00] I don't think we're ready to go down that path yet, though ;) [20:47:16] also, latency is problematic with https, so anons would take a performance hit [20:47:34] don't you have some hardware SSL offloading in front of squids? [20:47:34] we'd need to run nginx on all of the squid and varnish nodes for that [20:47:41] Ryan_Lane: did you see anons being logged-in? ;) [20:47:43] maplebed: https://upload.wikimedia.org/wikisource/mr/thumb/e/ef/%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/page3-1000px-%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu.jpg [20:47:46] 401 not authorized? [20:47:48] we have an nginx cluster [20:47:54] that is doing ssl termination [20:48:09] Saibo: Reedy suggested using https for everyone [20:48:18] not just logged-in users [20:48:21] ehm.. yes, sry [20:48:30] Reedy: that can happen when the URL is invalid [20:48:33] https://mr.wikisource.org/w/index.php?title=%E0%A4%AA%E0%A4%BE%E0%A4%A8:%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/3&action=edit&redlink=1 [20:48:34] are you sure that's a valid thumb url? [20:48:37] it's actually technically easier to make everyone use https, than just logged-in users [20:48:43] It's the url from the image trying to be used there [20:49:14] is that Swift ? [20:49:16] AaronSchulz: ^^^^ [20:49:20] Ya [20:49:34] AaronSchulz: I thought all thumbs had to start ###px after the / [20:49:35] https://mr.wikisource.org/w/index.php?title=%E0%A4%AA%E0%A4%BE%E0%A4%A8:Wind_in_the_Willows_%281913%29.djvu/95&action=edit&redlink=1 [20:49:41] Other djvu seem ok [20:49:47] * hashar opens the bug "Token may have timed out" error message needs i18n support [20:49:53] maplebed: btw, awesome job on swift deploymeny [20:49:56] *deployment [20:50:04] really happy to see that going [20:50:25] maplebed: that thumb is valid [20:50:30] Ryan_Lane: heh... I'm about to revert. [20:50:34] :( [20:50:36] but thanks anyways... [20:50:38] you also have seek= and mid and other params [20:50:44] in addition to page [20:50:53] !log asher synchronized wmf-config/db.php 'setting s7 to read-only for master swap, db37 to be new master' [20:50:55] Logged the message, Master [20:51:09] maplebed: too many bugs in swift? [20:51:18] rats!!! [20:51:35] just one - but it's affecting about 4% of all images (1.5% of all thumbnails) [20:51:41] damn. that sucks [20:51:43] anyway, rewrite.py doesn't really validate the thumb name anyway, that's for thumb.php [20:51:54] maplebed: I got another offer for the swift people to help us deploy [20:51:57] as FOSDEM [20:52:00] *at [20:52:15] if you'd like the devs to help [20:52:26] I think they want to set up a meeting soon [20:52:27] maybe they can maintain php-cloudfiles better [20:52:27] I would. Could you email me names / addresses? [20:52:42] It was stephano. I'll contact him about details [20:52:46] !log asher synchronized wmf-config/db.php 'setting s7 to read-only for master swap, db37 to be new master, db16 still out' [20:52:48] Logged the message, Master [20:52:51] AaronSchulz: :D [20:53:20] AaronSchulz: that's a legit request. we should ask them that when we meet. [20:54:09] Reedy: ah, I've found the problem with the mr wikisource. [20:54:19] Reedy: do you know if it's a new project? [20:54:31] AaronSchulz: the container for mr.wikisource doesn't exist. [20:54:42] maplebed: created 2. February [20:54:49] yup, that'd do it. [20:54:56] ^ [20:54:57] # [20:55:04] I need to find a list of all projects created since mid-jan. [20:55:06] maplebed: probably means theres a few more wikis [20:55:16] (I can probably create it, but if someone hase it'd that'd be nice.) [20:55:22] New patchset: Asher; "upgrading mysql on db16, db37 is new s7 master" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2591 [20:55:23] I think there are 3 or 4 [20:55:28] Done all on the 2nd feb [20:55:41] or 1st [20:55:45] bewikimedia [20:55:46] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2591 [20:55:46] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2591 [20:55:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2591 [20:55:47] vepwiki [20:55:57] maplebed: http://lists.wikimedia.org/pipermail/newprojects/2012-February/thread.html [20:56:05] thanks DaBPunkt [20:56:32] umm. [20:56:44] 3 bewikimedia? just ignore the dupliates ;) [20:56:52] that list is broken. it mentions http://vep.wikimedia.org as a new wiki, but it doesn't exist. [20:57:04] I'll guess it means vep.wikipedia, but hrmph. [20:57:10] binasher: what's the start-position for s7? [20:57:42] maplebed: database name is probably safest to go against [20:57:47] DaBPunkt: oh yeah, just a sec [20:57:53] maplebed: vepwiki [20:58:06] !log new s7 repl position - MASTER_LOG_FILE='db37-bin.000285', MASTER_LOG_POS=865712092 [20:58:07] maplebed: is adding a new container something anyone with shell can do, or is it an ops request? Need to update "add a wiki" on wikitech either way [20:58:07] Reedy: I need the expanded name for the container though. [20:58:08] Logged the message, Master [20:58:20] Reedy: after 1.19 it'll happen automagically. [20:58:27] awesome [20:58:29] in the mean time, it's a maplebed-request. [20:58:38] DaBPunkt: i'm going to switch s2 and s3 shortly as well, but that's it for today [20:58:49] maplebed: May I ask what a "container" is in this context? [20:58:58] binasher: ok [20:59:22] DaBPunkt: if swift is like Amazon's S3, a container is like a bucket. [20:59:30] does that help? [20:59:30] :P [20:59:35] * Reedy looks for the walrus [20:59:58] maplebed: not realy. I do not trust in clouds ;) [21:00:07] well not just after 1.19, after some other stuff happens [21:00:59] DaBPunkt: ok. an alternate explanation - swift's object storage model has containers and objects - objects go in containers. They're sort of like directories / folders on a regular filesystem except that they can't be nested. We've partitioned thumbnails so there's one container per project (most of teh time) [21:01:21] (there are two projects for which the containers are broken up into 256 separate containers, indexed by the shard present in the URL). [21:01:28] any more helpful? [21:02:02] So it is a kind of distributed fielsystem or obejct-space? [21:02:05] DaBPunkt: heh. it's not "cloud" storage. it's "object" storage [21:02:17] DaBPunkt: swift is a distributed object store, yeah. [21:02:31] 'container' in this context is vocabulary specific to swift. [21:02:50] !log asher synchronized wmf-config/db.php 'returning db16 after upgrading mysql' [21:02:52] Logged the message, Master [21:04:40] zzz =_= [21:05:15] maplebed: ok, sounds interresting [21:05:33] DaBPunkt: http://wikitech.wikimedia.org/view/Swift for more reading. [21:07:34] DaBPunkt: or https://en.wikipedia.org/wiki/OpenStack#Object_Storage_.28Swift.29 ;-) (...probably) [21:11:50] binasher: did you switch s3 already? [21:12:16] Reedy: I think the mr.wikisource is working now, after bad stuff is purged. [21:12:30] DaBPunkt: no [21:12:34] Thanks [21:12:58] * maplebed -> afk [21:13:38] !log asher synchronized wmf-config/db.php 'setting s3 to read-only, switching master to db39' [21:13:40] Logged the message, Master [21:14:47] binasher: mm, strange. Arcording to db-php db39 is the master of s3. But the TS uses db34 as master and I can't find a master-switch of s3 in the techlog [21:14:57] DaBPunkt: and finally, see http://www.mediawiki.org/wiki/FileBackend [21:15:17] !log asher synchronized wmf-config/db.php 'returning s3 to writeable, db39 is the new master' [21:15:19] Logged the message, Master [21:15:30] !log new s3 master position - MASTER_LOG_FILE='db39-bin.000550', MASTER_LOG_POS=63238699 [21:15:32] Logged the message, Master [21:15:38] DaBPunkt: I have to edit the file before I deploy the file [21:15:48] ah ok. That's clear it :) [21:18:10] mm, our s3-copy is6h behind. So no master-change for us now [21:18:25] New patchset: Asher; "upgrading db34, db39 new s3 master" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2592 [21:18:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2592 [21:19:48] Does anybody knows about a filed called "rev_sha1"? [21:20:05] DaBPunkt: it's new [21:20:15] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2592 [21:20:16] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2592 [21:24:00] brion: ok, it was somehow missing in our test2wiki-database [21:24:26] binasher: please let the binlog on db34 remain for a few hours if possible [21:25:59] eek! [21:26:11] gotta update that eh [21:29:46] !log asher synchronized wmf-config/db.php 'db34 upgrading, returning to s3' [21:29:48] Logged the message, Master [21:32:12] New patchset: Lcarr; "reenabling ifup script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2593 [21:33:21] you gave somebody else the 'andrew' commit name? [21:33:23] ugh [21:33:28] I have 'andrew' on hilight [21:33:50] and I also have the 'andrew' username on wmf servers [21:33:59] that is going to be screwed up [21:34:59] !log asher synchronized wmf-config/db.php 'setting s2 to read-only, switching master to db13' [21:35:01] Logged the message, Master [21:36:13] !log asher synchronized wmf-config/db.php 'setting s2 to writeable, db13 is new master' [21:36:15] Logged the message, Master [21:36:22] !log new s2-master pos - MASTER_LOG_FILE='db13-bin.000278', MASTER_LOG_POS=599752853 [21:36:24] Logged the message, Master [21:36:58] DaBPunkt: the db34 binlogs will be around for a while.. it's going to start running schema migrations tonight that will be long running [21:37:05] DaBPunkt: also s2 info ^^ [21:37:14] thnx [21:39:37] !log asher synchronized wmf-config/db.php 'returning db30 to s3, going to wait til after schema migrations to upgrade to lucid/new-mysql' [21:39:39] Logged the message, Master [21:42:33] DaBPunkt: i'm actually going to switch s4 in a couple minutes too.. that's really the last one for today [21:43:09] ok [21:47:34] !log asher synchronized wmf-config/db.php 'setting s4 to read-only, switching master to db22' [21:47:36] Logged the message, Master [21:48:51] !log asher synchronized wmf-config/db.php 'setting s4 to writeable, new master is db22, db31 still out' [21:48:53] Logged the message, Master [21:48:57] !log new s4 master pos - MASTER_LOG_FILE='db22-bin.000030', MASTER_LOG_POS=964208442 [21:48:59] Logged the message, Master [21:52:32] New patchset: Asher; "upgrading db31, new s2+s4 masters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2594 [21:53:17] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2593 [21:53:18] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2593 [21:55:10] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2594 [21:55:11] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2594 [21:56:25] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [21:56:37] New patchset: Lcarr; "Fixing default interface to default_gateway_interface" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2595 [21:56:58] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2595 [21:56:59] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2595 [21:57:01] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:58:04] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 138 MB (1% inode=57%): /var/lib/ureadahead/debugfs 138 MB (1% inode=57%): [21:59:52] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.309 seconds [22:03:37] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 221 MB (3% inode=58%): /var/lib/ureadahead/debugfs 221 MB (3% inode=58%): [22:03:37] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.942 seconds [22:04:40] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=58%): /var/lib/ureadahead/debugfs 1 MB (0% inode=58%): [22:04:40] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=57%): /var/lib/ureadahead/debugfs 199 MB (2% inode=57%): [22:06:19] RECOVERY - mysqld processes on db31 is OK: PROCS OK: 1 process with command name mysqld [22:06:28] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:06:50] !log reedy synchronized wmf-config/InitialiseSettings.php 'Setting wmfUseRevSha1Columns' [22:06:52] Logged the message, Master [22:07:13] RECOVERY - MySQL Replication Heartbeat on db1038 is OK: OK replication delay seconds [22:07:13] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 75 MB (1% inode=58%): /var/lib/ureadahead/debugfs 75 MB (1% inode=58%): [22:07:13] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 123 MB (1% inode=57%): /var/lib/ureadahead/debugfs 123 MB (1% inode=57%): [22:07:32] RECOVERY - MySQL Replication Heartbeat on db51 is OK: OK replication delay seconds [22:07:32] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay seconds [22:07:32] RECOVERY - MySQL Replication Heartbeat on db22 is OK: OK replication delay 0 seconds [22:07:40] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:07:49] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay seconds [22:10:13] RECOVERY - Disk space on srv223 is OK: DISK OK [22:11:07] RECOVERY - Disk space on srv219 is OK: DISK OK [22:11:25] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=57%): /var/lib/ureadahead/debugfs 0 MB (0% inode=57%): [22:11:25] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 274 MB (3% inode=58%): /var/lib/ureadahead/debugfs 274 MB (3% inode=58%): [22:14:07] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 100 MB (1% inode=58%): /var/lib/ureadahead/debugfs 100 MB (1% inode=58%): [22:14:16] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 135 MB (1% inode=58%): /var/lib/ureadahead/debugfs 135 MB (1% inode=58%): [22:16:22] RECOVERY - Disk space on srv220 is OK: DISK OK [22:16:31] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 134 MB (1% inode=58%): /var/lib/ureadahead/debugfs 134 MB (1% inode=58%): [22:16:49] AaronSchulz: here's the IE8 report: http://meta.wikimedia.org/wiki/Talk:Wikimedia_maintenance_notice [22:16:58] New patchset: Hashar; "adding .gitreview (again)" [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2596 [22:17:52] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=57%): /var/lib/ureadahead/debugfs 0 MB (0% inode=57%): [22:18:23] New review: Hashar; "(no comment)" [test/mediawiki/core2] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/2596 [22:18:23] New review: Hashar; "(no comment)" [test/mediawiki/core2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2596 [22:18:23] Change merged: Hashar; [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2596 [22:20:25] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [22:22:16] might be a js loading race condition [22:22:18] * AaronSchulz shrugs [22:22:32] seems to work on soft refresh, then fails on hard refresh again [22:22:58] * AaronSchulz awaits krinkle magic ;) [22:23:25] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 266 MB (3% inode=58%): /var/lib/ureadahead/debugfs 266 MB (3% inode=58%): [22:25:33] Reedy: you beat me to adding $wmfUseRevSha1Columns [22:25:39] heh [22:25:40] RECOVERY - Disk space on srv224 is OK: DISK OK [22:25:40] RECOVERY - Disk space on srv221 is OK: DISK OK [22:25:49] RECOVERY - Disk space on srv223 is OK: DISK OK [22:25:58] RECOVERY - Disk space on srv222 is OK: DISK OK [22:26:32] !log Removing php-1.17 from fenari [22:26:35] Logged the message, Master [22:33:50] !log running ddsh -F5 -cM -g mediawiki-installation 'sudo -u mwdeploy rm -rf /usr/local/apache/common-local/php-1.17' [22:33:52] Logged the message, Master [22:36:48] * AaronSchulz watches the nuke light up the sky [22:38:25] PROBLEM - Puppet freshness on gilman is CRITICAL: Puppet has not run in the last 10 hours [22:38:25] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [22:40:33] Cool, that looks to have freed half a gig or so disk space from the apaches [22:40:58] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.946 seconds [22:44:35] PROBLEM - MySQL Slave Delay on db34 is CRITICAL: CRIT replication delay 247 seconds [22:44:54] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:13] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.516 seconds [22:46:25] !log reedy ran sync-common-all [22:46:28] Logged the message, Master [22:46:40] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 281 MB (3% inode=62%): /var/lib/ureadahead/debugfs 281 MB (3% inode=62%): [22:46:58] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=62%): /var/lib/ureadahead/debugfs 0 MB (0% inode=62%): [22:46:58] Yay, l10n cache spam [22:49:49] RECOVERY - MySQL Slave Delay on db34 is OK: OK replication delay 0 seconds [22:50:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:50:34] RECOVERY - Disk space on srv224 is OK: DISK OK [22:50:52] RECOVERY - Disk space on srv223 is OK: DISK OK [22:51:46] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 31 MB (0% inode=62%): /var/lib/ureadahead/debugfs 31 MB (0% inode=62%): [22:51:46] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 178 MB (2% inode=62%): /var/lib/ureadahead/debugfs 178 MB (2% inode=62%): [22:52:49] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.724 seconds [22:53:13] !log reedy synchronized php-1.19/includes 'r111486' [22:53:15] Logged the message, Master [22:54:22] 7.9G isn't enough!! [22:54:37] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 209 MB (2% inode=62%): /var/lib/ureadahead/debugfs 209 MB (2% inode=62%): [22:55:31] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.770 seconds [22:55:40] RECOVERY - Disk space on srv219 is OK: DISK OK [22:55:49] RECOVERY - Disk space on srv220 is OK: DISK OK [22:55:49] RECOVERY - Disk space on srv221 is OK: DISK OK [22:56:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:59:34] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:59:34] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=62%): /var/lib/ureadahead/debugfs 199 MB (2% inode=62%): [23:00:46] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.270 seconds [23:00:46] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.016 seconds [23:00:55] RECOVERY - Disk space on srv219 is OK: DISK OK [23:01:06] New patchset: Pyoungmeister; "some loggin for the lsearchz" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2597 [23:02:52] PROBLEM - MySQL Slave Delay on db34 is CRITICAL: CRIT replication delay 217 seconds [23:04:13] RECOVERY - MySQL Slave Delay on db34 is OK: OK replication delay 0 seconds [23:04:49] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:04:58] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 85 MB (1% inode=62%): /var/lib/ureadahead/debugfs 85 MB (1% inode=62%): [23:07:13] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.593 seconds [23:07:40] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 154 MB (2% inode=62%): /var/lib/ureadahead/debugfs 154 MB (2% inode=62%): [23:08:06] New review: Pyoungmeister; "manually verifying" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2597 [23:08:07] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2597 [23:08:43] PROBLEM - Host ganglia1001 is DOWN: PING CRITICAL - Packet loss = 100% [23:10:52] RECOVERY - Disk space on srv221 is OK: DISK OK [23:12:22] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:13:25] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=62%): /var/lib/ureadahead/debugfs 0 MB (0% inode=62%): [23:15:31] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.027 second response time on port 8123 [23:17:19] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 84 MB (1% inode=62%): /var/lib/ureadahead/debugfs 84 MB (1% inode=62%): [23:17:19] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 29 MB (0% inode=62%): /var/lib/ureadahead/debugfs 29 MB (0% inode=62%): [23:17:37] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.043 seconds [23:18:09] !log reedy synchronized php-1.19/includes 'r111486' [23:18:11] Logged the message, Master [23:18:40] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 139 MB (1% inode=62%): /var/lib/ureadahead/debugfs 139 MB (1% inode=62%): [23:19:36] !log reedy synchronized php-1.18/extensions/ArticleFeedbackv5/modules/jquery.articleFeedbackv5/jquery.articleFeedbackv5.js 'r111506' [23:19:39] Logged the message, Master [23:21:22] RECOVERY - Disk space on srv221 is OK: DISK OK [23:21:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:21:40] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:24:40] PROBLEM - MySQL Slave Delay on db34 is CRITICAL: CRIT replication delay 201 seconds [23:24:40] RECOVERY - Host ganglia1001 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [23:25:16] RECOVERY - Disk space on srv223 is OK: DISK OK [23:25:25] RECOVERY - Disk space on srv219 is OK: DISK OK [23:25:40] !log reedy synchronized php-1.18/extensions/ArticleFeedbackv5/modules/jquery.articleFeedbackv5/jquery.articleFeedbackv5.js 'r111506' [23:25:42] Logged the message, Master [23:26:02] RECOVERY - MySQL Slave Delay on db34 is OK: OK replication delay NULL seconds [23:26:37] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.203 seconds [23:26:46] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 5.130 seconds [23:27:40] PROBLEM - MySQL Slave Running on db34 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column rev_sha1 in field list on query. Default d [23:29:19] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 38 MB (0% inode=62%): /var/lib/ureadahead/debugfs 38 MB (0% inode=62%): [23:30:31] RECOVERY - Disk space on srv223 is OK: DISK OK [23:30:40] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:32:01] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.607 seconds [23:32:10] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:33:50] Saibo: :) [23:34:22] hexmode: here! [23:34:24] :D [23:34:34] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=62%): /var/lib/ureadahead/debugfs 199 MB (2% inode=62%): [23:34:43] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.309 seconds [23:35:01] Saibo: you could pm me, but we may discover something here [23:35:31] * Saibo likes that more - especially for posting error logs.. you never know if there is personal info in  [23:36:21] 1.4 on enwiki too [23:36:26] hexmode: can you get navpopups installed? the links feel so "dead" without it :D [23:36:32] or I import it via URL [23:36:48] but it would be good to see it with a slower computer [23:36:57] hm.. strange [23:37:03] navpopups coming up [23:37:31] hexmode: mine is slow ;) 1.6 GHz (single core), Pentium M [23:37:45] (you might have remembered already) [23:38:11] I remember Suse ;) [23:38:23] hehe, that too [23:38:26] we should try NS4! [23:38:39] what sexual act is that? [23:38:43] :D [23:39:33] gn8 folks [23:39:39] gut nacht, DaBPunkt! [23:39:40] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 58 MB (0% inode=62%): /var/lib/ureadahead/debugfs 58 MB (0% inode=62%): [23:39:49] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 155 MB (2% inode=62%): /var/lib/ureadahead/debugfs 155 MB (2% inode=62%): [23:40:24] hexmode: why can't I change https://test2.wikipedia.org/w/index.php?title=Special:Stabilization&page=Main_Page ? [23:40:51] are flagged revs not really enabled? [23:41:10] RECOVERY - Disk space on srv220 is OK: DISK OK [23:41:41] hexmode: oh, and hotcat, please - also widely used [23:42:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:43:52] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.961 seconds [23:43:53] navpop [23:43:55] done [23:44:09] Reedy: flagged revs on test2? [23:44:21] very important for dewp ;) [23:44:27] and some others... [23:44:29] hrm... not sure he is on [23:44:37] i'm here [23:44:57] do note, test.wikipedia.org is also on 1.19wmf1 [23:45:22] is flaggedrevs on it? [23:45:33] it's enabled on both it seems [23:45:36] it is - but only partly [23:45:37] yes [23:45:43] It's enabled [23:45:46] I cannot flag pages [23:45:48] It's just not configured [23:46:16] RECOVERY - Disk space on srv223 is OK: DISK OK [23:46:21] Saibo: how does it need to be configured [23:46:25] RECOVERY - Disk space on srv224 is OK: DISK OK [23:46:54] hexmode: hm.. well - that way that there is a flagged revs button on the bottom of each article page ;) [23:47:29] Reedy: is it just resticted to a certain category? [23:47:31] but: that may not be the best solution. I think it can be enabled also only for some pages [23:47:49] *looks at dewp* [23:48:27] hmm.. not sure [23:48:34] I only know it enabled fully or not ;) [23:50:07] It has over 40 configuration settings [23:50:09] Saibo: I don't know enough about gadgets, evidently. I've editted -definitions, but they aren't showing up [23:50:46] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:50:52] Reedy: for anything that needs to be configured off-wiki, could you just copy dewp? [23:51:58] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.093 seconds [23:52:10] !log reedy synchronized wmf-config/flaggedrevs.php 'Enable FR like dewiki on test2wiki' [23:52:12] Logged the message, Master [23:54:32] Saibo: better? [23:54:55] no [23:54:59] no change [23:55:06] maybe the admins have no rights for that? [23:55:10] https://www.mediawiki.org/wiki/Extension:FlaggedRevs#Configuration [23:55:14] hrm [23:55:32] the dewiki config has been copied over [23:55:43] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:55:50] so it's exactly the same [23:56:01] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:56:05] *giving me the right* .. but usually that is included in the admin bit [23:56:22] now! [23:56:25] there it is [23:56:32] the group config is different apparently [23:56:53] Saibo: just gave you editor [23:57:05] the I have it twice now :D [23:57:07] can't give you reviwer [23:57:09] *then [23:58:25] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.696 seconds [23:58:43] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.619 seconds