[00:16:20] Re doublewiki [00:16:22] " it hasnt been used much as it hasnt worked for anything slightly complicated" [03:27:42] I did a core dump after a parse of [[User:B jonas/Pages with shortcuts]] [03:27:50] the result: many many copies of the entire article text [03:29:23] it may be a reference leak of some kind [03:31:42] each copy has one extra link in it [03:32:48] every time it processes a link, it adds another copy of the whole article text [04:07:24] cool...progress! [04:08:09] (maybe not a smoking gun, but whiff of gunpowder?) [04:17:17] <^demon> TimStarling: I was thinking about DoubleWiki again. What about throwing the processed text in memcached so it'll save an external request and that mass of regexes. [04:18:26] I guess [04:18:47] it's not ideal, but it's probably better than what is there [04:19:12] what about a cache of the whole table? [04:19:34] <^demon> What table? DoubleWiki just pulls text from interwiki links. [04:19:40] i.e. the body text after processing, which is a element [04:19:53] <^demon> Oh, yeah. I meant caching the final HTML [04:21:03] ok [04:37:27] <^demon> Hmm. [04:42:48] <^demon> Ok, I've done all I'm going to do for DoubleWiki. It's quickly becoming a time sink. [05:28:23] speaking of time sinks [05:28:55] half an hour to go and I'm still working on this stupid memory leak [05:54:20] morning [05:54:27] hello [05:54:42] ready for some 1.17 fun? [05:54:53] after some coffee I may be ;) [05:55:30] hey there [05:55:58] hiya RoanKattouw [05:56:20] :) [05:57:31] well that is annoying [05:57:54] what, that you don't have any coffee brewed? :) [05:58:04] fun fact: I am not online via the wireless at our new dc (from the hotel) beccause the hotel's wireless has stopped working [05:58:22] mifi to the rescue? [05:58:23] and the person at the desk has no tech person they can call to fix it, nor any piece of equipment they can power cycle. [05:58:36] I should dig that out, just in case [05:58:44] *robla assumes there's not magical toobz involved [05:59:11] prolly are [05:59:39] I could go to the adjoining hotel and sit in the atrium but it's so much less friendly than being in my pjs in my bed (plus I want to cook dinner) [05:59:46] remind me, what's the point of switching a small wiki first? [06:00:03] to work out any goofy little bugs that show up [06:00:12] in other extensions etc [06:00:13] ? [06:00:27] but we're already running 1.17 on a lot of small wikis [06:00:39] you'd think that the goofy bugs would have shown up there in the last few days [06:00:54] unless there is something different about the wikis we're going to choose [06:01:21] Hmm, good point [06:01:24] If we don't deliberately choose a couple different ones, then, I agree, we're just putting off the inevitable [06:01:49] let's just hit nlwiki then [06:02:08] ...and then do a bunch of small ones if we don't feel like we can get more big ones tonight [06:02:26] guess we'll see [06:03:07] running svn up in php-1.17 [06:03:08] Sounds good [06:03:32] when was I meant to run this maintenance script? [06:03:59] before deployment? [06:04:03] Earlier the better [06:04:11] on all wikis, right? [06:04:14] Yeah [06:05:22] Just saw the script, looks good [06:06:38] what script? [06:07:00] never mind (see it in the logs now) [06:07:27] running [06:08:04] I take it that's to make sure the 4000 or so people with the old edit toolbar don't get upset with us? [06:08:22] it'll take a few minutes [06:08:30] robla: Yes [06:11:05] it's doing enwiki now [06:14:12] hello [06:14:14] robla, how are we doing? [06:14:36] hi guillom: running one last maintenance script before starting [06:14:44] ok, great [06:16:05] we can switch eowiki now, the script is done with it [06:16:58] any objections? [06:17:26] Fine by me [06:17:36] go ahead [06:17:53] go for it [06:18:00] srv298 # Not installed yet, Rob will install and uncomment ---- seems to be installed now --catrope: ssh: srv298 # Not installed yet, Rob will install and uncomment ---- seems to be installed now --catrope: Name or service not known [06:18:05] you can't put comments in dsh node files [06:18:54] you can add a hash to the start of the name to make the hostname invalid [06:18:56] which will make it fail [06:18:58] (I'd forgotten eowiki was earlier on the list) [06:19:01] but it's not actually a comment [06:19:26] Grr, dammit [06:19:48] Oh it's already fixed, thanks [06:21:17] All seems quiet [06:21:59] a couple of errors from SVGMetadataExtractor.php, on wikis that we switched a long time ago [06:23:19] Why am I getting this weird fragmented backtrace here? Is that because of the "don't use that high level function for streams" comment in wmerrors? [06:23:28] (Specifically: [06:23:31] #2 [internal function]: addMatchedText(Object(OutputPage), '

Retrieved from "http://simple.wikipedia.org/wiki/File:North_rhine_w_template_2.svg" [06:29:16] doublewiki still whining I see [06:29:34] it should be fixed in 1.17, ^demon fixed it [06:29:43] Browsing nlwiki, looks fine [06:30:02] Oh wit [06:30:03] yes, the complaints are 1.1 [06:30:07] The incomplete backtraces are my fault [06:30:14] Caused by grep -v wmf-deployment [06:30:18] heh [06:30:18] 6 [06:30:19] *RoanKattouw whacks himself over the head [06:30:55] Hmm this is strange [06:31:07] The performance issue we saw last time is just not showing this time [06:31:19] maybe it fixed itself [06:31:25] let's switch some more wikis [06:31:28] I see nothing, yeah [06:32:08] hurray for the performance tuning gnomes! [06:32:22] I told you guys we should wait for the next version! [06:32:28] :-D [06:33:14] small or big? [06:33:21] let's do another big one [06:33:28] *robla looks at the list [06:33:40] I'll do de and fr [06:33:46] .wikipedia.org [06:34:04] Those are, what, 7% of our traffic combined? [06:34:29] (As opposed to just under 2% for nlwiki) [06:35:13] how about we move up this list: http://meta.wikimedia.org/wiki/List_of_Wikimedia_projects_by_size , so ptwiki? [06:35:16] amazing. uptime still good [06:35:49] with de, I mean, if that's not a clincher I don't know what is [06:35:56] (granted, it's not totally accurate to go off of page count, but it's a readable table and probably not the worst proxy) [06:36:02] apergos1: er, enwiki? ;-) [06:36:10] hehe ok, I'll give you that [06:36:24] about 80 RL cache misses a second now [06:36:58] I'm fine with whatever, though I'd recommend one wiki at a time [06:37:16] (and probably not enwiki just yet) ;-) [06:37:17] for hundreds of wikis? ;) [06:37:22] no! [06:37:26] <^demon> How about jawiki? [06:37:31] I was just gonna say [06:37:35] jawiki is #2 in traffic IIRC [06:37:43] ~8% [06:37:51] Be Bold [06:37:52] <^demon> It's daytime there, so we'd get a little more traffic than by sticking to western europe. [06:37:53] So that sounds good [06:38:04] Yeah we already did de and fr, although it's early there [06:38:09] I can do them one at a time, for now [06:38:17] I'll do one every few minutes [06:38:23] And apergos1 is right, if we switch de and there is no visible impact on Ganglia, that's saying something [06:38:41] someone fixed The Bug without knowing it ;) [06:38:56] now we're going to have to go back through all the commits and undo them :-P [06:39:21] <^demon> At some point sooner rather than later we should switch commons. Central role like meta and all. [06:39:45] let's do commons next [06:39:51] Yeah [06:39:56] weren;t we going to do mw.org too? [06:40:01] <^demon> We did that Monday [06:40:07] Don't mind me, but I find it rather worrying that *we don't know* why it's working ;) [06:40:09] The typical mw.org way [06:40:24] Get consensus on IRC from like 3 devs :) [06:40:29] <^demon> "Sound good to you?" [06:40:30] worksforme [06:40:31] <^demon> "Sure" [06:40:37] <^demon> "Nobody said no, I'm doing it" [06:40:48] ok commons next [06:41:10] guillom: Well to be fair, we had nlwiki on so briefly before that I'm not sure it was warranted to draw too many conclusions from the CPU spike we saw at that time [06:41:30] a spike which is not happening _at all_ right now though [06:41:32] this is what a deployment should look like [06:41:35] nothing happening [06:41:43] except new features appearing [06:41:52] <^demon> And bugs going away :) [06:42:02] hehe [06:42:43] bits app servers are now at ~ 10% load [06:43:07] if 4 hosts is enough for that, I may do away with that entire cluster ;) [06:43:17] not worth the trouble of keeping it separate [06:43:45] for which? [06:43:53] need coffee? [06:44:00] I'm going to do zhwiki next, maybe it'll stop spamming fatal.log [06:44:03] no, need a better wireless connection [06:44:10] don't you have a mifi? [06:44:14] <^demon> zhwiki and then some of the sources? [06:44:16] apergos: bits app servers [06:44:17] yes, I haven't dug it out yet [06:44:19] <^demon> I want the doublewiki errors to go away [06:44:23] ...then do that? :) [06:44:30] On dewiki I get the warning "Failed to load resource: http://bits.wikimedia.org/skins-1.17/$wgExtensionAssetsPath/FlaggedRevs/client/flaggedrevs.css?86?301-1" [06:44:38] Ugh [06:44:45] That's my fault, fixing [06:45:57] we should probably let the dust settle on some of the deployments before plowing ahead [06:46:15] <^demon> apache log is showing nothing of interest. [06:46:15] also, what about one of the languages with variants? [06:46:22] <^demon> zhwiki just went over. [06:46:25] did zhwiki already [06:46:32] Is 1.17 changing something about redirects? http://fr.wikipedia.org/wiki/Lyc%C3%A9e_Lucas-de-Nehou [06:47:11] We should test the variant functionality a bit then [06:47:14] <^demon> Looks like redirected templates maybe? [06:47:26] ^demon, yes [06:47:41] Maybe the alias was killed in TWN? [06:47:53] same error from zh.wikipedia.org in 1.17, guess it's not going to fix that one [06:48:06] Doesn't look like it, 'redirect' => array( '0', '#REDIRECTION', '#REDIRECT' ), [06:48:32] *RoanKattouw wonders whether this might be caused by Tim's magic word fix [06:49:49] ok, let's try to fix it for a few minutes, then revert on frwiki if we don't manage [06:50:36] Same is happening for #REDIRECT [06:51:37] Doesn't seem to be i18n-related; how is this not happening on other 1.17 wikis? [06:51:40] *RoanKattouw tries on mw.org [06:51:52] works for me with #redirect [06:51:59] using Title::newFromRedirect() from eval.php [06:52:30] maybe it's because one is a substring of the other [06:52:34] Right, and not with #REDIRECTION [06:52:50] Is it possible that this is caused by #REDIRECT being a substring of #REDIRECTION ? [06:53:07] So it sees #REDIRECT ION (=garbage) [[link]] [06:53:38] this code has changed, hasn't it? [06:54:02] bits servers have ~ 350k objects in cache now [06:54:24] less than 100k in 1.16 [06:55:00] My suspicion is correct [06:55:13] > $mw->matchStartAndRemove($s); echo $s; [06:55:15] ION [[foo]] [06:56:29] #REDIRECT in templates WFM now on frwiki but only if the redirect isn't broken (i.e. points to an existing page) [06:56:50] reverting frwiki to 1.16 [06:58:19] Aha [06:58:22] I found it [06:58:29] Tim's magic word merging code is indeed at fault [06:58:39] It merges the en synonyms /before/ the fr synonyms [06:58:49] there are probably a lot of resourceloader:filter:minify- objects ( mark ) [06:59:05] So it becomes 'REDIRECT', 'REDIRECTION', which is bad if the former is a substring of the latter [06:59:08] apergos1: No, those are in memc [07:00:14] hmm [07:00:31] it doesn't matter which one is first [07:00:41] it should match the longest substring [07:00:49] It doesn't [07:00:56] It just builds a regex using explode [07:00:58] that's MagicWord's fault [07:01:03] REDIRECT|REDIRECTION [07:01:05] I know that [07:01:22] the uncached requests are probably due to bug 27302. ResourceLoader is generating a new url every second for every user. [07:01:25] But it'd be faster to fix what's triggering it than to fix MagicWord itself [07:01:59] pawelx: Oh, I hadn't noticed the timestamp part of that. That hurts [07:02:13] Oh, 1 is rounded down *facepalm* [07:02:42] !bz 27302 [07:02:42] --elephant-- I don't know anything about "bz". [07:03:13] morning [07:03:15] https://bugzilla.wikimedia.org/show_bug.cgi?id=27302 [07:03:36] RoanKattouw: yeah, not sure why the round() is needed there anyway [07:03:43] I'm fixing it [07:06:19] the localised versions should be first, so that $magic->getSynonym(0) returns the localised version [07:06:39] that is used in several places to get a canonical string to use for #redirect [07:07:18] Yes [07:07:26] And your code is putting the English version first, isn't it? [07:07:45] yes, so I'll fix both that and MagicWord::initRegex() [07:08:42] i'm gonna get some breakfast, back in a few minutes [07:10:02] *guillom grabs breakfast. [07:10:46] I'm eating dinner :-D [07:10:58] (at 2:15 am) [07:10:58] <^demon> Ok if I turn NewUserMessage back on for 1.17 wikis? The fixes have all been merged in. [07:15:02] I'm fine with it [07:17:07] apergos: ok... any good reason for that? ;) [07:19:06] am I asking the right question by asking "any Chinese language speakers here that can tell us if the zh/zh-yue variant support is working right?", or did I phrase that wrong. [07:19:32] see #wikimedia-tech for why I'm asking that [07:19:33] Like I said before, we should just switch a Serbian wiki [07:19:40] <^demon> We already switched zhwiki [07:19:46] And us English-speaking people will be able to tell for ourselves farily easily [07:20:02] <^demon> Fwiw, getting lots of OOM issues for LangConverter stuff. [07:20:11] Because we're better at distinguishing between Latin and Cyrillic than we are between zh-foo and zh-bar [07:20:29] ^demon: In 1.17? [07:20:40] <^demon> Yes [07:20:45] just look at the stroke count [07:21:06] I checked that variant page views were working right after we switched [07:21:07] looks like one of the key templates on zhwiki broke (see #wikimedia-tech). [07:26:31] the canaries are getting more insistent [07:30:33] <^demon> Googling it pointed me to a php bug, only skimmed it. [07:30:49] I've been looking too, not seeing anything useful yet [07:32:09] we logged a fatal from NoteTA on zhwiki [07:32:16] <^demon> I resolved the "catchable fatal error" with Title::equals(), had a missing MFT. [07:32:44] from an article with NoteTA in it anyway [07:32:52] probably not related [07:33:52] TimStarling: Could you run s-c-a as root? [07:34:36] done [07:34:50] well, in progress anyway [07:36:34] I wish I knew what the api requests were that were resulting in the canary error [07:38:06] The action=parse ones? [07:38:12] what canary error? [07:39:07] there are a lot of OOMs from zhwiki [07:39:12] maybe we should fix that next [07:39:41] ALERT - canary mismatch on efree() - heap overflow detected (attacker '208.80.152.46', file '/usr/local/apache/common/docroot/wikipedia.org/w/index.php'), referer: http://de.wikipedia.org/wiki/Reba [07:39:43] there's a sample [07:39:54] <^demon> Fixed a fatal in FR. [07:40:19] I did not see them at first switchover, they showed up after a while [07:40:48] referrers de ja nl... [07:42:15] TimStarling: Turns out I had misdiagnosed the issue fixUsabilityPrefs.php was fixing. Have now put in fixUsabilityPrefs2.php in the local copy, run it on each wiki /after/ switching it to 1.17 [07:42:55] The empty value insertion really meant '0' not 'default', which means we just screwed 4,000 people who had the toolbar disabled, and the empty values are inserted by 1.16 :( [07:45:07] the fact that no api related canaries are chirping now makes me think something got straightened out with the last sync [07:46:06] apergos: how many servers was it coming from? [07:46:47] for the index ones, it's several at least [07:46:56] let me look back for the api ones [07:47:39] I wonder if we can configure it to dump core [07:48:15] 41, 43, 45 for the api ones [07:48:23] which have stopped [07:50:01] my test wiki had $wgCapitalLinks=false, no wonder things were a bit screwy [07:50:03] about 25 servers with the index.php canaries [07:50:41] the api ones are back [07:50:56] <^demon> Whatever's causing them isn't getting caught by wmerrors. [07:51:01] nope [07:51:20] efree() is a function internal to PHP [07:51:33] the overview from google is that it's often the wrong version of a library or some other component deployed together [07:51:49] how we would suddenly cause that midway through a 1.17 switch I dunno [07:52:47] that's not a very good overview [07:52:53] it could be anything [07:52:53] cope [07:52:58] nope it isn't [07:53:09] PHP is full of heap overflows [07:53:16] Could it be wmerrors causing/triggering it? [07:53:24] every release has a stack of them that they fix [07:53:37] yes [07:53:45] since wmerrors is part of everything [07:53:54] it's weird that there are far fewer servers with api related whines [07:54:06] if you can reproduce it, we can get a backtrace [07:54:25] if we had the api call params we could [07:54:26] when it's just a few of them across the whole cluster, it will be more difficult [07:55:02] what does suhosin do exactly after it hits that canary error? [07:55:17] does it abort? exit? some other signal? [07:55:31] will it show up in our squid logs? [07:56:43] not in the squid logs, I'm on one know and was looking [07:56:47] *now [07:57:50] with a special awk command, or just with sampled-1000.log? [07:58:27] I was actually looking at the local syslog over there. but right now I am on srv292, one of the apaches that produces the errors [07:58:33] the syslog is full of these [07:59:05] <^demon> I'm on 292 also, but I didn't have access to the syslog :( [08:00:43] ah right, so there's plenty of errors [08:00:54] I thought you were saying there was only a few [08:01:06] why didn't you get a backtrace? [08:01:30] ? [08:01:44] <^demon> wmerrors isn't logging one [08:01:49] use gdb [08:02:28] if there's plenty of errors, you don't have to rely on logging [08:02:39] you can just attach to a random apache process and wait for it to do it [08:02:40] <^demon> gdb isn't installed on the apaches. [08:02:52] we install it on all the ones we need it on [08:03:13] it's on srv292 now [08:03:14] now installed on srv292 [08:03:23] dpkg: status database area is locked by another process [08:03:24] E: Sub-process /usr/bin/dpkg returned an error code (2) [08:03:27] mark too fast [08:03:34] sorry [08:03:39] *mark goes back to drinking coffee [08:03:44] i'll install it on all machines [08:05:37] got a backtrace, doesn't help much [08:05:44] you were fast [08:05:52] Can you share it anyway? [08:05:55] I couldn't get a live process by the time I got into gdb [08:06:00] defunct every time [08:06:00] http://p.defau.lt/new.html [08:06:14] er? [08:06:31] how about the actual backtrace instead of the form for a new pastebin? [08:07:44] http://p.defau.lt/?sn6sssHKCX2ROsyl6pd0IQ [08:07:48] thanks [08:07:51] it's $GLOBALS [08:08:01] that's the hashtable that it's destroying [08:08:04] should be able to get a key [08:08:36] great :-/ [08:08:46] wgAutoConfirmAge [08:09:23] to attach before the process disappears, use: gdb -p `ps -C apache2 | tail -n1 | awk '{print $1}'` [08:10:01] ps -C, that's what I wanted [08:14:43] break php_security_log [08:14:45] cont [08:14:47] frame 2 [08:15:05] print (char*)p->arKey [08:16:39] So, people, is there anything preventing us from rolling out to more wikis other than the strange PHP canary thing? [08:17:01] let's try disabling wmerrors at runtime and see if that fixes the canary thing [08:17:17] RoanKattouw: zhwiki OOM [08:17:30] For what, langconv? [08:17:43] <^demon> Yeah [08:19:22] maybe, but the backtraces don't all say langconv anymore [08:19:39] One is in replaceInternalLinks2() [08:19:47] maybe it is the same issue that I spent half the afternoon tracing [08:20:05] RIL2 was where I was working, yes [08:20:18] <^demon> Still getting canaries. [08:20:49] I'll revert it [08:21:39] <^demon> APC cache full/corruption wouldn't cause this, could it? [08:21:49] segfaults on srv247, I'm going to restart that one [08:21:50] if it is, we should be able to fix it by restarting [08:22:56] ok avail_mem is only 5MB on srv292 [08:23:04] nice idea ^demon [08:23:57] and the canareis are gone from srv247 now [08:23:59] btw the way to check that is to use the scripts in live-1.5 [08:24:04] BLOCKER: http://bits.wikimedia.org/zh.wikipedia.org/load.php?debug=false&lang=zh-cn&modules=site&only=styles&skin=vector&version=20110216T075000Z load nothing for zh.wikipedia [08:24:26] [mem_size] => 113750032 [08:24:27] hmm, let me find out why [08:24:40] if works only when I remove "&lang=zh-cn" param [08:24:59] It also works when I change the timestamp [08:25:09] Try making a whitespace change to MediaWiki:Common.css [08:25:14] Wfm [08:25:26] That should update the timestamp in the URL after at most 5 minutes [08:26:31] I'll restart all apaches [08:27:01] OK so we have OOMs on zhwiki, but 1) is there much we can do about it and 2) didn't these already happen all the time in 1.16? [08:27:02] RoanKattouw: it also works when I remove the version param [08:27:18] Yes, it'll remove when you use any other URL [08:27:22] s/remove/work [08:27:41] RoanKattouw: no, it didn't work when I tried to change zh-cn to zh [08:27:58] But with a different timestamp? [08:28:06] I wonder if removing the wmerrors hack will alleviate the memory leak (I suppose that's what's happening) [08:28:16] <^demon> I'm curious if canaries will start coming back. [08:28:31] RoanKattouw: with a different ts it works [08:28:39] and I did some edit in mediawiki:common.css [08:28:43] Right [08:28:44] if we leave the code alone, I will guess yes [08:29:00] Within 5 mins it should update the timestamp in the URL for everyone then [08:29:00] but we'll have to wait awhile [08:29:16] <^demon> I assume so. It would be good to go ahead and figure out a better way of debugging if/when it shows up. [08:29:28] <^demon> Not much time to react when the processes start going defunct so quickly. [08:29:36] nope [08:29:56] it's not a memory leak [08:30:13] we need to increase APC memory size [08:30:24] I thought it had been [08:32:02] again [08:32:08] say to 200 [08:33:35] <^demon> So the problem is really that MediaWiki has gotten too big for the current APC size? [08:33:53] And that we're running two instances of it, I guess [08:34:04] there were 920 files in it [08:34:08] As for MW getting bigger: we added includes/libs and includes/resourceloader [08:34:10] for two versions [08:34:13] And includes/installer [08:34:13] that's a lot of files [08:34:28] (Although to be fair the latter is never loaded on WMF) [08:34:42] <^demon> includes/libs is small and most of the stuff already existed elsewhere :p [08:34:49] <^demon> But the point remains, lots of code. [08:38:41] so shouldn't I be able to run php /usr/local/apache/common/live-1.5/apc-sma-info.php directly on the apache on question? [08:39:07] you mean through the CLI? no [08:39:21] the cache is in shared memory which is only accessible to apache children [08:39:27] hmph [08:40:08] another problem, http://zh.wikipedia.org/w/index.php?action=raw&title=MediaWiki:Sidebar gives a line "**mainpage|mainpage-description " and in http://zh.wikipedia.org/w/api.php?action=parse&text={{int:mainpage-description}} it's "??????". but it shows "Wikipedia:??????" in site sidebar [08:46:26] TimStarling, apergos: https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Special:Code/MediaWiki/82227 [08:46:53] that's not really urgent [08:47:03] I thought it might be related [08:47:04] the whole pool will be destroyed at request shutdown [08:47:07] Right [08:47:17] and since it's handling a fatal error, request shutdown will be very soon [08:48:37] <^demon> Apache CPU has climbed somewhat over the last 30 minutes or so. [08:48:38] So, any remaining blockers for further deployments? [08:48:46] ^demon: Probably due to increased request rate [08:48:56] <^demon> Probably, Europe's waking up [08:49:09] what about liangent's problem? [08:49:14] Oh wait, that current spike is not normal [08:49:49] liangent's problem WFM [08:50:02] Shows ?????? in the sidebar for me [08:50:04] note the content of MediaWiki:mainpage is Wikipedia:?????? [08:50:18] Yes, the link target is Wikipedia:?????? [08:50:24] And the link text is ?????? [08:50:24] RoanKattouw: my uselang=zh-cn [08:50:33] it shows Wikipedia:?????? for me, with uselang=zh [08:50:35] but link text is ?????? [08:50:45] Yes, it happens in zh-cn [08:51:45] Hmm [08:52:01] mainpage-description seems to be correct in zh-cn, and purging MW:Sidebar doesn't help [08:53:57] hmm yes load is somewhat higher... dunno if it's enough to worry about [08:54:32] I have the third problem now... but solving them one by one is ok [08:54:36] What worries me is the abrupt spike [08:54:58] <^demon> canaries are back. [08:55:11] could be demand-side [08:55:12] Don't see anything related in SAL though [08:55:21] liangent: What's your third problem then? [08:55:35] css defined in gadgets are not loaded [08:55:50] srv292 again [08:56:10] just one server [08:56:42] <^demon> Hm, maybe just depool it [08:57:06] liangent: Link to relevant Gadget and CSS not loaded? [08:57:10] Aha, in zh-cn: [08:57:15] **mainpage|mainpage [08:57:32] suspecting hardware problems for srv292? [08:57:38] there goes another server [08:57:45] <^demon> Also srv194 [08:57:49] yes [08:57:54] so no, not suspecting hw [08:58:22] What the... [08:58:32] OK it looks like MediaWiki message overrides aren't working for other languages any more [08:59:13] Yes, that's what's happening [08:59:47] Nikerabbit: Around? Can you help me with a MessageCache issue? [09:00:18] the CPU spike was when I restarted the apaches [09:00:31] robla: Hi Rob [09:00:33] Right [09:00:37] 32 minutes past the hour [09:01:11] 60% cpu is relatively high [09:01:14] ok I'm deleting MediaWiki:Sidebar/zh-cn and it looks good not [09:01:16] *now [09:01:17] we normally stay below that [09:01:25] hi nadeesha_calcey [09:01:26] we're still higher than we were [09:01:28] but it could be temporary indeed [09:01:31] before the switch [09:01:33] liangent: There was a /zh-cn page? [09:01:50] RoanKattouw: yes [09:01:52] Yeah that'll have done it [09:02:18] it was old but didn't cause any problem [09:02:43] and I have to delete other remaining subpages (zh-hk etc) [09:03:16] robla: Can we start second window testing now? [09:03:20] nadeesha_calcey: see http://www.mediawiki.org/wiki/MediaWiki_1.17/Wikimedia_deployment#Phase_2_.28underway_February_16.29 for a list of wikis we deployed to so far. some basic testing on the new ones would be good [09:03:58] (I guess frwiki is still on 1.16) [09:04:22] RoanKattouw: no but I have to write {{MediaWiki:Sidebar}} in them? [09:04:26] robla: okay, we can do that [09:04:36] liangent: I don't think so. You should be able to just delete those subpages [09:04:46] liangent: After that, can you tell me more about the broken CSS issue? [09:06:22] I guess MediaWiki:Sidebar should follow content language [09:06:40] I guess so [09:07:58] robla: Tim fixed the root cause of the frwiki problem some time ago [09:08:34] I didn't deploy it, did you deploy it? [09:08:45] I asked you to run s-c-a at some point [09:08:59] IIRC that was just after I'd svn upped for that fix [09:08:59] oh, so I did deploy it, I just didn't realise? [09:09:09] Along with a bunch of CSS-related MFTs, yes [09:10:25] <^demon> Ugh. All the info on the canaries via google is just silly workarounds telling you to disable suhosin. [09:10:30] *^demon continues digging [09:11:52] RoanKattouw: how are gadget styles loaded now [09:12:00] The normal way, AFAIK [09:12:05] Could you point me to the one that's broken? [09:12:10] with resourceloader, or just action=raw? [09:12:21] RoanKattouw: pong [09:12:42] RoanKattouw: on zhwikipedia, I'm having gadget-fontsize selected, but no css is loaded [09:13:36] Nikerabbit: With your changes in MessageCache.php , it seems that wfMsg('sidebar') doesn't return the contents of MediaWiki:Sidebar if ui lang != cont lang' [09:13:51] what changes? [09:14:00] The MediaWiki:Messagename/langcode change [09:14:16] From my reading of the code, MessageCache checks for that page only [09:14:25] Not for its parent page [09:15:02] I still don't get it, there shouldn't be much changes in 1.17 message cache compared to 1.16 [09:15:15] In 1.16, MediaWiki:Sidebar/zh-cn was ignored. In 1.17 it's not, so someone complained their sidebar broke. They then found and deleted the zh-cn subpage, but in eval.php wfMsg('sidebar') with $wgLang = Language::factory('nl') still returns the default nl message despite the zh override [09:15:17] restarted them again eh? [09:15:24] Despite that, the sidebar does work right [09:15:49] Or, no, it doesn't, actually :O [09:16:04] So bug #1 is sidebar is using UI lang not contlang [09:16:34] bug #2 is that if there's a [[MediaWiki:Foo]] page but no [[MediaWiki:Foo/nl]] page, your ui lang is nl, but the cont lang is someone else, the parent page should be used but isn't [09:17:01] the fact that restarting apaches helps tells you that it is an APC issue [09:17:40] <^demon> People claim it's fixed in more recent versions of suhosin. [09:17:52] meh [09:18:07] that's what suhosin does, it makes these errors [09:18:13] you can't fix it [09:18:49] if you did, what would be the point of it? [09:19:09] RoanKattouw: what fallback seq should be used in message? [09:19:31] No idea, ask Nikerabbit :D [09:19:48] db subpage -> php i18n -> db basepage [09:19:54] <^demon> TimStarling: Trying to learn new stuff here today :) [09:19:56] the other way around, I think [09:20:05] or db subpage -> db basepage -> php i18n for uimsg [09:20:11] yeah, that [09:20:37] mm load is creeping up a bit more [09:20:45] then contmsg? [09:20:55] yes [09:20:58] no [09:21:10] english [09:21:22] it just goes through the fallback sequence for the UI, the content language isn't used [09:21:26] It *seems* to be doing db subpage -> php i18n [09:21:49] I hate svn, how do I diff between two branches [09:22:15] <^demon> svn diff /path/to/branch /path/to/otherbranch, right? [09:22:20] I think so [09:22:26] TimStarling: s/isn't/is in your words? [09:23:07] say if your content language is zh and your user language is de [09:23:25] <^demon> TimStarling: Trying to learn here...how would I go about getting a stack trace from one of these failed apache processes? [09:23:44] without root access, you can't ;) [09:23:47] then wfMsg() should go [[MediaWiki:message/de]] -> [[MediaWiki:message]] -> de i18n -> en i18n [09:23:50] <^demon> Oh :( [09:24:04] RoanKattouw: I remember it was doing db subpage->php i18n so that I had to write a script to duplicate the content of basepage to all subpages [09:24:39] this is not what everyone wants, but it supports things like [[MediaWiki:Sidebar]] [09:25:23] That's what it should do, but that's not what it's doing [09:26:00] this was screwed up in 1.16 too [09:26:05] apparently people just like to screw it up [09:26:14] that's why I remember the order [09:26:53] <^demon> srv151 is segfaulting some. [09:27:12] i'll restart it [09:27:33] done [09:27:44] did these servers just start having problems post-deployment? or did we just start noticing it? [09:29:15] <^demon> I just remember it last week during the staged deploy on a smaller scale. [09:29:20] well 151 jsut started a little bit ago, it was fine for the few hours before [09:30:49] nl reports that categories don't report the number of pages and categories within the subcats [09:31:05] same for 158 [09:31:29] seems this is resolved now though. [09:31:45] perhaps took a while before the cache caught up with updating the messages [09:31:51] RoanKattouw: I don't see any changes between REL1_16 and REL1_17 that could cause that, except one I'm not sure of [09:32:02] a little bit ago in the case of those two servers = the last 15 minutes [09:32:07] <^demon> srv151 segfaults were proceeded by OOM for DoubleWiki, I think. [09:32:12] thedj: Yeah Romaine reported that to me already [09:32:13] there should probably be some minor breakage in the area of categories and subcategories [09:32:26] it would be very surprising if I got that complex revert exactly right [09:32:27] Or wait, that may have been categorytree [09:32:36] PHP Fatal error: Allowed memory size of 83886080 bytes exhausted (tried to allocate 64926 bytes) in /usr/local/apache/common-local/php-1.17/languages/LanguageConverter.php on line 400 [09:32:45] for 151 [09:32:58] I've seem a lot of OOMs on that specific line [09:33:08] <^demon> The LangConv OOM has been going since we switched over the wikis using it. [09:33:16] <^demon> Much worse than 1.16 was. [09:33:29] only difference is -$langcode = $lang->getCode(); +$langcode = $lang->getPreferredVariant(); which is out of my league [09:34:10] Yeah Tim just said this was wrong in 1.16 too [09:34:17] who's the designated canary-whacker? srv292 [09:34:47] root@fenari:/home/wikipedia/syslog# grep 'canary mismatch' apache.log | awk '{print $1}' | sort | uniq -c | sort -rn 72533 Feb [09:34:47] 21406 Dec [09:34:47] 2 Jan [09:34:50] The order should be MW subpage for ui lang -> MW base page -> PHP i18n for ui lang -> fallback chain from there [09:35:16] But the MW base page step is cut out [09:35:36] it's the final step [09:35:46] but that hasn't change since last deployment [09:36:16] is this only happening in chinese wp? [09:36:21] What do you mean it's the final step? It's not much use if its does MW subpage -> PHP i18n with fallbacks -> MW base page, is it? [09:36:29] Nikerabbit: that's what tim said "it supports things like sidebar" [09:36:33] I've only reproduced it on zhwiki, yes, but I haven't tried anywhere else [09:36:45] Nikerabbit: Go to http://zh.wikipedia.org/?uselang=fi [09:36:56] because $langcode has changed... the check against content language code might be broken [09:37:03] You'll get a Finnish sidebar even though they have a local override in [[MediaWiki:Sidebar]] [09:37:48] what do you mean with Finnish sidebar? [09:38:16] Nikerabbit: you get a sidebar with items defined in MessagesFr.php [09:38:29] no I don't actually [09:38:56] and there is no sidebar message in MessagesFi.php [09:39:06] Oh [09:39:08] Try ?uselang=nl then [09:39:12] ok. Go to ?uselang=en [09:39:19] next time a server starts giving canary errors, I'm going to attach with gdb [09:39:27] you probably mean that some of the links are translated? [09:39:43] Also, some of the custom links won't appear, I think [09:39:53] 10.0.2.239 there is your chance [09:40:02] I see all custom links both with nl, fi and no override [09:40:23] Hmm wait, you're right [09:40:26] It works fine [09:40:39] My mistake then, I guess, sorry for the confusion [09:40:41] is it just that they are shown in English without override? [09:40:52] The texts are translated, but that's expected [09:40:55] The actual list is unchanged [09:41:13] I thought the actual list had changed too [09:41:19] RoanKattouw, Nikerabbit: but in my attempt just now [09:41:24] when I deleted /zh-cn [09:41:30] the list IS changes [09:41:44] my user language is zh-cn [09:41:52] Maybe sidebar does use wfMsgForContent() [09:42:00] Because this still looks wrong: [09:42:10] catrope@fenari:/home/wikipedia/common/php-1.17$ php maintenance/eval.php --wiki=zhwiki [09:42:11] > $wgLang = Language::factory('nl'); echo wfMsg('sidebar'); [09:42:17] Returns the message from MessagesNl.php [09:42:31] so I added /zh-cn and wrote {{mediawiki:sidebar}} or its content in it to get the customized sidebar bacvk [09:42:33] of course it uses wfMsgForContent [09:42:34] *back [09:42:56] only in some wikis like commons it's overriden to use wfMsg [09:43:40] Nikerabbit: I didn't check the code but how to explain what happened when I deleted /zh-cn then [09:43:55] I'm pretty sure the change I pasted is the cause of the issues you see and I don't [09:44:37] *RoanKattouw wonders what getPreferredVariant() does for zh/zh-cn [09:45:23] Behaves normally, it seems [09:46:07] # Is this a custom message? Try the default language in the db... [09:46:12] Right, so the check is there indeed [09:46:17] if it returns anything other than what was originally given... then everything using wfMsgForContent will have funny effects [09:46:24] RoanKattouw: I already copied mediawiki:sidebar to all zh variant subpages... [09:46:30] it=getPreferredVariant [09:46:31] .. but only at the end [09:46:46] It returns zh-cn for zh-cn [09:47:57] re canary: I got a URL and a server, it didn't help [09:48:09] :-( [09:48:10] RoanKattouw: did you look at LanguageConverter::getPreferredVariant()... it depends on whole lot of things [09:48:20] it did it for random pages, not just the URL I logged [09:48:23] and by definition it is not guaranteed to return the same code always [09:48:55] Nikerabbit: I tried in eval.php and it seemed to work. Will read source [09:48:55] TimStarling: what url was it? [09:49:18] http://de.wikipedia.org/wiki/Hain [09:49:24] but you know it didn't help, right? [09:49:44] it works as well as any URL [09:50:09] a page without anything fancy [09:50:21] Hmm it depends on user too [09:50:25] liangent: Are you User:Liangent ? [09:51:27] maybe there's some extension newly used in 1.17, eg. for tiff handling? [09:51:31] Hmm, still returns zh-cn if I use his username [09:52:10] RoanKattouw: yes [09:52:39] RoanKattouw, are you going to deploy to other wikis today? [09:52:54] I don't know [09:52:59] robla: ? [09:53:16] TimStarling: What's the status regarding blockers for expanding deployment? How serious is the canary thing? [09:53:41] can we deploy all of the wikis? [09:53:46] to 1.17 [09:53:57] if it's a problem with running two copies of MW at once, that would fix it [09:54:28] as long as we're ready to back off it [09:54:29] I am worried about the current 60% cpu usage, but we can try [09:54:43] I'm concerned about en wiki usage (though it's off hours here) [09:54:45] (sorry...stepped away for a sec) [09:54:48] Let's try all minus enwiki first [09:54:57] or just enwiki [09:55:03] Could do that too [09:55:08] I'd say give enwiki a shot [09:55:08] Probably easier :) [09:55:11] Whee [09:55:47] 10.0.2.239 whackacanary [09:56:10] RoanKattouw: I tried to reproduce what happened but failed [09:56:10] by deleting some subpages [09:56:11] RoanKattouw: well just leave it there. the gadget problem? [09:56:12] apergos: it's restarted now, it was the one I was using for testing [09:56:13] RoanKattouw: I don't what is the intention there, but that change breaks the check mediawiki:basepagename first when using wfMsgForContent() [09:56:17] So enwiki now? [09:56:20] if we do all then we can see if it's the two version issue; [09:56:22] liangent: Yes, let's move to the gadget problem [09:56:27] if we just do en we don't gain that [09:56:40] let's just be ready to back out is all [09:56:41] I'll switch en, but if the problem with APC is related to running multiple versions of MW, it won't help [09:56:47] what tim says [09:56:56] Switching en will give us confidence to switch the rest, though [09:57:30] With regards to CPU usage and such [09:57:39] rather a lot of seg faults... [09:57:59] well the decision is made I guess [09:58:50] uh oh [09:58:55] thousands of bits cache misses [09:59:08] Of course [09:59:15] bits now unstable [09:59:19] crap [09:59:26] I expect nagios messages any second [09:59:40] do you want it reverted? [09:59:54] hmm [09:59:59] let's wait a few minutes [10:00:07] I don't see how we can do better in any other way [10:00:16] Bits caches CPU usage seems to have flatlined [10:00:30] Do I understand correctly that missing CSS/JS is caused by overload of bits server? [10:00:38] *Bits app servers CPU [10:00:56] ody> [10:00:56]

Error 503 Service Unavailable

[10:00:57]

Service Unavailable

[10:00:57]

Guru Meditation:

[10:01:44] client side is getting better. i get more resources on each page refresh [10:01:45] apaches seem ok [10:02:13] manually pooled all bits servers [10:02:40] RoanKattouw: in old version gadget css are loaded with Guru Meditation:

XID: 4354112

[10:02:51] RoanKattouw: is it still working in this way currently? [10:03:12] only my user modules are now missing [10:03:22] > 10k cache misses per second [10:03:23] is it now 09 utc? [10:03:23] not good [10:03:25] everything else is there [10:03:30] ouch [10:03:40] much more than previous deployment attemps [10:03:46] probably because this time the apache cluster is not overloaded [10:03:53] Nikerabbit: 10:08 UTC [10:03:59] *sigh* [10:04:08] yeah, our working backend DoSing your non-working frontend [10:04:25] So bits/Varnish is overloaded? [10:04:28] varnish is ok [10:04:34] the varnish backends aren't [10:04:37] anyway, gather data [10:04:43] we'll revert soon [10:04:49] so get what data you can [10:05:13] lots of segfaults on srv248 [10:05:17] *mark restarts apache [10:05:33] geoiplookup is also having it difficult. [10:05:43] <^demon> date_create(): Failed to parse time string (2008-23-11T11:43:00.00+00:00) at position 6 (3): Unexpected character in /usr/local/apache/common-local/php-1.17/includes/GlobalFunctions.php on line 2071 [10:05:58] and there went nagios [10:06:18] thedj: geoip is also on bits [10:06:38] 0 Backend_health - srv249 Still sick 4--X-R- 0 3 8 0.011589 0.001020 HTTP/1.0 403 Forbidden [10:06:38] 0 Backend_health - srv248 Still sick 4--X-R- 0 3 8 0.011826 0.000993 HTTP/1.0 403 Forbidden [10:06:41] huh [10:06:47] TimStarling: revert [10:07:14] .request = [10:07:14] "GET /w/load.php HTTP/1.1" [10:07:14] "Host: en.wikipedia.org" [10:07:14] "Connection: close"; [10:07:27] why would that give 403 [10:07:35] ^demon: the year only has 12 months here [10:07:55] doh [10:07:57] no useragent? [10:08:02] of course [10:08:04] :-D [10:08:06] which didn't happen with the temp script [10:08:06] <^demon> Platonides: Yep, something's passing a bad string. wmerrors didn't grab a stacktrace though :\ [10:08:06] that's what I was thinking [10:08:08] *mark fixes [10:08:17] or is it a US date? [10:08:43] were the bits app servers in LVS? or was varnish just sending traffic directly? [10:08:45] the year first format always has then month-day [10:09:01] directly [10:09:06] it maybe that someone wrote that in an article, though [10:09:07] CPU recovery on bits app servers [10:09:16] *mark deploys a UA fix [10:09:24] <^demon> Platonides: It's supposed to be that way, but some American probably wrote it backwards ;-) [10:09:51] silly americans ;) [10:11:00] are those delayed nagios whines? [10:14:00] /home/wikipedia/syslog/syslog is 109GB [10:14:01] apergos: No, esams bits just went bonkers [10:14:15] ugh [10:14:16] is it time to write a logrotate.d script yet? [10:14:44] still 150GB free, we can let it go to 350 [10:14:53] *250 [10:14:54] there is one [10:15:03] it's just not working right now because of nfs1/nfs2 move [10:15:14] root@nfs2:/home/wikipedia/syslog# head syslog [10:15:14] Dec 18 06:28:30 208.80.152.63 squid[12583]: storeUpdateCopy: Aborted at 12742 (4096) [10:15:16] is enwiki still on 1.17? [10:15:16] I believe there is an rt ticket [10:15:21] No [10:15:29] enwiki's back on 1.16 [10:15:33] ok [10:15:36] since about 10 seconds after you told me to revert [10:15:40] need to get bits.esams stable again [10:16:32] linux scheduler is melting on those boxes [10:18:58] # grep 'canary mismatch' apache.log | awk '$1 == "Feb" { print $2}' | sort | uniq -c [10:18:59] 42429 16 [10:18:59] 1 5 [10:18:59] 34668 8 [10:20:14] the 8th was when we switched all wikis, right? [10:20:29] Yes [10:20:51] so the problem can't be wmerrors because we hadn't written it then [10:21:03] And it can't be partial deployment either, can it? [10:21:08] no [10:21:10] Because I hadn't written that either [10:23:26] <^demon> That makes sense and could've saved us suspecting wmerrors, seeing as we all remember canaries from partial deploy. [10:24:14] ? [10:24:23] I don't remember canaries from anythin [10:24:34] <^demon> I remember mentioning it, but it wasn't as widespread. [10:24:39] <^demon> Because we deployed less. [10:25:04] <^demon> Might've just gotten lost in the backscroll and temporarily fixed by apache restarts in between. [10:26:17] nobody has reported it [10:26:26] no actual users, right? [10:26:48] the user-visible problems would be fairly subtle [10:27:43] I don't think there were any, no [10:27:49] Also, it happens on shutdown, right? [10:27:54] yes [10:27:58] Has PHP already sent the output to Apache by that time? [10:28:15] connecting to the apache directly, you see an error about unclean socket shutdown [10:28:19] but maybe squid hides that [10:29:09] there may have been other problems though [10:29:11] Best case, they get their page normally [10:29:27] Worst case, they see it as any other fatal error, so all they'd notice is an increased blank page rate [10:29:28] segfaults, corrupted output, etc. [10:29:36] it's a symptom of a serious problem [10:29:54] Corrupted output? [10:29:58] I see an increased blank page rate on load.php [10:30:08] Yeah you would [10:30:13] Because it's on bits and you're in Europe [10:30:14] wikidiff2 was readded at the same time as the first deployment [10:30:21] Hmmm [10:30:26] Good point [10:30:40] although I expect that would mean an equal failure rate in 1.16 [10:30:50] no it wasn't [10:30:59] wikidiff2 was enabled on the 11th [10:31:04] the other candidate extension i see is intl [10:31:07] # 09:38 Tim: enabling wikidiff2 on all servers by manually creating wikidiff2.ini [10:31:22] so we can strike wikidiff2 [10:32:14] it doesn't have to be an extension, it could be a userspace change [10:32:15] Stupid question time: did you restart Apache after that? [10:32:42] I would have checked with phpinfo() that it was enabled [10:33:00] I was working with the assumption that the bug wasn't in php core [10:33:15] just to simplify, as it seems more unlikely [10:34:03] error rates on the 8th, breakdown by minute: http://p.defau.lt/?O_zyZ7x98ICH5QUM5WpiwA [10:35:34] Correlates excellently with 1.17 [10:35:50] The second deployment, specifically [10:36:13] Wow and the first oo [10:36:29] Although the second seems to have been much worse [10:37:06] in the first one, I restarted all apaches regularly because of the high load [10:37:16] so that would have hidden it [10:37:44] Right [10:39:59] *RoanKattouw lols at http://torrus.wikimedia.org/torrus/CDN?token=T25664&view=last24h and guesses it's the 5-minute cache expiry on the RL startup module [10:44:22] I wonder if it only gives a canary error on requests to 1.17 wikis [10:44:40] leave srv254 for a while, I'll find out [10:58:51] one hour left. what should we do with it? [10:59:08] I was just thinking the same thing, a bit sleepily [10:59:09] How serious is the canary issue? [10:59:23] And did bits get fixed so it can handle deploying enwiki? [10:59:25] how serious is the varnish backend issue? [10:59:46] I don't think the canary issue is serious enough to stop us, but the bits issue is [10:59:48] ok [10:59:54] bits.esams is recovering [10:59:59] in a bit i'm ready for one more enwiki test [11:00:17] bits.esams should be rather unrelated to 1.17 [11:00:27] What was the problem? Had it depooled all servers because of 403s? [11:00:35] one hour left to? [11:00:37] same problem we've been seeing for a while [11:00:41] Just a thought: There's a bad regex in JavascriptDistiller that causes apache to crash on windows (stack overflow), wondering if this may be related to your memory corruption issue [11:00:49] varnish thread pileup under some nonideal conditions, and then it won't recover [11:01:13] is a bug between varnish and the kernel [11:01:23] pawelx: it would only cause segfaults [11:01:30] not heap overflows [11:02:27] Hydriz: 12:00 UTC is the end of our maintenance window [11:02:47] ok, I thought stack could leak into heap area and cause corruption or sth [11:02:47] ? [11:02:57] <^demon> pawelx: File a bug though, so it doesn't get forgotten :) [11:02:59] Which means Wikimedia would be updated then? [11:03:02] btw the heap overflows are only from the 1.17 wikis [11:03:03] is there someone not working on the performance problems [11:03:19] so if there's APC cache corruption, it's somehow magically localised to 1.17 [11:03:27] or other problems will be deferred until the performance problem is resolved? [11:03:37] 12 pm UTC time... [11:03:54] ^demon: I thought there was one already, can't find it now though [11:04:18] ok, ready for enwiki deployment attempt #4? [11:04:29] really? [11:04:37] Hydriz: It's 11:09 UTC [11:05:03] I know, but it is interesting that you guys are deploying it at noon UTC [11:05:41] Hydriz: see #wikimedia [11:06:09] TimStarling: I'm ready, I say bring it. [11:06:11] mark? [11:06:11] Others? [11:06:20] TimStarling: hrm.... [11:06:32] mark: how confident are you that the bits problem is fixed? [11:06:35] if mark says it's a go [11:06:43] even if it's just more data collection [11:06:50] i'm waiting for traffic to be moved back [11:06:53] should be a couple more minutes [11:07:21] ok [11:07:26] that was effectively a pretty large downtime event for us. given how long the recovery was, I'd want to be pretty darn sure we're ready [11:07:43] The long recovery was Europe-only AFAIK [11:07:58] well, if it's only Europe :-/ [11:07:59] yep, and not very related to 1.17 [11:08:32] but that's why we've scheduled a maintenance window [11:09:31] so...back to my question: mark: how confident are you that you've got a fix? [11:10:01] confident enough that the UA issue is resolved [11:10:12] the varnish problems in general are not fixed and can happen anytime [11:10:14] so let's do it [11:11:22] mark: should this be a quicker recovery? [11:11:40] we never know how long it'll take to recover varnish [11:11:52] typically, moving traffic to the US recovers it but can take some time [11:12:24] let's give it a shot I guess [11:12:31] sweet [11:12:44] there's no point in delaying it [11:12:50] Moving traffic to the US is a quick recovery from the /users/' perspective, though, right? [11:13:00] well, it takes 15 minutes [11:13:14] it just took longer since I extended the downtime to investigate it [11:13:20] better now (during maintenance window) than any other time [11:13:33] so it'll be slightly faster now probably [11:14:23] sorry, was distracted, let's do this [11:15:24] done, enwiki back on 1.17 [11:16:27] ah good, CPU is rising this time on the backend, instead of falling [11:16:31] still forbiddens [11:17:02] hmm so it is (rising) [11:17:47] restarted varnish on sq67 [11:17:54] it had two different backend probes running, it seemed [11:17:56] that seem sa bug [11:17:57] better now [11:19:32] bits backend looks remarkably calm (from ganglia) [11:19:41] it's looking good [11:20:39] *robla tries to avoid jinxing it just yet [11:21:33] cache misses are at an acceptable level [11:21:40] a few hundred a s [11:21:55] apaches are a bit more loaded but still stable-ish [11:22:43] if that's where they're going to sit form now on we prolly want to throw a few more in the pool [11:22:50] yep [11:23:03] <^demon> TimStarling: I thought you shut up those XML errors from svg metadata extraction? [11:24:14] it won't suppress all errors from XMLReader [11:24:23] so...75-80% of our traffic is on 1.17 now? [11:24:32] 52 plus whatever we had before [11:24:37] I just changed the settings to make it a bit less noisy [11:24:42] <^demon> Oh ok [11:25:01] We had jp and de (not fr) before, so it's somewhere in the 60s, maybe 70, I'd say [11:25:09] we should switch over the rest of the wikis before our window closes [11:25:12] let's do the rest soon yeah [11:25:18] yes [11:25:35] Let's just do that with cat all.dblist > 1.17.dblist for now [11:25:39] We can normalize the setup later [11:25:52] there's a command called "cp" that does a very similar thing [11:25:58] I was just thinking that, yeah :) [11:25:59] I think I might use that ;) [11:26:07] heh [11:26:10] except I made a backup first [11:26:13] let's get it done. might fix the apc probs [11:26:14] but just in case let's be ready to back off [11:26:29] perfect [11:26:48] Whee [11:27:09] 542 dropped requests in varnish during this push ;) [11:27:25] I am very sorry for those 542 people [11:27:29] :-D [11:27:30] but other than that it seems fine right now [11:27:47] front end load is lower [11:28:05] bits app server load is high [11:28:08] 80% [11:28:12] PHP Fatal error: Call to undefined function wfGetIP() in /usr/local/apache/common-local/php-1.17/wmf-config/CommonSettings.php on line 2340 [11:28:28] a *lot* lower [11:28:30] hmmmm [11:28:55] oh, we're back to that are we? [11:28:56] bits app server load dropped a bit again [11:28:57] he.wp looks good [11:29:02] 60% now [11:29:11] you really can't use wfGetIP() in CommonSettings.php [11:29:16] sorry [11:29:29] zh looks good with variants as well [11:29:29] Wasn't that reverted before? [11:29:39] I thought that was fixed yeah. but... [11:29:59] unless it's a revert back to an older config [11:30:08] mark: Probably a lot of initial resource generation. Those things are changed very infrequently and are cached for like forever [11:30:13] That was the IP limit exemption thing? [11:30:14] yep [11:30:20] guillom, yus [11:30:24] i am happy enough [11:30:27] I fixed it [11:31:00] looks like I need to find 2 more apaches for the bits app server pool, but it's fine for now [11:31:14] huh on some hosts the load went up (spike) and on some it dropped suddenly ... weird [11:31:26] mark, just make RobH get the new data centre online quicker :P [11:31:34] that won't help anything [11:31:42] it's a replica after all [11:32:17] Pfft. It was with sarcasm [11:32:18] root@srv234:~# gdb -batch -p `ps -C apache2 |grep -v defunct | tail -n4 | head -n1 | awk '{print $1}'` -x ~tstarling/url.gdb 2>&1 | grep http [11:32:18] http://mk.wikipedia.org/w/index.php?title=%D0%A1%D0%BF%D0%B5%D1%86%D0%B8%D1%98%D0%B0%D0%BB%D0%BD%D0%B0%3A%D0%91%D0%B0%D1%80%D0%B0%D1%98&search=%D1%80%D0%B0%D0%B7%D0%B2%D0%BE%D1%98+%D0%BD%D0%B0+%D0%BA%D0%BE%D0%BC%D0%BF%D1%98%D1%83%D1%82%D0%B5%D1%80%D0%B8%D1%82%D0%B5 [11:32:30] sorry for spam [11:32:42] it's not even a one-liner anymore, you need the extra script [11:32:55] bits app server load stabilized around 40-45% [11:32:57] that's excellent [11:33:09] but it does show you the URL of the next heap overflow error, which is nice [11:33:15] my gut feeling about 4 RL apaches was correct ;) [11:33:31] nadeesha_calcey: still around to help with testing? [11:33:33] http://eiximenis.wikimedia.org/1-17-allwikis [11:34:36] http://ganglia.wikimedia.org/?c=Apaches%208%20CPU&h=srv226.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=3&mc=3 [11:35:20] time to order some more apaches [11:35:25] maybe so [11:35:26] we haven't done that in a long time [11:35:30] it's coming down to under 10 [11:35:33] probably because we were not deploying any new features ;) [11:35:39] shhh [11:35:45] I didn't know we had only 109 Apaches in the general pool [11:35:48] byt srv260 (for example) is hovering around 6 [11:36:08] 1950 vs R610 [11:36:14] slightly less in the new dc even [11:36:17] but they will be all 12-cores [11:37:04] load does seem a bit uneven indeed [11:37:09] remember, there are jobrunners too [11:37:18] hmm that's true [11:38:40] so where are we now? canaries and...? [11:39:06] plenty of OOM errors to fix [11:39:25] that's probably easier and more useful than canaries [11:39:39] because I should be in the data center in about... [11:39:50] shouldn't you get some sleep? :) [11:39:58] < 3 hours, and the site doesn't look like it's going to melt [11:40:04] yep. I'm heading to bed [11:40:11] hmm, were vector prefs overwritten [11:40:18] thedj: Oh, that's right, the script [11:40:20] Fixing [11:40:20] see folks later. if Rob pops into the channel and I'm not at the dc I'm still in the be [11:40:21] d [11:40:23] apergos: remember, the new data center is not important enough for you to exhaust yourself on it [11:40:29] i see people saying they have the new toolbar while they used to have the old toolbar selected [11:40:32] so please get enough sleep [11:40:45] I got some sleep before coming on line tonight. I will not be setting the alarm now though [11:40:47] and you can walk over to the dc afterwards ;) [11:40:51] yep [11:41:01] that's the plan. [11:41:03] ok [11:41:06] sleep well then! [11:41:13] thanks [11:41:27] have an uneventful and bug-squashing night [11:42:13] nite apergos [11:42:18] TimStarling, is there a list of these OOM's or anything for us to start poking at? [11:42:23] thedj: Known issue, running a script to (partially) fix it like right now. The ones still having the issue will, unfortunately, have to switch it off again [11:42:24] Reedy: I; [11:42:28] Reedy: I'll give you one OOM [11:42:29] no [11:42:40] I think I've figured out the canaries [11:42:49] http://mediawiki.pastebin.com/aTaCdCVB [11:45:08] http://no.wikipedia.org/w/index.php?rand=16&title=Liste_over_verdens_st?rste_olje-_og_gasskraftverk [11:45:12] rand=16? [11:45:45] Reedy: probably a userscript tool [11:45:58] hm [11:46:58] ok, the maintenance banner is going to go down automatically in 10 minutes. I assume that's ok. [11:47:24] guillom, is it worth putting up noting upgrade to 1.17 has happened, report issue to irc/bz? [11:48:02] Reedy, we can't put that in the notice, but it's what the landing page on mw.o has been saying all along, so I hope it'll be enough [11:48:28] RoanKattouw, no.wiki doesn't seem to want to load at all.. [11:49:11] Yeah, whole of no.wiki is giving 504 [11:49:15] 504? [11:49:19] ok it is the time to solve user-side problems? [11:49:37] Gateway timeout [11:49:46] no.wikipedia.org seems to work fine for me [11:50:10] does now [11:50:15] robla: hi Rob, I have http://zh.wikipedia.org remaining to test, others are completed [11:52:20] It seems that for some reason adding buttons to mwCustomEditButtons in monobook is broken [11:52:45] Hmm [11:57:35] from the obscure wiki testing dept: http://mediawiki.pastebin.com/eiiaVQdn [11:58:14] robla, can you get a URL to go with that? [11:58:17] http://wikimaniateam.wikimedia.org/wiki/Special:RecentChanges [11:59:15] oh, fun [12:00:09] I recognize this [12:00:21] Wondering what caused it again [12:01:16] ok, heap overflows (canary errors) have stopped [12:01:34] I almost had it about 4 hours ago [12:02:39] that's the hashtable that it's destroying [12:02:40] should be able to get a key [12:02:41] wgAutoConfirmAge [12:03:13] What was it? [12:03:16] and I thought $wgAutoConfirmAge was the most boring possible global, and that it must just be complaining about a random global variable [12:03:23] You'd think so, yeah [12:03:30] but it turns out that it is used as a reference in a strange and unusual way [12:03:37] $wgAutopromote = array( [12:03:37] 'autoconfirmed' => array( '&', [12:03:37] array( APCOND_EDITCOUNT, &$wgAutoConfirmCount ), [12:03:37] array( APCOND_AGE, &$wgAutoConfirmAge ), [12:04:26] it turns out that it's *always* that global that it dies on [12:04:43] and when I changed that code a bit, it died on $wgAutoConfirmCount instead [12:04:47] A reference in an array? How is that unusual? [12:05:01] and when I removed the reference from DefaultSettings.php, it stopped happening completely [12:05:15] It has been around for about two or three years [12:05:23] yes, it's been around for a while [12:05:48] I haven't isolated it completely yet, but let's just say that there are not many global variables that have array elements overwritten by extract() [12:06:01] haha [12:06:04] and then those elements passed around from one function to another [12:06:20] So I guess we may have found a very obscure PHP bug here [12:06:36] yes, there will be a bug report eventually, assuming it's not fixed already [12:06:40] we're not using the latest PHP [12:07:02] <^demon> The only bug report I found relating to canaries claims to have been fixed around 5.2.10 or so [12:07:46] A bug can be fixed without having been reported [12:07:50] did you search for bogus bugs? [12:07:58] A committer could've noticed it and fixed it, or it could've been fixed inadvertently [12:08:10] Or they could've marked it as bogus. My money's on that one [12:08:43] <^demon> TimStarling: They usually show up in google searches, iirc. [12:10:00] http://bugs.php.net/search.php?search_for=heap+overflow&boolean=1&limit=30&order_by=&direction=DESC&cmd=display&status=All&bug_type=All&php_os=&phpver=&cve_id=&assign=&author_email=&bug_age=0&bug_updated=0 [12:10:48] <^demon> I didn't search well, obviously. It was ~4am ;-) [12:11:29] robla: I just navigate to this http://zh.wikipedia.org/wiki/Special:PrefStats, and noted the page content has english content, and also the site has few texts in english such as IRC, MIME. I am not sure whether this is a bug or the expected [12:13:07] nadeesha_calcey: I'm not sure either, but I'm not too worried about that one [12:13:56] IT's preferences [12:13:59] They are stored as such in the DB [12:14:03] so it's very unlikely to be localised [12:14:33] nadeesha_calcey: I think the most valuable thing right not is probably to help out with this list: http://eiximenis.wikimedia.org/1-17-allwikis [12:14:33] robla: ok. thanks Rob [12:15:27] nadeesha_calcey: there's a chance that one of the wikis on that list is completely down, and we wouldn't know it for a while because the community may be too small to have someone that knows how to reach us [12:16:28] Feb 16 12:18:30 10.0.2.155 apache2[1627]: [error] [client 208.80.152.74] ALERT - canary mismatch on efree() - heap overflow detected (attacker '208.80.152.74', file '/usr/local/apache/common-local/php-1.17/wmf-config/CommonSettings.php', line 265) [12:16:54] it's $wgStyleSheetPath, another global that's used as a reference [12:17:33] robla: I ll have a look on that list and let you know the status [12:18:43] nadeesha_calcey: thanks. we've got volunteers in the community also working on that list, so make sure you mark your results as you do them (see the instructions on the top) [12:19:06] I'm probably going to go to bed here in just a little bit [12:19:37] robla: ok, I ll mark the results. [12:21:28] robla, where did you get that list from? It contains wikis outside the cluser that WMF doesn't manage :) [12:21:48] guillom: I constructed it off a page on meta [12:21:58] I'm sure there's a better source somewhere :) [12:22:07] like Special:SiteMatrix ? :) [12:23:32] yeah, probably like that one :-) oh well... [12:24:06] or the api sitematrix ;) [12:25:47] I'll get it right for 1.18 :) [12:26:04] My eyes are bleeding, and I blame robla http://nv.wikipedia.org/wiki/%C3%8Diyis%C3%AD%C3%AD_Naaltsoos [12:26:26] LOL [12:26:29] Oh. My. GOD [12:27:22] <^demon> 1999 called, they want their clip art back. [12:27:41] :D [12:29:27] alright, back in a couple minutes, then off to bed [12:36:01] ok, are there some non-canary things that I can help with? [12:36:45] OpenSearchXML is generating an array_map() callback error.. But i've no idea what wiki, it works fine on mw.org and locally [12:36:49] I was gonna ask for a script to grep all MW: pages for document.write, but Bryan's already running one he wrote [12:37:08] <^demon> I can't replicate the array_map() bug yet locally or on enwiki. [12:37:46] toolserver is overloaded like hell, so it may take a while [12:38:51] "An error occurred while invoking the map callback": that's very vague [12:39:15] yeah, there's only one line it could be, but still, it's useless [12:39:45] PHP Warning: wfMkdirParents: failed to mkdir "/mnt/thumbs/wikipedia/it/thumb/6/63/Vista_dal_Castello.jpg" mode 511 in /usr/local/apache/common-local/php-1.17/includes/GlobalFunctions.php on line 2317 [12:40:15] Similar error from commons [12:41:39] Is anyone working on CategoryTree issue right now? [12:41:41] Feb 16 12:44:52 10.0.2.155 apache2[12347]: PHP Warning: Invalid argument supplied for foreach() in /usr/local/apache/common-local/php-1.17/extensions/Collection/Collection.templates.php on line 233 [12:41:41] Feb 16 12:45:02 10.0.2.236 apache2[19247]: PHP Warning: unpack() [function.unpack]: Type C: not enough input, need 1, have 0 in /usr/local/apache/common-local/php-1.17/includes/media/GIFMetadataExtractor.php on line 140 [12:41:58] goodnight everyone, and great work! [12:43:02] there's no array_map() call on line 85 [12:43:03] vvv: I looked at it briefly but didn't see anything. Go ahead and take it if you want [12:43:21] Yeah, it's on 77 [12:43:40] <^demon> The unpack() warning has existed since 1.16 as well [12:43:55] not a big deal then [12:44:04] <^demon> Reedy: Ignore mkdir errors. [12:44:35] TimStarling, was just 3 of those in a row.. [12:46:29] <^demon> TimStarling: There's also a persistent OOM error in LinkHolderArray, line 265. [12:50:16] If someone went onto the specific apache, could you tie up the error to a request? [12:50:50] if we knew what process it was going to occur on [12:51:01] or if we could generate thousands of them so that all processes got one [12:51:09] bleh [12:51:17] line numbers aren't meant to be approximate [13:06:59] RoanKattouw: it seems like CategoryTree is completely broken on category pages [13:07:04] *looks [13:08:59] ^demon: what are you doing in my file? [13:09:20] <^demon> TimStarling: I can leave it, I just had an idea. [13:12:34] What's approximare revision no. for 1.16wmf4? [13:12:58] http://www.mediawiki.org/wiki/Branch_points [13:12:59] vvv, http://www.mediawiki.org/wiki/Branch_points [13:13:04] 62817 [13:13:06] Thanks [13:13:54] this ApiOpenSearchXml.php error takes too long [13:14:51] in between test cases [13:15:46] mark: 404 @ http://bits.wikimedia.org/be-x-old.wikipedia.org/load.php [13:16:01] ^demon: what was your idea? [13:16:08] my fatal thing is not being hit [13:16:21] <^demon> Just on the off chance, try switching the callback to public. [13:16:47] either that array_map() is not returning null, or it's not being called at all [13:17:14] when I changed the file, the line number changed as well, to track the close brace [13:17:17] very mysterious [13:17:28] <^demon> The line numbers seem very approximate :\ [13:17:35] <^demon> Makes debugging fun [13:19:06] maybe it happens in CLI mode [13:19:17] no, we know the process name [13:20:17] I am looking at the source of array_map() in PHP 5.2.4 [13:20:34] and right after it displays that error, it calls RETURN_NULL() [13:21:08] so how can it display the error and not hit my fatal? [13:21:29] <^demon> This is in ext/standard/array.c, right? [13:21:37] yes [13:23:21] Okay, I found where CategoryTree was broken [13:23:27] r81671 [13:24:01] vvv, that's to go back in at some point soon hopefully [13:24:38] The current CategoryTree version in 1.17wmf1 depends on $mCategoryViewerClass [13:32:39] RoanKattouw: if you are interested, the CategoryTree bug may be fixed by reverting r71051 in 1.17wmf1 until the categorylinks are reintroduced [13:32:50] !r 71051 [13:32:50] --elephant-- http://www.mediawiki.org/wiki/Special:Code/MediaWiki/71051 [13:48:32] good night [13:50:25] <^demon> Night Tim, thanks. [13:57:01] RoanKattouw: still there? [13:57:13] and available to help with site-specific problems? [13:57:14] Yes [13:57:31] ok. first is gadgets [13:57:36] vvv: I just reverted r71051 on 1.17wmf1 and deployed it. Wanna check if it worked? [13:57:52] check our mw:gadgets-definition and user_properties records of mine [13:57:58] RoanKattouw: looks like it does [13:58:01] I have gadget-fontsize selected [13:58:27] RoanKattouw: http://ru.wikipedia.org/wiki/%D0%97%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0 [13:58:39] Open it in firefox [13:58:43] and in mw:gadgets-definition it's defined as * fontsize|fontsize.css [13:58:49] Yes [13:58:58] but mediawiki:gadget-fontsize.css is not loaded for me [13:58:59] Is it broken? [13:59:04] vvv: Broken how? [13:59:25] I don't see anything obviously wrong [13:59:29] RoanKattouw: a lot of junk and uglily collided divs [13:59:31] liangent: It's loading through RL , probably [13:59:40] It seems to be gone [13:59:51] Probably was a temporary isssue [13:59:54] No, looks fine [14:00:10] liangent: Which wiki? [14:00:20] RoanKattouw: zhwiki [14:01:10] ok I see "ext.gadget.fontsize" [14:01:21] Yeah that'll be [14:01:26] but I didn't see effect [14:01:29] it [14:01:33] Hmm [14:01:35] ie, font size is not changed [14:01:57] oh my fault [14:02:18] I have javascript disabled [14:02:36] but this can be a regression: for users without js, css gadgets can be used before but cannot be used not, and can cause some style splash [14:02:54] s/not/now [14:03:09] Yes, that's known [14:03:11] "style splash" I don't know how to call it [14:03:22] Flash of unstyled content [14:03:36] RoanKattouw: Aren't you going to deploy r82219 (RL/timestamp rounding fix)? Seems to me like it could save a lot of pointless requests.. [14:03:46] I think I deployed it already [14:03:50] Didn't I? [14:04:14] Don't know, but I still see many of those 200 OK per page [14:04:44] Huh, I didn't [14:04:47] Let me merge it [14:05:59] the next, RoanKattouw, look at page source of http://zh.wikipedia.org/wiki/Gmail (I chose this article randomly) [14:07:11] there exist '[11]' and '
  • ^' [14:07:41] Yeah, so? [14:07:44] note that in id= a prefix _ exists but in href no prefix [14:08:16] Ah [14:09:34] I know there's a message about prefix [14:10:07] is it also something wrong in wfMsg? [14:10:13] I don't think so [14:10:18] There's only one msg for the prefix [14:14:00] RoanKattouw: did you see it and find something wrong in code? [14:14:15] I'm still looking at it [14:20:32] I saw an RC entry saying ??new????????09:31:43?? MediaWiki:Gadgets-definition/zh-sg??? (?????? | ??????) . . (+1) . . Shizhao (?????? | ?????? | ??????) (creating page with content of "-") [14:21:09] I guess he was also finding a workaround of the bug of wfMsg [14:21:21] I think it should really be checked [14:23:36] Is that philip tzou? [14:24:58] RoanKattouw: no [14:25:14] that is just shizhao, svn name=shizhao :) [14:25:24] Oh OK [14:25:30] so did the performance pronlem fix itself? [14:25:32] Philip Tzou came up with a fix for MessageCache [14:25:40] Nikerabbit: I guess so [14:25:50] *RoanKattouw ambushes Nikerabbit into reviewing r82246 [14:26:08] i'm on a train [14:26:19] in 30 mins perhaps [14:26:25] OK [14:27:21] the next problem (maybe not a "bug") is that people are complaining about the change of {{CURRENTMONTHNAME}} [14:27:31] What changed? [14:27:32] what it SHOULD be you think? [14:27:40] Month name in content langauge? [14:28:22] yes and chinese doesn't have special month names like January, February etc. [14:28:42] we're just calling them "the first month, the second month" etc [14:28:52] OK [14:29:27] it was written (literally translated) as "1 month", "2 month", "3 month" [14:29:42] So it's a translations issue then? [14:30:01] but it's now written as "one month", "two month" (ie, number in chinese, not arabic numbers) [14:30:07] RoanKattouw: i don't like it - it special cases something that affects much more [14:31:15] Nikerabbit: What about just changing the order of the checks to check the MW base page right after the MW subpage? [14:31:47] RoanKattouw: what the translation of {{CURRENTMONTHNAME}} should be (by definition?) [14:31:49] it would still be broken imho [14:32:33] wfMsg('february') [14:33:03] the problem is that messagecache doesn't know if wfmsgforcontent is called [14:33:27] it should only override when ui message is wanted [14:33:29] And? [14:33:39] RoanKattouw: Thanks for merging.. Bits servers seem to appreciate it too http://ganglia.wikimedia.org/?c=Bits%20application%20servers&m=cpu_report&r=hour&s=descending&hc=3&mc=3 :) [14:33:43] The MW base page is always in the content lang, isn't it? [14:33:56] Wooooaaah [14:34:07] mark: RoanKattouw: Thanks for merging.. Bits servers seem to appreciate it too http://ganglia.wikimedia.org/?c=Bits%20application%20servers&m=cpu_report&r=hour&s=descending&hc=3&mc=3 :) [14:34:10] yes, and it's special [14:34:16] mark: We should get pawelx a bottle of wine or something :D [14:34:31] Cut bits load in half [14:34:47] which thing? [14:34:50] Nikerabbit: So how does forcontent vs. forUI affect checking of the MW base page [14:34:52] (before I head over to the dc) [14:35:02] apergos: ? [14:35:09] cut the load in half [14:35:26] A software change I made on pawelx 's request [14:35:29] wfmsgforcontent obviously needs to check the basepagename first [14:35:35] Cut CPU load on the bits Apaches from 40% to 20% [14:35:48] and only that, not any translations [14:35:50] I was asking which change :-P [14:35:58] Nikerabbit: Yeah but won't it be passingthe correct code to get() as $langcode ? [14:36:23] !r 82219 [14:36:23] --elephant-- http://www.mediawiki.org/wiki/Special:Code/MediaWiki/82219 [14:36:36] Like all massive-impact changes, it's a one-liner [14:36:41] oh the rounding [14:36:43] heh [14:36:44] the code is useles, we would need special value for "content language" [14:36:46] I believe it [14:36:51] ok, oughta here, back shortly [14:36:52] 1 -> rounded to 0 -> wfTimestamp(0) == now [14:37:09] yep [14:37:17] ..over and over again :-D [14:37:21] tah [14:40:50] a temp fix could be to revert the change until a better solution is implemented.. the old behaviour has been there for a while [14:41:16] Which change? [14:41:27] Philip's [14:41:29] ? [14:41:30] getpreferredvariant [14:41:35] Oh [14:41:47] Philip is over in #wikimedia-tech , I'd say go talk to him :) [14:49:43] RoanKattouw: are you working on broken bold buttons? [14:49:48] Yes [14:49:53] SVN upping the fix now [14:51:21] back home [14:51:50] Damn SVN up is slow on the cluster today [14:52:39] RoanKattouw: I think this stems from the fact that we store configuration data (or language independent data) in mw namespace [14:53:41] Yeah [14:56:14] What was the reason the Gadgets were not resourseloaderified? [14:57:00] what? the extension itself is [14:57:15] gadgets themselves need to be made and marked rl-compatible [14:58:04] Mhmm [14:58:12] And doing each and every gadget would be a bitch [14:58:47] rl-compatible? [14:59:37] http://www.mediawiki.org/wiki/ResourceLoader/Migration_guide_(users) [15:06:37] RoanKattouw: r82250 was a fix for a bold button bug? [15:07:06] Aye [15:07:14] It should be fixed now [15:07:34] $wgExtensionAssetsPath = "http://bits.wikimedia.org/w/extensions-1.17" on 1.17 wikis [15:07:50] Thanks [15:12:34] whii [15:12:43] [15:12:49] is that too supposed to come from bits? [15:14:31] No [15:14:34] It has to be same-origin [15:14:48] We brought it in from bits before and that broke it [15:14:58] uncool [15:15:23] Whoa, Apache load now almost at 80% [15:15:37] And a cool load spike on bits due to syncing two JS files [15:48:45] RoanKattouw: any resolution now? [15:48:53] liangent: Of what, the Cite issue? [15:49:14] I'm looking at it. There's all sorts of message overrides in the MW namespace on zhwiki [15:50:07] RoanKattouw: I didn't really know how it should work... [15:50:26] so when I wanted to write a message [15:50:39] I write it in base page, and use a script to copy it to all variant subpages [15:50:51] so.. [15:50:52] Yeah, that sucks [15:50:59] Nikerabbit and philip_tzou are working on a fix for that [15:51:07] I'm just saying, Cite has customizations too [15:52:43] Whee, Firefox crash. That's #3 today [15:53:36] Too many tabs? [15:53:37] use a standalone irc client then [15:53:45] Heavy add-ons? [15:54:01] All of the above? ::) [15:54:11] about Cite extension: yeah but I guess the extension itself shouldn't generate broken links [15:54:44] whatever prefix sysops use [15:54:57] It seems to, though [15:55:03] liangent: Yeah I'm considering moving to XChat [15:55:34] guillom: I haven't had that many tabs open today. Once upon a time I had two windows open for like a week, one with ~60 tabs and one with ~30 [15:55:43] Actually that was only like a month ago [15:56:54] RoanKattouw, what's the message name for the prefix [15:57:11] Ah [15:57:19] I guess just deleting it and all subpages can fix it on zhwiki [15:57:20] cite_reference_link_prefix and cite_references_link_prefix [15:57:25] Could try that [15:57:51] It's working on enwiki, so it's something zhwiki-specific [15:57:53] And it's not JS [15:58:16] RoanKattouw, you can also run chatzilla standalone [15:58:28] How so? [15:58:49] as a xulrunner app [15:59:10] that used to be possible [16:01:18] liangent: Hmm this Cite bug is a total mystery to me [16:01:41] It's zhwiki-specific somehow but I have no friggin' clue what's going on [16:02:00] It's literally using the same code to get the prefix for and for [16:02:20] liangent: Deleting all customizations of those two prefix messages in all languages/variants is a good idea, could you try that nwo? [16:02:38] don't know whether I should delete those MW pages... [16:03:48] Why wouldn't you? [16:04:02] What could possibly be the reason for customizing the references anchors? It makes no sense [16:05:47] if I delete them and Cite works maybe it's impossible to debug then [16:06:49] I don't really care [16:06:57] You can always undelete if you need to [16:07:11] We've had so many problems today that I care about making stuff work, not necessarily about why it was broken [16:13:52] RoanKattouw: https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Special:Code/MediaWiki/82253 [16:15:57] Cool [16:16:02] Nikerabbit: Is that rev OK with you? [16:18:32] RoanKattouw: by deleting those pages it becomes "cite_ref-1" and works [16:18:58] It works now? [16:18:59] not sure if this will break links from external sites [16:19:47] yes but the id changed [16:19:48] Better to break external links that same-page internal links :) [16:20:17] I guess you can try changing it back after I deploy this message-related fix [16:21:23] when [16:22:21] When I get it to merge cleanly :) [16:23:08] 10 minutes or 1 hour? [16:23:13] 5 mins [16:24:34] Nikerabbit: Merging it to 1.17wmf1 results in only changing getPreferredVariant() to getCode() [16:25:43] how are things going? [16:26:02] All quiet on the Western front, it seems [16:26:07] JS breakage, of course [16:26:33] Bryan came up with a list of hundreds of MW: pages that contain 'document.write' , which is the surest sign of potential breakage [16:26:47] Like, the bad kind of breakage where the JS obliterates the entire page to replace it with one button [16:26:56] Or a featured article star, or something like that [16:28:08] RoanKattouw: yep [16:28:59] Nikerabbit: What about the changes to Language.php in r77452 (more getCode->getPreferredVariant)? [16:29:33] Aaah never mind [16:29:35] That's in r82246 [16:32:44] liangent: Synced [16:33:11] liangent: Let's try undeleting just the base pages for the two prefix messages (not the subpages), and see if that works. If not, we just delete them again. Sound good? [16:33:48] what about subpages [16:33:56] there were /zh-cn and /zh-tw ones [16:34:14] With the same content? [16:34:24] Try without restoring those first [16:34:29] yes [16:36:31] it doesn't work. the underscore prefix is missing again [16:36:31] can they be dropped in some HTML processing steps? [16:37:13] underscore can be somehow special: equals to space somewhere and get trimmed [16:38:25] I have no idea [16:39:38] what's the prefix used in enwiki? with a leading underscore? [16:39:53] No idea [16:39:56] Checking [16:40:19] It's the default [16:40:21] cite_note- [16:41:03] do you have the right to edit them to something with a leading underscore and see whether it works [16:41:26] or on a smaller or abandoned wiki [16:41:29] I do but I'd rather not [16:41:32] Yeah, on a smaller wikio [16:41:33] test2wiki! [16:45:57] liangent: You nailed it, it's the underscroe [16:46:06] Something's removing it [16:46:15] See https://secure.wikimedia.org/wikipedia/test2/wiki/Cite [16:49:16] then the next step [16:49:16] let me delete those messages on zhwiki first? [16:50:33] Oh this is interesting [16:50:43] [[#_foo|Foo]] renders to Foo [16:52:20] hundreds [16:52:23] well they'll be busy [16:52:30] back to racktables for me (bleah) [16:53:45] Hmm [16:54:33] RoanKattouw: I get the same result on testwiki [16:54:50] Yeah [16:54:51] http://test.wikipedia.org/w/api.php?action=parse&text=[[%23_foo|Foo]] [16:54:54] So that's where the real bug is [16:55:32] but .. this can be the reason that zhwiki community chose a prefix with a leading underscore (before I joined) [16:55:48] they'll not be mixed with typed anchors [16:56:58] and stop people writing [[#_ref-1|something]] because the index can change [16:57:40] Right [16:57:52] Well the fact that the underscore is removed is almost certainly a bug [17:01:00] then file it in bugzilla? [17:01:03] are you doing it [17:01:26] Not yet, will get to it [17:02:54] ok, then go to CURRENTMONTHNAME. do you think what translation in mediawiki should be [17:02:58] pawelx: Your contribution in the news :) http://identi.ca/notice/64645439 [17:03:23] if someone in zhwiki propose changing it back [17:03:31] should it be done locally, or in mediawiki [17:03:44] CURRENTMONTHNAME should just return wfMsg('february'), right? [17:03:49] yes [17:04:01] Did it do that before and does it do so now? [17:04:22] it must be someone changed the translation [17:04:30] from 2 month to two month [17:05:08] for me I prefer two month for February and 2 month for Feb [17:05:16] Then fight it out on TranslateWiki :) [17:05:25] You can also override MW:february locally if you really want to [17:05:34] RoanKattouw: lol, that was quick [17:06:00] Signpost picked up on it quickly [17:06:27] the problem is the whole system was originally designed for english... [17:06:40] Hmm, after an hour, but still... original message: http://identi.ca/notice/64641179 [17:06:42] any isn't there CURRENTDAYNAME [17:07:04] s/any/why [17:07:26] No idea [17:07:34] You should tallk to Nikerabbit about i18n issues like these [17:07:50] ok.. [17:09:10] whee! power outage last night. [17:13:00] is there a bugzilla-like system on enwiki? [17:15:02] users' messy complain in village pump is difficult to organize... [17:30:00] RoanKattouw_away: cc me when you file the bugreport. thanks [18:02:48] hi jorm [18:03:05] heya. [18:05:15] what's the thing with links? [18:18:56] i'm not sure what you mean? [18:30:43] roblaAFK, you asleep? [18:31:58] if he's awake, i'd be surprised. [18:32:30] thought so, but wanted to check [18:33:00] the power went out here around 2:30 a.m. so i don't have irc scroll after that but he was active then, so. . . [18:39:29] hi guillom, I'm up now [18:40:11] robla: we ended the mtg without u ;- ) [18:40:41] no worries...more time for me to eat breakfast, then :) [18:40:50] :) [18:51:00] http://blog.xkcd.com/2010/05/03/color-survey-results/ [19:23:10] RoanKattouw, TrevorParscal: are we quite serious about removing old wikibits functions in future? [19:23:22] Trevor will have to answer that, I'm busy [19:23:30] I see [19:23:31] yes [19:23:39] replacing them anways [19:23:52] with better solutions to the same problems [19:24:05] sometimes though, it would just be cleaning and refactoring things [19:24:10] like the table sorting [19:24:20] it should get refactored into a clean jquery plugin [19:25:03] And when will it happen? [19:25:25] as soon as we have a full set of alternatives [19:25:33] and people start using them enough [19:25:41] that we can justify removing the old code [19:26:04] the added benefit is that so far the jQuery replacments for most of these things are a tiny fraction of the old code size [19:26:19] so eventually, we will be sending less JS to the client overall [19:26:27] When they are removed, will we ship some Javascript b/c packs so users of other wikis can continue use their scripts? [19:26:48] we could [19:26:51] that would be neat [19:27:16] it's not meant to be a war against b/c, it's just meant to be a migration towards jquery and smaller code sizes [19:27:22] better practices :) [19:27:43] Well, I'm currently talking about b/c [19:27:56] other than wikibits, most legacy code is defined and executed on load, so there's nobody using it really [19:28:15] From what I read, you are going to remove legacy wikibits code [19:28:24] And ungodly amount of user JS depends on it [19:28:39] not immediately, eventually, and we are very aware [19:29:11] gtg - be back online in a bit - Roan also knows this topic well if you have more questions - see http://www.mediawiki.org/wiki/ResourceLoader/JavaScript_Deprecations [19:29:12] cya [19:29:23] I have no time to talk right now [19:29:29] NS_CATEGORY is totally broken on mlwiki [19:29:50] RoanKattouw: maybe something with the Unicode 5.1 normalization thingy? [19:29:54] Could be [19:30:03] or was that deployed in 1.16wmf4 already? [19:30:39] Yes [19:35:02] ugh. i need a better tool for drawing process flows than photoshop. [19:39:23] Inkscape? [19:39:34] kind of the same problems. [20:38:07] maybe i'll just draw on paper and then scan/photograph it. [20:45:56] whiteboard -> photograph is what I do usually [21:00:24] robla: scrum today or no? [21:00:47] hexmode: yeah, let's do one last call [21:01:24] In 10 mins right? [21:01:28] yup [21:01:57] RoanKattouw: Hm.. my dirty new Image() hacks no longer work to patrol edits :P [21:02:21] good [21:02:42] I know that, so I'm prepared. Updated tools this morning. [21:06:03] robla: weird http://fr.wikipedia.org/wiki/Fichier:RueDeParis.jpg [21:07:39] hexmode only can't connect on that file? [21:08:05] Platonides: yes [21:08:47] *robla looks for where that's linked from [21:09:25] it could be a really long metadata greater than mysql max packet size [21:09:44] *Bryan pokes at TS [21:09:46] but I'd expect the error not to be "Can't connect" but "Connection interrupted or so" [21:10:26] img_metadata: 0 [21:14:08] *hexmode pasted "the funky js" at http://paste2.org/p/1250444 [21:15:25] Do you guys hear me? [21:15:35] Do you hear the buzz? :P [21:15:40] Grr [21:15:53] How about now? [21:15:57] Meh [21:15:59] Lemme grab the other headset [21:37:29] robla: http://identi.ca/notice/64641179 [21:37:34] robla: Signpost RTed it too :) [21:55:54] TrevorParscal: https://bugzilla.wikimedia.org/show_bug.cgi?id=24859 can this be done in RL? [21:56:14] TrevorParscal: I'm really asking if it is a post-deployment fix or a 1.18 fix [21:58:18] TrevorParscal: or not really a RL issue at all [23:02:33] as a matter of interest ... are all projects now on the new release ? [23:02:47] They should be [23:03:04] There was some issues where ones with hyphens in the name, were delayed [23:04:22] thanks