[00:16:20] Re doublewiki [00:16:22] " it hasnt been used much as it hasnt worked for anything slightly complicated" [03:27:42] I did a core dump after a parse of [[User:B jonas/Pages with shortcuts]] [03:27:50] the result: many many copies of the entire article text [03:29:23] it may be a reference leak of some kind [03:31:42] each copy has one extra link in it [03:32:48] every time it processes a link, it adds another copy of the whole article text [04:07:24] cool...progress! [04:08:09] (maybe not a smoking gun, but whiff of gunpowder?) [04:17:17] <^demon> TimStarling: I was thinking about DoubleWiki again. What about throwing the processed text in memcached so it'll save an external request and that mass of regexes. [04:18:26] I guess [04:18:47] it's not ideal, but it's probably better than what is there [04:19:12] what about a cache of the whole table? [04:19:34] <^demon> What table? DoubleWiki just pulls text from interwiki links. [04:19:40] i.e. the body text after processing, which is a element [04:19:53] <^demon> Oh, yeah. I meant caching the final HTML [04:21:03] ok [04:37:27] <^demon> Hmm. [04:42:48] <^demon> Ok, I've done all I'm going to do for DoubleWiki. It's quickly becoming a time sink. [05:28:23] speaking of time sinks [05:28:55] half an hour to go and I'm still working on this stupid memory leak [05:54:20] morning [05:54:27] hello [05:54:42] ready for some 1.17 fun? [05:54:53] after some coffee I may be ;) [05:55:30] hey there [05:55:58] hiya RoanKattouw [05:56:20] :) [05:57:31] well that is annoying [05:57:54] what, that you don't have any coffee brewed? :) [05:58:04] fun fact: I am not online via the wireless at our new dc (from the hotel) beccause the hotel's wireless has stopped working [05:58:22] mifi to the rescue? [05:58:23] and the person at the desk has no tech person they can call to fix it, nor any piece of equipment they can power cycle. [05:58:36] I should dig that out, just in case [05:58:44] *robla assumes there's not magical toobz involved [05:59:11] prolly are [05:59:39] I could go to the adjoining hotel and sit in the atrium but it's so much less friendly than being in my pjs in my bed (plus I want to cook dinner) [05:59:46] remind me, what's the point of switching a small wiki first? [06:00:03] to work out any goofy little bugs that show up [06:00:12] in other extensions etc [06:00:13] ? [06:00:27] but we're already running 1.17 on a lot of small wikis [06:00:39] you'd think that the goofy bugs would have shown up there in the last few days [06:00:54] unless there is something different about the wikis we're going to choose [06:01:21] Hmm, good point [06:01:24] If we don't deliberately choose a couple different ones, then, I agree, we're just putting off the inevitable [06:01:49] let's just hit nlwiki then [06:02:08] ...and then do a bunch of small ones if we don't feel like we can get more big ones tonight [06:02:26] guess we'll see [06:03:07] running svn up in php-1.17 [06:03:08] Sounds good [06:03:32] when was I meant to run this maintenance script? [06:03:59] before deployment? [06:04:03] Earlier the better [06:04:11] on all wikis, right? [06:04:14] Yeah [06:05:22] Just saw the script, looks good [06:06:38] what script? [06:07:00] never mind (see it in the logs now) [06:07:27] running [06:08:04] I take it that's to make sure the 4000 or so people with the old edit toolbar don't get upset with us? [06:08:22] it'll take a few minutes [06:08:30] robla: Yes [06:11:05] it's doing enwiki now [06:14:12] hello [06:14:14] robla, how are we doing? [06:14:36] hi guillom: running one last maintenance script before starting [06:14:44] ok, great [06:16:05] we can switch eowiki now, the script is done with it [06:16:58] any objections? [06:17:26] Fine by me [06:17:36] go ahead [06:17:53] go for it [06:18:00] srv298 # Not installed yet, Rob will install and uncomment ---- seems to be installed now --catrope: ssh: srv298 # Not installed yet, Rob will install and uncomment ---- seems to be installed now --catrope: Name or service not known [06:18:05] you can't put comments in dsh node files [06:18:54] you can add a hash to the start of the name to make the hostname invalid [06:18:56] which will make it fail [06:18:58] (I'd forgotten eowiki was earlier on the list) [06:19:01] but it's not actually a comment [06:19:26] Grr, dammit [06:19:48] Oh it's already fixed, thanks [06:21:17] All seems quiet [06:21:59] a couple of errors from SVGMetadataExtractor.php, on wikis that we switched a long time ago [06:23:19] Why am I getting this weird fragmented backtrace here? Is that because of the "don't use that high level function for streams" comment in wmerrors? [06:23:28] (Specifically: [06:23:31] #2 [internal function]: addMatchedText(Object(OutputPage), '

Retrieved from "http://simple.wikipedia.org/wiki/File:North_rhine_w_template_2.svg" [06:29:16] doublewiki still whining I see [06:29:34] it should be fixed in 1.17, ^demon fixed it [06:29:43] Browsing nlwiki, looks fine [06:30:02] Oh wit [06:30:03] yes, the complaints are 1.1 [06:30:07] The incomplete backtraces are my fault [06:30:14] Caused by grep -v wmf-deployment [06:30:18] heh [06:30:18] 6 [06:30:19] *RoanKattouw whacks himself over the head [06:30:55] Hmm this is strange [06:31:07] The performance issue we saw last time is just not showing this time [06:31:19] maybe it fixed itself [06:31:25] let's switch some more wikis [06:31:28] I see nothing, yeah [06:32:08] hurray for the performance tuning gnomes! [06:32:22] I told you guys we should wait for the next version! [06:32:28] :-D [06:33:14] small or big? [06:33:21] let's do another big one [06:33:28] *robla looks at the list [06:33:40] I'll do de and fr [06:33:46] .wikipedia.org [06:34:04] Those are, what, 7% of our traffic combined? [06:34:29] (As opposed to just under 2% for nlwiki) [06:35:13] how about we move up this list: http://meta.wikimedia.org/wiki/List_of_Wikimedia_projects_by_size , so ptwiki? [06:35:16] amazing. uptime still good [06:35:49] with de, I mean, if that's not a clincher I don't know what is [06:35:56] (granted, it's not totally accurate to go off of page count, but it's a readable table and probably not the worst proxy) [06:36:02] apergos1: er, enwiki? ;-) [06:36:10] hehe ok, I'll give you that [06:36:24] about 80 RL cache misses a second now [06:36:58] I'm fine with whatever, though I'd recommend one wiki at a time [06:37:16] (and probably not enwiki just yet) ;-) [06:37:17] for hundreds of wikis? ;) [06:37:22] no! [06:37:26] <^demon> How about jawiki? [06:37:31] I was just gonna say [06:37:35] jawiki is #2 in traffic IIRC [06:37:43] ~8% [06:37:51] Be Bold [06:37:52] <^demon> It's daytime there, so we'd get a little more traffic than by sticking to western europe. [06:37:53] So that sounds good [06:38:04] Yeah we already did de and fr, although it's early there [06:38:09] I can do them one at a time, for now [06:38:17] I'll do one every few minutes [06:38:23] And apergos1 is right, if we switch de and there is no visible impact on Ganglia, that's saying something [06:38:41] someone fixed The Bug without knowing it ;) [06:38:56] now we're going to have to go back through all the commits and undo them :-P [06:39:21] <^demon> At some point sooner rather than later we should switch commons. Central role like meta and all. [06:39:45] let's do commons next [06:39:51] Yeah [06:39:56] weren;t we going to do mw.org too? [06:40:01] <^demon> We did that Monday [06:40:07] Don't mind me, but I find it rather worrying that *we don't know* why it's working ;) [06:40:09] The typical mw.org way [06:40:24] Get consensus on IRC from like 3 devs :) [06:40:29] <^demon> "Sound good to you?" [06:40:30] worksforme [06:40:31] <^demon> "Sure" [06:40:37] <^demon> "Nobody said no, I'm doing it" [06:40:48] ok commons next [06:41:10] guillom: Well to be fair, we had nlwiki on so briefly before that I'm not sure it was warranted to draw too many conclusions from the CPU spike we saw at that time [06:41:30] a spike which is not happening _at all_ right now though [06:41:32] this is what a deployment should look like [06:41:35] nothing happening [06:41:43] except new features appearing [06:41:52] <^demon> And bugs going away :) [06:42:02] hehe [06:42:43] bits app servers are now at ~ 10% load [06:43:07] if 4 hosts is enough for that, I may do away with that entire cluster ;) [06:43:17] not worth the trouble of keeping it separate [06:43:45] for which? [06:43:53] need coffee? [06:44:00] I'm going to do zhwiki next, maybe it'll stop spamming fatal.log [06:44:03] no, need a better wireless connection [06:44:10] don't you have a mifi? [06:44:14] <^demon> zhwiki and then some of the sources? [06:44:16] apergos: bits app servers [06:44:17] yes, I haven't dug it out yet [06:44:19] <^demon> I want the doublewiki errors to go away [06:44:23] ...then do that? :) [06:44:30] On dewiki I get the warning "Failed to load resource: http://bits.wikimedia.org/skins-1.17/$wgExtensionAssetsPath/FlaggedRevs/client/flaggedrevs.css?86?301-1" [06:44:38] Ugh [06:44:45] That's my fault, fixing [06:45:57] we should probably let the dust settle on some of the deployments before plowing ahead [06:46:15] <^demon> apache log is showing nothing of interest. [06:46:15] also, what about one of the languages with variants? [06:46:22] <^demon> zhwiki just went over. [06:46:25] did zhwiki already [06:46:32] Is 1.17 changing something about redirects? http://fr.wikipedia.org/wiki/Lyc%C3%A9e_Lucas-de-Nehou [06:47:11] We should test the variant functionality a bit then [06:47:14] <^demon> Looks like redirected templates maybe? [06:47:26] ^demon, yes [06:47:41] Maybe the alias was killed in TWN? [06:47:53] same error from zh.wikipedia.org in 1.17, guess it's not going to fix that one [06:48:06] Doesn't look like it, 'redirect' => array( '0', '#REDIRECTION', '#REDIRECT' ), [06:48:32] *RoanKattouw wonders whether this might be caused by Tim's magic word fix [06:49:49] ok, let's try to fix it for a few minutes, then revert on frwiki if we don't manage [06:50:36] Same is happening for #REDIRECT [06:51:37] Doesn't seem to be i18n-related; how is this not happening on other 1.17 wikis? [06:51:40] *RoanKattouw tries on mw.org [06:51:52] works for me with #redirect [06:51:59] using Title::newFromRedirect() from eval.php [06:52:30] maybe it's because one is a substring of the other [06:52:34] Right, and not with #REDIRECTION [06:52:50] Is it possible that this is caused by #REDIRECT being a substring of #REDIRECTION ? [06:53:07] So it sees #REDIRECT ION (=garbage) [[link]] [06:53:38] this code has changed, hasn't it? [06:54:02] bits servers have ~ 350k objects in cache now [06:54:24] less than 100k in 1.16 [06:55:00] My suspicion is correct [06:55:13] > $mw->matchStartAndRemove($s); echo $s; [06:55:15] ION [[foo]] [06:56:29] #REDIRECT in templates WFM now on frwiki but only if the redirect isn't broken (i.e. points to an existing page) [06:56:50] reverting frwiki to 1.16 [06:58:19] Aha [06:58:22] I found it [06:58:29] Tim's magic word merging code is indeed at fault [06:58:39] It merges the en synonyms /before/ the fr synonyms [06:58:49] there are probably a lot of resourceloader:filter:minify- objects ( mark ) [06:59:05] So it becomes 'REDIRECT', 'REDIRECTION', which is bad if the former is a substring of the latter [06:59:08] apergos1: No, those are in memc [07:00:14] hmm [07:00:31] it doesn't matter which one is first [07:00:41] it should match the longest substring [07:00:49] It doesn't [07:00:56] It just builds a regex using explode [07:00:58] that's MagicWord's fault [07:01:03] REDIRECT|REDIRECTION [07:01:05] I know that [07:01:22] the uncached requests are probably due to bug 27302. ResourceLoader is generating a new url every second for every user. [07:01:25] But it'd be faster to fix what's triggering it than to fix MagicWord itself [07:01:59] pawelx: Oh, I hadn't noticed the timestamp part of that. That hurts [07:02:13] Oh, 1 is rounded down *facepalm* [07:02:42] !bz 27302 [07:02:42] --elephant-- I don't know anything about "bz". [07:03:13] morning [07:03:15] https://bugzilla.wikimedia.org/show_bug.cgi?id=27302 [07:03:36] RoanKattouw: yeah, not sure why the round() is needed there anyway [07:03:43] I'm fixing it [07:06:19] the localised versions should be first, so that $magic->getSynonym(0) returns the localised version [07:06:39] that is used in several places to get a canonical string to use for #redirect [07:07:18] Yes [07:07:26] And your code is putting the English version first, isn't it? [07:07:45] yes, so I'll fix both that and MagicWord::initRegex() [07:08:42] i'm gonna get some breakfast, back in a few minutes [07:10:02] *guillom grabs breakfast. [07:10:46] I'm eating dinner :-D [07:10:58] (at 2:15 am) [07:10:58] <^demon> Ok if I turn NewUserMessage back on for 1.17 wikis? The fixes have all been merged in. [07:15:02] I'm fine with it [07:17:07] apergos: ok... any good reason for that? ;) [07:19:06] am I asking the right question by asking "any Chinese language speakers here that can tell us if the zh/zh-yue variant support is working right?", or did I phrase that wrong. [07:19:32] see #wikimedia-tech for why I'm asking that [07:19:33] Like I said before, we should just switch a Serbian wiki [07:19:40] <^demon> We already switched zhwiki [07:19:46] And us English-speaking people will be able to tell for ourselves farily easily [07:20:02] <^demon> Fwiw, getting lots of OOM issues for LangConverter stuff. [07:20:11] Because we're better at distinguishing between Latin and Cyrillic than we are between zh-foo and zh-bar [07:20:29] ^demon: In 1.17? [07:20:40] <^demon> Yes [07:20:45] just look at the stroke count [07:21:06] I checked that variant page views were working right after we switched [07:21:07] looks like one of the key templates on zhwiki broke (see #wikimedia-tech). [07:26:31] the canaries are getting more insistent [07:30:33] <^demon> Googling it pointed me to a php bug, only skimmed it. [07:30:49] I've been looking too, not seeing anything useful yet [07:32:09] we logged a fatal from NoteTA on zhwiki [07:32:16] <^demon> I resolved the "catchable fatal error" with Title::equals(), had a missing MFT. [07:32:44] from an article with NoteTA in it anyway [07:32:52] probably not related [07:33:52] TimStarling: Could you run s-c-a as root? [07:34:36] done [07:34:50] well, in progress anyway [07:36:34] I wish I knew what the api requests were that were resulting in the canary error [07:38:06] The action=parse ones? [07:38:12] what canary error? [07:39:07] there are a lot of OOMs from zhwiki [07:39:12] maybe we should fix that next [07:39:41] ALERT - canary mismatch on efree() - heap overflow detected (attacker '208.80.152.46', file '/usr/local/apache/common/docroot/wikipedia.org/w/index.php'), referer: http://de.wikipedia.org/wiki/Reba [07:39:43] there's a sample [07:39:54] <^demon> Fixed a fatal in FR. [07:40:19] I did not see them at first switchover, they showed up after a while [07:40:48] referrers de ja nl... [07:42:15] TimStarling: Turns out I had misdiagnosed the issue fixUsabilityPrefs.php was fixing. Have now put in fixUsabilityPrefs2.php in the local copy, run it on each wiki /after/ switching it to 1.17 [07:42:55] The empty value insertion really meant '0' not 'default', which means we just screwed 4,000 people who had the toolbar disabled, and the empty values are inserted by 1.16 :( [07:45:07] the fact that no api related canaries are chirping now makes me think something got straightened out with the last sync [07:46:06] apergos: how many servers was it coming from? [07:46:47] for the index ones, it's several at least [07:46:56] let me look back for the api ones [07:47:39] I wonder if we can configure it to dump core [07:48:15] 41, 43, 45 for the api ones [07:48:23] which have stopped [07:50:01] my test wiki had $wgCapitalLinks=false, no wonder things were a bit screwy [07:50:03] about 25 servers with the index.php canaries [07:50:41] the api ones are back [07:50:56] <^demon> Whatever's causing them isn't getting caught by wmerrors. [07:51:01] nope [07:51:20] efree() is a function internal to PHP [07:51:33] the overview from google is that it's often the wrong version of a library or some other component deployed together [07:51:49] how we would suddenly cause that midway through a 1.17 switch I dunno [07:52:47] that's not a very good overview [07:52:53] it could be anything [07:52:53] cope [07:52:58] nope it isn't [07:53:09] PHP is full of heap overflows [07:53:16] Could it be wmerrors causing/triggering it? [07:53:24] every release has a stack of them that they fix [07:53:37] yes [07:53:45] since wmerrors is part of everything [07:53:54] it's weird that there are far fewer servers with api related whines [07:54:06] if you can reproduce it, we can get a backtrace [07:54:25] if we had the api call params we could [07:54:26] when it's just a few of them across the whole cluster, it will be more difficult [07:55:02] what does suhosin do exactly after it hits that canary error? [07:55:17] does it abort? exit? some other signal? [07:55:31] will it show up in our squid logs? [07:56:43] not in the squid logs, I'm on one know and was looking [07:56:47] *now [07:57:50] with a special awk command, or just with sampled-1000.log? [07:58:27] I was actually looking at the local syslog over there. but right now I am on srv292, one of the apaches that produces the errors [07:58:33] the syslog is full of these [07:59:05] <^demon> I'm on 292 also, but I didn't have access to the syslog :( [08:00:43] ah right, so there's plenty of errors [08:00:54] I thought you were saying there was only a few [08:01:06] why didn't you get a backtrace? [08:01:30] ? [08:01:44] <^demon> wmerrors isn't logging one [08:01:49] use gdb [08:02:28] if there's plenty of errors, you don't have to rely on logging [08:02:39] you can just attach to a random apache process and wait for it to do it [08:02:40] <^demon> gdb isn't installed on the apaches. [08:02:52] we install it on all the ones we need it on [08:03:13] it's on srv292 now [08:03:14] now installed on srv292 [08:03:23] dpkg: status database area is locked by another process [08:03:24] E: Sub-process /usr/bin/dpkg returned an error code (2) [08:03:27] mark too fast [08:03:34] sorry [08:03:39] *mark goes back to drinking coffee [08:03:44] i'll install it on all machines [08:05:37] got a backtrace, doesn't help much [08:05:44] you were fast [08:05:52] Can you share it anyway? [08:05:55] I couldn't get a live process by the time I got into gdb [08:06:00] defunct every time [08:06:00] http://p.defau.lt/new.html [08:06:14] er? [08:06:31] how about the actual backtrace instead of the form for a new pastebin? [08:07:44] http://p.defau.lt/?sn6sssHKCX2ROsyl6pd0IQ [08:07:48] thanks [08:07:51] it's $GLOBALS [08:08:01] that's the hashtable that it's destroying [08:08:04] should be able to get a key [08:08:36] great :-/ [08:08:46] wgAutoConfirmAge [08:09:23] to attach before the process disappears, use: gdb -p `ps -C apache2 | tail -n1 | awk '{print $1}'` [08:10:01] ps -C, that's what I wanted [08:14:43] break php_security_log [08:14:45] cont [08:14:47] frame 2 [08:15:05] print (char*)p->arKey [08:16:39] So, people, is there anything preventing us from rolling out to more wikis other than the strange PHP canary thing? [08:17:01] let's try disabling wmerrors at runtime and see if that fixes the canary thing [08:17:17] RoanKattouw: zhwiki OOM [08:17:30] For what, langconv? [08:17:43] <^demon> Yeah [08:19:22] maybe, but the backtraces don't all say langconv anymore [08:19:39] One is in replaceInternalLinks2() [08:19:47] maybe it is the same issue that I spent half the afternoon tracing [08:20:05] RIL2 was where I was working, yes [08:20:18] <^demon> Still getting canaries. [08:20:49] I'll revert it [08:21:39] <^demon> APC cache full/corruption wouldn't cause this, could it? [08:21:49] segfaults on srv247, I'm going to restart that one [08:21:50] if it is, we should be able to fix it by restarting [08:22:56] ok avail_mem is only 5MB on srv292 [08:23:04] nice idea ^demon [08:23:57] and the canareis are gone from srv247 now [08:23:59] btw the way to check that is to use the scripts in live-1.5 [08:24:04] BLOCKER: http://bits.wikimedia.org/zh.wikipedia.org/load.php?debug=false&lang=zh-cn&modules=site&only=styles&skin=vector&version=20110216T075000Z load nothing for zh.wikipedia [08:24:26] [mem_size] => 113750032 [08:24:27] hmm, let me find out why [08:24:40] if works only when I remove "&lang=zh-cn" param [08:24:59] It also works when I change the timestamp [08:25:09] Try making a whitespace change to MediaWiki:Common.css [08:25:14] Wfm [08:25:26] That should update the timestamp in the URL after at most 5 minutes [08:26:31] I'll restart all apaches [08:27:01] OK so we have OOMs on zhwiki, but 1) is there much we can do about it and 2) didn't these already happen all the time in 1.16? [08:27:02] RoanKattouw: it also works when I remove the version param [08:27:18] Yes, it'll remove when you use any other URL [08:27:22] s/remove/work [08:27:41] RoanKattouw: no, it didn't work when I tried to change zh-cn to zh [08:27:58] But with a different timestamp? [08:28:06] I wonder if removing the wmerrors hack will alleviate the memory leak (I suppose that's what's happening) [08:28:16] <^demon> I'm curious if canaries will start coming back. [08:28:31] RoanKattouw: with a different ts it works [08:28:39] and I did some edit in mediawiki:common.css [08:28:43] Right [08:28:44] if we leave the code alone, I will guess yes [08:29:00] Within 5 mins it should update the timestamp in the URL for everyone then [08:29:00] but we'll have to wait awhile [08:29:16] <^demon> I assume so. It would be good to go ahead and figure out a better way of debugging if/when it shows up. [08:29:28] <^demon> Not much time to react when the processes start going defunct so quickly. [08:29:36] nope [08:29:56] it's not a memory leak [08:30:13] we need to increase APC memory size [08:30:24] I thought it had been [08:32:02] again [08:32:08] say to 200 [08:33:35] <^demon> So the problem is really that MediaWiki has gotten too big for the current APC size? [08:33:53] And that we're running two instances of it, I guess [08:34:04] there were 920 files in it [08:34:08] As for MW getting bigger: we added includes/libs and includes/resourceloader [08:34:10] for two versions [08:34:13] And includes/installer [08:34:13] that's a lot of files [08:34:28] (Although to be fair the latter is never loaded on WMF) [08:34:42] <^demon> includes/libs is small and most of the stuff already existed elsewhere :p [08:34:49] <^demon> But the point remains, lots of code. [08:38:41] so shouldn't I be able to run php /usr/local/apache/common/live-1.5/apc-sma-info.php directly on the apache on question? [08:39:07] you mean through the CLI? no [08:39:21] the cache is in shared memory which is only accessible to apache children [08:39:27] hmph [08:40:08] another problem, http://zh.wikipedia.org/w/index.php?action=raw&title=MediaWiki:Sidebar gives a line "**mainpage|mainpage-description " and in http://zh.wikipedia.org/w/api.php?action=parse&text={{int:mainpage-description}} it's "??????". but it shows "Wikipedia:??????" in site sidebar [08:46:26] TimStarling, apergos: https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Special:Code/MediaWiki/82227 [08:46:53] that's not really urgent [08:47:03] I thought it might be related [08:47:04] the whole pool will be destroyed at request shutdown [08:47:07] Right [08:47:17] and since it's handling a fatal error, request shutdown will be very soon [08:48:37] <^demon> Apache CPU has climbed somewhat over the last 30 minutes or so. [08:48:38] So, any remaining blockers for further deployments? [08:48:46] ^demon: Probably due to increased request rate [08:48:56] <^demon> Probably, Europe's waking up [08:49:09] what about liangent's problem? [08:49:14] Oh wait, that current spike is not normal [08:49:49] liangent's problem WFM [08:50:02] Shows ?????? in the sidebar for me [08:50:04] note the content of MediaWiki:mainpage is Wikipedia:?????? [08:50:18] Yes, the link target is Wikipedia:?????? [08:50:24] And the link text is ?????? [08:50:24] RoanKattouw: my uselang=zh-cn [08:50:33] it shows Wikipedia:?????? for me, with uselang=zh [08:50:35] but link text is ?????? [08:50:45] Yes, it happens in zh-cn [08:51:45] Hmm [08:52:01] mainpage-description seems to be correct in zh-cn, and purging MW:Sidebar doesn't help [08:53:57] hmm yes load is somewhat higher... dunno if it's enough to worry about [08:54:32] I have the third problem now... but solving them one by one is ok [08:54:36] What worries me is the abrupt spike [08:54:58] <^demon> canaries are back. [08:55:11] could be demand-side [08:55:12] Don't see anything related in SAL though [08:55:21] liangent: What's your third problem then? [08:55:35] css defined in gadgets are not loaded [08:55:50] srv292 again [08:56:10] just one server [08:56:42] <^demon> Hm, maybe just depool it [08:57:06] liangent: Link to relevant Gadget and CSS not loaded? [08:57:10] Aha, in zh-cn: [08:57:15] **mainpage|mainpage [08:57:32] suspecting hardware problems for srv292? [08:57:38] there goes another server [08:57:45] <^demon> Also srv194 [08:57:49] yes [08:57:54] so no, not suspecting hw [08:58:22] What the... [08:58:32] OK it looks like MediaWiki message overrides aren't working for other languages any more [08:59:13] Yes, that's what's happening [08:59:47] Nikerabbit: Around? Can you help me with a MessageCache issue? [09:00:18] the CPU spike was when I restarted the apaches [09:00:31] robla: Hi Rob [09:00:33] Right [09:00:37] 32 minutes past the hour [09:01:11] 60% cpu is relatively high [09:01:14] ok I'm deleting MediaWiki:Sidebar/zh-cn and it looks good not [09:01:16] *now [09:01:17] we normally stay below that [09:01:25] hi nadeesha_calcey [09:01:26] we're still higher than we were [09:01:28] but it could be temporary indeed [09:01:31] before the switch [09:01:33] liangent: There was a /zh-cn page? [09:01:50] RoanKattouw: yes [09:01:52] Yeah that'll have done it [09:02:18] it was old but didn't cause any problem [09:02:43] and I have to delete other remaining subpages (zh-hk etc) [09:03:16] robla: Can we start second window testing now? [09:03:20] nadeesha_calcey: see http://www.mediawiki.org/wiki/MediaWiki_1.17/Wikimedia_deployment#Phase_2_.28underway_February_16.29 for a list of wikis we deployed to so far. some basic testing on the new ones would be good [09:03:58] (I guess frwiki is still on 1.16) [09:04:22] RoanKattouw: no but I have to write {{MediaWiki:Sidebar}} in them? [09:04:26] robla: okay, we can do that [09:04:36] liangent: I don't think so. You should be able to just delete those subpages [09:04:46] liangent: After that, can you tell me more about the broken CSS issue? [09:06:22] I guess MediaWiki:Sidebar should follow content language [09:06:40] I guess so [09:07:58] robla: Tim fixed the root cause of the frwiki problem some time ago [09:08:34] I didn't deploy it, did you deploy it? [09:08:45] I asked you to run s-c-a at some point [09:08:59] IIRC that was just after I'd svn upped for that fix [09:08:59] oh, so I did deploy it, I just didn't realise? [09:09:09] Along with a bunch of CSS-related MFTs, yes [09:10:25] <^demon> Ugh. All the info on the canaries via google is just silly workarounds telling you to disable suhosin. [09:10:30] *^demon continues digging [09:11:52] RoanKattouw: how are gadget styles loaded now [09:12:00] The normal way, AFAIK [09:12:05] Could you point me to the one that's broken? [09:12:10] with resourceloader, or just action=raw? [09:12:21] RoanKattouw: pong [09:12:42] RoanKattouw: on zhwikipedia, I'm having gadget-fontsize selected, but no css is loaded [09:13:36] Nikerabbit: With your changes in MessageCache.php , it seems that wfMsg('sidebar') doesn't return the contents of MediaWiki:Sidebar if ui lang != cont lang' [09:13:51] what changes? [09:14:00] The MediaWiki:Messagename/langcode change [09:14:16] From my reading of the code, MessageCache checks for that page only [09:14:25] Not for its parent page [09:15:02] I still don't get it, there shouldn't be much changes in 1.17 message cache compared to 1.16 [09:15:15] In 1.16, MediaWiki:Sidebar/zh-cn was ignored. In 1.17 it's not, so someone complained their sidebar broke. They then found and deleted the zh-cn subpage, but in eval.php wfMsg('sidebar') with $wgLang = Language::factory('nl') still returns the default nl message despite the zh override [09:15:17] restarted them again eh? [09:15:24] Despite that, the sidebar does work right [09:15:49] Or, no, it doesn't, actually :O [09:16:04] So bug #1 is sidebar is using UI lang not contlang [09:16:34] bug #2 is that if there's a [[MediaWiki:Foo]] page but no [[MediaWiki:Foo/nl]] page, your ui lang is nl, but the cont lang is someone else, the parent page should be used but isn't [09:17:01] the fact that restarting apaches helps tells you that it is an APC issue [09:17:40] <^demon> People claim it's fixed in more recent versions of suhosin. [09:17:52] meh [09:18:07] that's what suhosin does, it makes these errors [09:18:13] you can't fix it [09:18:49] if you did, what would be the point of it? [09:19:09] RoanKattouw: what fallback seq should be used in message? [09:19:31] No idea, ask Nikerabbit :D [09:19:48] db subpage -> php i18n -> db basepage [09:19:54] <^demon> TimStarling: Trying to learn new stuff here today :) [09:19:56] the other way around, I think [09:20:05] or db subpage -> db basepage -> php i18n for uimsg [09:20:11] yeah, that [09:20:37] mm load is creeping up a bit more [09:20:45] then contmsg? [09:20:55] yes [09:20:58] no [09:21:10] english [09:21:22] it just goes through the fallback sequence for the UI, the content language isn't used [09:21:26] It *seems* to be doing db subpage -> php i18n [09:21:49] I hate svn, how do I diff between two branches [09:22:15] <^demon> svn diff /path/to/branch /path/to/otherbranch, right? [09:22:20] I think so [09:22:26] TimStarling: s/isn't/is in your words? [09:23:07] say if your content language is zh and your user language is de [09:23:25] <^demon> TimStarling: Trying to learn here...how would I go about getting a stack trace from one of these failed apache processes? [09:23:44] without root access, you can't ;) [09:23:47] then wfMsg() should go [[MediaWiki:message/de]] -> [[MediaWiki:message]] -> de i18n -> en i18n [09:23:50] <^demon> Oh :( [09:24:04] RoanKattouw: I remember it was doing db subpage->php i18n so that I had to write a script to duplicate the content of basepage to all subpages [09:24:39] this is not what everyone wants, but it supports things like [[MediaWiki:Sidebar]] [09:25:23] That's what it should do, but that's not what it's doing [09:26:00] this was screwed up in 1.16 too [09:26:05] apparently people just like to screw it up [09:26:14] that's why I remember the order [09:26:53] <^demon> srv151 is segfaulting some. [09:27:12] i'll restart it [09:27:33] done [09:27:44] did these servers just start having problems post-deployment? or did we just start noticing it? [09:29:15] <^demon> I just remember it last week during the staged deploy on a smaller scale. [09:29:20] well 151 jsut started a little bit ago, it was fine for the few hours before [09:30:49] nl reports that categories don't report the number of pages and categories within the subcats [09:31:05] same for 158 [09:31:29] seems this is resolved now though. [09:31:45] perhaps took a while before the cache caught up with updating the messages [09:31:51] RoanKattouw: I don't see any changes between REL1_16 and REL1_17 that could cause that, except one I'm not sure of [09:32:02] a little bit ago in the case of those two servers = the last 15 minutes [09:32:07] <^demon> srv151 segfaults were proceeded by OOM for DoubleWiki, I think. [09:32:12] thedj: Yeah Romaine reported that to me already [09:32:13] there should probably be some minor breakage in the area of categories and subcategories [09:32:26] it would be very surprising if I got that complex revert exactly right [09:32:27] Or wait, that may have been categorytree [09:32:36] PHP Fatal error: Allowed memory size of 83886080 bytes exhausted (tried to allocate 64926 bytes) in /usr/local/apache/common-local/php-1.17/languages/LanguageConverter.php on line 400 [09:32:45] for 151 [09:32:58] I've seem a lot of OOMs on that specific line [09:33:08] <^demon> The LangConv OOM has been going since we switched over the wikis using it. [09:33:16] <^demon> Much worse than 1.16 was. [09:33:29] only difference is -$langcode = $lang->getCode(); +$langcode = $lang->getPreferredVariant(); which is out of my league [09:34:10] Yeah Tim just said this was wrong in 1.16 too [09:34:17] who's the designated canary-whacker? srv292 [09:34:47] root@fenari:/home/wikipedia/syslog# grep 'canary mismatch' apache.log | awk '{print $1}' | sort | uniq -c | sort -rn 72533 Feb [09:34:47] 21406 Dec [09:34:47] 2 Jan [09:34:50] The order should be MW subpage for ui lang -> MW base page -> PHP i18n for ui lang -> fallback chain from there [09:35:16] But the MW base page step is cut out [09:35:36] it's the final step [09:35:46] but that hasn't change since last deployment [09:36:16] is this only happening in chinese wp? [09:36:21] What do you mean it's the final step? It's not much use if its does MW subpage -> PHP i18n with fallbacks -> MW base page, is it? [09:36:29] Nikerabbit: that's what tim said "it supports things like sidebar" [09:36:33] I've only reproduced it on zhwiki, yes, but I haven't tried anywhere else [09:36:45] Nikerabbit: Go to http://zh.wikipedia.org/?uselang=fi [09:36:56] because $langcode has changed... the check against content language code might be broken [09:37:03] You'll get a Finnish sidebar even though they have a local override in [[MediaWiki:Sidebar]] [09:37:48] what do you mean with Finnish sidebar? [09:38:16] Nikerabbit: you get a sidebar with items defined in MessagesFr.php [09:38:29] no I don't actually [09:38:56] and there is no sidebar message in MessagesFi.php [09:39:06] Oh [09:39:08] Try ?uselang=nl then [09:39:12] ok. Go to ?uselang=en [09:39:19] next time a server starts giving canary errors, I'm going to attach with gdb [09:39:27] you probably mean that some of the links are translated? [09:39:43] Also, some of the custom links won't appear, I think [09:39:53] 10.0.2.239 there is your chance [09:40:02] I see all custom links both with nl, fi and no override [09:40:23] Hmm wait, you're right [09:40:26] It works fine [09:40:39] My mistake then, I guess, sorry for the confusion [09:40:41] is it just that they are shown in English without override? [09:40:52] The texts are translated, but that's expected [09:40:55] The actual list is unchanged [09:41:13] I thought the actual list had changed too [09:41:19] RoanKattouw, Nikerabbit: but in my attempt just now [09:41:24] when I deleted /zh-cn [09:41:30] the list IS changes [09:41:44] my user language is zh-cn [09:41:52] Maybe sidebar does use wfMsgForContent() [09:42:00] Because this still looks wrong: [09:42:10] catrope@fenari:/home/wikipedia/common/php-1.17$ php maintenance/eval.php --wiki=zhwiki [09:42:11] > $wgLang = Language::factory('nl'); echo wfMsg('sidebar'); [09:42:17] Returns the message from MessagesNl.php [09:42:31] so I added /zh-cn and wrote {{mediawiki:sidebar}} or its content in it to get the customized sidebar bacvk [09:42:33] of course it uses wfMsgForContent [09:42:34] *back [09:42:56] only in some wikis like commons it's overriden to use wfMsg [09:43:40] Nikerabbit: I didn't check the code but how to explain what happened when I deleted /zh-cn then [09:43:55] I'm pretty sure the change I pasted is the cause of the issues you see and I don't [09:44:37] *RoanKattouw wonders what getPreferredVariant() does for zh/zh-cn [09:45:23] Behaves normally, it seems [09:46:07] # Is this a custom message? Try the default language in the db... [09:46:12] Right, so the check is there indeed [09:46:17] if it returns anything other than what was originally given... then everything using wfMsgForContent will have funny effects [09:46:24] RoanKattouw: I already copied mediawiki:sidebar to all zh variant subpages... [09:46:30] it=getPreferredVariant [09:46:31] .. but only at the end [09:46:46] It returns zh-cn for zh-cn [09:47:57] re canary: I got a URL and a server, it didn't help [09:48:09] :-( [09:48:10] RoanKattouw: did you look at LanguageConverter::getPreferredVariant()... it depends on whole lot of things [09:48:20] it did it for random pages, not just the URL I logged [09:48:23] and by definition it is not guaranteed to return the same code always [09:48:55] Nikerabbit: I tried in eval.php and it seemed to work. Will read source [09:48:55] TimStarling: what url was it? [09:49:18] http://de.wikipedia.org/wiki/Hain [09:49:24] but you know it didn't help, right? [09:49:44] it works as well as any URL [09:50:09] a page without anything fancy [09:50:21] Hmm it depends on user too [09:50:25] liangent: Are you User:Liangent ? [09:51:27] maybe there's some extension newly used in 1.17, eg. for tiff handling? [09:51:31] Hmm, still returns zh-cn if I use his username [09:52:10] RoanKattouw: yes [09:52:39] RoanKattouw, are you going to deploy to other wikis today? [09:52:54] I don't know [09:52:59] robla: ? [09:53:16] TimStarling: What's the status regarding blockers for expanding deployment? How serious is the canary thing? [09:53:41] can we deploy all of the wikis? [09:53:46] to 1.17 [09:53:57] if it's a problem with running two copies of MW at once, that would fix it [09:54:28] as long as we're ready to back off it [09:54:29] I am worried about the current 60% cpu usage, but we can try [09:54:43] I'm concerned about en wiki usage (though it's off hours here) [09:54:45] (sorry...stepped away for a sec) [09:54:48] Let's try all minus enwiki first [09:54:57] or just enwiki [09:55:03] Could do that too [09:55:08] I'd say give enwiki a shot [09:55:08] Probably easier :) [09:55:11] Whee [09:55:47] 10.0.2.239 whackacanary [09:56:10] RoanKattouw: I tried to reproduce what happened but failed [09:56:10] by deleting some subpages [09:56:11] RoanKattouw: well just leave it there. the gadget problem? [09:56:12] apergos: it's restarted now, it was the one I was using for testing [09:56:13] RoanKattouw: I don't what is the intention there, but that change breaks the check mediawiki:basepagename first when using wfMsgForContent() [09:56:17] So enwiki now? [09:56:20] if we do all then we can see if it's the two version issue; [09:56:22] liangent: Yes, let's move to the gadget problem [09:56:27] if we just do en we don't gain that [09:56:40] let's just be ready to back out is all [09:56:41] I'll switch en, but if the problem with APC is related to running multiple versions of MW, it won't help [09:56:47] what tim says [09:56:56] Switching en will give us confidence to switch the rest, though [09:57:30] With regards to CPU usage and such [09:57:39] rather a lot of seg faults... [09:57:59] well the decision is made I guess [09:58:50] uh oh [09:58:55] thousands of bits cache misses [09:59:08] Of course [09:59:15] bits now unstable [09:59:19] crap [09:59:26] I expect nagios messages any second [09:59:40] do you want it reverted? [09:59:54] hmm [09:59:59] let's wait a few minutes [10:00:07] I don't see how we can do better in any other way [10:00:16] Bits caches CPU usage seems to have flatlined [10:00:30] Do I understand correctly that missing CSS/JS is caused by overload of bits server? [10:00:38] *Bits app servers CPU [10:00:56] ody> [10:00:56]

Error 503 Service Unavailable

[10:00:57]

Service Unavailable

[10:00:57]

Guru Meditation:

[10:01:44] client side is getting better. i get more resources on each page refresh [10:01:45] apaches seem ok [10:02:13] manually pooled all bits servers [10:02:40] RoanKattouw: in old version gadget css are loaded with Guru Meditation:

XID: 4354112

[10:02:51] RoanKattouw: is it still working in this way currently? [10:03:12] only my user modules are now missing [10:03:22] > 10k cache misses per second [10:03:23] is it now 09 utc? [10:03:23] not good [10:03:25] everything else is there [10:03:30] ouch [10:03:40] much more than previous deployment attemps [10:03:46] probably because this time the apache cluster is not overloaded [10:03:53] Nikerabbit: 10:08 UTC [10:03:59] *sigh* [10:04:08] yeah, our working backend DoSing your non-working frontend [10:04:25] So bits/Varnish is overloaded? [10:04:28] varnish is ok [10:04:34] the varnish backends aren't [10:04:37] anyway, gather data [10:04:43] we'll revert soon [10:04:49] so get what data you can [10:05:13] lots of segfaults on srv248 [10:05:17] *mark restarts apache [10:05:33] geoiplookup is also having it difficult. [10:05:43] <^demon> date_create(): Failed to parse time string (2008-23-11T11:43:00.00+00:00) at position 6 (3): Unexpected character in /usr/local/apache/common-local/php-1.17/includes/GlobalFunctions.php on line 2071 [10:05:58] and there went nagios [10:06:18] thedj: geoip is also on bits [10:06:38] 0 Backend_health - srv249 Still sick 4--X-R- 0 3 8 0.011589 0.001020 HTTP/1.0 403 Forbidden [10:06:38] 0 Backend_health - srv248 Still sick 4--X-R- 0 3 8 0.011826 0.000993 HTTP/1.0 403 Forbidden [10:06:41] huh [10:06:47] TimStarling: revert [10:07:14] .request = [10:07:14] "GET /w/load.php HTTP/1.1" [10:07:14] "Host: en.wikipedia.org" [10:07:14] "Connection: close"; [10:07:27] why would that give 403 [10:07:35] ^demon: the year only has 12 months here [10:07:55] doh [10:07:57] no useragent? [10:08:02] of course [10:08:04] :-D [10:08:06] which didn't happen with the temp script [10:08:06] <^demon> Platonides: Yep, something's passing a bad string. wmerrors didn't grab a stacktrace though :\ [10:08:06] that's what I was thinking [10:08:08] *mark fixes [10:08:17] or is it a US date? [10:08:43] were the bits app servers in LVS? or was varnish just sending traffic directly? [10:08:45] the year first format always has then month-day [10:09:01] directly [10:09:06] it maybe that someone wrote that in an article, though [10:09:07] CPU recovery on bits app servers [10:09:16] *mark deploys a UA fix [10:09:24] <^demon> Platonides: It's supposed to be that way, but some American probably wrote it backwards ;-) [10:09:51] silly americans ;) [10:11:00] are those delayed nagios whines? [10:14:00] /home/wikipedia/syslog/syslog is 109GB [10:14:01] apergos: No, esams bits just went bonkers [10:14:15] ugh [10:14:16] is it time to write a logrotate.d script yet? [10:14:44] still 150GB free, we can let it go to 350 [10:14:53] *250 [10:14:54] there is one [10:15:03] it's just not working right now because of nfs1/nfs2 move [10:15:14] root@nfs2:/home/wikipedia/syslog# head syslog [10:15:14] Dec 18 06:28:30 208.80.152.63 squid[12583]: storeUpdateCopy: Aborted at 12742 (4096) [10:15:16] is enwiki still on 1.17? [10:15:16] I believe there is an rt ticket [10:15:21] No [10:15:29] enwiki's back on 1.16 [10:15:33] ok [10:15:36] since about 10 seconds after you told me to revert [10:15:40] need to get bits.esams stable again [10:16:32] linux scheduler is melting on those boxes [10:18:58] # grep 'canary mismatch' apache.log | awk '$1 == "Feb" { print $2}' | sort | uniq -c [10:18:59] 42429 16 [10:18:59] 1 5 [10:18:59] 34668 8 [10:20:14] the 8th was when we switched all wikis, right? [10:20:29] Yes [10:20:51] so the problem can't be wmerrors because we hadn't written it then [10:21:03] And it can't be partial deployment either, can it? [10:21:08] no [10:21:10] Because I hadn't written that either [10:23:26] <^demon> That makes sense and could've saved us suspecting wmerrors, seeing as we all remember canaries from partial deploy. [10:24:14] ? [10:24:23] I don't remember canaries from anythin [10:24:34] <^demon> I remember mentioning it, but it wasn't as widespread. [10:24:39] <^demon> Because we deployed less. [10:25:04] <^demon> Might've just gotten lost in the backscroll and temporarily fixed by apache restarts in between. [10:26:17] nobody has reported it [10:26:26] no actual users, right? [10:26:48] the user-visible problems would be fairly subtle [10:27:43] I don't think there were any, no [10:27:49] Also, it happens on shutdown, right? [10:27:54] yes [10:27:58] Has PHP already sent the output to Apache by that time? [10:28:15] connecting to the apache directly, you see an error about unclean socket shutdown [10:28:19] but maybe squid hides that [10:29:09] there may have been other problems though [10:29:11] Best case, they get their page normally [10:29:27] Worst case, they see it as any other fatal error, so all they'd notice is an increased blank page rate [10:29:28] segfaults, corrupted output, etc. [10:29:36] it's a symptom of a serious problem [10:29:54] Corrupted output? [10:29:58] I see an increased blank page rate on load.php [10:30:08] Yeah you would [10:30:13] Because it's on bits and you're in Europe [10:30:14] wikidiff2 was readded at the same time as the first deployment [10:30:21] Hmmm [10:30:26] Good point [10:30:40] although I expect that would mean an equal failure rate in 1.16 [10:30:50] no it wasn't [10:30:59] wikidiff2 was enabled on the 11th [10:31:04] the other candidate extension i see is intl [10:31:07] # 09:38 Tim: enabling wikidiff2 on all servers by manually creating wikidiff2.ini [10:31:22] so we can strike wikidiff2 [10:32:14] it doesn't have to be an extension, it could be a userspace change [10:32:15] Stupid question time: did you restart Apache after that? [10:32:42] I would have checked with phpinfo() that it was enabled [10:33:00] I was working with the assumption that the bug wasn't in php core [10:33:15] just to simplify, as it seems more unlikely [10:34:03] error rates on the 8th, breakdown by minute: http://p.defau.lt/?O_zyZ7x98ICH5QUM5WpiwA [10:35:34] Correlates excellently with 1.17 [10:35:50] The second deployment, specifically [10:36:13] Wow and the first oo [10:36:29] Although the second seems to have been much worse [10:37:06] in the first one, I restarted all apaches regularly because of the high load [10:37:16] so that would have hidden it [10:37:44] Right [10:39:59] *RoanKattouw lols at http://torrus.wikimedia.org/torrus/CDN?token=T25664&view=last24h and guesses it's the 5-minute cache expiry on the RL startup module [10:44:22] I wonder if it only gives a canary error on requests to 1.17 wikis [10:44:40] leave srv254 for a while, I'll find out [10:58:51] one hour left. what should we do with it? [10:59:08] I was just thinking the same thing, a bit sleepily [10:59:09] How serious is the canary issue? [10:59:23] And did bits get fixed so it can handle deploying enwiki? [10:59:25] how serious is the varnish backend issue? [10:59:46] I don't think the canary issue is serious enough to stop us, but the bits issue is [10:59:48] ok [10:59:54] bits.esams is recovering [10:59:59] in a bit i'm ready for one more enwiki test [11:00:17] bits.esams should be rather unrelated to 1.17 [11:00:27] What was the problem? Had it depooled all servers because of 403s? [11:00:35] one hour left to? [11:00:37] same problem we've been seeing for a while [11:00:41] Just a thought: There's a bad regex in JavascriptDistiller that causes apache to crash on windows (stack overflow), wondering if this may be related to your memory corruption issue [11:00:49] varnish thread pileup under some nonideal conditions, and then it won't recover [11:01:13] is a bug between varnish and the kernel [11:01:23] pawelx: it would only cause segfaults [11:01:30] not heap overflows [11:02:27] Hydriz: 12:00 UTC is the end of our maintenance window [11:02:47] ok, I thought stack could leak into heap area and cause corruption or sth [11:02:47] ? [11:02:57] <^demon> pawelx: File a bug though, so it doesn't get forgotten :) [11:02:59] Which means Wikimedia would be updated then? [11:03:02] btw the heap overflows are only from the 1.17 wikis [11:03:03] is there someone not working on the performance problems [11:03:19] so if there's APC cache corruption, it's somehow magically localised to 1.17 [11:03:27] or other problems will be deferred until the performance problem is resolved? [11:03:37] 12 pm UTC time... [11:03:54] ^demon: I thought there was one already, can't find it now though [11:04:18] ok, ready for enwiki deployment attempt #4? [11:04:29] really? [11:04:37] Hydriz: It's 11:09 UTC [11:05:03] I know, but it is interesting that you guys are deploying it at noon UTC [11:05:41] Hydriz: see #wikimedia [11:06:09] TimStarling: I'm ready, I say bring it. [11:06:11] mark? [11:06:11] Others? [11:06:20] TimStarling: hrm.... [11:06:32] mark: how confident are you that the bits problem is fixed? [11:06:35] if mark says it's a go [11:06:43] even if it's just more data collection [11:06:50] i'm waiting for traffic to be moved back [11:06:53] should be a couple more minutes [11:07:21] ok [11:07:26] that was effectively a pretty large downtime event for us. given how long the recovery was, I'd want to be pretty darn sure we're ready [11:07:43] The long recovery was Europe-only AFAIK [11:07:58] well, if it's only Europe :-/ [11:07:59] yep, and not very related to 1.17 [11:08:32] but that's why we've scheduled a maintenance window [11:09:31] so...back to my question: mark: how confident are you that you've got a fix? [11:10:01] confident enough that the UA issue is resolved [11:10:12] the varnish problems in general are not fixed and can happen anytime [11:10:14] so let's do it [11:11:22] mark: should this be a quicker recovery? [11:11:40] we never know how long it'll take to recover varnish [11:11:52] typically, moving traffic to the US recovers it but can take some time [11:12:24] let's give it a shot I guess [11:12:31] sweet [11:12:44] there's no point in delaying it [11:12:50] Moving traffic to the US is a quick recovery from the /users/' perspective, though, right? [11:13:00] well, it takes 15 minutes [11:13:14] it just took longer since I extended the downtime to investigate it [11:13:20] better now (during maintenance window) than any other time [11:13:33] so it'll be slightly faster now probably [11:14:23] sorry, was distracted, let's do this [11:15:24] done, enwiki back on 1.17 [11:16:27] ah good, CPU is rising this time on the backend, instead of falling [11:16:31] still forbiddens [11:17:02] hmm so it is (rising) [11:17:47] restarted varnish on sq67 [11:17:54] it had two different backend probes running, it seemed [11:17:56] that seem sa bug [11:17:57] better now [11:19:32] bits backend looks remarkably calm (from ganglia) [11:19:41] it's looking good [11:20:39] *robla tries to avoid jinxing it just yet [11:21:33] cache misses are at an acceptable level [11:21:40] a few hundred a s [11:21:55] apaches are a bit more loaded but still stable-ish [11:22:43] if that's where they're going to sit form now on we prolly want to throw a few more in the pool [11:22:50] yep [11:23:03] <^demon> TimStarling: I thought you shut up those XML errors from svg metadata extraction? [11:24:14] it won't suppress all errors from XMLReader [11:24:23] so...75-80% of our traffic is on 1.17 now? [11:24:32] 52 plus whatever we had before [11:24:37] I just changed the settings to make it a bit less noisy [11:24:42] <^demon> Oh ok [11:25:01] We had jp and de (not fr) before, so it's somewhere in the 60s, maybe 70, I'd say [11:25:09] we should switch over the rest of the wikis before our window closes [11:25:12] let's do the rest soon yeah [11:25:18] yes [11:25:35] Let's just do that with cat all.dblist > 1.17.dblist for now [11:25:39] We can normalize the setup later [11:25:52] there's a command called "cp" that does a very similar thing [11:25:58] I was just thinking that, yeah :) [11:25:59] I think I might use that ;) [11:26:07] heh [11:26:10] except I made a backup first [11:26:13] let's get it done. might fix the apc probs [11:26:14] but just in case let's be ready to back off [11:26:29] perfect [11:26:48] Whee [11:27:09] 542 dropped requests in varnish during this push ;) [11:27:25] I am very sorry for those 542 people [11:27:29] :-D [11:27:30] but other than that it seems fine right now [11:27:47] front end load is lower [11:28:05] bits app server load is high [11:28:08] 80% [11:28:12] PHP Fatal error: Call to undefined function wfGetIP() in /usr/local/apache/common-local/php-1.17/wmf-config/CommonSettings.php on line 2340 [11:28:28] a *lot* lower [11:28:30] hmmmm [11:28:55] oh, we're back to that are we? [11:28:56] bits app server load dropped a bit again [11:28:57] he.wp looks good [11:29:02] 60% now [11:29:11] you really can't use wfGetIP() in CommonSettings.php [11:29:16] sorry [11:29:29] zh looks good with variants as well [11:29:29] Wasn't that reverted before? [11:29:39] I thought that was fixed yeah. but... [11:29:59] unless it's a revert back to an older config [11:30:08] mark: Probably a lot of initial resource generation. Those things are changed very infrequently and are cached for like forever [11:30:13] That was the IP limit exemption thing? [11:30:14] yep [11:30:20] guillom, yus [11:30:24] i am happy enough [11:30:27] I fixed it [11:31:00] looks like I need to find 2 more apaches for the bits app server pool, but it's fine for now [11:31:14] huh on some hosts the load went up (spike) and on some it dropped suddenly ... weird [11:31:26] mark, just make RobH get the new data centre online quicker :P [11:31:34] that won't help anything [11:31:42] it's a replica after all [11:32:17] Pfft. It was with sarcasm [11:32:18] root@srv234:~# gdb -batch -p `ps -C apache2 |grep -v defunct | tail -n4 | head -n1 | awk '{print $1}'` -x ~tstarling/url.gdb 2>&1 | grep http [11:32:18] http://mk.wikipedia.org/w/index.php?title=%D0%A1%D0%BF%D0%B5%D1%86%D0%B8%D1%98%D0%B0%D0%BB%D0%BD%D0%B0%3A%D0%91%D0%B0%D1%80%D0%B0%D1%98&search=%D1%80%D0%B0%D0%B7%D0%B2%D0%BE%D1%98+%D0%BD%D0%B0+%D0%BA%D0%BE%D0%BC%D0%BF%D1%98%D1%83%D1%82%D0%B5%D1%80%D0%B8%D1%82%D0%B5 [11:32:30] sorry for spam [11:32:42] it's not even a one-liner anymore, you need the extra script [11:32:55] bits app server load stabilized around 40-45% [11:32:57] that's excellent [11:33:09] but it does show you the URL of the next heap overflow error, which is nice [11:33:15] my gut feeling about 4 RL apaches was correct ;) [11:33:31] nadeesha_calcey: still around to help with testing? [11:33:33] http://eiximenis.wikimedia.org/1-17-allwikis [11:34:36] http://ganglia.wikimedia.org/?c=Apaches%208%20CPU&h=srv226.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=3&mc=3 [11:35:20] time to order some more apaches [11:35:25] maybe so [11:35:26] we haven't done that in a long time [11:35:30] it's coming down to under 10 [11:35:33] probably because we were not deploying any new features ;) [11:35:39] shhh [11:35:45] I didn't know we had only 109 Apaches in the general pool [11:35:48] byt srv260 (for example) is hovering around 6 [11:36:08] 1950 vs R610 [11:36:14] slightly less in the new dc even [11:36:17] but they will be all 12-cores [11:37:04] load does seem a bit uneven indeed [11:37:09] remember, there are jobrunners too [11:37:18] hmm that's true [11:38:40] so where are we now? canaries and...? [11:39:06] plenty of OOM errors to fix [11:39:25] that's probably easier and more useful than canaries [11:39:39] because I should be in the data center in about... [11:39:50] shouldn't you get some sleep? :) [11:39:58] < 3 hours, and the site doesn't look like it's going to melt [11:40:04] yep. I'm heading to bed [11:40:11] hmm, were vector prefs overwritten [11:40:18] thedj: Oh, that's right, the script [11:40:20] Fixing [11:40:20] see folks later. if Rob pops into the channel and I'm not at the dc I'm still in the be [11:40:21] d [11:40:23] apergos: remember, the new data center is not important enough for you to exhaust yourself on it [11:40:29] i see people saying they have the new toolbar while they used to have the old toolbar selected [11:40:32] so please get enough sleep [11:40:45] I got some sleep before coming on line tonight. I will not be setting the alarm now though [11:40:47] and you can walk over to the dc afterwards ;) [11:40:51] yep [11:41:01] that's the plan. [11:41:03] ok [11:41:06] sleep well then! [11:41:13] thanks [11:41:27] have an uneventful and bug-squashing night [11:42:13] nite apergos [11:42:18] TimStarling, is there a list of these OOM's or anything for us to start poking at? [11:42:23] thedj: Known issue, running a script to (partially) fix it like right now. The ones still having the issue will, unfortunately, have to switch it off again [11:42:24] Reedy: I; [11:42:28] Reedy: I'll give you one OOM [11:42:29] no [11:42:40] I think I've figured out the canaries [11:42:49] http://mediawiki.pastebin.com/aTaCdCVB [11:45:08] http://no.wikipedia.org/w/index.php?rand=16&title=Liste_over_verdens_st?rste_olje-_og_gasskraftverk [11:45:12] rand=16? [11:45:45] Reedy: probably a userscript tool [11:45:58] hm [11:46:58] ok, the maintenance banner is going to go down automatically in 10 minutes. I assume that's ok. [11:47:24] guillom, is it worth putting up noting upgrade to 1.17 has happened, report issue to irc/bz? [11:48:02] Reedy, we can't put that in the notice, but it's what the landing page on mw.o has been saying all along, so I hope it'll be enough [11:48:28] RoanKattouw, no.wiki doesn't seem to want to load at all.. [11:49:11] Yeah, whole of no.wiki is giving 504 [11:49:15]