[00:03:54] what's typical ttl on these pcache entries again? [00:07:42] wmf has them set to a week [00:08:03] yes but what's typical ttl as opposed to the value in the config file? [00:08:23] er.. no 14 days [00:08:47] the default is just one day [00:08:57] um [00:09:00] I don't mean the default [00:09:23] I mean the actual "memcached has only X much space and it starts to toss objects cause new ones go in, after about X time" [00:09:27] well Y time. [00:09:30] but anyways... [00:09:36] looks like about a day [00:09:48] ok [00:09:55] how were you abl to check that? [00:11:18] I have a spreadsheet that I made a while ago [00:11:30] nice [00:11:32] just seeing if I made screenshots from it [00:12:23] I don't think so [00:12:31] ah too bad [00:13:10] anyway the general strategy is to log into a random memcached server with telnet, and type "stats items" [00:14:07] IIRC, "age" is the relevant statistic [00:14:20] it shows how old the last item to be evicted was, in seconds [00:14:46] *apergos tries it [00:14:47] e.g. on srv271 you have [00:15:09] STAT items:65:number 12876 [00:15:09] STAT items:65:age 42045 [00:15:31] so this slab has a TTL of 12 hours [00:15:44] but [00:16:16] STAT items:49:number 4628 [00:16:16] STAT items:49:age 121909 [00:16:19] this one is more than a day [00:16:26] yeah, slabs aren't perfect [00:17:08] "stats slabs" tells you how big each slab is, in the chunk_size field [00:17:40] STAT 65:chunk_size 977 [00:17:47] so that one is 977 bytes [00:19:46] ok, had a look at the command list [00:19:47] thanks [00:22:09] surprisingly, our memcached needs didn't go up that much :) [00:26:23] huh [00:27:38] STAT items:11:number 1 [00:27:38] STAT items:11:age 4527393 [00:27:38] STAT items:11:evicted 0 [00:27:52] I should never look too closely at this crap, I'm bound to find something irritating [00:28:01] > 52 days? [00:31:33] number=1 [00:31:45] yes, I see that [00:32:01] so it's been a long time since anything was evicted or added [00:32:17] (it's also tiny) [00:32:22] probably never [00:32:28] 52 days is probably the uptime [00:32:54] bing [00:34:59] STAT items:17:number 10266 [00:34:59] STAT items:17:age 2928522 [00:34:59] STAT items:17:evicted 3 [00:34:59] STAT items:17:evicted_time 1147658 [00:35:04] who can tell. bleah [00:35:21] kinda unrelated and not urgent, installer is still broke on 1.17wmf [00:35:37] a hello pdhanda [00:35:47] what's the state of prototype atm? [00:37:10] should be ok to use, i needed to fix some db issues on de [00:37:24] it's running 1.which now? :-D [00:37:40] 1.17wmf1 on release-en [00:37:58] with or without tim's patch and platonid es' patch? [00:37:58] release-* actually [00:38:02] without [00:38:05] ok [00:39:26] ok who stayed at the wikisuites and gave their phone number to a collection agency? fess up now [00:44:07] must be Ryan_lane .. ;-) [00:44:25] a collection agency? [00:44:31] that would imply I have debt [00:44:34] debt collection [00:44:38] heehee [00:44:52] just got the automated phone call to this apartment [00:44:59] of course they don't say who they are looking for [00:45:00] I think I may have $50 on a credit card right now for this month? [00:45:13] I owe like 10$ on a pge bill [00:45:15] that's about it [00:46:13] damn [00:46:40] I see wfIncrStats is working now [00:49:56] how about we gather some statistics about parser cache object age using wfIncrStats? [00:50:16] that would give us a more accurate picture about the effect of a cache clear [00:50:23] sure [00:50:40] and the effect of $wgInvalidateCacheOnLocalSettingsChange [00:51:05] I'm just going to do one more thing looking at the memcached stats for a sec [01:00:43] clear-profile doesn't clear the stats [01:00:50] that could be annoying [01:00:52] mm [01:01:13] (done looking at the memcached stats output, not getting a lot from there) [01:01:45] http://noc.wikimedia.org/cgi-bin/report.py?db=stats/enwiki&sort=name&limit=5000 [01:01:52] the ones with a leading zero are accurate [01:03:00] I restarted the daemon [01:03:19] ok [01:03:52] ok I'm wrong, clear-profile does work after all [01:04:03] no worries [01:04:32] anyway that shows you the age in hours of each ParserOutput object on a parser cache hit [01:04:54] rounded to the nearest hour up to 100, then rounded to the nearest 10 hours [01:05:17] right, I see [01:10:59] so what gets me is that immediately upon reversion, ie with in 5 minutes, we are good to go [01:17:58] so within 6 hours we have half of the cache back... if it was a one time event like the wgInvalidateCacheOnLocalSettingsChange [01:19:04] sure seems like the second attempt should have done better, if that was all that was at fault [01:19:44] but it does indicate that setting $wgCacheEpoch expires more of the cache than I expected [01:19:46] do we not find objects because somehow we are concocting a bad key? [01:19:58] ohw much were you thinking? [01:20:31] the memcached stats show a shorter age, because they are based on last access [01:20:47] right [01:20:49] whereas these ages are based on the time the cache entry was written [01:21:56] you could have a heavily-accessed cache entry with an age of 100 hours, and memcached could have its last used time as 5 seconds [01:22:08] such a cache entry would be expired along with everything else [01:25:39] right [01:27:10] the acid test (which no one would agree to do) would be to switch again now that the impact of wgInvalidateCacheOnLocalSettingsChange would be minimal [01:27:55] or we could set $wgCacheEpoch on 1.16wmf4 and see what happens [01:28:03] errr :-D [01:28:33] it's hard to predict because you need to know the distribution of requests [01:28:46] i.e. how many of them are for popular objects, how many for less popular objects [01:28:57] the exponent in the power law [01:29:24] and even when you have that, simulating it would be a significant task [01:29:30] and we don't have an easy way to track that anyways [01:30:17] we have cache hit rate [01:32:29] any idea why this is in bugzilla? https://bugzilla.wikimedia.org/show_bug.cgi?id=25758 [01:32:47] if you set $wgCacheEpoch to some recent value, the hit rate will instantly go down to some smaller value [01:33:07] we can calculate that instantaneous hit rate using the cache age stats [01:33:41] then afterwards, the hit rate will rise over time and slowly go back up to what it was before you set $wgCacheEpoch [01:33:46] call it the refill rate [01:33:49] sure [01:33:56] that's what we don't know, but the current theory is that it's slow [01:34:14] because when we deployed for 40 minutes, the cache hit rate didn't rise enough to get us out of trouble [01:34:19] reaaaally slow if it's going to have that sort of impact [01:34:51] so sure we could actually back off $wgCacheEpoch by not too much and watch, not enough to do harm [01:35:05] if it's implausibly slow then we have to think of another explanation for the downtime [01:35:34] se it;s the two deployments that bothers me [01:35:38] *see it's [01:35:49] that the second one didn't do any better, or hardly any better [01:36:17] but anyways I'm for a controlled test, it would tell us something [01:40:05] we wanna wait for people to be around? try this now? check with ro bla? [01:40:19] I feel like the risks are low if we don't back it off much [01:42:10] the second one should have done a bit better [01:43:39] 3 hours after $wgCacheEpoch, the hit rate should have been 40% [01:44:04] of normal, i.e. 0.4 * 50% [01:44:24] 6 hours after $wgCacheEpoch, it should have been 50% of normal [01:44:45] exactly [01:45:08] refilling would have been minimal because it was only refilling when $wgCacheEpoch was recent [01:45:28] when it was on 1.16wmf4, it would have been using those old objects and not regenerating them [01:45:33] so no refill [01:45:40] ok, makes sense [01:46:13] but that's still much better than the numbers we had [01:46:38] no, it's about right [01:46:50] hit rate was 55% on 1.16wmf4 [01:47:09] no... [01:47:16] 22% miss rate, so 78% hit rate [01:47:28] so you're right, much better [01:47:55] on 1.17 we had 55% hit rate [01:49:14] with this model, you'd expect 78% * 40% after three hours, which is 32% [01:49:43] 78% * 52% after 6 hours, which is 40% [01:50:06] but on 1.17 we had 55% each time [01:50:11] that could be because of refilling [01:50:22] probably the most popular objects were instantly refilled [01:51:35] do we need to do an experiment on the live site to confirm this? and if so what should it be? [01:51:43] which means that really we could stare at the numbers endlessly but the best way to know is to try it. [01:51:51] that's exactly what I'm trying to figure out [01:52:22] what the least risky test is we can do that will give us a very high degree of certainty that this is or is not it [01:52:27] we could simulate a 6 hour expiry for a few minutes [01:52:41] by very high I mean high enough that everyone else is reasonably convinced [01:52:42] check cache hit rate, see if it's similar to what we saw during deployment [01:54:26] if it's bad it should fall over right away.. that will be the thing [01:54:48] we wanna do this without taking the site down [01:55:56] we have a bit more headroom because of the time of the day [01:56:05] that's true [01:56:12] which means if we do it, we had better do it soon [01:56:35] too bad we can't target just one server [01:56:37] well... [01:56:40] or could we? [01:57:07] no, we wouldn't get any refill if we just did it on one server [01:57:10] eh right [01:57:14] phooey [01:57:40] well let's think it through [01:57:45] give me a few minutes to set up data collection [01:57:46] if we do the run and crap falls over [01:57:50] then we know this is the issue [01:57:53] I want to have a lot of data [01:57:57] in which case we know we can schedule deployment [01:58:03] if it doesn't fall over, no harm done etc [01:58:12] (ok) [02:01:13] so the plan is to update $wgCacheEpoch to simulate a (partial?) cache clear and confirm behavior? [02:01:23] yes. or fail to confirm [02:01:51] if we confirm stuff could fall over but at least we'll know "that was ths issue, it's fixed now, deployment is safe as far as pcache goes" [02:02:05] if we fail to confirm then we won't have had much impact on the site, presumably [02:02:47] still feeling a bit queasy about it [02:02:59] but... short of actual testing this stuff, I dunno how we're going to be sure [02:08:07] got any brain waves, bri on? [02:12:00] if we do this, and things start falling over, we can put the setting back, and it should recover fairly quickly, right? [02:12:30] it shoudl recover instantly; no new objects will be tossed, it will just go back to using the old objects [02:12:42] sounds reasonable to me [02:12:51] (i've been lurking this conversation for a while ;) ) [02:12:51] any objects that had been regenerated by the new setting would have been shoved back into the cache [02:12:54] thats' the theory... [02:12:54] heh [02:13:08] theory and practice. [02:13:10] yeah. [02:13:44] this is how come I'm only half an ops [02:13:47] and if it doesn't work in theory, we are down for a couple hours? [02:13:53] err [02:13:55] ops people are supposed to be really conservative about stuff [02:13:58] *cough* [02:14:00] s/in theory/in practice/ [02:14:28] well, testing like this is better than trying a deploy without knowing it's going to work [02:14:28] well if we reset the setting, [02:14:40] it won't toss anything else [02:14:44] it's more conservative to do a test like this [02:14:44] that's what controls that. [02:14:49] I agree [02:15:10] *nod* it's not unreasonable. would be nice to do on smaller scale though [02:15:15] it sure would [02:15:16] but..... scale's what's being tested :P [02:15:20] heh [02:15:21] but we don't have a smaller scale , yup [02:15:27] I'm working on it ;) [02:15:32] yay [02:15:35] I know, that wasn't a dig at you [02:15:45] I know. I was just throwing that out there [02:15:45] heh [02:15:46] although boy it will sure be nice when that's live [02:15:57] hopefully next time we'll have a way to test things a little better [02:15:57] getting a load on that cluster will be hard though [02:16:06] cause we have to have steady use [02:16:23] it also won't use the live databases, which makes things difficult [02:16:25] worry about that later [02:16:45] though we could spin up a production version for testing, then kill it afterward [02:17:19] I assumed we would do something liek that [02:17:39] that's one of the ideas I had hoped to do with it :) [02:17:47] guess we should announce in the other channel before we go [02:18:05] *TimStarling is trying to figure out xmlstarlet [02:18:12] don't know why I use this program [02:18:27] I have no idea what that is [02:27:14] ok, I'm ready [02:27:26] oh boy [02:28:06] gonna restart profiling right after, yeah? [02:28:56] yes [02:30:03] I'll set $wgCacheEpoch = '20110208203425', which is 6 hours ago [02:30:46] I'm looking at where I think it's set [02:30:55] it's 2006 something right now? [02:31:01] yes [02:31:05] (which is meaningless, I know, but just surprising) [02:31:05] ok [02:31:09] don't edit the file, I'm about to save it [02:31:12] not going to [02:32:26] it's live [02:33:50] load going up [02:34:02] that's expected [02:34:21] 31% hit rate? [02:34:30] hmm [02:34:45] is that what you see? [02:34:45] load coming back down [02:36:34] 33? [02:37:11] ok load seems higher than it was but stable [02:38:29] 35? [02:38:42] I may be computing these the wrong way, we can check later [02:38:46] load is still a long long way below the red line [02:38:57] yes a long ways [02:39:48] it's off peak, but CPU only went up from ~40% to ~50% [02:40:08] very unimpressive [02:40:22] heh [02:40:33] I prefer unimpressive but I would prefer even more to have the solution in our hands [02:41:49] I think this isn't the problem [02:41:57] shall I revert now? [02:42:02] yes [02:43:39] what are the pcache_not_possible numbers? [02:43:41] CPU has recovered [02:43:57] instantly of course [02:45:23] it's incremented from OutputPage::addWikiTextTitle() [02:45:54] I think we should ignore them [02:46:00] so far I have been [02:46:25] let's just count pcache_miss_absent and pcache_miss_expired [02:46:48] 28360, 22127 for the last numbers I had [02:46:58] 29010 pcache_hit [02:47:42] 36.4% [02:47:55] hmmmm [02:48:34] how did they compute the earlier numbers? [02:49:23] with Article::view and Parser::parse-Article::getOutputFromWikitext [02:49:32] different measuring sticks [02:49:47] but I have before and after stats [02:49:54] I just have to process this XML somehow [02:50:04] ok [02:51:58] 66.4 before the switch? [02:52:41] sure looks like. My last numbers for that are hit: 543031, absent: 263432, expired: 10219 [02:52:58] I'll take your word for it [02:53:03] until my script is written anyway [02:53:39] well you can always do the addition/division yerself. even with a calculator I'm dangerous(ly wrong) [02:54:07] so you had us at 31 and we ended at 36% in .. [02:54:13] bah, I did not keep track of the time. um [02:54:32] 10.5 minutes or so [02:58:33] last night (= this morning for some of us) we had 1.17 running the first time for about 30 minutes [02:58:40] there was a lot of flapping and such during that time [03:00:17] and another 30 mins or so the second time it looks like from the admin logs [03:07:44] holler when you got something [03:11:39] gnumeric is much nicer than OO Calc [03:11:53] I'll near it in mind [03:12:05] I'll bear it in mind too [03:12:40] more features, less stupid, and the UI is much faster [03:12:59] OO Calc reminds me of running Excel on a 486 [03:13:02] most things are much faster than oo calc [03:13:20] I never ran excel on a 486 (or anything). sounds bad [03:14:19] I wonder if we could have a lot more things that are considered not cacheable in 1.17 [03:14:39] but then they won't be marked as misses I suppose [03:17:02] it even has "save as image" in the graph context menu [03:17:08] awww [03:17:12] <^demon> I can't imagine this sleep-for-three-hours-then-continue-working pattern is really good for me :p [03:17:18] I don't know why I haven't used it for years [03:17:25] I know it's no good for me but wth [03:17:41] I don't know that we'll do another round of anything tonight [03:17:52] unless the numbers come up much different than I expect [03:17:57] or one of us has a brain wave [03:18:30] I'm going to see if I can do a local test to rule out a coupl ethings, might take some time though, need to set up a fresh installation [03:18:58] http://tstarling.com/stuff/wgCacheEpoch-experiment.png [03:19:20] hmm, that line needs to be a bit thicker [03:20:01] ok [03:20:19] updated [03:20:47] ok, about my numbers [03:21:03] so now, what does this tell us about runs one and two last night? [03:23:44] seems like it woul dbe tough to get to 55% in the first half hour... and then there would be no reason for the second run to just sit there aat the same 55% [03:24:05] we are measuring hit rate in a different way, so the numbers aren't directly comparable [03:24:24] yeah I suppose [03:24:25] but I think the theory that the downtime was caused by low hit ratio is looking shaky [03:24:56] what was the percentage for before the first deployment, using their yardsticks, do we have it? [03:25:53] no we don't have it [03:25:57] bah [03:26:02] profiling was broken until halfway through the first deployment [03:26:05] right [03:26:08] so it was [03:26:08] so there's no data from before then [03:26:21] we can convert the numbers [03:26:37] the difference is in the denominator, right? [03:27:04] i.e. Article::view doesn't always lead to either a hit or a miss [03:27:12] ah [03:27:22] so I don't know anythign about those numbers [03:28:04] but we could always capture profilgin data now [03:28:26] probably similar to before deployment [03:29:49] and then look at relative numbers for those as compared to what we got just a little bit ago [03:30:28] the sample rate for profiling is 1 in 50 [03:32:56] currently we have Article::view() = 14656, pcache_hit + pcache_miss_* = 611838 [03:33:45] so 83% of Article::view() calls leads to a pcache hit/miss [03:35:45] so if the old miss rate was 22%, the new miss rate will be 26% [03:36:04] with denominator reduced by a factor of 0.83 [03:37:38] and the old 1.17 miss rate of 45% would be a new miss rate of 54% [03:38:53] so a 1.17 new hit rate of 46%, which is higher than what we saw in the $wgCacheEpoch test [03:39:32] so sorry but not getting the first set of numbers ot work out [03:39:39] I told you I can't add even using bc or whatever [03:39:52] real article views is 14656*50 ? [03:40:01] yes [03:40:10] what this is saying to me is that the LocalSettings.php thing was sufficient to cause the drop in hit rate [03:40:16] but not sufficient to cause the spike in CPU [03:40:43] wait, this is wrong [03:40:47] so you're saying 732800*83/100 = 611838 [03:40:47] this graph is wrong [03:40:56] but I can't get that to come out [03:40:59] most of what I said is right but this graph is wrong [03:41:00] *apergos waits [03:41:21] I didn't take deltas, so it's showing cumulative statistics [03:41:29] mmmmm [03:42:11] what did we have from earlier? [03:43:09] we had cumulative statistics from short known time intervals [03:43:26] short intervals... let's find out how long those were I guess [03:43:37] if possible. [03:43:40] it doesn't matter, it's just an average [03:43:53] the graph is an average over time when it shouldn't be [03:44:03] so it won't swing at the end as much as it should [03:44:03] right, I get that [03:44:15] in the old stats, all we have is the average [03:44:21] mm hmm [03:44:26] that's the thing [03:45:25] anyways I tend to agree: cpu spike not really accounted for, and that's what is the stopper [03:46:20] so maybe we disregard the pcache stuff entirely and go hunting for cpu related items [03:50:01] updated: http://tstarling.com/stuff/wgCacheEpoch-experiment.png [03:50:10] now the recovery looks like it should [03:51:19] ok [03:51:19] hmm really? 30 minutes to recover fully? [03:51:36] or that data from the end is after the revert? [03:51:57] it recovers instantly at 02:47 when I reverted $wgCacheEpoch [03:52:04] ok, that makes sense then [03:53:20] the CPU spike might have been on another database [03:53:25] we only looked at profiling on enwiki [03:53:39] that's true [03:53:49] I expect it would have had to be one of the bigger ones [03:53:59] there's about what 10 candidates? [03:54:53] I guess [03:55:09] I suppose we want them by readership [03:55:11] ugh [03:55:42] but this is not going ot be some template; any content-side thing is still there running happily under 1.16 [04:01:01] I wouldn't assume that. [04:01:34] It would be extremely unlikely that someone added a bit of content right when we went live, and removed it both times [04:01:58] That's not how I read what you wrote. [04:02:02] either js or template or whatever [04:02:06] But sure, that'd be unlikely. [04:02:21] I wouldn't say definitely this isn't content/template-related, though. [04:02:26] definitively, too. [04:02:36] doma s referred earlier to es having taken down the site with some template or other [04:02:48] Lots of small things have taken down the site (or come close). [04:03:02] oh sure, it could be related to a different way we process something. I mean, it pretty much has to be right ? :-P [04:03:32] Yes, I think so. [04:04:05] Starting at 1.16wmf4 and updating at half points might help figure out what's causing the problem. Dunno. [04:04:13] This is one of the many reasons I hate branches, though. :D [04:04:47] well getting agreement to push all those tests live might be tough [04:04:59] I don't think we need to waste too much time shooting down Shirley's silly ideas [04:05:18] eh, I'm kind of in time wasting mode for a minute here [04:05:50] staring at the profile output from 1.16 earlier and your 1.17 saved output and wondering where the culprit is [04:11:06] I'll write an email about where we're up to [04:11:16] ok [04:20:56] we sure got a lot of requests for things that don't exist [04:20:56] as I look through the logs on one of the apaches (still get them, it seems) [04:20:56] File does not exist: /usr/local/apache/common/docroot/wikipedia.org/chrome: [04:20:56] and many other oddities [04:24:20] PHP Fatal error: Allowed memory size of 83886080 bytes exhausted (tried to allocate 4064 bytes) in /usr/local/apache/common-local/wmf-deployment/includes/StringUtils.php on line 322 [04:24:22] and [04:24:40] PHP Fatal error: Maximum execution time of 180 seconds exceeded in /usr/local/apache/common-local/wmf-deployment/includes/Title.php on line 2585 [04:26:16] of course this is at the point the machine was heavily loaded [04:43:00] within a few minutes of deployment the seocnd time we have max clients reached on the one apache where I'm combing the logs [04:45:45] same for the first deployment [04:46:52] hard to know what's a cause and what's a symptom unfortunately [04:54:16] *robla reads through backlogs [04:55:32] hello [04:55:45] I just saved a bunch of profile data pages, prolly not useful now [04:55:53] and am about to give it up for the night I guess [04:57:15] I wrote a summary of the $wgCacheEpoch thing for private-l [04:57:18] oh yeah and we did not of course reach max clients during this little lest, nothing like it [04:57:22] *test [04:57:44] <^demon> Summary was nice, saved me from reading too much scrollback [04:58:09] thanks for sending the email [05:00:49] yes, thank you! [05:01:23] <^demon> I'm so out of whack right now. I've lost track of what time && timezone I'm even in anymore [05:01:52] you think you lost track... [05:02:38] only thing keeping me from sleep right now is I don't want to wake up at 2 am [05:03:49] <^demon> I napped from like 6 to 10, so I'm a bit too awake to head back off yet. [05:04:09] ahh [05:04:12] nice [05:17:16] <^demon> OT: Is the 1950s Julius Caesar with Marlon Brando good? [05:21:37] uhhhh [05:21:39] I dunno [05:21:44] I am sure I never saw it [05:21:57] looking for things to watch on teh internets? [05:22:10] <^demon> I love older movies. [05:22:17] <^demon> And they're great to fall asleep to. [05:22:20] oh, ok [05:22:32] I mostly got indoctrinated ot watch musicals [05:22:37] like old fred astaire movies [05:36:36] do you have netflix access? [05:36:44] nope [06:59:56] TimStarling: So, $wgCacheEpoch. Where does that 10:34 UTC number come from? Some file's mtime? In that case, have you considered we've done quite a bit of moving around and syncing and stuff? [07:00:35] it's the modification time of LocalSettings.php on all apaches [07:00:47] Right, thought so [07:00:48] it's older in NFS but something touched it in the local copies [07:01:15] Was it newly created when I first synced out the php-1.17 dir maybe? [07:01:25] rsync is supposed to preserve times but often fails due to permissions issues [07:01:39] you have to run it as root if you want it to work properly [07:01:59] anyway it doesn't matter why it was changed [07:02:21] I've fixed the configuration now so it won't matter if it's changed again [07:02:55] the creation time is still teh same [07:02:59] it's only the mtime [07:04:25] You're right, 10:34 UTC [07:05:00] anyways it's back to the drawing board: why did the site fall over? [07:05:28] so we had a bunch of requests to the apaches: why? were they slower and so they stacked up? [07:06:07] if CPU is exhausted, thread count spikes, that's just how it works [07:06:25] that's why we have a maximum [07:06:52] the thread counts everywhere will be exhausted, then users will see error messages [07:07:28] and since error messages are cheaper than serving the site, that brings CPU demand back into line with capacity [07:07:36] sure but my question is about the root cause [07:08:09] don't know [07:08:10] OK so the first time I'm pretty sure something was wrong with the pcache [07:08:22] exactly. nobody knows. we are back at square one. [07:08:28] I believe brion had discovered a case where it didn't use an existing cache entry [07:09:03] basically once anyone had caused a new parser cache entry to be saved for a given page, *all* legacy entries for that page would never again be loaded [07:09:17] $wgInvalidateCacheOnLocalSettingsChange alone was enough to cause the parser cache hit rate to drop by the amount that it dropped [07:09:27] Right [07:10:13] So there could be and probably is another cause contributing to the CPU overload [07:10:21] what it wasn't enough to do (it seems) was cause the cpu hit... so what did? I've been thinking about ways to tackle that without watcing it live, [07:10:23] that is the theory [07:10:26] but coming up empty-handed [07:10:29] we could do a staged deployment [07:11:09] staging site by site? i'd recommend that yes, though..... it should be tested on a production-equivalent cluster first :) [07:11:23] wiki by wiki, yes [07:11:30] if only we had a production-equivalent cluster [07:12:17] we could take something medium sized like fr wiki [07:12:26] Yeah [07:12:49] not rtl, I don't think it has flagged revs (have to check) [07:13:17] Like I said last night, I think a quick hack for doing het deploys in this context shouldn't be hard [07:13:21] nope it doesn't [07:13:33] \o/ [07:14:30] what would we need to make that work? [07:14:54] http://wikitech.wikimedia.org/view/Heterogeneous_deployment [07:14:58] I'm already there [07:15:12] definitely both versions sitting on all the apaches (we have that) [07:16:04] I'm unclear about what happens to bits, etc [07:16:57] here is a typical bits URL: http://bits.wikimedia.org/skins-1.5/common/shared.css?283-19 [07:17:10] i added two lines to the plan :) http://wikitech.wikimedia.org/view/Heterogeneous_deployment [07:17:12] However [07:17:30] When het deploying between a version that calls static resources on bits and a version that calls load.php on bits, this shouldn't matter [07:17:35] if you want to deploy 1.17, you can just change the version in that URL from 1.5 to 1.17 [07:18:03] except for debug mode [07:18:15] we had better have debug mode working [07:18:23] Hmm, true [07:18:46] So we need to point skins-1.17 to php-1.17/skins [07:18:53] and images [07:19:00] http://prototype.wikimedia.org/rc-en/skins/common/images/poweredby_mediawiki_88x31.png [07:19:08] mmmm [07:19:09] Right [07:19:15] I'm sold, we need skins-1.17 [07:19:22] Should be easy to set up anyway [07:19:32] ok [07:19:46] maybe also extensions-1.17 [07:19:55] Yes, good one [07:20:07] $wgCacheDirectory = '/tmp/mw-cache-1.17'; [07:20:11] i.e. $wgExtensionAssetsPath [07:20:13] yes [07:20:15] ah [07:20:26] yes ok, that's the message cache [07:20:38] Yes, it's shared [07:20:44] you can just search the configuration for /tmp, that will show you all the shared caches [07:21:00] Right, all those need to be 1-17ed [07:21:25] Then live-1.5/MWVersion.php is where the actual version switching would happen [07:21:45] And the $IP setting in php-1.17/wmf-config/CommonSettings.php would need to be changed [07:21:45] bingo [07:21:45] just like the olden days [07:22:08] Hm, I have an idea [07:22:10] i point out that that will break some content making incorrect path assumptions (js and css) [07:22:12] We can try to het deploy test [07:22:13] though probably not much [07:22:31] thedj: You mean for extensions/ ? [07:22:38] you can always make a new test wiki just for testing this [07:22:39] for skins-1.17 [07:22:45] We already have skins-1.5 [07:22:54] Oh, you mean user JS/CSS [07:22:59] Yeah, we can just create a new wiki [07:23:01] yup. hardcoded paths. [07:23:19] bleah there are probably a lot of those [07:23:21] ah well [07:23:53] we provide stylepath already for user JS [07:24:11] if they have hard coded "http://bits.wikimedia.org/skins-1.5", that is their own silly fault [07:24:15] Hmm, this quick het deploy thing is sounding fun [07:24:17] for enwiki it's probably not much of a problem, we are reasonably away from the hardcoding, and we have knowledgable people. it's the other wiki's that probably will run into issue. [07:24:33] Too bad I have to hand in this school thing today, or I would've dived straight into it [07:24:33] +s [07:24:43] heh [07:24:46] *thedj kicks RoanKattouw to school [07:24:52] oh crap, i have to get to work :D [07:24:57] thedj: I'm not saying it's not a problem, just that it's someone else's problem [07:25:03] I have to sleep. I know that's novel but... [07:25:13] you know, thought I'd give it a try for once [07:25:23] good night [07:25:28] night! [07:25:34] TimStarling: i don't care much either, just making sure it is taken into account. [07:26:59] If and when we do deploy 1.17 to one wiki in an isolated fashion, we'll want to set the LocalSettings.php mtime to equal the current $wgCacheEpoch, I guess [07:27:43] why? [07:28:11] So we don't get the cache epoch bump and the pcache hit rate dip [07:28:24] Reduces noise in profiling and such, allows us to focus on other things that might be wrong [07:28:27] it's fixed already [07:28:39] Oh, you disabled the feature? [07:28:52] I set $wgInvalidateCacheOnLocalSettingsChange=false in CommonSettings.php in php-1.17 [07:29:16] which disables the feature, yes [07:29:34] previously it was a hack in the default LocalSettings.php, I moved it out of there and into Setup.php [07:35:14] Oh wow, I totally missed http://noc.wikimedia.org/cgi-bin/report.py?db=stats/enwiki [07:35:19] I didn't know that wfIncrStats stuff still worked [07:35:33] What's up with the pcache_hit:age=001 stuff [07:35:35] ? [07:47:16] *robla doesn't know [07:48:02] what sort of local parser benchmarking do we have? [07:49:02] *robla does some rudimentary snooping around the parser tests [07:50:46] Does http://www.mediawiki.org/wiki/WMF_Projects/wmsync have an associated bug? [07:50:57] Shirley: I don't think so [07:51:07] I think the 2010q4 category can safely be killed, too. [07:51:18] RoanKattouw: I made it log the age of each parser cache object as it is retrieved [07:51:21] in hours [07:51:35] Ah OK [07:51:47] Ariel and I were modelling what would happen if we increased $wgCacheEpoch [07:51:58] I'll remove it now if you don't want it [07:52:02] It's fine [07:52:11] I hadn't noticed the stats before, they're very useful [07:52:34] they were broken for a long time, domas fixed them [07:52:42] broken as in missing [07:53:08] Cool [07:53:21] OK so one thing is, we can't profile arbitrary wikis this way [07:53:40] We'd have to manually add any guinea pigs to the profiler code/config, right? [08:00:31] yes [08:01:00] you just add the to startprofiler.php [08:01:03] bbl [08:01:07] I'm headed to bed, but I'm just wondering if anyone has already tried to find some crude basis (e.g. basic parser test) for a binary search through the revision history (ala git bisect) that might uncover problem commits. [08:03:06] We'd need to reproduce the parser weirdness on the current version first [08:03:18] I'm not sure that anyone has even succeeded in that first [08:03:52] Also, the $wgCacheEpoch thing Tim discovered can explain the miss rates on its own, with something else probably causing the increase in CPU usage [08:05:23] I guess I'm trying to probe some ways to figure out the CPU usage increase [08:06:14] and maybe more generally than a parser performance test, has anyone figured out how 1.16 and 1.17 compare in a non-production environment [08:12:25] nm...I'm off to bed [08:18:52] RoanKattouw, which parser weirdness? [08:20:03] The high pcache miss rate [08:20:10] Although that may have been solely due to the $wgCacheEpoch thing [08:20:19] In fact it probably was [08:21:06] we may want to add another counter to the second if (!value) that we added yesterday [08:21:50] where does the age= come from? [08:22:58] The more informative counters, the better [08:23:00] Tim added the age= stuff [08:23:35] seems it wasn't placed yet in svn [08:24:06] That's right [08:24:13] Two uncommitted changes on the cluster [08:24:34] oh, remember to add r81765 [08:24:46] I don't want to kill the system again for that :P [08:24:49] http://mediawiki.pastebin.com/PhhJAJGM [08:25:49] so age is number og hours [08:26:10] Yes [08:30:01] I don't see the source for pcache_miss_stub either [08:30:43] Me neither [08:31:05] Article.php [08:31:08] Note that the changes I pastebinned are against 1.16wmf4 [08:31:58] I supposed so [08:34:33] !r 81765 [08:34:33] --elephant-- http://www.mediawiki.org/wiki/Special:Code/MediaWiki/81765 [08:39:19] I would also add this http://p.defau.lt/?U1f5EP3R8goHlP9H_UqZ4Q [08:39:20] to have more numbers available [08:39:32] (on 1.17) [08:41:18] I think there's a typo in your if condition [08:41:34] where? [08:41:49] strpos( $value , ... ) [08:41:53] $value is a ParserOutput, not a string [08:42:26] AFAICT [08:42:35] you're compeltely right [08:42:48] it should have been strpos( $parserOutputKey, '*' ) [08:44:37] http://p.defau.lt/?5ihZ3Lk3CU2CyxkkA_q_Hg [08:47:49] OK so if the new format misses but the old format hits [08:48:02] We should see pcache_miss_newkeys but no corresponding pcache_miss, corret? [08:48:44] right [08:48:51] and it would actually be a hit [08:48:57] Right [08:49:06] those added from first attempt to the second [08:49:20] At least this'll allow us to see whether the new keys thing is a problem [08:50:15] if we're doing a heterogeneous deploy I'd begin with a wiki without too much extensions [08:50:35] such as without flaggedrevs [09:01:51] Yeah [09:26:28] when will the next deploy attempt be? [09:38:51] Tonight, we will decide whether to deploy tomorrow [09:39:13] (That's tonight and tomorrow UTC) [09:57:53] RoanKattouw: I made some js changes in site js for 1.17 which are not compatible with 1.16. with cached js going away people begin complaining. do you think I should revert it now? [09:59:41] I'd make it conditional on wgVersion [10:02:01] What Platonides said [10:02:15] wgVersion will be either '1.16wmf4' or '1.17wmf1' depending on which version is live [10:10:15] ok... I'm writing a single line importScript('MediaWiki:Common.js/' + wgVersion + '.js'); in Common.js [10:11:08] do i understand correctly, that there was an error in the patch for attempt 2 ? [10:11:40] what should I use for importScript in a 1.17-style? [10:13:03] liangent: i don't understand the question. [10:13:15] thedj: We don't know exactly what went wrong in attempt 2 [10:13:16] wgVersion = 1.17wmf1 [10:13:32] We do know that the high pcache miss rate can be explained by accidental bumping of $wgCacheEpoch [10:14:36] how should I use mw.loader to achieve the same goal as importScript? [10:15:24] importScript will continue working in 1.17 [10:15:36] liangent: mediaWiki.loader.load( url ); [10:15:43] But yes, importScript will continue to work [10:15:56] At least in user/site JS [10:16:12] RoanKattouw: that way it have to be url. what about loading by page name? [10:16:19] liangent: not possible atm [10:17:26] I'd continue using importScript for now [10:18:01] yeah, that's the best idea. if that is every truly deprecated, there will be plenty of work anyways. [10:19:22] but what about importScriptURI [10:19:34] importScript/importScriptURI were defined as an API for mediawiki javascript [10:19:41] I would keep them for a long time [10:19:43] I guess there'll be some issues when people mix importScriptURI and mw.loader [10:19:51] Not at this time at least [10:19:56] mw.loader in URL mode is dumb [10:20:04] ie. loading script twice [10:20:08] It doesn't keep any kind of state about loaded URLs, it just loads them [10:20:32] So both importScriptURI() and mw.loader.load() in URL mode will happily load URLs twice [10:20:55] At least that's the way stuff works right now [10:21:09] there was a loadedScripts memory in importScriptURI? [10:21:10] For instance, there is no reason importScriptURI couldn't be aliased to mw.loader.load() in the future [10:21:18] Hello, we have some problem on bnwiki with JS. Can anybody help? [10:21:18] I don't know, was there? Maybe [10:21:21] importscripturi actually checks if a key has been loaded before. [10:21:34] Hm, I don't think load() does that for URLs [10:21:37] Tanvir: What problems? [10:22:47] The problem is, when we try the js code on our personal js it works, but when they are on MediaWiki:Something.js.. it doesn't working.. [10:22:53] The version still shows 1.16 ; which I assume correct. [10:23:02] Tanvir: have you tried doing a null edit on MediaWiki:Something.js ? [10:23:12] Platonides, no. [10:23:12] there were some message problems yesterday [10:23:17] try it [10:23:23] Okay trying. [10:23:28] and yes, the sites are still at 1.16 [10:23:31] They should've been fixed when I purged the individual msg cache on all wikis, but try anyway [10:23:56] do we know what caused those problems? [10:24:06] If it's still broken after that, see if there are JS errors and give us the error messages and line numbers and such [10:24:12] Msg cache incompatibilities I guess [10:24:24] Not worth diving into much if there's a known workaround [10:24:54] Platonides, null edit means? should I remove something and then restore it back? [10:25:13] save without changing anything [10:25:19] Okay. [10:25:46] Platonides: I ran this last night, that seemed to fix it: for wiki in `< all.dblist`; do echo '$wgMessageCache->clear();' | php wmf-deployment/maintenance/eval.php --wiki=$wiki; echo $wiki; done [10:27:06] I hope the normal update.php performs something like that [10:27:32] I think it does [10:28:17] Platonides, it works! :D [10:28:52] :) [10:29:38] But did someone closed the editing privileges for few seconds? I got a message when saving like, some admin restricted database using privileges. [10:30:05] Platonides: The updater doesn't do this, no :( but it does DELETE FROM objectcache; so it's not a problem on wikis using CACHE_DB [10:30:27] Tanvir: just now? [10:30:31] Yes. [10:30:37] there was no change... [10:31:13] Oh, but I got the message. It's not a problem though. [10:31:39] would have been nice to have it [10:32:46] So, 1.17 implementation won't take place in near future? [10:33:07] it depends what you mean near future [10:33:08] we will try it [10:33:16] today not [10:33:22] this week maybe [10:33:31] and this month for sure [10:33:43] Okay. :) [10:33:52] We will make a decision tonight (UTC) as to whether to deploy again tomorrow (UTC) [10:35:03] RoanKattouw: so was the pcache issue fixed? [10:35:21] The pcache issue may have been due to the $wgCacheEpoch thing instead [10:35:28] Tim and Ariel ran a test last night when I was asleep [10:35:40] Where they set $wgCacheEpoch 6 (or 3?) hours in the past [10:35:51] And they saw the same increase in pcache miss rates, but only a small increase in CPU usage [10:37:34] so, that presumably means that the cpu spikes were caused by something entirely different? [10:37:42] Which leads to the troubling conclusion that the pcache miss rates and the CPU usage were probably not correlated much, and that the cluster meltdown was caused by something else [10:37:47] Yeah [10:38:17] maybe we also hit some kind of Michael Jackson effect? [10:38:19] So the plan I favor is to hack up a quick "system" (i.e. a hack) that'll allow us to deploy 1.17 selectively [10:38:29] Cache stampeding? That's possible [10:38:35] But it couldn't be due to the cache epoch issue [10:38:40] I have been thinking about it, but I would expect it to have fixed itself in such case [10:39:09] Because 1) the cache epoch issue was repeated and led to small CPU increase and 2) Michael Jackson-like pages typically don't have pcache entries older than 6h [10:39:22] maybe if the cache epoch did something like expiring the caches for all main pages... [10:39:38] We repeated the cache epoch bump [10:39:57] What we didn't repeat was a cache epoch bump in conjuction with a scap, Tim used sync-file [10:39:59] it could have hit a different page set [10:40:02] scap purges a few other caches [10:40:32] Either way, what I'd like to do is hack up a system so we can deploy 1.17 selectively, and take a guinea pig wiki to see how it behaves [10:40:36] well, good luck [10:40:39] bbl [10:40:40] But we'd have to hack that up first [10:40:59] Use mediawikiwiki as a trial :P [10:41:03] Wouldn't take much time but the only person enthusiastic about it (me) has a school project deadline to meet today [10:41:09] Reedy: Rather not, CR is on there [10:41:16] Boo [10:41:27] it doesn't have enough traffic to be significative [10:41:31] That too [10:41:39] officewiki is even quieter and has phone numbers, so that one's out too [10:41:49] I've been thinking we could take a closed wiki [10:42:04] Although such a wiki won't have much traffic either [10:42:07] Maybe incubator, or meta, or something [10:42:28] meta may be a good option [10:42:45] It does host CentralNotice [10:42:54] as long as it doesn't break completelely... [10:43:07] it is also the point from which stewards act [10:43:07] the first wiki we move to 1.17 should be a test wiki [10:43:09] CN won't be cahgned [10:43:14] Yeah, it should [10:43:15] not a wiki we use [10:43:23] But at some point we'll have to move a wiki that gets actual traffic [10:43:32] it takes about a minute to make one [10:43:44] Yes, and root access [10:43:58] So could you make me one before you go to bed tonight? [10:44:05] ok [10:44:11] I'll start poking tonight [10:44:15] Schoolwork first [10:44:37] what's it about? [10:44:44] Quadrics [10:44:52] And Mathematica [10:48:25] I'll do it when I have two hands free [10:58:05] Take your time [11:15:02] do we have things (eg. category sorts) in 1.17 lazily updating on access? [11:15:09] that could be a source of cpu spike [11:15:26] there's no category sort feature in 1.17wmf1 [11:16:07] I think there were some category changes [11:16:32] maybe not sorts [11:18:13] The category collation changes were patched out of wmf1 [11:18:32] Either way, we should be able to tell from profiling data [11:19:20] TimStarling: How is the profiling of parse() entry points (e.g. 'Parser::parse-OutputPage::getWikiText') done currently? With profile points in the caller or in the callee? [11:19:53] If the former, maybe we should move to the latter, so we can track down any other sources of excessive parse() calls [11:20:17] Parser::parse-OutputPage::getWikiText is done in Parser::parse [11:20:46] $fname = __METHOD__.'-' . wfGetCaller(); [11:20:48] I see [11:20:51] That's very useful [11:34:13] what should this test wiki be called? [11:44:13] it can be test2.wikipedia.org [11:44:32] suppose it doesn't really matter what it's called [11:44:44] as long as it isn't mistaken for a normal wiki [11:46:10] *vvv tried to import a dump with FAs on his local wikis [11:46:27] Resulted in PHP fatal concerning recursion limit [11:51:51] Blame werdna? [11:54:35] TimStarling: oneseventeen.wikipedia.org :D [11:54:48] (You know, no leading digits) [11:54:57] Seriously though, test2wiki will be fine [11:55:09] RoanKattouw, it's up already ;) [11:55:19] So it is [11:55:27] Haha. I got user id 1 [11:55:42] Oh, SUL? [11:56:03] Ya [11:56:11] Well, i suppose it needs to be as much of a replica as possible [11:56:41] That's fine [11:56:44] It's running 1.16 for now [11:56:52] So I can import the contents of test.wp later [11:58:26] test2wiki? nothing comes from newprojects list [11:58:46] liangent, it's just test2 [12:00:31] liangent, what is new projects list? [12:08:07] Hmm, that's weird [12:08:40] Maybe Tim-away didn't use addwiki.php but something else [12:08:46] Or he didn't use it in the documented way [12:08:55] Why? [12:08:58] (Or addwiki.php is broken :) ) [12:10:30] Waihorace, https://lists.wikimedia.org/mailman/listinfo/newprojects [12:18:13] test2 is running 1.16wmf4? [12:24:25] Waihorace: yes, see http://test2.wikipedia.org/wiki/Special:Version [12:24:46] lol [12:24:52] was test2 deleted? [12:24:54] But why? It isn't use for test the 1.17? [12:24:59] no [12:25:06] It will be moved to 1.17 later [12:25:09] It is totally inaccessible for me [12:25:17] probably a dns issue [12:25:23] true [12:25:33] but I am the first user to edit it ;) [12:25:40] not all user creations are listed at http://test2.wikipedia.org/wiki/Special:Log [12:25:42] i saw that [12:25:47] ah [12:25:49] well, after 127.0.0.1 [12:25:50] got it back [12:26:18] I am following the channel for the feeds and found Krinkle to be the next user [12:27:14] 11:59, 9 February 2011 Reedy (Talk | contribs) Account created automatically ??? [12:27:52] Indeed [12:27:53] the creations of user Krinkle and Platonides aren't listed there [12:28:04] compare with http://test2.wikipedia.org/wiki/Special:ListUsers [12:28:53] Huh, I have user ID 13. I'm lucky [12:29:15] Maybe [12:29:23] you are listed 12th there [12:29:39] Krinkle is before [12:29:52] created at the same time as Akkakk [12:29:59] lol [12:30:13] I am 7 [12:30:22] the first wiki for me to be in the single digit [12:30:37] I am 10. [12:30:48] and it is the 800th wiki for my global account [12:32:37] It is my 139 wiki... [12:32:47] oh lol [12:33:00] I thought my 800th wiki would be eo.wikisource [12:33:24] i've got 805 wiki's tied in [12:33:41] lol [12:33:48] 5 wikis must be private ones? [12:34:04] or now closed [12:34:24] I have global accounts in closed wikis [12:34:34] I wonder where the 5 comes from? [12:35:04] i just logged into everything on sitematrix [12:35:09] "Please do not start editing this new site" <- Who cares [12:35:22] it's just a generic message [12:35:28] Should this sentence be remove? [12:35:28] no [12:35:34] even if it is a new language [12:35:38] I don't really care [12:35:41] :P [12:40:30] Well, we surely are not going to import test2 from beta.wv [12:42:55] What's the current prefered way of importing/exporting dumps? mwdumper? [12:44:22] yes [12:44:36] <^demon> mwdumper isn't really well maintained ;-) [12:44:56] <^demon> I would think the maintenance scripts dumpBackup and importDump are the canonical ways. [12:45:36] stupid browser [12:45:54] constantly giving me "Not found error for test2.wikipedia.org" [12:53:02] And what is used by Wikimedia? [12:54:01] when creating new projects? [12:54:15] When dumping and importing dumps? [15:28:52] morning [15:28:58] time to see what's up with test2 [15:29:31] It's still 1.16 ;) [15:30:02] did anyone import some content? [15:30:22] don't think so [15:30:29] Roan was going to later I think [15:30:41] Tonight [15:31:12] there we go, I've got some boring userid I guess [15:31:50] which content? [15:31:54] as long as I'm around... [15:32:40] *apergos wanders around the en project dumps looking for a likely suspect [15:35:42] apergos, not fancy importing an el one to confuse us all? ;) [15:35:56] meh [15:36:05] it would be fine for me and not helpful to you [15:36:50] so we could import simple.wiki and get all the back revisions... [15:37:15] that would give us full fledged articles, refs, etc [15:37:32] and be much smaller then enwiki [15:37:44] alternatively we could export all the articles (with templates etc) in some specific category of en wiki [15:38:21] lemme see what typical articles look like on simple [15:38:52] eh not bad [15:39:07] they are chock full of cites, have some pics, all the standard stuff [15:40:25] a lot are shorter [15:40:51] lemme see if we can find a good category on en wiki that would do us. [15:41:40] Featured articles/ [15:41:50] won't that be a bit small? [15:42:42] 3710 pages [15:44:44] I suppose if we got those and all their supported templates.... [15:45:10] I wonder what test.wp has in it [15:46:18] nice. I chose a random page and got the [15:46:25] Infinite loop 0 page [15:49:12] hmm he doesn't see to have it set up to go to a particular server [15:57:47] So? [15:58:10] It's not like that wiki gets a lot of traffic, and we can shut it down trivially if we want [15:58:25] if($wgDBname=='test2wiki') die('disabled'); [15:58:47] sure [15:59:18] just thinking about imports or exports [17:25:47] Alright, dinner. After dinner and some TV I'll start poking at test2.wp.o [17:26:44] Greetings, Programs! [18:25:02] I hadn't been running the stats thing, but I guess I should have been: http://www.mediawiki.org/wiki/MediaWiki_roadmap/1.17/Revision_report [18:38:18] robla: wow, that's a lot of new revisions [18:54:35] ooh [18:54:47] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/78179 doesn't actually need backporting since it was reverted in 1.17 [20:49:46] alolita: I noticed you put the daily "Tomasz, Arthur @FOSDEM, GNUNIFY" events in the calendar. Did you know GCal can do events spanning multiple days so you don't have to do the repetition thingy? [20:49:46] <^demon> robla: You can ignore my e-mail from earlier, I will be around at 4:15. [20:53:33] ossm [20:57:54] alolita: Fixed it by replacing the daily repeating events with two multi-day ones [21:10:23] What's on test2 is missing 2 revisions... [21:10:53] Oh [21:10:53] http://eiximenis.wikimedia.org/1-17 [21:10:54] I'll svn up [21:11:09] Heh, one being an undeclared global, and one being my fix ;) [21:11:13] nice [21:17:48] http://test2.wikipedia.org/wiki/Special:Version --> running 1.17 through rudimentary het deploy setup [21:21:22] Krinkle, how did you create your test2 account? [21:21:34] Platonids: I just logged in [21:21:35] :P [21:21:38] autocreate [21:21:55] you logged in from Special:UserLogin ? [21:22:07] yes [21:22:13] I supposed so [21:22:36] there seems to be a bug when doing that [21:23:36] RoanKattouw: what's het deploy? :o [21:23:46] Platonides: Hm.. I'm not in the log. [21:23:53] http://test2.wikipedia.org/wiki/Special:Log/newusers [21:23:54] Nikerabbit, heterogeneous deploy [21:23:59] Krinkle, that's the bug [21:24:04] TimStarling: Have you called in? [21:24:07] neither am I [21:24:13] yes [21:24:16] heh, I thought het was an article in Dutch [21:24:18] OK [21:24:20] Platonides: Oh, okay. I thought I wasn't supposed to be able to create an account (that that was the bug) [21:24:35] no, the bug is not creating the log entry [21:24:42] RoanKattouw: When I first read it I thought it some kind of dutch system "Het deploy" ;-) [21:25:10] Krinkle: Me too :) [21:25:17] It's something robla coined [21:25:20] TimStarling: Better [21:25:32] http://simple.wikipedia.org/wiki/Special:RecentChanges [21:25:34] I didn't even log in [21:25:44] I just was auto created an account [21:25:54] you were logged in other wikis [21:25:56] Platonides: But there are some other "auto create" entries, so what not ours ? What's did we do different and is it a regression or an older bug ? [21:26:08] they were logged in wikipedia [21:26:12] and it was autocreated on visit [21:26:20] ah, via the cookie [21:26:26] yes [21:26:30] Indeed, I wasn't logged in yet into centralauth at the time. [21:27:33] The logo on test2.wikipedia reminds me of https://bugzilla.wikimedia.org/show_bug.cgi?id=24278 [21:34:16] CT is saying that we should deploy a few wikis and then wait 24 hours and then announce whether we're going to continue [21:36:12] RoanKattouw, at what time did you update to 1.17 ? [21:36:29] Platonides: See SAL [21:36:37] what's SAL? [21:36:40] 21:10 UTC? [21:37:38] (server admin log) [21:37:40] [[Server admin log]] on wikitech [21:38:58] http://bit.ly/wikisal [21:39:04] we need the custom logo on test2 it is true [21:39:33] No, we don't need the custom logo on test2, we need to update the fallback (which is currently the case, it's not set to the old logo it's not set and the default is still the old logo) [21:40:49] RoanKattouw, but at which point did it begin using 1.17 code? [21:41:01] was it between 20:06 and 20:22 UTC? [21:41:02] Did what begin to use 1.17 code? [21:41:05] test2? [21:41:09] yes [21:43:09] Platonides: logmsgbot> !log catrope synchronized php-1.17/LocalSettings.php 'Update path here too' [21:43:12] Thereabouts [21:43:55] right [21:44:10] RoanKattouw, you haven't svn up'd :P [21:44:13] so it didn't log all account creations in 1.16 [21:44:19] and is loggin none in 1.17 :( [21:45:07] Reedy: Yes I have [21:45:19] I did svn up php-1.17 and synced out the files [21:45:21] MediaWiki 1.17wmf1 (r81761) [21:45:29] Lies [21:45:35] That only updates on s-c-a or scap [21:45:40] oh [21:45:42] right :P [21:45:43] Not usually for sync-common [21:47:35] are live hacks in 1.17wmf1 branch ? [21:48:41] I think so [21:48:55] Yes, Parser.php and Preprocessor_DOM.php [21:49:01] I miss some revisions merged there :P [21:50:12] http://mediawiki.pastebin.com/P5DzDMcB [21:50:13] r81778, r81765, ( r81742 for LiquidThreads_alpha ) [21:50:21] Were they merged to 1.17wmf1? [21:50:28] If not, tag them with that [21:50:49] Revision: 81849 [21:50:54] Is what I have here [21:51:02] I can run s-c-a to be sure it's up-tod-ate [21:51:59] tagged [21:52:17] After this meeting [21:55:19] I would throw an exception if $this->parser->mOptions was null ther [21:55:22] *there [21:55:50] I was just trying to shut up a fatal [21:56:23] yep, but whatever calls it seem broken, too [21:58:18] that is gonna suck [21:58:21] 4 am again [21:58:32] I'm stil not time adjusted or I woudlnt' care [21:58:51] today was slightly better, woke up at 7:30 am ... but only cause I have been seriously sleep deprived the previous nights [21:59:18] *apergos will do their very best to wake up after 10 am tomorrow :-P [21:59:40] robla: Also note that, after our cute little 6-hour window on Friday, I will want to catch a train some time between 1300 and 1400 UTC [22:06:20] Platonides: Hmm, right, I need to apply r81742 to LQT_alpha as well [22:06:23] robla: when might a post go up about this? I already have people asking on a mailing list about it in relationship to something else, I'd love to have a link I can point them to [22:06:28] (sorry) [22:07:26] in a meeting, but hopefully by 3pm [22:07:41] ok, I'll just tell them to check the tech bloag in a few hours. thanks [22:14:30] robla: We should talk about victim/guinea pig wikis after you get out of that meeting [22:14:48] Platonides: Merged and deployed the revs you asked about [22:14:55] alright...ready for blog post, then guinea pig discussion [22:15:01] simples for sure. at some point when we are more confident frwiki has not got flagged revs and it is a decent size so it could be a "later on" choice [22:15:08] (and that's all I have to say about the subject) [22:15:52] is there anything else we want for test2 in the meantime? or any other prep before tomorrow? [22:15:52] apergos: simples plural? I thought we only had simple English [22:16:01] simple wiki [22:16:05] then there are others: [22:16:10] well, I guess I need the guinea pig list for the blog post now, don't I? [22:16:46] wiktionary [22:16:49] species (ugh) [22:17:13] ah there used to be wikibooks but it's closed [22:17:41] same with wikiquote. hmph [22:17:43] "guinea pigs" will now be referred to as "first wave" [22:17:48] heh [22:17:53] too late, this is a public channel :-D [22:19:14] OK :) [22:19:21] enwikisource has an edit a minute or 2 minutes, it might be a decent choice also [22:19:44] hmm, guess I had more to say about the topic :-D [22:19:50] What about usability+strategy? [22:20:24] http://usability.wikimedia.org/w/index.php?title=Special:RecentChanges&days=30 [22:20:26] thanks, Roan [22:20:29] very minimal editing over there [22:20:33] apergos: RoanKattouw: let's actually use the blog drafting page as the place to come up with the list of wikis: http://eiximenis.wikimedia.org/ReleaseBlogPost2011-02-09 [22:20:35] so prolly not much traffic either [22:20:40] ok [22:20:42] *click* [22:21:32] http://strategy.wikimedia.org/w/index.php?title=Special:RecentChanges&days=30 not much better [22:23:42] robla? [22:23:50] Platonides? [22:24:07] oh, I didn't notice eiximenis had a chat, too :P [22:24:15] was wondering what was "over there" [22:33:35] it's the secret cabal. [22:35:31] <^demon> We should move eiximenis to cabal [22:40:10] no but I do think we need a cabal.wikimedia.org [22:40:13] *apergos checks... [22:40:33] NXDOMAIN ok, just making sure... [22:41:25] btw robla, you need to add me to the secret channel [22:42:27] Platonides: YOU CAN"T PROVE THERE'S A SECRET CHANNEL [22:42:31] :) [22:42:35] robla doesn't add people [22:42:35] there's a secret channel? [22:42:47] we don't give the keys to the kingdom to just anyone :-P [22:43:07] Ryan_Lane: shhhhhhhhhh! [22:43:13] shhh about what? [22:43:14] robla, I didn't want to say the name... :P [22:43:17] *Ryan_Lane is confused [22:43:48] aren't you using a channel ending in "rity" ? [22:43:54] heh [22:43:58] <^demon> Yeah, it's grrrrity [22:44:35] mediawiki-celebrity [22:44:40] it's a secrity [22:46:07] apergos, YES! [22:46:10] cabal wiki ftw [22:46:43] we already know what the logo will look like too ;-) [22:49:14] You mean #mediawiki_cabal_omgitsasecret ? [22:49:31] shh... there's no cabal [22:56:05] and robla, you skived off my request :P [22:56:33] TimStarling: Do we need to increase APC cache size for het deploy? [22:57:20] Platonides: you may be asking the wrong "Rob". I honestly don't control the access lists for any private channels [22:58:11] s/cabal.wikimedia/internal.wikimedia [22:58:30] <^demon> internal is boring :p [23:00:05] very. [23:00:18] so we discuss vendor related crap in there, is the thing. who can have access to that? [23:01:26] I thought someone told me it was you [23:01:30] perhaps it was RobH [23:01:49] RobH has the keys. [23:01:58] well he has half the keys [23:01:58] any op in that room does, but i know how to flip the bits [23:02:02] but i am not sure who is allowed in [23:02:03] who has the other friggin half? [23:02:18] apergos: if we told you that, we'd have to kill you [23:02:24] we are discussing and I am polly gonna not do a thing until I have mark or someone who is more in charge than i am decide on who is and isnt allowed [23:02:46] cuz i am not supposed to be the person deciding this stuff. [23:02:56] nor am i comfortable doing so without some clear guidelines. [23:03:16] robla: heh, hr tried to assign my vacation days to you earlier ;] [23:03:30] so thx for coming along, it takes a lot of stuff from me that i dont want assigned to me ;] [23:03:33] and you stopped them :-/ [23:03:36] RoanKattouw: I will have to check that [23:03:38] yes, i stopped them [23:03:40] =] [23:03:51] well, they asked me cuz they realized there was omre than one [23:04:17] so I had the opposite thing: they kept forgetting to remove my vacation days [23:04:30] I told em I had used em and they were sure I hadn't [23:04:53] apergos is keepin it real, even if it means she loses vacation time [23:05:06] 'when keeping it real goes wrong' [23:05:09] yeah [23:05:11] *sigh* [23:06:00] robla you should read the changes I made to the blog post notes under second wave [23:06:05] (maybe you did already, I dunno) [23:06:32] apergos: yeah, I saw that, thanks [23:06:41] I think I'll merge it somehow with what I wrote above [23:07:18] great [23:07:35] I just wantd the idea to get in or get explicitly rejected [23:08:29] my new laptop battery just arrived from china [23:09:13] robla: I don't think I'm gonna be around for part of my shirt tomorrow but I guess that's not a big deal since we're not live yet? [23:09:15] I need the sleep [23:09:16] laptop batteries from lenovo australia are incredibly expensive, much more than they are on the US website [23:09:29] so I got a cheap copy, cost like $35 [23:09:47] here's hoping it doesn't catch fire or explode [23:09:49] RoanKattouw: correct....I still need to figure out the shifts anyway [23:09:55] I've had 5 nights in a row with 7 hours or less, I don't want to make that 6 [23:10:04] (assuming RoanKattouw is talking about "shifts" rather than shirts) [23:10:14] Yes, sorry :D [23:10:19] btw RoanKattouw, did your assignment go well? [23:10:25] Yeah, I got it mostly done [23:10:33] Handed it in mostly but not completely done [23:10:36] Hopefully I'll get away with taht [23:10:39] :S [23:10:46] good luck [23:13:31] robla: what I was getting at was actually the opposite of what you put: [23:13:45] it may well be that we switch over all the things listed in first wave [23:13:55] apergos: that's not what we agreed to [23:13:58] and cpu useage doesn't have a noticable bump [23:14:15] if it doesn't then we are stuck [23:14:24] we're not switching over everything in the first wave, no matter what [23:14:32] I agree, we aren't [23:14:58] but if we need to list everything explicitly, we had better list a couple of large things as possbilities later in wave 1 [23:15:11] sure, let's do it that way [23:15:15] cool [23:15:49] eowikipedia? [23:16:31] eo projects already added [23:16:41] it's size by traffic mostly, right? [23:17:04] Yes, traffic [23:17:09] Edits matter too [23:17:18] ie we want moderately active, then reasnably active, and if that still doesn't get us then pretty darn active [23:17:23] But mostly traffic [23:17:41] For 'pretty darn active' we can just take a top10 wiki :D [23:17:43] yes, I've been using edits as a likely indicator of reads as well (could be entirely wrong but eh) [23:17:43] Dutch? [23:17:51] Yes, that'll work [23:18:01] Half the people closely involved in the deployment speak Dutch [23:18:04] heh [23:18:08] worksforme [23:18:12] Almost [23:18:24] <^demon> I speak simple. [23:19:00] we will need to double the APC cache size [23:19:00] I was saying, if we need to get serious and grab a top10 wiki, we could use nlwiki [23:19:06] simplewiki probably isn't large enough [23:19:15] Any reason not to double it now? [23:19:32] (For a value of 'now < 'when we start deploying things' ) [23:19:38] it looks like they have 50MB, so say 120MB [23:19:53] they're all using more than half of their 50MB [23:20:46] Hmm, het deploy doesn't work for secure [23:20:47] let's make sure our list is enough to do the job, folks. otherwise this exercise may not get us the info we need [23:22:52] Can we figure out the % of requests each wiki accounts for? [23:23:53] hmm [23:24:04] of reads I suppose you mean [23:25:49] Yes [23:26:54] we would be concerned about backend hits [23:26:55] but could deployment could change the pattern :s [23:28:04] http://stats.wikimedia.org/EN/TablesPageViewsMonthly.htm [23:28:24] this is not backend only [23:28:30] it excludes mobile [23:28:40] and it's going to be hrd to make good estimates but it gives us a notion [23:29:18] the top ten are 92% of all hits [23:29:26] so the rest are going to be negligible [23:30:11] OK so we'd need a top10 wiki [23:30:16] What's nlwiki's percentage? [23:30:37] oh and these are pedia only. [23:31:23] wonder if I have anything more detailed in my email [23:32:20] nope. I don't [23:33:44] nlwiki is 10th [23:35:02] 1.17% of pageviews [23:35:54] Is that supposed to be symbolic ? [23:36:12] eswiki 4th 5.68% [23:36:12] the first is obviously enwiki at 52.59% [23:36:15] haha [23:36:23] Wow, so even nlwiki is only a percent [23:36:48] second is ja with 8.19% [23:36:58] <^demon> Ha, I never knew wikidiff2 was in the Ubuntu repos. [23:37:21] the official ones or wmf? [23:37:33] <^demon> official ones [23:37:38] they had to rebuild it here a couple of days ago [23:40:02] So the best we can do for a small single-wiki deployment is 8% :( [23:40:12] robla: ----^^ [23:41:45] *robla ponders [23:42:53] I got het deploy to work on secure too [23:43:13] And immediately caught a mistake I'd made in configuring it to use ResourceLoader over https [23:44:53] so...regardless of percentages.... [23:45:15] we still get plenty of traffic and edits [23:45:31] ...and it should be enough profiling information to help us finish the job [23:45:39] Profiling, for sure [23:45:50] If we want to see actual impact on CPU usage I'm not sure that 1% is gonna cut it [23:46:18] If you do a few of the small wikis, you will get 15%+.. [23:46:25] see the thing is we can't figure out from existing profiling information what's going on (lookint at testwiki when it was running 1.17) [23:46:40] Reedy: Well apparently the 10 largest wikis are 92% put together ... [23:46:43] (and comparing that to pre 1.17) [23:46:55] That was 1.17 with the cache epoch issue [23:46:59] RoanKattouw, and ~40% for 2-10 [23:47:10] Reedy: Right, so multiple top10 wikis [23:47:23] nlwiki is convenient but only 1.17% [23:47:39] <^demon> We could deploy the 820 smaller wikis for 8% ;-) [23:47:53] given that this is a Friday deployment, we shouldn't get too ambitious anyway [23:48:16] How many to get 25% of page views? [23:48:25] one if we choose the right one? [23:48:36] well, its either twice over, or multiple to get 25% [23:48:41] 1% may be enough to do profiling on though [23:48:59] it gives you a reasonable picture of how caches fill [23:49:24] problably not one, actually. grrr [23:49:30] Yeah, for profiing it'll be fine [23:50:05] we can extrapolate to overall CPU usage [23:50:51] yup, we don't need absolute certainty from the Friday deploy; just more data than what we have today [23:50:59] If only 1% is running the new code, any CPU usage impact would be reduced by a factor of 100 [23:51:16] Making extrapolation hard [23:51:23] RoanKattouw: if it *is* noticeable, then that tells us something, doesn't it? [23:51:27] Yes [23:51:30] where did you get the figures for ja from, .. ah he's gone [23:51:33] drat [23:51:41] If it's noticeable, we're probably in trouble [23:51:41] http://stats.wikimedia.org/EN/TablesPageViewsMonthly.htm [23:52:56] ????, ?? ?????? ?????????????? ?????????? ?????? ???????????????????? ???? ?????? ???????? ?????????? [23:52:59] grrrrrrr [23:53:14] someday someone will solve the keyboard layout issue in a way that doesnt' sucl [23:53:18] *suck [23:53:19] as much [23:53:29] we can extrapolate from profiling data to overall CPU usage [23:53:36] Rigth [23:53:40] just multiply by 100 [23:53:47] if the nlwiki deployment isn't noticeable, then we'll have to extrapolate from the profiling info, but we'll at least know we're not in for catastrophic failure on deploy [23:53:51] yes but boy the margin of error is going to get us [23:53:57] you wouldn't be able to pick up an nlwiki signal in the noise on ganglia [23:54:03] that's why you need profiling [23:54:14] Yea [23:54:17] or else split the apache cluster and send 1.17 wikis to a different set of apaches [23:55:33] That could work, although splitting the cluster 99%-1% would be a bit awkward I guess [23:56:00] my vote would be to rely on the profile data [23:56:02] But it would give us more information for sure [23:56:23] remember, we were planning earlier this week to deploy to the whole cluster without any of this data :) [23:56:39] right [23:56:57] but now what we know is, it's broke. and we want to watch it be broken so we can find and stab the problem... [23:57:19] sure....and during the second window (on Feb 15), we can still do a gradual rollout [23:57:36] Yeah [23:58:23] that's when we can say "alright....let's turn on dewiki and see what this thing can do" [23:58:58] mmm... I kind of thought on Teusday we were supposed to be sure the site wouldn't go belly up, if we were gonna deploy [23:58:59] ...and then of course only swallow the enwiki elephant when we're good and ready [23:59:12] Yeah, dewiki is the 3rd largest wiki, and some of us speak German [23:59:17] *apergos sells tickets to people to watch robla awallow the en wiki elephant [23:59:20] Plus most Germans speak decent English anyway, it'll be fine [23:59:30] *swallow