[00:03:54] what's typical ttl on these pcache entries again? [00:07:42] wmf has them set to a week [00:08:03] yes but what's typical ttl as opposed to the value in the config file? [00:08:23] er.. no 14 days [00:08:47] the default is just one day [00:08:57] um [00:09:00] I don't mean the default [00:09:23] I mean the actual "memcached has only X much space and it starts to toss objects cause new ones go in, after about X time" [00:09:27] well Y time. [00:09:30] but anyways... [00:09:36] looks like about a day [00:09:48] ok [00:09:55] how were you abl to check that? [00:11:18] I have a spreadsheet that I made a while ago [00:11:30] nice [00:11:32] just seeing if I made screenshots from it [00:12:23] I don't think so [00:12:31] ah too bad [00:13:10] anyway the general strategy is to log into a random memcached server with telnet, and type "stats items" [00:14:07] IIRC, "age" is the relevant statistic [00:14:20] it shows how old the last item to be evicted was, in seconds [00:14:46] *apergos tries it [00:14:47] e.g. on srv271 you have [00:15:09] STAT items:65:number 12876 [00:15:09] STAT items:65:age 42045 [00:15:31] so this slab has a TTL of 12 hours [00:15:44] but [00:16:16] STAT items:49:number 4628 [00:16:16] STAT items:49:age 121909 [00:16:19] this one is more than a day [00:16:26] yeah, slabs aren't perfect [00:17:08] "stats slabs" tells you how big each slab is, in the chunk_size field [00:17:40] STAT 65:chunk_size 977 [00:17:47] so that one is 977 bytes [00:19:46] ok, had a look at the command list [00:19:47] thanks [00:22:09] surprisingly, our memcached needs didn't go up that much :) [00:26:23] huh [00:27:38] STAT items:11:number 1 [00:27:38] STAT items:11:age 4527393 [00:27:38] STAT items:11:evicted 0 [00:27:52] I should never look too closely at this crap, I'm bound to find something irritating [00:28:01] > 52 days? [00:31:33] number=1 [00:31:45] yes, I see that [00:32:01] so it's been a long time since anything was evicted or added [00:32:17] (it's also tiny) [00:32:22] probably never [00:32:28] 52 days is probably the uptime [00:32:54] bing [00:34:59] STAT items:17:number 10266 [00:34:59] STAT items:17:age 2928522 [00:34:59] STAT items:17:evicted 3 [00:34:59] STAT items:17:evicted_time 1147658 [00:35:04] who can tell. bleah [00:35:21] kinda unrelated and not urgent, installer is still broke on 1.17wmf [00:35:37] a hello pdhanda [00:35:47] what's the state of prototype atm? [00:37:10] should be ok to use, i needed to fix some db issues on de [00:37:24] it's running 1.which now? :-D [00:37:40] 1.17wmf1 on release-en [00:37:58] with or without tim's patch and platonid es' patch? [00:37:58] release-* actually [00:38:02] without [00:38:05] ok [00:39:26] ok who stayed at the wikisuites and gave their phone number to a collection agency? fess up now [00:44:07] must be Ryan_lane .. ;-) [00:44:25] a collection agency? [00:44:31] that would imply I have debt [00:44:34] debt collection [00:44:38] heehee [00:44:52] just got the automated phone call to this apartment [00:44:59] of course they don't say who they are looking for [00:45:00] I think I may have $50 on a credit card right now for this month? [00:45:13] I owe like 10$ on a pge bill [00:45:15] that's about it [00:46:13] damn [00:46:40] I see wfIncrStats is working now [00:49:56] how about we gather some statistics about parser cache object age using wfIncrStats? [00:50:16] that would give us a more accurate picture about the effect of a cache clear [00:50:23] sure [00:50:40] and the effect of $wgInvalidateCacheOnLocalSettingsChange [00:51:05] I'm just going to do one more thing looking at the memcached stats for a sec [01:00:43] clear-profile doesn't clear the stats [01:00:50] that could be annoying [01:00:52] mm [01:01:13] (done looking at the memcached stats output, not getting a lot from there) [01:01:45] http://noc.wikimedia.org/cgi-bin/report.py?db=stats/enwiki&sort=name&limit=5000 [01:01:52] the ones with a leading zero are accurate [01:03:00] I restarted the daemon [01:03:19] ok [01:03:52] ok I'm wrong, clear-profile does work after all [01:04:03] no worries [01:04:32] anyway that shows you the age in hours of each ParserOutput object on a parser cache hit [01:04:54] rounded to the nearest hour up to 100, then rounded to the nearest 10 hours [01:05:17] right, I see [01:10:59] so what gets me is that immediately upon reversion, ie with in 5 minutes, we are good to go [01:17:58] so within 6 hours we have half of the cache back... if it was a one time event like the wgInvalidateCacheOnLocalSettingsChange [01:19:04] sure seems like the second attempt should have done better, if that was all that was at fault [01:19:44] but it does indicate that setting $wgCacheEpoch expires more of the cache than I expected [01:19:46] do we not find objects because somehow we are concocting a bad key? [01:19:58] ohw much were you thinking? [01:20:31] the memcached stats show a shorter age, because they are based on last access [01:20:47] right [01:20:49] whereas these ages are based on the time the cache entry was written [01:21:56] you could have a heavily-accessed cache entry with an age of 100 hours, and memcached could have its last used time as 5 seconds [01:22:08] such a cache entry would be expired along with everything else [01:25:39] right [01:27:10] the acid test (which no one would agree to do) would be to switch again now that the impact of wgInvalidateCacheOnLocalSettingsChange would be minimal [01:27:55] or we could set $wgCacheEpoch on 1.16wmf4 and see what happens [01:28:03] errr :-D [01:28:33] it's hard to predict because you need to know the distribution of requests [01:28:46] i.e. how many of them are for popular objects, how many for less popular objects [01:28:57] the exponent in the power law [01:29:24] and even when you have that, simulating it would be a significant task [01:29:30] and we don't have an easy way to track that anyways [01:30:17] we have cache hit rate [01:32:29] any idea why this is in bugzilla? https://bugzilla.wikimedia.org/show_bug.cgi?id=25758 [01:32:47] if you set $wgCacheEpoch to some recent value, the hit rate will instantly go down to some smaller value [01:33:07] we can calculate that instantaneous hit rate using the cache age stats [01:33:41] then afterwards, the hit rate will rise over time and slowly go back up to what it was before you set $wgCacheEpoch [01:33:46] call it the refill rate [01:33:49] sure [01:33:56] that's what we don't know, but the current theory is that it's slow [01:34:14] because when we deployed for 40 minutes, the cache hit rate didn't rise enough to get us out of trouble [01:34:19] reaaaally slow if it's going to have that sort of impact [01:34:51] so sure we could actually back off $wgCacheEpoch by not too much and watch, not enough to do harm [01:35:05] if it's implausibly slow then we have to think of another explanation for the downtime [01:35:34] se it;s the two deployments that bothers me [01:35:38] *see it's [01:35:49] that the second one didn't do any better, or hardly any better [01:36:17] but anyways I'm for a controlled test, it would tell us something [01:40:05] we wanna wait for people to be around? try this now? check with ro bla? [01:40:19] I feel like the risks are low if we don't back it off much [01:42:10] the second one should have done a bit better [01:43:39] 3 hours after $wgCacheEpoch, the hit rate should have been 40% [01:44:04] of normal, i.e. 0.4 * 50% [01:44:24] 6 hours after $wgCacheEpoch, it should have been 50% of normal [01:44:45] exactly [01:45:08] refilling would have been minimal because it was only refilling when $wgCacheEpoch was recent [01:45:28] when it was on 1.16wmf4, it would have been using those old objects and not regenerating them [01:45:33] so no refill [01:45:40] ok, makes sense [01:46:13] but that's still much better than the numbers we had [01:46:38] no, it's about right [01:46:50] hit rate was 55% on 1.16wmf4 [01:47:09] no... [01:47:16] 22% miss rate, so 78% hit rate [01:47:28] so you're right, much better [01:47:55] on 1.17 we had 55% hit rate [01:49:14] with this model, you'd expect 78% * 40% after three hours, which is 32% [01:49:43] 78% * 52% after 6 hours, which is 40% [01:50:06] but on 1.17 we had 55% each time [01:50:11] that could be because of refilling [01:50:22] probably the most popular objects were instantly refilled [01:51:35] do we need to do an experiment on the live site to confirm this? and if so what should it be? [01:51:43] which means that really we could stare at the numbers endlessly but the best way to know is to try it. [01:51:51] that's exactly what I'm trying to figure out [01:52:22] what the least risky test is we can do that will give us a very high degree of certainty that this is or is not it [01:52:27] we could simulate a 6 hour expiry for a few minutes [01:52:41] by very high I mean high enough that everyone else is reasonably convinced [01:52:42] check cache hit rate, see if it's similar to what we saw during deployment [01:54:26] if it's bad it should fall over right away.. that will be the thing [01:54:48] we wanna do this without taking the site down [01:55:56] we have a bit more headroom because of the time of the day [01:56:05] that's true [01:56:12] which means if we do it, we had better do it soon [01:56:35] too bad we can't target just one server [01:56:37] well... [01:56:40] or could we? [01:57:07] no, we wouldn't get any refill if we just did it on one server [01:57:10] eh right [01:57:14] phooey [01:57:40] well let's think it through [01:57:45] give me a few minutes to set up data collection [01:57:46] if we do the run and crap falls over [01:57:50] then we know this is the issue [01:57:53] I want to have a lot of data [01:57:57] in which case we know we can schedule deployment [01:58:03] if it doesn't fall over, no harm done etc [01:58:12] (ok) [02:01:13] so the plan is to update $wgCacheEpoch to simulate a (partial?) cache clear and confirm behavior? [02:01:23] yes. or fail to confirm [02:01:51] if we confirm stuff could fall over but at least we'll know "that was ths issue, it's fixed now, deployment is safe as far as pcache goes" [02:02:05] if we fail to confirm then we won't have had much impact on the site, presumably [02:02:47] still feeling a bit queasy about it [02:02:59] but... short of actual testing this stuff, I dunno how we're going to be sure [02:08:07] got any brain waves, bri on? [02:12:00] if we do this, and things start falling over, we can put the setting back, and it should recover fairly quickly, right? [02:12:30] it shoudl recover instantly; no new objects will be tossed, it will just go back to using the old objects [02:12:42] sounds reasonable to me [02:12:51] (i've been lurking this conversation for a while ;) ) [02:12:51] any objects that had been regenerated by the new setting would have been shoved back into the cache [02:12:54] thats' the theory... [02:12:54] heh [02:13:08] theory and practice. [02:13:10] yeah. [02:13:44] this is how come I'm only half an ops [02:13:47] and if it doesn't work in theory, we are down for a couple hours? [02:13:53] err [02:13:55] ops people are supposed to be really conservative about stuff [02:13:58] *cough* [02:14:00] s/in theory/in practice/ [02:14:28] well, testing like this is better than trying a deploy without knowing it's going to work [02:14:28] well if we reset the setting, [02:14:40] it won't toss anything else [02:14:44] it's more conservative to do a test like this [02:14:44] that's what controls that. [02:14:49] I agree [02:15:10] *nod* it's not unreasonable. would be nice to do on smaller scale though [02:15:15] it sure would [02:15:16] but..... scale's what's being tested :P [02:15:20] heh [02:15:21] but we don't have a smaller scale , yup [02:15:27] I'm working on it ;) [02:15:32] yay [02:15:35] I know, that wasn't a dig at you [02:15:45] I know. I was just throwing that out there [02:15:45] heh [02:15:46] although boy it will sure be nice when that's live [02:15:57] hopefully next time we'll have a way to test things a little better [02:15:57] getting a load on that cluster will be hard though [02:16:06] cause we have to have steady use [02:16:23] it also won't use the live databases, which makes things difficult [02:16:25] worry about that later [02:16:45] though we could spin up a production version for testing, then kill it afterward [02:17:19] I assumed we would do something liek that [02:17:39] that's one of the ideas I had hoped to do with it :) [02:17:47] guess we should announce in the other channel before we go [02:18:05] *TimStarling is trying to figure out xmlstarlet [02:18:12] don't know why I use this program [02:18:27] I have no idea what that is [02:27:14] ok, I'm ready [02:27:26] oh boy [02:28:06] gonna restart profiling right after, yeah? [02:28:56] yes [02:30:03] I'll set $wgCacheEpoch = '20110208203425', which is 6 hours ago [02:30:46] I'm looking at where I think it's set [02:30:55] it's 2006 something right now? [02:31:01] yes [02:31:05] (which is meaningless, I know, but just surprising) [02:31:05] ok [02:31:09] don't edit the file, I'm about to save it [02:31:12] not going to [02:32:26] it's live [02:33:50] load going up [02:34:02] that's expected [02:34:21] 31% hit rate? [02:34:30] hmm [02:34:45] is that what you see? [02:34:45] load coming back down [02:36:34] 33? [02:37:11] ok load seems higher than it was but stable [02:38:29] 35? [02:38:42] I may be computing these the wrong way, we can check later [02:38:46] load is still a long long way below the red line [02:38:57] yes a long ways [02:39:48] it's off peak, but CPU only went up from ~40% to ~50% [02:40:08] very unimpressive [02:40:22] heh [02:40:33] I prefer unimpressive but I would prefer even more to have the solution in our hands [02:41:49] I think this isn't the problem [02:41:57] shall I revert now? [02:42:02] yes [02:43:39] what are the pcache_not_possible numbers? [02:43:41] CPU has recovered [02:43:57] instantly of course [02:45:23] it's incremented from OutputPage::addWikiTextTitle() [02:45:54] I think we should ignore them [02:46:00] so far I have been [02:46:25] let's just count pcache_miss_absent and pcache_miss_expired [02:46:48] 28360, 22127 for the last numbers I had [02:46:58] 29010 pcache_hit [02:47:42] 36.4% [02:47:55] hmmmm [02:48:34] how did they compute the earlier numbers? [02:49:23] with Article::view and Parser::parse-Article::getOutputFromWikitext [02:49:32] different measuring sticks [02:49:47] but I have before and after stats [02:49:54] I just have to process this XML somehow [02:50:04] ok [02:51:58] 66.4 before the switch? [02:52:41] sure looks like. My last numbers for that are hit: 543031, absent: 263432, expired: 10219 [02:52:58] I'll take your word for it [02:53:03] until my script is written anyway [02:53:39] well you can always do the addition/division yerself. even with a calculator I'm dangerous(ly wrong) [02:54:07] so you had us at 31 and we ended at 36% in .. [02:54:13] bah, I did not keep track of the time. um [02:54:32] 10.5 minutes or so [02:58:33] last night (= this morning for some of us) we had 1.17 running the first time for about 30 minutes [02:58:40] there was a lot of flapping and such during that time [03:00:17] and another 30 mins or so the second time it looks like from the admin logs [03:07:44] holler when you got something [03:11:39] gnumeric is much nicer than OO Calc [03:11:53] I'll near it in mind [03:12:05] I'll bear it in mind too [03:12:40] more features, less stupid, and the UI is much faster [03:12:59] OO Calc reminds me of running Excel on a 486 [03:13:02] most things are much faster than oo calc [03:13:20] I never ran excel on a 486 (or anything). sounds bad [03:14:19] I wonder if we could have a lot more things that are considered not cacheable in 1.17 [03:14:39] but then they won't be marked as misses I suppose [03:17:02] it even has "save as image" in the graph context menu [03:17:08] awww [03:17:12] <^demon> I can't imagine this sleep-for-three-hours-then-continue-working pattern is really good for me :p [03:17:18] I don't know why I haven't used it for years [03:17:25] I know it's no good for me but wth [03:17:41] I don't know that we'll do another round of anything tonight [03:17:52] unless the numbers come up much different than I expect [03:17:57] or one of us has a brain wave [03:18:30] I'm going to see if I can do a local test to rule out a coupl ethings, might take some time though, need to set up a fresh installation [03:18:58] http://tstarling.com/stuff/wgCacheEpoch-experiment.png [03:19:20] hmm, that line needs to be a bit thicker [03:20:01] ok [03:20:19] updated [03:20:47] ok, about my numbers [03:21:03] so now, what does this tell us about runs one and two last night? [03:23:44] seems like it woul dbe tough to get to 55% in the first half hour... and then there would be no reason for the second run to just sit there aat the same 55% [03:24:05] we are measuring hit rate in a different way, so the numbers aren't directly comparable [03:24:24] yeah I suppose [03:24:25] but I think the theory that the downtime was caused by low hit ratio is looking shaky [03:24:56] what was the percentage for before the first deployment, using their yardsticks, do we have it? [03:25:53] no we don't have it [03:25:57] bah [03:26:02] profiling was broken until halfway through the first deployment [03:26:05] right [03:26:08] so it was [03:26:08] so there's no data from before then [03:26:21] we can convert the numbers [03:26:37] the difference is in the denominator, right? [03:27:04] i.e. Article::view doesn't always lead to either a hit or a miss [03:27:12] ah [03:27:22] so I don't know anythign about those numbers [03:28:04] but we could always capture profilgin data now [03:28:26] probably similar to before deployment [03:29:49] and then look at relative numbers for those as compared to what we got just a little bit ago [03:30:28] the sample rate for profiling is 1 in 50 [03:32:56] currently we have Article::view() = 14656, pcache_hit + pcache_miss_* = 611838 [03:33:45] so 83% of Article::view() calls leads to a pcache hit/miss [03:35:45] so if the old miss rate was 22%, the new miss rate will be 26% [03:36:04] with denominator reduced by a factor of 0.83 [03:37:38] and the old 1.17 miss rate of 45% would be a new miss rate of 54% [03:38:53] so a 1.17 new hit rate of 46%, which is higher than what we saw in the $wgCacheEpoch test [03:39:32] so sorry but not getting the first set of numbers ot work out [03:39:39] I told you I can't add even using bc or whatever [03:39:52] real article views is 14656*50 ? [03:40:01] yes [03:40:10] what this is saying to me is that the LocalSettings.php thing was sufficient to cause the drop in hit rate [03:40:16] but not sufficient to cause the spike in CPU [03:40:43] wait, this is wrong [03:40:47] so you're saying 732800*83/100 = 611838 [03:40:47] this graph is wrong [03:40:56] but I can't get that to come out [03:40:59] most of what I said is right but this graph is wrong [03:41:00] *apergos waits [03:41:21] I didn't take deltas, so it's showing cumulative statistics [03:41:29] mmmmm [03:42:11] what did we have from earlier? [03:43:09] we had cumulative statistics from short known time intervals [03:43:26] short intervals... let's find out how long those were I guess [03:43:37] if possible. [03:43:40] it doesn't matter, it's just an average [03:43:53] the graph is an average over time when it shouldn't be [03:44:03] so it won't swing at the end as much as it should [03:44:03] right, I get that [03:44:15] in the old stats, all we have is the average [03:44:21] mm hmm [03:44:26] that's the thing [03:45:25] anyways I tend to agree: cpu spike not really accounted for, and that's what is the stopper [03:46:20] so maybe we disregard the pcache stuff entirely and go hunting for cpu related items [03:50:01] updated: http://tstarling.com/stuff/wgCacheEpoch-experiment.png [03:50:10] now the recovery looks like it should [03:51:19] ok [03:51:19] hmm really? 30 minutes to recover fully? [03:51:36] or that data from the end is after the revert? [03:51:57] it recovers instantly at 02:47 when I reverted $wgCacheEpoch [03:52:04] ok, that makes sense then [03:53:20] the CPU spike might have been on another database [03:53:25] we only looked at profiling on enwiki [03:53:39] that's true [03:53:49] I expect it would have had to be one of the bigger ones [03:53:59] there's about what 10 candidates? [03:54:53] I guess [03:55:09] I suppose we want them by readership [03:55:11] ugh [03:55:42] but this is not going ot be some template; any content-side thing is still there running happily under 1.16 [04:01:01] I wouldn't assume that. [04:01:34] It would be extremely unlikely that someone added a bit of content right when we went live, and removed it both times [04:01:58] That's not how I read what you wrote. [04:02:02] either js or template or whatever [04:02:06] But sure, that'd be unlikely. [04:02:21] I wouldn't say definitely this isn't content/template-related, though. [04:02:26] definitively, too. [04:02:36] doma s referred earlier to es having taken down the site with some template or other [04:02:48] Lots of small things have taken down the site (or come close). [04:03:02] oh sure, it could be related to a different way we process something. I mean, it pretty much has to be right ? :-P [04:03:32] Yes, I think so. [04:04:05] Starting at 1.16wmf4 and updating at half points might help figure out what's causing the problem. Dunno. [04:04:13] This is one of the many reasons I hate branches, though. :D [04:04:47] well getting agreement to push all those tests live might be tough [04:04:59] I don't think we need to waste too much time shooting down Shirley's silly ideas [04:05:18] eh, I'm kind of in time wasting mode for a minute here [04:05:50] staring at the profile output from 1.16 earlier and your 1.17 saved output and wondering where the culprit is [04:11:06] I'll write an email about where we're up to [04:11:16]