[00:36:03] !log aaron synchronized php-1.20wmf9/includes/filerepo/backend/FileBackendMultiWrite.php 'deployed c51a9a288b6dd5c0023a77f324c04707b23501c6' [00:36:12] Logged the message, Master [00:40:05] on wikipedia: [00:40:06] A database error has occurred. Did you forget to run maintenance/update.php after upgrading? See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script [00:40:06] Query: REPLACE INTO `pc032` (keyname,value,exptime) VALUES ('zhwiki:pcache:idhash:2963962-0!*!*!*!*!4!zh!*','...','20130816003909') [00:40:06] Function: SqlBagOStuff::set [00:40:06] Error: 1637 Too many active concurrent transactions (10.0.0.221) [00:40:23] same here [00:48:22] binasher: super win! [00:49:02] hmm, log doesn't seem to be flooding [00:53:04] * binasher stops el purge script [00:54:23] AaronSchulz: it looks like you started working on sqlbag'o exception handling? :) [00:54:34] speak of the devil [00:54:51] "el purge" [00:55:08] * AaronSchulz was watching dberror log [00:55:25] binasher: what script? [00:57:10] AaronSchulz: purgeParserCache.php [00:57:46] makes sense [00:58:14] the log was flooding for spurts shortly after I said "hmm, log doesn't seem to be flooding" ;) [01:00:17] * AaronSchulz likes how fast grepping the logs is now [01:01:37] why is it faster? [01:04:55] * jeremyb waves TimStarling... do you want to review a couple of my config changes? [01:05:02] https://gerrit.wikimedia.org/r/#/q/project:operations/mediawiki-config+owner:jeremyb+status:open,n,z [01:21:05] danke Tim [01:21:19] !log tstarling synchronized wmf-config/InitialiseSettings.php 'remove wikimedia philippines event exemption' [01:21:30] Logged the message, Master [02:28:34] !log LocalisationUpdate completed (1.20wmf9) at Thu Aug 16 02:28:33 UTC 2012 [02:28:43] Logged the message, Master [02:41:33] Nemo_bis: LU works now... [02:56:34] !log synchronized payments cluster to 37e31eddf5a4 [02:56:43] Logged the message, Master [03:45:17] /clear [03:45:20] Erp. [03:46:09] * jeremyb claps [03:59:21] Louder. [13:42:19] I briefly got an error message about no slave available on enwikipedia, but its gone now [13:42:36] Zomg slavery [13:43:02] free the slaves [13:43:41] Damianz/ TBloemink : Wikipedia is built on the hard work of slaves :P [13:43:48] :O [13:44:01] https://bugzilla.wikimedia.org/show_bug.cgi?id=39427 ? [13:44:11] BD erorr any on? [13:44:19] *anyone [13:44:23] e.g. Platonides ? [13:44:39] * bawolff is going to guess that's the same issue I had [13:44:54] unless it happens to still be there, mine went away [13:45:04] it is still there [13:45:11] Wheres the error? [13:45:55] what do you mean where? [13:47:17] graphs look kind of spikey: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=MySQL%2520pmtpa&tab=m&vn= ( of course I don't really know how to interpret said graphs) [13:47:48] yes [13:47:54] lets find an op :) [13:49:00] apergos: ahmm ^ [13:49:08] That does seem to be a rather large chunk of load over the past hour. [13:51:49] All the graphs that are red seem to be s4 slaves [13:51:51] maybe related to the export of enwiki Asher was going to do? [13:52:07] enwiki is s1, so then it's not :P [13:52:54] Let's just blame Asher :D [13:53:11] db60 is en [13:55:02] db33 is s4 but has almost no load [13:55:33] hmm, it has less priorirty, too [13:55:59] it did a couple minutes ago though [15:41:08] Would you know the technical solution used on es.wikipedia.org to redirect direcly on Commons image links? [15:48:21] Dereckson, which one? [15:48:29] sample? [15:48:38] I probably implemented it :P [15:51:20] Somelike like Reedy might want to take a look at https://gerrit.wikimedia.org/r/20035, patches a niggling regression. Minor, but that's the sort that annoys people most :) [15:53:15] Platonides: https://es.wikipedia.org/wiki/Kennedy_%28Minnesota%29 [15:53:36] links are directly to commons [15:54:00] if you disable javascript, links stay on es. but I don't see anything in [[MediaWiki:Common.js]] [15:55:18] it's a gadget [15:55:33] https://es.wikipedia.org/wiki/MediaWiki:Gadget-a-commons-directo.js [15:56:17] Thank you. [15:57:23] you're welcome [16:19:18] tyteen4a03: fix your connection [16:21:01] 16 16:20:14 -!- mode/#wikipedia [+b *!tyteen4a03@*$##fix_your_connection] by jeremyb [16:22:34] !log reedy synchronized php-1.20wmf9/includes/QueryPage.php [16:22:43] Logged the message, Master [16:22:48] Jarry1250: ^ [16:22:50] Reedy: 20035 ? [16:22:57] yeah [16:23:34] Reedy: good shout, appears fixed [16:24:07] jup [16:30:08] jeremyb: noted [16:30:39] was trying out new perform settings, will stop bothering people now [16:30:58] tyteen4a03: k, just seemed to be a regular thing (not just today) [16:31:32] jeremyb: it happened a few days back, yes [16:31:51] I have no idea why my irc client retry connection for 99 times [16:32:26] well, just teach it not to excess flood. #freenode can help you figure that out probably (or tell you where the channel for your client is) [16:32:45] anyway, i have to run [16:32:49] jeremyb: see you [17:59:12] !log aaron synchronized wmf-config/filebackend.php 'Make internal reads come from swift for all wikis' [17:59:21] Logged the message, Master [18:00:51] and there it goes [18:02:37] whee!!! [18:04:26] * AaronSchulz watches ganglia [18:05:39] wow [18:05:41] holy HEAD request, batman! [18:05:45] swift queries / sec [18:05:56] eggzactly [18:06:21] yay [18:06:51] mmm, eggs [18:07:04] cpu is definitely up there [18:07:07] 40 [18:07:21] but... still leaves 60 [18:08:08] the 404 response times seem a little spiky maybe [18:08:32] * Nemo_bis runs to purge a bunch of DjVu to add some load [18:08:46] most of these HEADs are 404s [18:08:56] * apergos gives Nemo_bis an evil glare [18:09:07] AaronSchulz: what are all the 404 heads about? [18:09:13] which have to hit all 3 replicas and do not use memcached [18:09:43] aww [18:09:52] on the gets per second is the list in order of the stripes? [18:10:08] apergos: yes, I think so. [18:10:09] not sure, maybe people requesting thumbs for files that don't exist, or some bad url encoding or something, or worse case, heavily used files not in swift that should be [18:11:06] AaronSchulz: you mean thumbs for originals that don't exist, rather than just thumbs that don't exist, right? [18:11:23] apergos, AaronSchulz: I checked one of the symlinks and I can't find a file associated with it at all, not sure what's going on there. The files may have been deleted, or moved to commons, or the archives purged, or whatever [18:11:45] ok it's weird [18:11:49] I see some 404 urls on the logs [18:11:51] maplebed: I was thinking originals [18:11:58] how many bugs are fixed by just switching to swift? https://bugzilla.wikimedia.org/show_bug.cgi?id=28310#c8 [18:12:05] you have to stat the original to see if it exists before making the thumbnail [18:12:05] there shouldn't be any titles with non utf 8 in them any more except stuff that crept in long long ago and has ot been touched since [18:12:20] apergos: Yeah these are all 2004/2005 [18:12:55] hmm [18:13:05] it's interesting that looking at the 404 latency graph the HEAD 404s are way low (much lower than they were) but the GET 404 latency has risen. I infer though, that the majority of the 404s are HEADs. [18:13:09] could you stuff em in a bugzilla report someplace so we don't lose em? [18:14:28] !log aaron synchronized wmf-config/swift.php 'Disabled redundant thumb copy/purge hooks.' [18:14:37] Logged the message, Master [18:14:39] smart [18:14:42] maplebed: ms-be3 has load avg in the range of 10 [18:14:42] forgot about those [18:14:43] fwiw, the new files gallery on commons is loading pretty fast. [18:14:53] paravoid: I think that's expected. [18:15:14] 1 and 2 are about the same [18:15:18] according to ganglila [18:15:45] 1-5 are all significantly higher than 6-12, the specific differences being no SSDs and more content. [18:15:57] ah I was about to ask about 7 9 11... [18:15:57] heh [18:18:05] ms-be3 has insane i/o load imho [18:18:26] ms-be3 also just came back from being down for a day, so has some extra catching up to do. [18:18:37] oh [18:18:37] yes, I was expecting something like that [18:18:39] why was it down? [18:18:41] should stabilize in a while [18:18:44] I wouldn't object to actually stopping the object-server on that host in order to let it catch up more quickly. [18:18:47] random crash? [18:18:53] that's what I want to know, is how come these backends fall over [18:19:09] (with the object-server stopped, it'll fail-fast. and the proxy will ask a different server for the content) [18:19:45] how long does catchup usually take when it's loaded? [18:20:22] I'm not sure. I don't have many good numbers on that sort of thing. [18:20:46] well ballpark? few hours, a day [18:20:55] cause I have zero good numbers :-D [18:21:04] my tests were more around how long it takes a host with empty disks to join the cluster, [18:21:13] I see [18:21:15] not how long it takes when a host has been down for some amount of time. [18:21:34] I took ms-be1 out for a few hours yesterday after its reboot. [18:21:47] how long had it been down? [18:21:59] 12hrs maybe? [18:22:04] hmm [18:22:40] is it me or do I read that graph as saying 80% of hits 404? [18:23:23] paravoid: that's right. [18:23:30] looks about right yeah [18:23:53] erm, isn't that a problem? :) [18:23:58] potentially [18:24:03] that does seem excessively high. [18:24:35] there's a lot of hits with UA PHP-CloudFiles/1.7.10 [18:24:39] three requests that fail? before it gives up, is that right? [18:24:52] oh, all of them, that's MW, doh [18:24:56] heh [18:25:01] it better be [18:25:08] AaronSchulz: please roll back [18:25:16] awwww [18:25:19] the majority of the hits are for containers that don't exist. [18:25:24] uh oh [18:25:36] hmm, ok [18:25:37] how did you check for their existence? [18:25:42] it's hitting wikipedia-commons-shared-public.## [18:25:48] instead of wikipedia-commons-local-public.## [18:25:53] er? [18:26:11] what's shared-public and local-public? [18:26:11] paravoid: I tailed the log and eyeballed it (knowincg the pattern of containers that exist) [18:26:18] I'm tailing the log too [18:26:23] me three [18:26:25] but the actual answer to your question is the swift cli and the list command. [18:26:28] I just don't know what I'm looking for [18:26:53] maplebed: and more specifically? login to which box and run what? (still a newbie :) [18:27:14] paravoid: np. log into either iron or any swift box (theyare the only ones with the swift cli intsalled) [18:27:30] then http://wikitech.wikimedia.org/view/Swift/How_To#List_containers_and_contents [18:27:42] oh, slapped with the manual. heh, sorry :) [18:27:49] :D [18:28:04] so what is the shared-public space? [18:28:10] if you're logged in to iron, you'll have to replace the 127.0.0.1 with ms-fe.pmtpa.wmnet [18:28:16] apergos: the format for the containers is [18:28:17] !log aaron synchronized wmf-config/filebackend.php 'Use NFS for reads for shared-multiwrite' [18:28:27] Logged the message, Master [18:28:39] project-language-foo-zone [18:28:53] the zone is 'thumb' or 'archive' or 'public' (or 'temp'?) [18:29:04] public is where the originals are. [18:29:12] ok [18:29:16] and local/shared? [18:29:23] the foo part is something more mediawiki related, and I thought was only supposed to be local for our installation. [18:29:30] AaronSchulz can explain that part better. [18:29:39] but before we get into that, [18:29:50] I want to see if we can fix this and redeploy within our window. [18:29:55] ok [18:29:57] back in a sec. [18:30:06] sure thing [18:33:22] hey, are peons like myself allowed to edit stuff on wikitech.wm.org? i wanted to expand the packaging articles with some useful tips, having acquired some experience this past week [18:33:30] maplebed: -U mw:thumbnail failed, seems to be -U mw:thumb, I presume it's a typo in the docs and fixing it [18:33:36] yes peons like us can edit [18:33:46] ask and you shall receive (an account with editing abilities) [18:34:00] ask here? [18:34:01] ori-l: yes please! and don't be afraid to ask me if you need any packaging help [18:34:17] paravoid: awesome, thanks! [18:34:20] yes. what is your wiki account name that you use generally? [18:34:25] apergos: ori.livneh [18:34:39] all lower case? [18:34:48] (probably it will make the first letter upper case anyways) [18:35:17] yeah, i think it's lowercase but the first letter is coerced [18:35:20] so either is okay [18:35:44] * ori-l brbs. [18:35:46] please give me an email address for this [18:35:56] before you go ori-l [18:35:59] maplebed: so? how are we moving forward? [18:36:14] apergos: ori@wikimedia.org [18:36:23] as soon as he's back we'll find out I guess [18:36:31] so Aaron has an idea of what he needs to change. He's going to do so and we'll try again. [18:36:52] the problem was that when other wikis use commons content, the region appears as 'shared' instead of 'local'. [18:37:05] you have mail ori-l [18:37:20] there's bound to be a reason for that [18:37:29] so the test of looking at commons special:newpages doesn't trip the bug; only looking at, say, an image on enwiki that's actually on commons is what fails. [18:37:30] so it's identifiable as form a remote repo I suppose [18:37:37] *from [18:38:28] so, wait [18:38:40] paravoid: re: the swift username, it's not the same on every cluster. you need to get both the username and password from the configs in order to query stuff. [18:38:51] the docs are just to give you an example in context. [18:39:07] (but it doesn't hurt to make it match, I suppose) [18:39:38] so, the "local" part in containers is basically useless, right? [18:39:47] it's always "local" in swift [18:40:12] yes. that got in there to make mediawiki happy. or in this case, sad. [18:40:17] heh [18:40:18] haha [18:40:43] and I'd presume it's basically impossible to rename containers? [18:42:00] hmm. I haven't done that before, but ... either way, renaming isn't what we want, because commons "local" is the same place as enwiki->commons "shared" [18:42:15] i.e. what's commons/shared to enwiki is commons/local to commons. [18:42:19] right, you want one naming scheme [18:42:20] if that makes any sense. [18:42:37] and any translation gets done before hitting the swift backend [18:43:21] right. the two translation places are rewrite.py (within the swift proxy) and mediawiki. currently they disagree on how to translate the enwiki-hitting-commons case. [18:43:35] woopsie [18:43:40] by renaming I meant dropping the "-local" part since it's obviously not local [18:44:49] well, the first iteration of swift actually didn't have the -local part. it was added specifically to make mediawiki happy. so I doubt that aaron would want us to drop it as part of the solution. [18:45:25] mw needs to be able to distinguish between remotefilerepo things and localfilerepo things [18:45:55] the fact that in our case our remote repo is a local repo of one of the projects is an accident [18:45:57] so to speak [18:46:58] well, swift is a data store, the concept of locality doesn't make sense for swift aiui [18:47:42] no, it doesn't, it's all on the mw side [18:48:10] while that's true in our installation, what apergos says holds, in that enwiki is referencing an image in a remote wiki (just as any random wiki out there on teh internet could too) [18:49:25] 'local/shared' are somewhat legacy names, but changing everything wasn't worth it [18:49:56] site-lang-local means "the main repo for site-lang"...other repos in the future would have better names, like "math" or something [18:50:22] oohhh math, how nice is thaat [18:50:31] (and no that's not sarcasm) [18:50:56] AaronSchulz: I thought deleted went somewhere other than public. [18:51:06] (cuz otherwise you could still read them, no?) [18:51:13] maplebed: hmm? [18:51:21] thumb going to local-public? [18:51:40] ah. patchset 2 is better. [18:51:41] that typo fix was being committed as you said taht [18:51:54] one sec before you merge [18:52:08] I tend to make config changes via gerrit, and review the diff a second time before merging [18:52:14] so I already noticed that wtf [18:52:15] right [18:52:49] I just wanted to check that those containers actually exist. [18:52:58] and they do, complete with deleted having more shards. [18:53:08] so carry on. [18:54:01] (apergos and paravoid, public, thumb, and temp all have two hex digits for the shard but deleted has two alphanumerics (a-z0-9)) [18:54:21] I vagluely remember discussing the shard setup even :-) [18:54:25] errr... [18:54:28] no, that's not accurate. [18:54:31] ok that change is on testwiki now, not synced elsewhere [18:54:52] two hex digts? [18:55:16] well whatever corresponds to the two levels of hash we have on the flat filesystem anyways [18:55:21] AaronSchulz: the deleted containers only go up to .tw. is that correct? [18:56:27] in theory up to z*, but I noticed that they don't go higher than tw or something in NFS when I was copying files over, meh [18:57:53] makes me uneasy [18:58:36] well, we can make the rest later, but if only goes up to .tw in NFS, that's good enough for me. [18:58:39] !log aaron synchronized wmf-config/filebackend.php 'Fixed container mapping for shared repo' [18:58:49] Logged the message, Master [19:00:11] hrm [19:01:12] I see valid requests coming in [19:01:13] hrm? :) [19:01:32] eg GET /v1/AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/wikipedia-commons-local-public.11/1/11/Cc-by_new_white.svg HTTP/1.0 200 - PHP-CloudFiles/1.7.10 [19:04:26] AaronSchulz: so you've switched it again? [19:04:42] yeah, see the log entry in -operaitons. [19:04:43] not yet [19:04:55] orly? [19:05:13] $readFromSwiftShared is still false [19:05:28] then why am I seeing GETs to the -public buckkets? [19:05:43] fromcommons [19:06:16] but not from enwiki reading from commons? (I am guessing from the name of the config var instead of looking aat the code :-P) [19:07:40] so... I've got lunch scheduled at noon and will have to bail soon. [19:08:23] okay, any tips before you go? :) [19:08:39] * AaronSchulz saw him leave :) [19:08:57] uh, okay [19:09:23] trying to get file listings on commons from testwiki in eval.php seems to have some sort of error [19:09:31] not sure why though [19:11:16] what command are you giving? [19:11:36] just what we were looking at. [19:11:46] logs at /home/w/log/syslog/swift (on fenari) [19:11:47] http://pastebin.com/vy4YT5jk [19:11:57] and trying to make sense of what queries come in. [19:12:13] I'll be back at 1:15-130ish. [19:12:18] ok [19:12:48] what's it whining with? [19:13:28] getFileList() gives null, which is for errors [19:13:40] ohjoy [19:14:02] if I run it on commonswiki using local-multiwrite, its ok [19:14:20] so it's get mapping somehow [19:14:22] meh [19:14:52] * AaronSchulz saw nothing in the swift-backend log [19:15:00] apergos: thanks! [19:15:05] ori-l: yw [19:16:29] no fail, no success no nothing [19:16:32] that's a problem [19:18:15] where's the config for testwiki? (sorry but it's been a while) [19:18:37] which config? [19:19:03] where the swift repo would be enabled [19:19:19] maybe it's a bunch of conditionals in th regular configs [19:19:31] wait a minute [19:20:12] brb [19:23:52] !log aaron synchronized wmf-config/filebackend.php [19:24:01] Logged the message, Master [19:24:59] what did you change? [19:25:42] the shared-swift shard config [19:25:55] it had the same problem with shared vs local [19:26:08] ohhh [19:26:14] * apergos looked at the diff [19:26:19] my eval.php test works with '$backend = FileBackendGroup::singleton()->get( 'shared-swift' );' now [19:26:25] heh [19:26:27] I guess so [19:26:33] oddly enough NFS give null [19:26:37] * AaronSchulz still debugs [19:27:06] gah, same problem basically [19:27:08] * AaronSchulz fixes [19:30:56] !log aaron synchronized wmf-config/filebackend.php [19:31:03] ok that works now [19:31:05] Logged the message, Master [19:31:31] apergos: I think I can turn on $readFromSwiftShared now [19:32:15] apergos: any objections? :) [19:32:53] go ahead [19:34:29] !log aaron synchronized wmf-config/filebackend.php 'Set $readFromSwiftShared = true;' [19:34:39] Logged the message, Master [19:36:08] (I'm here too) [19:36:31] okay, 404s looking good [19:36:34] even lower than they were [19:37:59] * AaronSchulz likes the 4hr ms7 load graph, not that it change much this minute [19:38:28] no, try a shorter interval if you want change [19:40:35] well, 2h too, the other ones are kind of noisy [19:41:25] ah right [19:42:47] ms-be1011 has a 33% 404 rate, ms-be10 a 18.2%, both of them noticeably higher [19:50:57] so... [19:51:10] * AaronSchulz is hungry ;) [19:52:04] so I guess we are done then [19:52:12] I guess so [19:52:16] yay [19:52:20] I'll investigate the anomalies [19:52:28] there are other opsen around in the sf timezone if things get weird [19:52:31] but I don't think we need you for anything, go have lunch! [19:52:56] the be nodes never seem balanced IMO [19:53:04] would be nice if they were [19:53:31] ok but we don't have to solve that tonight [19:53:42] heh [20:00:15] apergos: i don't seem to have permission to edit the Pbuilder article [20:00:23] ah woops [20:00:24] sec [20:01:30] try that again please [20:01:44] apergos: works :) [20:01:45] ty [20:01:47] I always forget to set the rights afterwards [20:03:42] think I am calling it quits for the day [20:04:39] graphs in ganglia look reasonable to my eye [20:04:48] hi apergos paravoid [20:04:54] hi [20:04:56] ah hello [20:05:11] well once I hear form you that things look good, then I'll vanish [20:05:54] aren't we already beyond the last time for a 8h sleep [20:06:20] not yet [20:06:29] but I won't go directly from computer to sleep either [20:10:29] things look good from the log perspective. [20:10:34] now to poke at commons and enwiki a bit [20:11:36] ok [20:12:43] maplebed: did you see my comment above? [20:12:59] err... [20:13:00] looking. [20:13:12] oh, the difference in hit rate? [20:14:05] ms-be10 only came into service last week, so the difference doesn't surprise me. I think there is something not quite right with it though, because the network traffic is so radically differerent from the others. [20:14:09] I was gonna look at it after. [20:15:08] commons tests seem to be passing. [20:17:53] I can't see anything wrong. I'll post to the commons email list just in case. [20:20:45] binasher: http://en.wikipedia.org/wiki/Special:ArticleFeedbackv5Watchlist?ref=watchlist [20:20:49] all garbage :) [20:20:55] at least for me [20:21:29] 100% garbage! heh [20:22:01] watch this becomes the second largest consumer of space after revision [20:22:40] we need a distributed garbage store [20:22:59] hah [20:23:17] it could also be very lossy and use UDP and no checksums since if data is inconsistent, no one notices [20:24:38] AaronSchulz: limewire? [20:25:03] http://dev.mysql.com/doc/refman/5.1/en/blackhole-storage-engine.html [20:25:19] binasher: excellent write performance [20:25:26] for write-heavy loads [20:25:31] the best [20:25:57] apergos: I'm happy. I can't find anything not working. I'd say head off to nighty nighty land. [20:26:03] we'll wait for error reports. [20:27:42] night then! [20:27:45] thanks [20:27:48]