[01:37:04] are the devs aware that the API is currently acting funny? [01:52:07] oh dear I've found it: [01:52:20] it is a bug indeed [01:52:43] the 'repository' characteristic of images is not being returned [03:05:11] ugh it's only appearing after so many characters in the query [03:05:16] so hard to pin down [03:05:18] c'est la vie [04:14:30] ok, who broke the site ? [04:15:33] what's broken? [04:15:47] rendering cluster paged me [04:16:07] ipv6 en.wikipedia works for me [04:16:17] how much caching does wikipedia have? [04:16:25] lots [04:16:31] like, if it went down, would anyone notice :P [04:16:46] it would only knock offline new rescaling of images [04:18:29] swift is unhappy :( [04:21:48] more unhappy than usual? [04:21:58] what was the fix last time? [04:22:59] i have no idea ... [04:23:04] yeah, the ms-fe's are all totally in swap [04:23:07] like hardcore [04:24:06] also i just got back from dinner with a fair amount of wine .. worst time to have to debug [04:24:42] I'll look [04:24:52] i just restarted swift proxy-server on ms-fe4 [04:24:56] ms-fe3 is having the same problem though [04:24:59] and has been untouched [04:25:12] i just logged out of its serial [04:28:39] !log on ms-fe3: restarting swift-proxy due to swap [04:28:51] Logged the message, Master [04:31:19] still trying to get a shell on ms-fe2 [04:31:21] TimStarling: i love we came to the same conclusion - did you figure out anything ? [04:31:30] http://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=mem_report&s=by+name&c=Swift+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=3 [04:31:37] this is obviously a regular problem [04:31:44] yeah [04:31:51] very sawtoothed graph [04:31:51] nobody restarted it over the weekend, so it exploded [04:32:50] TimStarling: interesting , so ms-fe1 is precise and there's no sawtoothedness/swapdeath [04:33:22] also no CPU or network traffic [04:33:32] probably easier to avoid leaking memory when you're not doing anything ;) [04:34:16] !log on ms-fe2: restarted swift-proxy due to mem leak [04:34:27] Logged the message, Master [04:35:07] it's nice that they were reasonably responsive while they were swapping [04:35:22] often I just give up waiting and power cycle machines that are in swapdeath [04:36:00] it's easy enough to fix this semi-permanently, you know [04:36:30] just run swift-proxy from a restart loop, and disable swap [04:36:42] restart loop -- [04:36:50] disabling swap is not necessarily a bad idea [04:36:50] yeah, [04:36:53] #!/bin/bash [04:36:55] while true; do [04:36:59] hah [04:37:00] swift-proxy [04:37:03] sleep 1 [04:37:05] done [04:38:06] some people don't like restart loops, they think we should fix the applications [04:38:13] it's not very elegant [04:38:33] but it's like 3 lines of code and it'll fix the problem so well that nobody will even notice it's broken [04:39:32] i prefer the fix the problem solution myself [04:40:29] anyways, i'm off - thanks tim :) [04:43:25] bye [04:54:49] https://commons.wikimedia.org/wiki/Special:NewFiles [04:55:39] https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Thandie_Newton_2%2C_2010.jpg/92px-Thandie_Newton_2%2C_2010.jpg [04:55:47] I'm getting a Python traceback there. [04:57:13] TimStarling: ^ [04:58:48] seems kind of wrong to cache a 500 error [05:10:18] I'm also getting "HTTP Error 504: Gateway Time-out" when trying to upload through the API. [05:11:39] Brooke: it fixed itself [05:12:12] For that image, yes. [05:12:26] I tried loading an image on Commons and got: [05:12:30] A database error has occurred. Did you forget to run maintenance/update.php after upgrading? See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script [05:12:30] Query: SELECT 1 FROM `image` WHERE img_name = 'New_Government_of_the_U.S._-_NARA_-_5730035.jpg' LIMIT 1 FOR UPDATE [05:12:31] Function: LocalFile::lock [05:12:32] Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.0.6.41) [05:18:28] Hi all, if anyone's around, I'm unable to upload a small 3 MB file to Commons at the moment. I'm using the basic upload form and I get a timeout. Is this a known issue? [05:18:50] Request: POST http://commons.wikimedia.org/wiki/Special:Upload, from 69.214.171.3 via cp1011.eqiad.wmnet (squid/2.7.STABLE9) to 10.64.0.131 (10.64.0.131) [05:18:50] Error: ERR_READ_TIMEOUT, errno [No Error] at Mon, 15 Oct 2012 05:03:19 GMT [05:19:03] I'm looking at it [05:19:09] Dmcdevit reported it also [05:19:10] Okay thx :-) [05:19:44] FYI russavia is reporting the same issue [06:26:44] TimStarling: FYI some thumbnails are also failing to generate, I get the following stack trace: http://pastebin.com/Jk8Ayjnq [06:52:57] bbl [07:06:39] I get a message that the servers are overloaded [07:26:20] yes, they are overloaded [07:31:38] ms-be11 is clearly out of workers [07:32:01] sorry for not getting to this earlier, this is my first time looking at swift ops [07:33:45] !log experimentally doubling the worker count on ms-be11 since wchan indicates that the worker pool is exhausted [07:33:56] Logged the message, Master [07:37:55] !log ms-be11 showed an immediate improvement in bandwidth out, but wchan still indicates that 48 is not enough, increasing to 100 [07:38:06] Logged the message, Master [07:53:38] !log increasing worker count to 100 on all swift backends, via puppet [07:53:49] Logged the message, Master [08:11:06] any NoSQL experts present? [08:18:25] mystery solved, it's all about Gertie the Dinosaur [08:20:32] hi TimStarling , sorry I know nothing about NoSQL / Swift / reddis etc [08:20:57] it's ok, I was only asking so I could troll them [08:21:03] just heard and looked at the concept [08:21:29] aren't you working on migrating us from memcached to Reddis ? [08:21:34] what's it called when you get slashdotted from one of those special event google banners? [08:21:41] because I think "googled" is already taken [08:22:56] go to google.com, what do you see? [08:23:16] it takes a while, it's a little game [08:23:38] I see some comic strip coming from the 1920's [08:23:41] little nemo [08:23:43] googledotted? [08:23:52] those banners have a name I think [08:24:04] doodles [08:24:20] anyway when you click it enough times it takes you to http://en.wikipedia.org/wiki/Winsor_McCay [08:24:21] so you could probably say that we have been "doodled" ;D [08:24:33] and now millions of people are trying to play the video there, which is Gertie the Dinosaur [08:24:43] and that is hosted on sdd of ms-be11 [08:24:54] and as a result, sdd of ms-be11 is massively overloaded [08:25:00] apparently we don't have any caching or anything [08:25:11] so we just have to serve the file straight out of disk, uncached [08:25:13] I think we used to have cache of videos [08:25:20] but got disabled cause of some trouble with varnish [08:25:40] no [08:25:48] ahh mark will tell :-] [08:25:55] ah yes, it's 57MB, that is the size of the object file on the backend [08:26:03] http://paste.tstarling.com/p/HlwRSm.html [08:26:05] holy f*ck [08:26:13] tim the investigator [08:26:14] see, ms-be11 has 1300 FDs open for this video [08:26:37] amazing [08:27:08] well, it would have had a lot less, but I quadrupled the number of workers on it [08:29:15] i'm getting a cached hit from the squids, even in the frontends [08:29:24] but it's loading super slow for some reason [08:29:26] try with a Range? [08:29:57] cheers Tim, it would have taken me quite a while to find that out [08:30:05] among other reasons because I never visit google's frontpage [08:30:35] i'm still waiting for the full object to come in [08:30:42] I have been working on it for a few hours [08:30:58] I didn't visit google's front page until I found it in the squid referrer logs [08:31:24] could it be that squid request the file in parallel and never manage to cache it ? [08:31:55] hmm, rendering.svc is still down according to nagios [08:31:59] what I want to know is: why isn't ms-be11 serving it out of the kernel cache? [08:32:06] the load on image scaler is nothing [08:32:13] I think the backends are the root cause there [08:33:23] iostat shows it pumping out 40MB/s from the underlying device [08:34:14] rendering doesn't work [08:34:38] sure, but I think if you fix the Gertie issue, rendering will start working [08:41:07] TimStarling: so imagescalers have too many requests waiting because swift's slow because ms-be11 is slow? [08:41:25] is that your working theory? [08:41:37] swift is certainly very slow to respond to any queries [08:42:45] shouldn't maybe gertie the dinosaur be removed from the article for now...? [08:46:35] I see swift doing a lot of fadvise64(.., POSIX_FADV_DONTNEED) calls [08:46:44] yeah, I've noticed those too [08:46:46] not sure yet on what kind of FDs [08:46:56] and asked the swiftstack people when I met them a month ago [08:47:11] they said that they do that sometimes, depending on the file size [08:47:16] argh [08:47:17] didn't exactly understand why [08:47:34] perhaps we should hack them out in swift on ms-be11 [08:49:36] read += len(chunk) [08:49:36] if read - dropped_cache > (1024 * 1024): [08:49:36] self.drop_cache(self.fp.fileno(), dropped_cache, [08:49:36] read - dropped_cache) [08:49:36] dropped_cache = read [08:50:39] are you kidding me [08:51:10] TimStarling: so imagescalers have too many requests waiting because swift's slow because ms-be11 is slow? [08:51:34] yes, on ms-be11 I saw a lot of established connections from rendering.svc [08:52:04] ok, maybe I saw those on the frontend, come to think of it [08:52:11] yeah [08:52:29] so, 57M is certainly bigger than 1M [08:52:30] but yes, my theory is that a single slow hard drive will eventually suck up all available rendering threads and a good deal of general swift cluster resources [08:52:32] hence the fadvise [08:52:41] hence the dropped cache [08:52:42] due to long timeouts and lack of concurrency limits [08:52:48] what the fuck [08:52:52] let's hack that out [08:52:56] okay. [08:52:58] doing it [08:52:58] yes, hack it [08:54:17] done, swift restarting [08:54:21] done [08:54:26] sdd utilisation dropped [08:54:35] it's similar to the other drives now [08:54:58] iowait down [08:55:04] no, it takes a few minutes to repool [08:55:21] ok, then it dropped for being not in use ;) [08:55:28] I think the frontends must declare it down [08:56:09] so now probably some other backend is having the problem [08:56:23] it serves more traffic than other backends atm [08:56:53] it's repooled now, there was a jump in the network out [08:57:15] indeed [08:57:26] iowait is still normal [08:57:29] sdd utilisation back at 100% [08:57:37] ah now it's not [08:57:48] yeah it looks fine now [08:58:00] I'm looking at ganglia [08:58:06] iostat with a short polling interval is always noisy [08:58:09] right [08:58:12] was about to say that [08:58:18] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-be11.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1350291301&g=cpu_report&z=large&c=Swift%20pmtpa [08:58:20] sometimes I use iostat -xd 30 [08:58:35] well it did that for about 30s [08:58:43] perhaps to load that file once ;) [08:58:45] so, mark, add that to the list of how we're not prepared for videos [08:58:49] i did [08:59:52] KEEP_CACHE_SIZE = (5 * 1024 * 1024) [09:00:00] if response.content_length < KEEP_CACHE_SIZE and \ [09:00:01] 'X-Auth-Token' not in request.headers and \ [09:00:01] 'X-Storage-Token' not in request.headers: [09:00:01] file.keep_cache = True [09:00:21] which shortcuts self.drop_cache to nothing [09:00:32] but in this case it's a noop, since we do have an X-Auth-Token [09:00:42] not that it would matter, as we're above 5M too [09:01:22] so... ms-be11 is normal again, but swift is still slow as hell [09:01:25] and rendering is still down [09:01:44] the netapp copy finished btw, and the netapps are now doing snapmirror in sync mode [09:01:52] which possibly slows down nfs access [09:02:10] ms-be12 needs the same trick, sdf there is overloaded [09:02:39] doing ~53 MB/s out of sdf alone [09:03:43] done [09:03:50] (and thanks) [09:03:58] my test when I started: [09:03:59] Connect time: 0 ms [09:03:59] Request to response headers: 7 ms [09:03:59] Request to first data byte: 8 ms [09:03:59] Received 59646368 bytes, at 54000 bytes/s average [09:04:00] Request to end of data: 1099700 ms [09:04:00] Total time: 1099700 ms [09:04:18] i'm quite liking my new http test script, I've used it for various purposes already ;) [09:04:20] it's Gertie also: "GET /sdf1/24874/AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/wikipedia-commons-local-public.3b/3/3b/Gertie_the_Dinosaur.ogv" [09:05:42] and ms-be1: /sdl1/24874/AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/wikipedia-commons-local-public.3b/3/3b/Gertie_the_Dinosaur.ogv [09:06:17] debstack can get to work on that package ;) [09:11:27] mark: is that script in git? [09:11:32] no [09:11:47] it's only in /home/mark/firstbyte.py [09:12:21] it was the quick and dirty script I wrote after our http test tools discussion [09:13:21] i can clean it up a little and put it in git if you want [09:14:31] I think it would be useful [09:15:03] ok [09:19:42] that video still loads very slowly [09:20:01] maybe we are hitting another limit [09:20:18] you mean from swift? [09:20:22] or squids? [09:20:25] no from the squids [09:20:31] swift is very slow to load in everything [09:20:32] yeah, the CARP-balanced squid [09:20:39] ah better now [09:20:40] Received 59646368 bytes, at 1007000 bytes/s average [09:20:47] 1 MB/s [09:21:00] still not great, from fenari [09:21:32] swift timeouts completely argh [09:21:33] sq85 is the carp balanced squid [09:22:54] ms-be1 is also i/o waiting like crazy [09:22:57] live hacking there too [09:23:02] yeah tim said tht [09:23:14] oh I missed it [09:24:05] !log live-hacking swift on ms-be10, ms-be12, ms-be1 to remove fadvise calls [09:24:16] Logged the message, Master [09:24:17] i'm hungry, i need breakfast [09:24:31] I need coffee [09:24:37] yes that too [09:24:37] TimStarling: around? [09:24:42] breakfast includes coffee [09:24:50] liangent: yes [09:25:02] hmm now i'm getting a MISS from sq85 for some reason [09:25:10] it doesn't seem overloaded [09:25:15] no [09:25:20] TimStarling: some users are racing article creation with bots [09:25:37] i wonder if it's just evicting from its cache really quickly [09:25:40] including users from zh, vietnam, swedish [09:26:07] liangent: malicious users? [09:26:08] not sure whether this is affecting job queue or system load [09:26:20] or just normal articles? [09:26:27] TimStarling: normal articles [09:26:39] usually created from database [09:27:04] most are about towns currently [09:27:33] they just don't want to see their language get lower ranked in wikipedia rankings by article number [09:27:54] I think it's ok, as long as they don't use country data templates [09:28:06] and as long as the bots are single-threaded [09:28:21] TimStarling: zhwiki job queue rised by factor 12 since we talk about it last time. my bot wasn't active in template namespace since that time. can you have a look why so many new jobs were added? [09:28:24] we don't want people attempting to insert large numbers of articles concurrently [09:29:17] HTTP/1.0 504 Gateway Time-out [09:29:17] Server: squid/2.7.STABLE9 [09:29:17] Date: Mon, 15 Oct 2012 09:28:55 GMT [09:29:17] Content-Type: text/html [09:29:17] Content-Length: 1346 [09:29:18] X-Squid-Error: ERR_CANNOT_FORWARD 11 [09:29:18] X-Cache: MISS from sq85.wikimedia.org [09:29:19] X-Cache-Lookup: MISS from sq85.wikimedia.org:3128 [09:29:55] yeah swift is dead [09:29:57] no idea why [09:30:02] looking [09:30:45] Merlissimo: we don't have logs of job queue insertions, only job queue removals [09:30:55] TimStarling: hmm some or articles created by them are using that french "database" templates [09:31:36] TimStarling: but the account name is added to the database table, i think [09:32:19] sq85 isn't caching that video for very long, that's for sure [09:32:29] looking at the database is more helpful, no username though [09:32:49] but there is an insertion time [09:32:52] then another bot is chasing the creation bot to add iw links... [09:36:03] swift latency seems to be normal, it's imagescalers that are not responding [09:36:34] I'm looking at srv220, 465 established connections, 20 apache processes (MaxClients is 20) [09:37:45] probably a backlog of not yet thumbed images [09:39:00] could be, although they're relatively idle in CPU [09:39:23] seems better than before though [09:39:25] yeah very [09:41:50] i restarted apache on srv224 [09:42:02] to see what effect it would have [09:44:01] there are lots of errors about curl getting from swift in the logs [09:44:02] but not anymore [09:44:12] backlog is sane [09:44:12] Oct 15 08:43:21 srv224 apache2[9598]: PHP Warning: SwiftFileBackend::getLocalCopy: Invalid response (): (curl error: 18) transfer closed with 3915839 bytes remaining to read: Failed to obtain [09:44:13] valid HTTP response. in /usr/local/apache/common-local/php-1.21wmf1/includes/filebackend/SwiftFileBackend.php on line 1364 [09:44:46] okay, I'm going to restart the rest the [09:44:47] then [09:45:12] i assume it's gonna be fine soon [09:46:19] and there is the text [09:46:56] yeah I'm restarting apaches [09:48:26] alright [09:48:33] we'll have to investigate the caching of videos more [09:48:51] but breakfast doesn't need to wait for that [09:48:56] !log restarting all imagescaler apaches, did not recover after swift outage [09:48:57] so i'll be back later ;) [09:49:07] Logged the message, Master [09:49:22] are you gonna patch out the fadvise thing in the pacakge? [09:49:44] probably [09:49:56] I was actually thinking of upgrading swift this week to a newer version [09:50:06] the swiftstack people promised to help with the leaks too :) [09:50:20] ok [09:50:29] brb too [09:50:35] want coffee. need coffee. [09:50:40] same here [09:53:12] Merlissimo: refreshLinks2 jobs can be split into smaller jobs if they have more than 10 pages in them [09:54:19] for example there was a [[Template:Country_data_United_Kingdom]] that was split into 50 jobs [09:56:06] that's the usual split, actually, $wgUpdateRowsPerJob / RefreshLinksJob2::MAX_TITLES_RUN = 50 [09:56:37] so as the job runners hit expensive jobs, the job queue size appears to expand by a factor of 50 [09:59:23] TimStarling: would it be possible to delete jobs manually caused by me after i moved all country data iws to subpages? i could do this in a way that jobs created by this chance won't have any effect. [10:00:14] it would be possible to remove all the refreshLinks jobs for country data templates [10:00:20] it's not so easy to tell why each was inserted [10:01:33] but I could remove them all and then reinsert a single copy [10:01:37] as a kind of duplicate removal [10:02:39] I guess all refreshLinks2 insertions should do that kind of duplicate removal [10:03:04] TimStarling: then i'll first write an update script that moves the langlinks all at once and then you could do this. [10:03:16] ok [10:04:06] liangent: should i announce this on zhwiki first? [10:05:39] not sure, anyway it won't affect most of users, and all users affected should be technical [10:05:59] Merlissimo: or let me announce it in Chinese. can you give some examples? [10:06:05] of your bot changes [10:06:32] its like to one you did last time manually [10:06:58] Merlissimo: and its not running currently? [10:08:58] bbl [10:09:23] i have added automatically move to subpages for country data at my bot framework when my bot wants to update langlinks. but now i'll do the chance once. [10:10:08] currently the job queue size blocks my automatically running bot for doing changes on zhwiki template namespace [10:16:15] and back [10:19:43] TimStarling: is there a job queue size graph available ? [10:19:50] like the ts replag graph [10:20:38] Merlissimo: I just want to confirm it's working fine [10:20:56] liangent: probably [10:21:57] liangent: i am still working on my bot code. [10:22:37] but i am doing test edits first, of course [10:23:26] Merlissimo: ok let me say it with my example first [10:24:20] TimStarling: probably = you have data to generate one but nothing is available currently? [10:24:29] well bbl again [10:32:52] probably as in probably someone somewhere makes one of those [10:32:59] maybe on toolserver [10:33:09] maybe also asher made one [12:38:43] paravoid: thans for the update and resolution to the issue [12:39:06] sorry for taking so long :) [12:39:31] no problem [12:59:29] liangent: job queue size on only available for enwiki. Reedy created it i think. but not for zhwiki. i requested such a graph last time, but nobody set one up i think. [12:59:54] I didn't create it [13:00:22] One of the first things maplebed did.. [13:00:46] Reedy: yes you asked be to ping other people because you had to sleep. but nobody did it. [13:01:05] I still didn't create it :p [13:42:09] hello tehre :-] [14:42:45] bminish: mark contacted tele2 and they've fixed the problem [14:42:49] not that you care anymore, but still [15:17:05] Howdy,.. what kind of MPM model are Wikipedia, anybody here that knows? [15:17:28] And how large is a typical process/thread? [15:17:34] probably mod_php [15:17:40] or however it's called in apache2 [15:17:45] *whatever [15:18:00] not that,.. [15:18:27] Apache has several models for how it processes requests [15:18:36] prefork, threaded, and so forth [15:19:00] I don't think it's prefork [15:19:38] ii apache2 2.2.14-5ubuntu8.9 Apache HTTP Server metapackage [15:19:38] ii apache2-mpm-prefork 2.2.14-5ubuntu8.9 Apache HTTP Server - traditional non-threaded model [15:19:42] It doesn't seem very likely.. but memory usage is more critical in the threaded model [15:20:01] * jeblad_WMDE is chockolated [15:20:04] looksm like prefork to me [15:20:33] well, that answers the question [15:20:42] He,.. you can speed up the webservers drasitcally by choosing another model [15:20:57] I'm sure there's good reason... [15:21:05] But nice, then we can do more ugly stuff.. 8D [15:21:16] mod_php5 works well with the threaded model? [15:21:34] a threaded php is tipically much slower [15:21:42] Not 100% sure, I usulally play with ModPerl [15:21:44] and php is the important piece in wmf setup [15:21:53] is someone able to reliably answer http://lists.wikimedia.org/pipermail/wikimedia-l/2012-October/122332.html ? [15:22:32] does the 'ip' ratelimits in $wgRateLimits affect also (new?) users editing from the same IP, by cumulating their edits? [15:22:34] Prefork spins off a process and use it once before it is killed. Maximum security, but it costs. [15:22:50] But thanks. [15:22:54] :) [15:24:48] Nemo_bis, the ip limit is per action and per ip [15:25:05] which means that when there are moultiple newbies under the same ip, they are aggregated [15:30:28] Reedy, can you check another thing for me? Please? becauseyouaresoverykindandhelpfull.. :D [15:30:36]