[00:00:09] probably means they're in a busy idle loop [00:01:29] do we need to switch the symlink over to php-1.20wmf1? [00:03:13] eventually, but won't help now likely [00:03:24] Reedy's command should have used the 1.20wmf1 version [00:03:43] Yeah, it did #0 /home/wikipedia/common/php-1.20wmf1/includes/objectcache/ObjectCache.php(23): ObjectCache::newFromId('CACHE_A_ACCEL') [00:03:47] OH [00:03:49] I see a bug [00:03:53] CACHE_A_ACCEL? [00:04:19] no, ignore that [00:04:44] #0 [internal function]: ObjectCache::newAccelerator(Array) [00:04:44] #1 /usr/local/apache/common-local/php-1.20wmf1/includes/objectcache/ObjectCache.php(62): call_user_func('ObjectCache::ne...', Array) [00:06:39] Ok [00:07:01] What's php -n? [00:07:35] Calling without -n works fine [00:07:38] reedy@srv231:/usr/local/apache/common/multiversion$ php MWScript.php nextJobDB.php --wiki=aawiki [00:07:38] itwiktionary [00:07:47] http://www.php.net/manual/en/features.commandline.options.php [00:07:55] see 'n' [00:07:58] -n No php.ini file will be used [00:09:54] I see the problem [00:10:42] Go on? [00:44:26] !log aaron synchronized wmf-config/CommonSettings.php [00:44:29] Logged the message, Master [00:52:56] Sorry! This site is experiencing technical difficulties. Try waiting a few minutes and reloading. (Cannot contact the database server: Unknown error (10.0.6.41)) [00:53:30] Reedy: ^ ? [00:53:50] s4 master [00:53:59] Same. Been on and off 10 minutes or so [00:54:22] Same host? [00:54:27] binasher: ^ fancy looking into this? [00:54:46] aye, just got paged on it [00:55:06] please move any further discussion to ops [00:57:40] ah, was that due to db31? [00:57:44] binasher: ^^ ? [00:58:20] yes, looking at it. current status: don't panic [00:58:42] I'm not [00:58:48] Towel anyone? [00:58:50] just glad it wasn't swift [00:59:22] Ryan_Lane: why? :D [01:01:31] because only one person really knows it well [01:01:35] and he's on vacation [01:03:46] Reedy: https://gdash.wikimedia.org/dashboards/jobq/ [01:03:53] * AaronSchulz is going to head out [01:03:54] Yay [01:03:55] What was it? [01:03:59] magic [01:04:26] 0:45 - 0:55 is when db31 (s4 master) vanished [01:07:10] binasher: mind walking me through what you are doing? [01:07:20] I hate needing to page people in this situation [01:07:35] it seems like it was totally non responsive (dropped out of ganglia,) and i'm seeing no signs yet why, nor any issues with mysql.. nothing in dmesg or any logs [01:07:41] the hardware raid shows no errors [01:07:46] bbu is at 99% charge [01:08:33] did the process deadlock? [01:09:24] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&c=MySQL+pmtpa&h=db31.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [01:09:33] NOHZ: local_softirq_pending 100 [01:09:35] in dmesg [01:09:42] that's old [01:09:46] ah [01:10:03] unless i did the math wrong [01:10:16] but i think months ago [01:10:22] * Ryan_Lane nods [01:10:49] I hate that all of our logs report the same crap [01:12:02] I thought these had swappiness set to 0... [01:12:40] Swap: 976888 441848 535040 <— not that it's actually much. [01:15:18] swap use didn't increase just before or after the incident [01:15:23] * Ryan_Lane nods [01:15:38] the last sample from atop before shows a whole bunch of bzip2 processes running [01:15:45] o.O [01:16:00] what would be doing that? [01:16:11] and an apt-get, and puppet as the top processes writing 108.2M to disk [01:17:55] it doesn't show args to apt-get and i don't see anything in /var/log/apt that suggests an install… bzips could just be from an apt-get update maybe? are deb catalogs bz2 compressed? if so, probably a herring of some color [01:18:14] yeah [01:18:17] I believe they are [01:18:25] likely and update [01:18:27] *an [01:18:36] it would be a bunch of them, too [01:20:29] atop is incredibly useful. I need to learn how to use this more. [01:26:08] it does show that in the sample at 00:46:03 (it became unresponsive at 00:45), mysql allocated 138M of new virtual memory which is very unusual per previous samples, and wrote 200-300MB more to disk in that time than looks normal [01:27:23] now i'm going to look at queries in the binlog starting around 00:40 [01:28:46] actually looking at the slow log now [01:29:01] gah [01:30:25] ah. that makes sense. maybe a bunch of really shitty queries came in at once? [01:31:03] looks globalimagelinks related [01:31:27] db46 also seems to be having issues [01:32:01] the slow queries log is at /home/w/log? [01:32:35] db46 doesn't respond to a ping [01:33:10] i'll pull it from db.php but can you check the hw out while i continue looking at 31? [01:33:15] yep [01:33:16] on it [01:33:58] !log asher synchronized wmf-config/db.php 'pulling db46, host down' [01:34:01] Logged the message, Master [01:34:03] kernel fault [01:34:10] hrm [01:34:21] gimme a sec and I'll pastebin it somewhere [01:34:23] hopefully a nice coincidenc [01:34:59] in the ip stack? [01:35:12] http://pastebin.com/beNbVf6g [01:35:45] [2593479.501267] BUG: soft lockup - CPU#4 stuck for 61s! [swapper:0] [01:35:45] [ [01:36:09] let me reboot it [01:36:13] huh [01:36:24] oh slow q log is in /a/sqldata on each local db [01:37:02] also seeing this: http://pastebin.com/va2BPTrq [01:37:23] !log powercycling db46 [01:37:26] Logged the message, Master [01:40:36] Ryan_Lane: https://bugzilla.wikimedia.org/show_bug.cgi?id=36259 [01:40:52] Happened earlier. No log entry [01:41:06] not sure how I can help you with that [01:41:40] Bsadowski1: that's a software bug [01:41:44] I do ops. [01:41:47] Ah [01:41:48] :) [01:41:52] Who does the software? :P [01:41:58] the developers ;) [01:42:05] so, you entered that into the correct place [01:42:18] the developers are also often in this channel [01:42:36] the db31 issue seems related to the jobqueue graph aaron posted earlier [01:42:44] !log starting mysql on db46 [01:42:47] Logged the message, Master [01:43:02] stampede of SELECT /* Job::pop_type */ * FROM `job` WHERE job_cmd = 'enotifNotify' LIMIT 1 FOR UPDATE; queries (thousands) in the minute before trouble [01:43:14] ugh [01:43:20] we need a real queue :) [01:43:55] why didn't we get an alert on db46? [01:44:02] oh. I guess we did [01:44:19] only masters actually page [01:44:25] right [01:44:29] or i fucked up, but it seems to work [01:44:32] I didn't see the alert in irc [01:45:04] how did you end up finding the job queue stampede? [01:45:09] 50 select for updates per sec piling up and taking seconds to finish [01:45:23] TimStarling: didn't we remove the FOR UPDATE? [01:46:56] it's there in 1.19 [01:47:13] then thousands of getLinksFromPage queries.. all the activity that swamped it was just via the jobqueue, not directly from the site [01:47:54] TimStarling: what sort of task triggers GlobalUsage::getLinksFromPage? [01:49:16] page save [01:51:09] when a page is saved, there will be a GlobalUsage::getLinksFromPage query fetching the links from that page [01:51:42] then if there are differences between the result set and the new image links, there will be subsequent deletes and inserts [01:52:22] where by "links" we mean images used on that page [01:53:29] it hooks LinksUpdate, which does the same thing for the core links tables [01:54:02] hmm [01:54:28] I guess a refreshLinks job also triggers LinksUpdate, that would be what you're seeing [01:54:59] probably not a problematic query unless it comes in a flood of a hundred concurrent requests [01:55:14] using less job runners would be one way to address it, or some other job queue concurrency control [01:56:05] yeah, query looks fine in an explain [01:56:14] just concurrency [01:57:21] hewiki got the highest number of those queries, dewiki second [01:59:08] curious to check out recent changes but can't read hebrew or german.. wonder if it was defacement, bots, or legit [02:00:12] we could probably optimise that query [02:00:50] * TimStarling is dreaming about all sorts of fun optimisation projects in the course of composing a reply to a certain email [02:00:59] :) [02:01:24] if no images change, then maybe we should store a hash of the images out of a given page [02:01:28] then just compare the hashes [02:01:39] then do that for all the links tables [02:01:56] so then we would only have to fetch the actual link set when links actually change [02:02:00] nice [02:02:18] or it could be denormalised in some other way [02:02:23] what would you compute a hash from? concat of all image ids or something similar? [02:02:38] image titles [02:02:41] basically the row contents [02:02:56] we store image titles so that when an image moves, the links rows don't have to be updated [02:04:06] and i assume they're all already available within the context of a page save [02:04:33] yes [02:05:13] so the only heavy cost would be in a one time backfill [02:06:12] the hashes could start off as null, and if it's null when LinksUpdate runs, it would just ignore the hash and do what it does now [02:06:33] then it would populate the hash row as part of its normal update process [02:07:26] so the site would start off the same speed as it is now, and slowly get faster as templates and articles get edited [02:09:35] that would be fun to watch [02:14:03] !log LocalisationUpdate completed (1.20wmf1) at Thu Apr 26 02:14:03 UTC 2012 [02:14:06] Logged the message, Master [02:24:38] Links updates already get missed. You want to add a layer? :-/ [02:41:26] dinner time, night all [05:09:02] Error creating thumbnail: convert: Output file write error --- out of disk space? `/tmp/transform_216c189-1.jpg' @ error/jpeg.c/EmitMessage/235. [05:09:30] result of: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Blaeu_1652_-_Amsterdam.jpg/1280px-Blaeu_1652_-_Amsterdam.jpg [05:10:00] I got a 500 error when accessing the same. [05:14:34] AFBorchert: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=part_max_used&s=by+name&c=Image+scalers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=3 [05:25:13] jeremyb: thanks, the image scalers do not appear to be under load now, perhaps just one of the hosts has a problem [05:26:25] AFBorchert: maybe this is a better version: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=part_max_used&s=by+name&c=Image+scalers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=medium&hc=2 [05:26:29] AFBorchert: see the maxes [05:48:03] AFBorchert ? [05:57:57] wikipedia slow ? [05:58:22] any problem ? [05:58:23] Having trouble getting pages loaded. Netherlands. Examples: https://meta.wikimedia.org/wiki/Special:Contributions/Siebrand https://nl.wikimedia.org/wiki/WM:O [05:58:26] just had a nagios flood on LVS [05:58:34] all esams [05:58:41] i think someone's about to look at it? [05:59:04] 29 seconds for https://bits.wikimedia.org/nl.wikimedia.org/load.php?debug=false&lang=nl&modules=site&only=styles&skin=vector&* [05:59:06] wikipedia down atm. [05:59:27] tegra: country? [05:59:31] switzerland [06:00:16] same here: en-wp from Germany [06:00:46] AFBorchert: and dewiki too i guess? [06:00:52] AFBorchert: should be down across the board [06:01:08] mmm all down. [06:01:25] aharoni: coming to report an outage? ;) [06:01:33] boker or [06:01:42] status: down [06:01:47] in europe i think [06:01:56] khag sameakh. [06:02:12] * jeremyb wonders which chag [06:02:22] atzmaut [06:02:31] oh, hayom? [06:02:56] * aharoni is wondering whether google.com in other countries also shows an Israel Independence day doodle. [06:03:08] not here [06:03:51] only if i go to the .co.il page [06:04:51] down here [06:05:43] it's a global outage ? [06:06:22] tegra: certainly not, seems ok here [06:07:16] wikipedia-lb.esams.wikimedia.org [06:07:25] i can't ping. [06:07:25] right [06:07:45] maybe related to outages on http://www.internetpulse.net/ ? just a wild guess [06:08:16] mmm [06:08:17] (there do seem to be some esams problems that are not related to that page though) [06:10:56] some routing problem i think. [06:12:21] how do things look now guys? [06:12:24] no, I think it's an issue with one of the bits caches at esams. we're looking at it in wm ops [06:12:38] it did not improve yet [06:12:57] apergos: did you see tegra above about can't ping? [06:13:09] no [06:13:23] you can ping ? [06:13:47] i can ping pmtpa and eqiad but not esams (from US) [06:14:19] just took what tegra had been trying (wikipedia-lb.esams.wikimedia.org) and substituted in the other DCs [06:14:20] traceroute ends at r1ams2.core.init7.net [06:14:30] totally down atm [06:14:39] 1. 54-gw-vrrp2.rz.uni-ulm.de 0% 20 20 0 0 1 1 [06:14:42] 2. cdrip-core-int.rz.uni-ulm.de 0% 20 20 1 0 1 2 [06:14:46] 3. Ulm-N25-1-10GE-0-1-0-3.belwue.net 0% 20 20 1 1 1 1 [06:14:49] 4. Heidelberg-RZ-1-10GE-0-0-0-0.belwue. 0% 20 20 6 5 6 7 [06:14:52] 5. Mannheim1-10GE-4-0-0.belwue.net 0% 20 20 5 5 6 7 [06:14:53] AFBorchert: no pasting... [06:14:55] 6. Frankfurt-DECIX-1-10GE-0-0-0-0.belwu 0% 20 20 7 7 7 7 [06:14:58] 7. 80.81.193.67 0% 20 20 7 7 10 24 [06:15:01] 8. r1ams2.core.init7.net 0% 20 20 18 18 22 25 [06:15:03] oh.. [06:15:04] 9. ??? [06:15:06] init7 too [06:15:06] sorry [06:15:07] here [06:16:38] gw-wikimedia.init7.net [06:16:59] no request. [06:20:07] sometimes i can reach it. [06:21:10] overload on network cables? [06:21:33] dont know. [06:21:47] init7 problems. [06:22:38] http://en.wikipedia.org doesn't seem accessible from the Netherlands. [06:22:59] oh ok. [06:24:41] Joan you can't reach wikipedia-lb.esams.wikimedia.org too ? [06:25:13] well esams is itself in the netherlands... [06:25:28] joan is in the US [06:25:51] Seems better now. [06:26:03] Intermittent. [06:26:08] yes [06:26:09] for what it's worth: de.WP is not reachable according to my browser. [06:26:13] still lots of packetloss here [06:26:13] sometimes work. [06:28:08] totally down. [06:28:26] PL from the 3rd init7 box onwards ... [06:28:47] Works fine for me :D [06:28:58] Bsadowski1: where? [06:29:01] 95% packet loss at te-8-2.csw1-esams.wikimedia.org [06:29:14] I'm in the US [06:29:15] AFBorchert: yep, same here. [06:29:16] \o/ [06:29:53] init7 damn. [06:30:04] 7 is not a valid run level... [06:30:15] Packet loss seems to be at the load balancer [06:30:20] Yes. [06:30:21] jeremyb: yea, kind of strange company name :P [06:30:29] Seems better now, though. [06:30:48] multichill: traffic's been shifted over to US [06:31:02] Not for me it hasn't :-( [06:31:21] multichill: run dnsmasq? [06:31:24] jeremyb: on a dns level ... so sometime in the next hour we will get that change ;) [06:31:32] I know what it is. [06:31:36] j/k [06:31:38] I don't. [06:31:55] Was gonna joke and say like, some internet firewall. [06:31:56] :P [06:32:12] hi all, I came to report problems. no access to nl-wiki meta-wiki and en-wiki from the netherlands. no changes reported in both channels #cvn-wp-nl or #wikipedia-nl-vandalism at all [06:32:32] yes [06:32:51] Toolserver seem to be ok, that rules out most networking equipment [06:32:54] Moira2: would make sense to have few changes if no one can even read the wiki, right? ;) [06:33:22] ? [06:33:32] it seems to be a problem wit the bits caching servers [06:33:36] Moira2: IRC feeds [06:33:53] no i think it's a routing problem [06:33:57] multichill: this was a few mins ago but seems about the same now: http://dpaste.com/738014/plain/ [06:34:22] wikipedia also does not load when accessed from outside [06:34:25] via google [06:35:48] tegra: the "solution" is to use failover boxes in the us ... what the problem is, the admin will find out once everything is working again ;) [06:36:28] Moira2: yeap ... but I'm pretty sure here is one of the first places you will get to know once it works again ;) [06:36:41] it works again it seems [06:36:45] thanx guys and girls [06:37:08] works now. [06:37:17] :) [06:38:01] yep, failover [06:40:59] bye and thanks [12:17:34] hey folks [12:17:47] it seems that there is a problem with api since 1.20 [12:18:03] when I start my bot I get this [12:18:08] Site: Wikipedia (MediaWiki 1.20wmf1) [12:18:08] The remote server returned an error: (500) Internal Server Error. Retrying in 60 seconds. [12:18:29] is there a way to debug why the api query which was ok is now returning 500 [12:19:13] jeremyb: wtf [12:19:16] europe is down? [12:19:35] could you make a bit more descriptive topic because I am in europe and stuff loads [12:20:12] petan|wk: I guess the trafic is re-routed to the US [12:20:33] (that's the normal way if europe's systems are down) [12:20:53] it looks like wiki is down for europe from what was there [12:22:07] however my api problem is still happening now [12:26:25] What query are you doing to get the error 500? [12:26:31] Just saying it doesn't work is useless [12:28:35] traffic was back to normal a few hours ago, sorry the topic wasn't updated [12:28:56] apergos: there is also nothing in the server-admin-log AFAIS [12:29:46] ok sorry about that [12:30:08] apergos: no big deal :) [12:31:30] ZorroIII noticing you people too? [13:23:13] !log nikerabbit synchronized php-1.20wmf1/extensions/Narayam/ 'Updating Narayam' [13:23:15] Logged the message, Master [13:27:45] !log nikerabbit synchronized wmf-config/InitialiseSettings.php 'Narayamon tewiki bug 33480' [13:27:47] Logged the message, Master [13:28:29] ^demon: looks like special:version is not working correctly [13:28:51] <^demon> Howso? [13:29:21] I updated Narayam [13:29:21] Previous HEAD position was e2d49f2... Merge "Remove rules that did nothing, correct errors, add new rules for simplification, use single rule instead of multiple where possible" [13:29:26] HEAD is now at 30c7b8c... Localisation updates from http://translatewiki.net. [13:29:34] yet http://te.wikipedia.org/wiki/%E0%B0%AA%E0%B1%8D%E0%B0%B0%E0%B0%A4%E0%B1%8D%E0%B0%AF%E0%B1%87%E0%B0%95:%E0%B0%95%E0%B1%82%E0%B0%B0%E0%B1%8D%E0%B0%AA%E0%B1%81# points to the old commit [13:31:53] <^demon> Hmm. Oh well toss it in BZ? [13:32:50] ^demon: you know I've been filing approximately three new bugs every day for a whlie [13:34:13] <^demon> And I'm fixing them as fast as I can. But I didn't rewrite Special:Version to support git so I don't know what you want me to do. [13:36:08] ^demon: I didn't say I want you to fix it [14:24:52] apergos: would you be able to copy and extract some mediawiki release tarballs onto download.wm.o please? [14:26:08] can it wait a little, I'm in the middle of a discussion about an ops ticket [14:26:12] I'll ping you in a few [14:26:23] Yeah, it's not urgent. Thanks [14:30:30] Reedy: ok [14:30:36] looks like it's being handled [14:30:56] so tell me what to get and whether I should unpack them in some dir or what the drill is, etc [14:34:48] Reedy: ? [14:37:44] http://noc.wikimedia.org/~reedy/upload-1.17.4.tar [14:37:47] http://noc.wikimedia.org/~reedy/upload-1.18.3.tar [14:37:51] http://noc.wikimedia.org/~reedy/upload-1.19.0rc1.tar [14:38:17] Those need extracting into the mediawiki tarball folder [14:45:26] Reedy: check it, you should be all set I think [14:45:58] Yup, looks good. Thankyou! [14:46:44] sure [15:47:23] Reedy: hey can we put keys somewhere not on secure.wm.org ? [15:47:26] we're trying to kill that [15:47:37] Sure [15:47:46] I think someone suggested that before [15:48:28] LeslieCarr: we just need some kind of redirect for it for a while though [15:49:09] yeah, we can do that too, but also just not linking anything to it helps :) [15:49:17] secure.w.o is nice for cross-site requests [15:49:27] Secure needs to die [15:49:41] When it's got a new location, I can link to that location instead ;) [15:50:35] LeslieCarr: it'd probably be fine being moved to fenari [15:50:56] And hell, maybe into puppet too! [15:51:13] yay [15:51:31] Anything that speeds up secure.wm.o dieing is a good thing! :D [16:04:53] apergos: please could you copy http://noc.wikimedia.org/~reedy/upload-1.18.3.tar again and replace the files? Seems I made a tarball with 2 copies of the extensions [16:05:24] ok [16:05:41] I looked at thee contents of the file to make sure the directory structure was right but nothing else [16:06:40] try that [16:07:40] Reedy: [16:09:27] Yeah, it was mediawiki-1.18.3.tar.gz itself with issues (my fault) [16:09:28] Thanks [16:10:20] sure [18:33:41] !log catrope synchronizing Wikimedia installation... : Deploy AFTv5 updates [18:33:44] Logged the message, Master [19:06:17] !log catrope synchronized wmf-config/CommonSettings.php 'Add group permissions settings for AFTv5' [19:06:20] Logged the message, Master [20:15:52] "Due to high database server lag, changes newer than 84 seconds may not appear in this list." [20:16:15] "Changes newer than 27 seconds may not appear in this list. " :) [20:17:53] petan|wk: europe was down entirely not rerouted. sorry i forgot to revert the /topic when it was fixed [20:19:04] DaBPunkt: nothing in SAL about what? [20:19:45] * jeremyb points to http://wikitech.wikimedia.org/index.php?title=Server_admin_log&diff=46145&oldid=46142 [20:19:45] jeremyb: the fix/back-routing [20:23:39] DaBPunkt: see that link? [20:24:37] jeremyb: yes, that the "there is a problem"-line. But I could not find the "We fixed the problem"-line [20:24:51] DaBPunkt: ahhhh [20:25:49] DaBPunkt: well i'm pretty sure we fixed the problem and traffic is hitting esams again were not so close to the same point in time [20:26:15] so all 3 should be logged (divert, fix, revert) ideally [23:09:33] !log Recreated resources directory symlinks in bits docroot [23:09:36] Logged the message, Master [23:40:03] gn8 folks [23:46:37] !log asher synchronized wmf-config/db.php 'raising db58 weight' [23:46:39] Logged the message, Master