[00:01:18] New patchset: Asher; "frontend cache tier udplog data for graphite" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2523 [00:01:41] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2523 [00:01:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2523 [00:01:42] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2523 [00:01:43] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2523 [00:04:27] New patchset: Asher; "more capable and efficient version of sqstat" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2524 [00:04:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2524 [00:04:55] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2524 [00:04:56] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2524 [00:09:48] New patchset: Asher; "add path to proc name for silly nagios monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2525 [00:10:10] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2525 [00:10:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2525 [00:10:13] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2525 [00:10:13] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2525 [05:14:09] New patchset: Asher; "much more accurate at counting pageviews and edit submission" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2526 [05:15:34] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2526 [05:15:34] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2526 [05:54:33] New patchset: Asher; "fix edit tp99 reporting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2527 [05:54:54] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2527 [05:54:54] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2527 [05:54:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2527 [06:21:42] New patchset: Asher; "variable scope fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2528 [06:22:09] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2528 [06:22:10] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2528 [07:47:26] has anyone been looking into the enwiki search outage? [07:52:09] !log restarted lsearchd on search{3,4,9} [07:52:11] Logged the message, Master [07:53:31] fulltext search on enwiki is still broken despite the nagios ok [08:03:07] it is? [08:03:53] (I just tried it and I got a bunchof results that seem reasonable) [08:04:56] same here now [08:05:15] do you just look at the pybal conf files to decide what to restart? [08:05:16] though i just restarted lsearchd on 1, 3, 4, 9 again a little while ago [08:05:31] looking at ganglia I really can't read anything into the tea leaves there [08:06:44] i was looking at the pybal conf + tailing /var/log/pybal.log and watching the health checks fail, which are just a simple tcp port check for these i think [08:06:53] ok [08:07:39] and now why has the bottom fallen out of theapi servers... [08:08:42] network traffic nose dived [08:09:29] yeah [08:09:34] hi [08:09:37] hello [08:11:59] what's the status now? [08:13:26] still see it can't depool servers cause too many are down [08:13:31] (looking at lvs4) [08:13:46] i can search, but dont get answers from API [08:13:49] fucking search [08:14:13] most api queries are search queries [08:14:36] i think the search issue is killing the api apaches [08:14:45] yes, most likely [08:15:09] lots of [08:15:10] cp1020.eqiad.wmnet 86390811 2012-02-12T08:14:40.678 3414 122.174.37.108 TCP_MISS/000 0 GET http://en.wikipedia.org/w/api.php?action=opensearch&search=tardy+ulnar+&namespace=0&suggest= [08:15:19] right [08:15:24] i think thats a disconnect after.. 34 seconds? [08:17:23] search pool 1 looks ok in lvs [08:18:04] hmmm [08:18:12] check the network graph on http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Text+squids+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [08:18:25] looking [08:18:34] woah [08:18:35] pmtpa text squids are only used by esams misses now [08:19:05] eqiad text squids (api squids) don't show that pattern at all [08:19:30] I see it but I don't understand it [08:22:24] re, the "how to decide what to restart": -> ipvsadm -l .. right? [08:22:58] 77% of the last 500k requests came via the europe squids [08:23:08] well that makes sense [08:23:10] it's morning in europe [08:23:16] and of those 56% were misses [08:23:22] that's normal for api too [08:23:27] or do you mean all requests? [08:23:31] that's all requests though [08:23:49] misses on the frontend squids? [08:24:08] oh yeah, that's useless [08:24:20] nevermind :) [08:24:32] :) [08:25:50] I wonder if the spike in requests I see on the api squids can be explained by squid retry behaviour or not [08:27:06] mark: the esams squids send api requests to the pmtpa text squid pool? [08:27:16] eqiad api squids have a spike of a few hundred extra requests [08:27:19] binasher: yes [08:27:30] pmtpa squids are not used otherwise [08:27:46] esams squids send only to the pmtpa api squids, sq31-36 [08:27:58] but i'm seeing a spike in request on the eqiad api backend squids too [08:28:04] not on the regular text squids [08:29:38] looking at a tcpdump on a pmtpa text squid port 3128, around 2/3 of requests coming through are for /w/api.php?action=opensearch [08:29:55] that seems about right [08:30:07] yes [08:30:43] that's why api is affected by search downtime [08:31:02] we should really do some performance/reliability work on the api at some point [08:31:21] and perhaps introduce api keys, so we know who's hitting us the most, etc [08:31:29] a fabulous idea [08:31:35] I'm looking at sampled log [08:31:44] not seeing anything jump out from there [08:31:53] an example: http://torrus.wikimedia.org/torrus/CDN?path=%2FSquids%2Feqiad%2Ftext%2Fcp1004.eqiad.wmnet%2Fbackend%2FUsage%2FClient_requests [08:31:55] which again seems to indicate that this is normal traffic [08:33:00] performance / reliability work is probably needed most on search [08:33:22] oh absolutely ;) [08:33:28] Host : 10.2.1.22/80/3130 [08:33:28] Flags : no-query no-digest login=PASS connect-timeout=5 max-conn=1000 originserver [08:33:28] Address[0] : 10.2.1.22 [08:33:28] Status : Up [08:33:28] AVG RTT : 0 msec [08:33:28] OPEN CONNS : 1000 [08:33:38] api squids are maxed out on their connections to the api backend [08:33:46] ah - you were thinking the spike is just squid retrying failed connections? [08:33:52] i'm not sure [08:33:55] i'm trying to find out [08:34:00] squid used to retry 10 times [08:34:05] now it's 3 - if we configured that [08:34:10] let's see if I can find it [08:34:14] and you squashed that crap iirc [08:34:34] i'm seeing tampa squids return "Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable" to esams [08:34:43] poor things :-( [08:34:44] yes, that's typical [08:34:47] heh [08:34:48] and then squid would retry up to 10 times [08:34:50] making it worse [08:35:01] now we patched squid to make that configurable [08:35:03] and set it to 3 [08:35:12] but this happened when youu came in, and we had to back out the patch [08:35:18] i'm no longer sure if we reenabled that option [08:35:33] cachemgr can tell me [08:35:37] * mark looks [08:36:28] it's 10 times... [08:36:49] the patch is in [08:36:50] i'll look into the option and setting it to max 2-3 [08:36:52] but Config.retry.maxforwards needs to be set [08:36:54] :-( :-( [08:36:56] yes [08:37:00] and it's not I believe [08:37:52] i'm gonna set it to two now [08:38:10] ok [08:39:11] trying it on cp1001 [08:39:36] checking in cachemgr [08:40:27] looks like it might be working [08:40:35] i'm gonna try it on the other api squids now [08:40:38] not the rest of text yet [08:42:33] !log Set maximum_forwards 2 in squid.conf, deployed to the API squids only so far, rest is pending [08:42:36] Logged the message, Master [08:43:56] I don't think it really helps anyway [08:44:28] gonna deploy it to all squids now [08:44:57] !log maximum_forwards change deployed to all squids [08:44:59] Logged the message, Master [08:49:01] https://graphite.wikimedia.org/render/?title=Average%20Search%20Latency%20-8hours&from=-8%20hours&width=800&height=600&until=now&areaMode=none&hideLegend=&lineWidth=1&target=alias%28color%28LuceneSearchSet.newFromQuery.tavg,%22blue%22%29,%22Search%20Latency%20%28ms%29%22%29 [08:50:17] I still see try #3 go up a bit in cachemgr, so I'm wondering if it's in effect at all [08:50:35] hmm [08:51:04] what timezone is that? [08:51:39] i wish rainman would respond to questions or accept a pile of cash to spend a few days with us :/ [08:52:34] I think we need to prioritize finding someone [08:52:50] since it's now actively affecting uptime [08:53:35] peter should become rfe and take it over [08:54:02] is peter a coder? [08:54:14] hmm that's a good question, I dunno [08:54:16] oh.. grr. that appears to be us est.. its the default in one of the graphite confs [08:54:23] add &tz=utc [08:54:25] heh [08:54:28] but that's now [08:54:29] please use UTC everywhere [08:54:31] i'll fix that [08:55:36] i think it does some rsync every sunday morning [08:55:44] because search always goes down on sunday morning [08:55:50] usually restarting lsearchd fixes it [08:55:58] I still see rsync running [08:56:14] we need someone willing to work through a lot of java code, and who understands ancient versions of lucene in a distributed query execution setup [08:56:19] rsyncing indexes? [08:56:21] perhaps it's data corruption during the rsync? [08:56:25] eewwww [08:56:34] i killed an rsync on one of the enwiki searchers that was running since 2010 [08:56:40] nice [08:56:55] erm [08:56:56] the java is less than fun [08:56:56] search1: [08:56:57] lsearch 29523 0.0 0.0 17288 1272 ? S 08:37 0:00 /usr/bin/rsync --bwlimit=4048 -W --delete -u -t -r rsync://searchidx2/search/snapshot/enwiki.nspart1.sub2/20120212070013 /a/search/indexes/update/enwiki.nspart1.sub2 [08:56:57] lsearch 29524 1.2 0.0 17548 880 ? S 08:37 0:15 /usr/bin/rsync --bwlimit=4048 -W --delete -u -t -r rsync://searchidx2/search/snapshot/enwiki.nspart1.sub2/20120212070013 /a/search/indexes/update/enwiki.nspart1.sub2 [08:57:24] I guess that's a fork() [08:57:26] I think that's ok [08:57:28] i hope it is [08:57:30] yeah [08:57:37] yeah, one does the work and the other waits or something [08:57:41] yeah [08:58:12] we could try putting maxclients up a little on the app servers [08:58:21] no on the squids [08:58:26] perhaps both [08:58:40] 2012-02-12 08:42:37.614466 [ProxyFetch] srv291.pmtpa.wmnet (enabled/partially up/pooled): Fetch failed, 5.191 s [08:58:48] see I'm still seeing that consistently on lvs4 [08:58:55] so... and max clients there is 40 I think [08:59:17] we need to somehow convince the powers at be to open a search job rec on the platform team [08:59:36] now that's sensible. almost toooooo sensibel [08:59:41] sensible [09:00:06] 40, yeah [09:00:23] and it is of course maxed out [09:00:34] nice [09:00:40] the parent of the rsync is some java process [09:00:45] so we're probably not looking at a cron job [09:00:47] it's built in? [09:00:52] ohhhhh dear [09:01:26] apergos: are apaches returning 200s? [09:01:31] for the queries that do come in [09:01:31] https://graphite.wikimedia.org/render/?title=Average%20Search%20Latency%20-8days&from=-8%20days&width=800&height=600&until=now&areaMode=none&hideLegend=&lineWidth=1&target=alias%28color%28LuceneSearchSet.newFromQuery.tavg,%22blue%22%29,%22Search%20Latency%20%28ms%29%22%29&tz=utc [09:01:54] every sunday eh [09:01:57] looks like search always goes to shit sunday morning [09:02:01] yes it does [09:02:04] :-D [09:02:13] just to screw with our weekend... [09:02:23] if nothing else, i'm gonna move that weekly rsync to a weekday :P [09:03:21] on search4 right now: lsearch 8617 1 0 2010 ? 00:00:07 /usr/bin/rsync --bwlimit=4048 -W --delete -u -t -r rsync://searchidx1/search/snapshot/enwiki.nspart1.sub2/20101219055823 /a/search/indexes/update/enwiki.nspart1.sub2 [09:03:21] lsearch 8618 8617 0 2010 ? 00:00:16 /usr/bin/rsync --bwlimit=4048 -W --delete -u -t -r rsync://searchidx1/search/snapshot/enwiki.nspart1.sub2/20101219055823 /a/search/indexes/update/enwiki.nspart1.sub2 [09:04:03] I don't know (about 200s) yet [09:04:29] http://www.mediawiki.org/wiki/Extension:Lucene-search#Distributed_architecture [09:05:09] searchidx1? [09:05:10] wait [09:05:17] does that even... [09:05:19] I mean [09:05:20] so basically [09:05:30] every sunday morning, all search servers rsync the index all at the same time? [09:05:39] I thought searchidx2 is the real indexer now? [09:05:45] yeah [09:05:47] it's supposed to be [09:05:48] hmm.. the search indexes are 57GB on search4, the host as 16GB of ram, and java is using 2GB 32bit! [09:05:49] so what the heck [09:06:06] so the most used bits of the index have to be in the buffer cache [09:06:09] binasher: yeah, rainman always wanted 32 bit javas because otherwise it would use double the memory [09:06:15] who the fuck cares if it's limited to 2 GB?! [09:06:23] true [09:06:32] which gets flushed with the new crap when rsync writes a new index [09:06:38] which isn't even usable yet [09:06:46] I don't understand how that rsync can possibly be any good [09:07:01] and then every search has to go to disk.. [09:07:19] I recall he wanted to try SSDs [09:07:22] but that never happened [09:07:55] how about we STOP the rsyncs [09:07:57] and see what happens? [09:08:05] we can CONT them later [09:08:49] awesome [09:09:13] but yeah, searchidx1 should be searchidx2, too [09:09:39] unless [09:09:41] crontab on searchidx2: [09:09:42] 30 4 * * * /home/rainman/scripts/search-snapshot [09:09:42] 30 9 * * 5 /home/rainman/scripts/search-snapshot-precursors [09:09:43] 00 0 * * 6 /home/rainman/scripts/search-indexer-cron [09:09:43] 0 2 * * * /home/rainman/scripts/search-import-private-cron [09:09:43] 0 3 * * * /home/rainman/scripts/search-import-broken-cron [09:09:43] 25 9 * * 1,2,3,4,5,6,7 /home/rainman/scripts/search-build-prefix [09:09:43] # m h dom mon dow command [09:10:00] yeah it has to be 2. it just has to [09:10:14] getting new indexes around without killing search latency was a big problem at amazon.. writing custom code to copy it down that set posix_fadv_dontneed on files being written to, then a process to warm the buffer cache with the new file after the copy is done and it becomes the live index helped a lot [09:10:47] hmm [09:10:50] can't rsync do that? [09:10:52] rsync can be patched to do the same thing [09:10:53] would be a nice feature [09:10:56] yeah [09:11:38] should help with this.. maybe during the week i'll take a crack at that. sad that i'll have to build it for hardy [09:12:00] well [09:12:09] peter was gonna build search in eqiad too [09:12:10] that will be lucid [09:12:15] then we can upgrade the rest [09:12:23] some are karmic [09:12:30] (even worse) [09:12:39] yeah.. finally got the jar built with a current 64bit jre and into a deb on friday [09:12:50] cool [09:13:24] but he doesn't have much to go by on how to build new indexers and searchers except from looking at pmtpa [09:13:35] I know [09:13:56] that mediawiki.org link has some info [09:14:07] (only just found out about it myself ;) [09:15:19] I cannot find references to searchidx1 in either the global config file or the local one on search4 [09:15:22] how is that possible? [09:17:26] it was probably removed at some point since 2010 [09:18:17] I am sure it was [09:18:31] ah right, the rsyncs were from then [09:18:38] decommissioning.pp:"searchidx1" [09:18:45] :-D [09:18:46] heh [09:18:58] that mediawiki page recommends splitting up your index into more parts if no longer fits in memory [09:19:08] stab [09:19:28] i wonder how many articles we had when that was true for us [09:20:17] anyway [09:20:19] stuff is still down [09:20:22] shall we STOP that rsync? [09:22:42] I see no rsync on search4 [09:23:01] right [09:23:05] it finished on search1 too [09:23:45] probably takes some time for the new index to warm up [09:25:11] this is why i segregated the API off the normal text serving [09:25:14] even if it's a bit of a hack [09:26:40] yes I see at least some 200s to api opensearch requests on srv291 [09:27:07] then probably the best we can do is wait [09:28:58] http://tobi.oetiker.ch/patches/rsync-3.0.9-2-fadvise.patch [09:29:14] it looks like a small number of hosts not pooled now [09:30:02] looking [09:31:01] patches have been around since 2007 if not before that [09:32:11] avg search latency is slowly going down [09:32:43] hehe [09:32:54] perhaps we should again segregate off the opensearch api queries from the other api queries ;) [09:33:23] or make those backend api squids varnish [09:33:34] and have some more control on the VCL level [09:34:20] I like segregation right about now [09:34:39] it's going ot be varnish everywhere at some point, I guess, but segregation just seems smart [09:36:23] the really easy way to accomplish that right now is to edit the api_php acl in squid.conf to only match opensearch [09:36:32] then the other api queries will go to the normal squids and apaches (for now) [09:38:44] dirty hack but... [09:38:44] acl api_php urlpath_regex ^/w/api\.php.*[?&]action=opensearch [09:38:48] this should do it right? [09:39:14] trying it [09:39:15] I'm not thrilled with the sq33/4/5 graphs still [09:39:45] that looks right [09:43:18] !log Restricted only opensearch API requests to the API backend apaches, other API requests now hit the main mediawiki cluster [09:43:20] Logged the message, Master [09:45:35] !log Restricted only opensearch API requests to the API squids [09:45:36] Logged the message, Master [09:48:04] wow [09:48:08] labs is pushing 800 Mbps [09:48:46] why? [09:49:21] dunno [09:50:06] monitor_service_lvs_http { "api.svc.pmtpa.wmnet": ip_address => "10.2.1.22", check_command => "check_http_lvs!en.wikipedia.org!/wiki/Main_Page" } [09:50:09] that doesn't look right [09:52:31] ohh liking the sq34 graph more now [09:53:05] but not sq36 :-( [09:53:17] let me restart that [09:55:29] bed time, night all [09:55:32] night [09:58:24] mark: you mean it should check an actual api.php link, right? suggesting api.php?action=query&meta=siteinfo [09:58:37] for example [10:03:37] New patchset: Mark Bergsma; "Give the API apaches a higher MaxClients setting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2529 [10:04:07] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2529 [10:04:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2529 [10:04:12] New patchset: Dzahn; "let the LVS HTTP check on api cluster check a real api.php URL instead of Main_Page" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2530 [10:08:35] !log Increased MaxClients to 100 on API apaches in Puppet [10:08:37] Logged the message, Master [10:14:25] well [10:14:35] i'm gonna get out of bed now :P [10:14:38] will check back in later [10:14:41] ok [10:14:45] I should get out of my pjs [10:14:54] right now it seems... crappy but stable [10:14:55] bye [10:14:57] ish [10:14:58] bye [10:14:59] yeah [10:15:01] bye, getting coffee [10:25:52] hrm, another OK when it was OK already? [10:26:37] i'll merge that change to check for api.php [10:27:08] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2530 [10:27:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2530 [10:34:12] check_http -H en.wikipedia.org -I 10.2.1.22 -u "/w/api.php?action=query&meta=siteinfo" --> HTTP OK HTTP/1.1 200 OK .. [10:34:37] * mutante bbl [10:38:43] see ya [10:39:25] my those are some much happier text squids now [12:07:32] !log Rebalanced mw API app servers from load 120 to 150 in pybal list [12:07:34] Logged the message, Master