[00:03:43] PROBLEM - Puppet freshness on carbon is CRITICAL: Last successful Puppet run was Wed 12 Mar 2014 03:01:44 PM UTC [01:25:33] (03PS2) 10Springle: icinga: update check_mysql-replication to v 0.2.6 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116945 (owner: 10Matanya) [01:25:50] (03CR) 10Springle: [C: 032] icinga: update check_mysql-replication to v 0.2.6 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116945 (owner: 10Matanya) [01:35:33] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [01:49:45] (03PS1) 10Hoo man: Fix title display in mwgrep [operations/puppet] - 10https://gerrit.wikimedia.org/r/118421 [01:49:53] ori: --^ Easy one ;) [01:54:13] PROBLEM - MySQL InnoDB on db1038 is CRITICAL: CRIT longest blocking idle transaction sleeps for 2147483647 seconds [01:54:25] heh [01:54:29] -1 [01:54:50] ... nice uptime :D [01:56:00] 68.05 years o.O [01:56:13] RECOVERY - MySQL InnoDB on db1038 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:57:02] 2147483647 seconds ago was 22nd February, 1946 according to Wolfram Alpha [01:57:18] that's about right [01:57:31] given that unix time runs out in '38 and started in '70 :P [01:59:01] db1038 needs upgrade to 5.5.34 to fix that [02:17:46] !log LocalisationUpdate completed (1.23wmf16) at 2014-03-13 02:17:46+00:00 [02:17:55] Logged the message, Master [02:32:34] bd808: Sorry, I'm running super late [02:32:55] Ryan_Lane: No worries. Such is life [02:33:21] I'm fighting with puppet in labs to kill time ;) [02:33:25] On a bus. Should be home soonish [02:33:52] :D [02:33:53] Puppet will do that to you [02:33:54] !log LocalisationUpdate completed (1.23wmf17) at 2014-03-13 02:33:54+00:00 [02:34:02] Logged the message, Master [02:41:57] bd808: ok, sorry about that [02:42:00] hangout? [02:42:19] Yeah. I'll pm the link [02:55:43] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [03:04:33] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [03:04:43] PROBLEM - Puppet freshness on carbon is CRITICAL: Last successful Puppet run was Wed 12 Mar 2014 03:01:44 PM UTC [03:14:52] bd808: See bz; just don't create the group at all since it already exists globally. :-) [03:15:59] bd808: See also https://gerrit.wikimedia.org/r/#/c/118071/ [03:16:33] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [03:16:36] Coren: I'll check it out [03:17:07] (03CR) 10coren: [C: 032] "Made (annoyingly) necessary by nova-network's lack of configured IPv6" [operations/puppet] - 10https://gerrit.wikimedia.org/r/118301 (owner: 10Tim Landscheidt) [03:18:33] Coren: Ah ha! That is exactly the problem I'm hitting :) [03:18:55] (03PS5) 10coren: beta: skip l10nupdate user/group creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/118071 (owner: 10Hashar) [03:19:02] * Coren attaches the bz [03:19:57] Aaugh; tabspacezmixx0rz! [03:21:01] (03PS6) 10coren: beta: skip l10nupdate user/group creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/118071 (owner: 10Hashar) [03:21:12] Sorry about the spam. [03:22:23] bd808: So yeah, that change should exactly fix the issue you have. [03:23:16] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Mar 13 03:23:13 UTC 2014 (duration 23m 12s) [03:23:24] Logged the message, Master [04:14:33] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [05:56:43] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [06:05:43] PROBLEM - Puppet freshness on carbon is CRITICAL: Last successful Puppet run was Wed 12 Mar 2014 03:01:44 PM UTC [06:16:55] 23:23 <+logmsgbot> !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Mar 13 03:23:13 UTC 2014 (duration 23m 12s) <--- That's about half the time it took the last 4 days: http://paste.debian.net/87382/ [06:17:51] * greg-g makes a note, but there's probably enough variation in the work that this is normal, we just don't know [06:19:27] greg-g, I'm noticing extreme Parsoid slowness in prod, any idea what might be going on? [06:20:17] I know gwicke deployed but wasn't able to restart the parsoid services on all boxes, not sure if that's related [06:21:15] Eloquence: the boxes don't look busy: https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [06:21:38] but.. there is that dip in performance [06:22:18] I'll page Gabriel and Roan, spending a full daylight cycle in Europe with massive performance regression would be no fun. [06:23:22] thanks [06:23:49] do you mind seeing if you can repro before I do so? [06:24:08] if I special:random on a few pages on en.wp with VE enabled, some of them will take 30+ secs [06:24:16] some will be zippy [06:26:30] yeah, confirm, The_Last_Day_(Doctor_Who), which is pretty short, is taking a very long time [06:30:18] good evening RoanKattouw [06:30:21] ^^ [06:31:04] Hey there [06:31:07] Do we have slowness on load or on save? [06:31:13] I just tested initial load [06:31:18] If this is a backend issue then saves should be consistently slow but loads should only be slow for some pages [06:31:18] I've not paged Gabriel yet but happy to do so if needed. [06:31:48] yeah, loads are extremely slow on some pages, even short ones. once a page has been loaded it's subsequently fast [06:32:20] I can send you a HAR of a 40 second request if that's helpful [06:32:34] https://en.wikipedia.org/w/api.php?format=json&action=visualeditor&paction=parse&page=Charles_Simmons_(gymnast) took 40 seconds [06:32:36] When did this start? [06:32:47] I noticed it ~20 mins ago [06:32:54] but https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 looks odd [06:33:03] there's a suspicious drop in network traffic [06:33:27] That happened earlier today [06:33:31] I think there was a Parsoid deployment [06:33:47] 20:02 gwicke: deployed Parsoid 004c7acc with deploy f97820a2; restart todo [06:33:47] greg says the deploy didn't succeed and gwicke wanted to try again tomorrow. [06:33:50] yeah, but the services weren't all restarted successfully [06:33:58] yeah, that [06:34:15] The drop follows daily trends: https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [06:34:16] the restart never happened, he's planning to do it tomorrow during LD [06:34:29] What's weird is traffic didn't rise today [06:35:10] yeah, network and cpu haven't gone back up to normal [06:35:20] OK this network graph is weirder: https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=Parsoid+Varnish+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [06:35:27] just paged gwicke as well [06:35:40] The deployment was way later than that though [06:35:44] more clear there [06:35:57] Let's see if I can find Ori's client-side load time metrics [06:36:02] If those aren't still broken [06:36:14] deploy happend at 20 UTC [06:36:16] https://ganglia.wikimedia.org/latest/?r=4hr&cs=03%2F12%2F2014+17%3A00+&ce=03%2F13%2F2014+06%3A00+&m=cpu_report&s=by+name&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [06:36:35] it coincides [06:36:35] https://gdash.wikimedia.org/dashboards/ve/ [06:36:42] these look crazy. [06:36:55] i.e. they confirm what we're seeing in prod, I think. [06:36:56] (03PS1) 10Ryan Lane: Fix eqiad labs range [operations/puppet] - 10https://gerrit.wikimedia.org/r/118431 [06:37:19] yeah [06:38:44] hi [06:38:54] hi gwicke, see ^^ [06:39:20] so we didn't actually deploy anything since Monday last week as trebuchet was broken since then [06:39:43] today the only thing that effectively happened was a restart [06:40:02] so "20:02 gwicke: deployed Parsoid 004c7acc with deploy f97820a2; restart todo" is wrong? [06:40:14] that's what I thought had happened [06:40:35] trebuchet lied and didn't update the submodule with the code [06:40:39] the exact opposite happened? :) [06:40:43] great [06:40:56] nothing happened, only Coren's service restart as root worked [06:41:00] * greg-g ndos [06:41:02] (salt is also broken btw) [06:41:32] gwicke, something happened around 15:00 UTC per https://gdash.wikimedia.org/dashboards/ve/ [06:41:32] so, any ideas about the drop in performance? [06:41:45] the timing on the VE dashboard looks like it coincides with the restart [06:41:46] I checked how happy LVS/pybal are about the parsoidcache and parsoid clusters [06:41:49] Their answer is "never happier" [06:41:59] the load also looks fine [06:42:08] Heartbeat and trivial HTTP requests are fine on all backends [06:42:18] did you check bits etc? [06:42:37] this is not just a measure of parsoid load performance, but also RL etc [06:43:32] Yeah I haven't checked those yet [06:43:34] Also, there's the API [06:43:44] Gloria, are you noticing any site issues other than VE/Parsoid being slow? [06:44:05] I noticed that at [13:36] James_F, RoanKattouw is it just me or have you found ve loading/saving to be slow on mediawiki.org -- I ran into this on a couple different pages that I tried editing today. [06:44:20] http://en.wikipedia.org/wiki/Harlingen,_Friesland was snappy for me, 788 ms [06:44:29] Grah, that was the live site [06:44:35] API cluster is seeing a crazy spike: https://ganglia.wikimedia.org/latest/?c=API%20application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [06:44:36] Thanks Parsoid, for based links :S [06:44:59] http://parsoid-lb.eqiad.wikimedia.org/enwiki/Harlingen%2C_Friesland?oldid=572387747 was 579ms [06:45:03] I haven't had any of my random articles able to load in VE yet [06:45:05] (03PS4) 10Ryan Lane: Simplify trebuchet developer environment creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112315 [06:45:06] Holy crap [06:45:12] Yeah they're getting hammered, no wonder Parsoid is slow [06:45:44] should I page some opsen? [06:45:57] cpu-wise the cluster looks fairly normal [06:46:05] is varnish bottlenecking? [06:46:24] Hmmmm [06:46:28] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Parsoid%2520eqiad&tab=m&vn= [06:46:32] I'm trying to get a measure of how slow an uncached page is [06:46:39] 11-12% cpu [06:46:46] https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=API+application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 api cluster last week, nothing weird for today [06:46:46] that's low average [06:46:56] gwicke: If the API is overloaded, that will slow down Parsoid, it will spend more time waiting for the API [06:47:13] * Jasper_Deng always had his doubts about having Parsoid as an external service, but he's not a VE designer [06:47:49] the API looks busier than normal [06:47:50] Hah, now I'm getting somewhere, http://parsoid-lb.eqiad.wikimedia.org/enwiki/INS_Pratap_(K92)?oldid=542183384 took 4 seconds to 302 me [06:48:20] also not very balanced, with many machines >= 50% cpu [06:48:36] Yeah [06:48:40] Going to check on LVS for the API cluster [06:49:11] maybe somebody started some expensive operations from their cell phone [06:49:58] OK I've now gotten http://parsoid-lb.eqiad.wikimedia.org/enwiki/Chamak-class_missile_boat to take forever [06:50:57] Loaded in 2 minutes [06:51:39] do we have response time metrics for the API? [06:51:51] (03PS5) 10Ryan Lane: Simplify trebuchet developer environment creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112315 [06:52:06] and top entry points? [06:52:39] gwicke, I only see a method breakdown at https://gdash.wikimedia.org/dashboards/apimethods/ [06:52:42] I'm not sure that we do [06:53:46] those labels are not very helpful [06:54:00] everything seems to be MediaWiki.API [06:54:33] browsing around graphite now [06:55:39] (03CR) 10Ryan Lane: [C: 032] Simplify trebuchet developer environment creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112315 (owner: 10Ryan Lane) [06:57:35] at first sight I don't see anything directly useful in graphite [06:57:38] WTF [06:57:41] The API cluster is reverse-weighted [06:57:49] The boxes with fewer CPUs have higher waits [06:57:51] *weight [06:58:10] fun [06:58:21] Also, why are our DSH groups never up to date *grumble* [06:58:32] there seem to always have been two classes of machines in the app server cluster [06:58:48] one heavily loaded, and the other maybe 2/3 to 1/2 the load [06:59:48] it's almost 9am in greece... if you want help from an opsen [07:00:00] * RoanKattouw generates list of API servers from pybal manifest [07:00:10] gwicke: https://noc.wikimedia.org/pybal/eqiad/api [07:00:12] the overall load avg on the apps server cluster is 64% currently [07:00:29] normal is around 20 [07:02:16] I don't know the # of cpus on those boxes [07:02:27] paravoid: ping [07:03:11] I'm going to crash to bed as soon as paravoid shows up [07:03:16] I just measured that [07:03:24] dsh -cM -f apiservers -- nproc | sort -k 2 -r [07:03:29] It's exactly inverse to the load settings [07:03:35] greg-g, crash now if you want, I'll still be around for a bit as well. [07:04:14] Eloquence: kk, godspeed [07:04:20] :) [07:04:26] I'm going to fix the weights [07:04:27] there is api.log on fluorine [07:05:48] !log Changed pybal weights for eqiad API cluster to # of CPUs on each machine; weights were backwards (machines with fewer CPUs had higher weights) [07:05:56] Logged the message, Mr. Obvious [07:06:02] (03PS1) 10Ryan Lane: Add missing config from role::deployment::salt_masters::labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/118432 [07:06:13] Since we're definitely in ops-y territory now I'll go ahead and page paravoid :) [07:06:54] OK so the API looks to have had two load spikes, the latter of which seems to have subsided while I was doing the reweighting [07:07:32] parsoid slowness continues [07:07:59] (03CR) 10Ryan Lane: [C: 032] Add missing config from role::deployment::salt_masters::labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/118432 (owner: 10Ryan Lane) [07:08:20] parsoid itself looks fine [07:08:24] actually, getting better [07:10:05] (not for me) [07:10:54] yeah, only sporadic [07:10:58] still getting lots of sloow requests [07:11:11] one 30s+ just now [07:11:18] actually paging faidon now [07:11:29] Hmm, so the only thing I can think of [07:11:37] Is Daniel disabling Poolcounter hosts in Tampa [07:12:16] RoanKattouw, https://gdash.wikimedia.org/dashboards/poolcounter/ [07:12:17] Which he doesn't seem to have actually done (?) [07:12:32] I did notice a spike in poolcounter latency, but it seems to average out over the long run [07:12:45] Whoa [07:12:51] That spike is bigger than you realize [07:12:54] That graph is log scale [07:13:11] PC *average* client latency jumped from ~1s to ~200s [07:14:54] the API ist still way too busy [07:15:17] Whoa, we just had another load spike there [07:16:49] the annoying thing is that it's so trivial to bring down the entire API with a handful of requests [07:17:10] makes it hard to find the culprit as it's a needle in a haystack [07:17:21] Well PC latency is really high [07:17:25] So I think something is going on there [07:17:34] But I don't know PC well enough to investigate [07:17:51] RoanKattouw, but that does seem to occur regularly in the weekly graph: https://graphite.wikimedia.org/render/?title=PoolCounter%20Client%20Average%20Latency%20(ms)%20log(2)%20-1week&from=-1week&width=1024&height=500&until=now&areaMode=none&hideLegend=false&logBase=2&lineWidth=1&lineMode=connected&target=cactiStyle(MediaWiki.PoolCounter.Client.*.tavg) [07:18:22] That's .... very odd [07:18:35] Daily, actually [07:18:41] But you're right this is perfectly regular [07:19:05] And the API has had crazier load spikes: https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=API+application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [07:19:10] I see no correlation between the parsoid load and the API load [07:19:11] API load is actually *low* right now [07:19:44] so it seems likely that something is using the API heavily [07:19:52] Hah and PC has dropped back down again [07:20:13] If we had more useful per-method API data in graphite (with actual labeling), we might know more [07:20:14] RoanKattouw, yeah. seems the issue is actually limited to Parsoid no? [07:20:35] For now my theory is that someone's hitting the API with a bunch of heavy parse requests every 24h around this time of day [07:20:47] Possibly in batches, waiting until an entire batch finishes [07:21:07] a bot ? [07:22:05] Maybe [07:22:19] We should do origin IP analysis on the API request logs for this time frame [07:22:42] Data point: http://parsoid-lb.eqiad.wikimedia.org/enwiki/Vidyut-class_missile_boat?oldid=584648112 just took 41s [07:22:46] That's still slow [07:22:50] hitting us with lots of parse requests is definitely something that a certain search engine asked us about (gwicke knows all about that) [07:22:55] But not 2 minutes [07:23:16] the VE slowness started over an hour after Coren's parsoid restart [07:23:39] actually more than two hours later [07:24:56] there were some no-op scaps around that time [07:25:22] I'm analyzing api.log for 06:00-06:59 [07:25:29] catrope@fluorine:/a/mw-log$ grep '^2014-03-13 06:' api.log | cut -d ' ' -f 8 | sort | uniq -c | sort -rn | head -n 25 [07:26:16] Ugh, that fields is internal IPs [07:26:21] Helpful [07:26:46] the problem with looking at the log is that volume is so decoupled from cost [07:28:06] Or... hold on [07:28:24] Really what seems to be going on is most top IPs are internal but some are external [07:28:42] Most top IPs are wtp* [07:28:50] hey [07:28:50] gwicke: There is duration info though [07:29:05] sorry, just saw the text :( [07:29:16] hi paravoid [07:29:47] OK here are some WTFs from the log: [07:29:50] https://gdash.wikimedia.org/dashboards/ve/ [07:30:02] 2014-03-13 06:25:34 mw1190 enwiki: API GET [IP REDACTED] [IP REDACTED] T=9606ms format=xml action=query list=search srsearch=alabama%20national-archives [07:30:05] Why [07:30:14] Why does a simple search query like that take 9.6 seconds [07:32:42] are there any stats on the number of client connections to the api? [07:32:53] Filtering api.log for >10s durations gives me only action=visualeditor (which blocks waiting for Parsoid) but also a few search queries [07:33:07] ap_rps [07:33:50] http://ganglia.wikimedia.org/latest/stacked.php?m=ap_rps&c=API%20application%20servers%20eqiad&r=hour&st=1394695990&host_regex= [07:34:25] paravoid: That big shift you see there is me reweighting the API apaches in LVS [07:34:33] We have boxes with 16 cores and boxes with 24, according to nproc [07:34:42] The 16-core ones had weight 20 and the 24-core ones had weight 10 [07:35:06] That seemed stupid and the load imbalance was very apparent in Ganglia, so I set the 16-core ones to weight 16 and the 24-core ones to weight 24 [07:35:33] 24 core ones are 12 core with HT enabled most probably [07:35:44] and I think API is balanced by memory [07:35:50] Aaah [07:36:08] I would be happy to revert the weighting change if the original weighting made sense in some way [07:36:38] But at the time we were in the middle of a load spike and the weight 20 boxes were at 100% CPU while the weight 10 ones weren't breaking a sweat [07:36:40] VE latency seems to recover lately [07:37:06] the new boxes have 64G of ram, the old ones have 12G :) [07:37:48] and I've been told that API operations are more memory intensive than CPU intensive [07:38:03] there are a few [07:39:02] paravoid: Re memory, the 16-core boxes have 11G of memory and the 24-core boxes have 62G [07:39:41] no, that's backwards [07:40:02] Oh, wait, yes it is [07:40:14] That makes the original weighting make more sense [07:40:25] In general I could see API load being more memory-intensive [07:40:38] An hour ago that was definitely not the case, and I didn't think to look for memory [07:40:42] * RoanKattouw reverts weight change [07:40:56] I also think 24 is not 24, it's just 12*2 threads ;) [07:41:05] two 6-core CPUs [07:41:05] gwicke, it seems to be recovering a bit but I'm still getting the occasional 40sec request [07:41:11] Right [07:41:21] so do we know a root cause yet? is it related to the API load? [07:41:23] paravoid: Are you editing the pybal manifest right now? [07:41:23] (03PS4) 10Matanya: nfs: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109081 [07:41:31] no, go ahead [07:41:52] Eloquence: yes, there's API load that is starving the appservers [07:41:56] the API appservers [07:41:57] Eloquence, looks very much correlated with API (over)load to me [07:42:09] they hit 100% CPU at times, so this would explain delays [07:42:27] !log Reverted API pybal weights back to original values; apparently makes sense given amount of memory [07:42:35] Logged the message, Mr. Obvious [07:42:53] RoanKattouw> For now my theory is that someone's hitting the API with a bunch of heavy parse requests every 24h around this time of day [07:42:53] 18<RoanKattouw> Possibly in batches, waiting until an entire batch finishes [07:42:57] can we verify that theory? [07:43:04] With logs, possibly [07:43:15] I was working on doing some analysis taking running time into account [07:43:15] I'm grepping through logs already :) [07:43:20] ok :) [07:43:25] But I got distracted wondering why search requests routinely take >10s [07:43:31] starting around 3pm pacific [07:43:41] Also, action=visualeditor API requests block waiting for Parsoid, so they're red herrings [07:43:41] I missed that line above, but I had the same theory [07:43:56] I know that [search engine name redacted] was asking to hit our API with more frequent requests to do full parses on articles [07:43:58] They should be spread over lots of IPs though so hopefully that will drop out [07:44:15] as I recall we discouraged them from doing so and asked them to wait for Gabriel's magical cassandra beast to emerge from the darkness [07:44:23] Eloquence: But presumably not in 3 large batches around midnight PDT, then staying quiet the rest of the day? [07:44:32] hitting Parsoid would also be fine as that's cached [07:44:36] That sounds like something they would have mentioned [07:44:49] Also, these would have to be non-pcached parses [07:45:23] RoanKattouw, They didn't specify, and in the discussion we pointed them both to the existing parsoid web service and the magical future. [07:46:05] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=API%20application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1394696706&g=network_report&z=large shows no growth in traffic [07:46:19] ...and logs confirm that [07:46:42] hrm [07:47:19] i need Joel Krauska's email, anyone has it? [07:47:48] matanya, jkrauska at wikimedia dot org [07:47:55] thanks [07:49:37] RoanKattouw, the specific API requests they were asking about were action=parse and action=expandtemplates [07:49:38] in the daily pattern the VE activation time now looks very much back to normal [07:49:46] Right [07:49:57] paravoid: What do the logs confirm? [07:51:29] no noticeable abnormalitie in traffic volume [07:52:45] gwicke, yeah, but I just did one random request and am still getting 40s+ load time. [07:52:56] on which page? [07:53:05] this was https://en.wikipedia.org/wiki/Gagebrook,_Tasmania?veaction=edit , now cached. [07:54:21] same just now on https://en.wikipedia.org/wiki/Guoba?veaction=edit [07:55:19] I just got one 16s load on Gagebrook [07:55:29] most are around 1s [07:55:39] all directly through the parsoid api [07:55:49] (append a random parameter for cache busting) [07:56:48] i get : X-Parsoid-Performance duration=32359; start=1394697324571 [07:56:54] in doubt it might make sense to restart some apaches [07:57:29] parsoid looks very much API blocked, there is little CPU load [07:58:00] do we have some tracking of avg response times by api box? [07:58:27] no [07:58:37] but you can infer some of that from apache ganglia metrics [07:59:43] I'm now computing which clients are responsible for the most aggregate API time [08:00:07] it's also slow when trying with a local parsoid [08:00:29] api appservers are not that busy now [08:00:48] sometimes at least [08:02:13] the last attempts were all fast though [08:02:35] (03PS5) 10Matanya: nfs: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109081 [08:04:48] most are around 1-2 seconds, but some take 16 [08:06:23] prod looks slower though [08:08:53] http://el.wikipedia.org/w/api.php?action=parse&oldid=4308370&format=json&prop=text|langlinks|categories|categorieshtml|langlinks|links|iwlinks|templates|images|externallinks|sections|revid|displaytitle|headitems|headhtml [08:09:15] that takes a while [08:10:06] so there's that, plus POSTs (so no params) with a Java UA [08:10:58] regularly in the tens of seconds [08:11:13] you can also do at least 50 of those per request with generators [08:13:02] paravoid, were there any changes in the LVS area recently? [08:13:06] Hmmm [08:13:10] I have some interesting data here [08:13:44] See fluorine:/home/catrope/apidata [08:13:45] the elwiki ones match up with the cpu spikes [08:13:51] in timestamps [08:14:09] That's "IP reqs time" for all of api.log [08:14:20] i.e. the past hour or so [08:14:37] The 5 top IPs are all in the same /24 [08:14:45] ...and they are the IPs hitting the elwiki links above [08:14:48] paravoid, those suspicious greeks again! :P [08:14:58] that and nothing else [08:15:08] who.is doesn't tell me much about those IPs [08:15:11] Eloquence: heh, the irony [08:15:15] Except that they're apparently reserved for the US [08:15:20] It's an ARIN range [08:15:45] it's a hosting provider [08:16:37] okay, they stopped doing requests at 07:13 UTC, which about when API recovered (+50-100s for their requests to finish...) [08:16:42] hah [08:16:44] that explains that [08:16:55] Hmm, according to this who.is data this IP block was allocated less than a month ago?!? [08:17:01] but it doesn't explain why Eloquence & gwicke are seeing intermittent issues moments ago [08:17:07] were* [08:18:10] getting mostly fast responses now, but still the occasional slow one. [08:18:34] but the slow one I just got was ~16s, not ~40s as before [08:18:45] gwicke: the answer to your question is "yes, but not in any area that should matter for this" [08:18:45] A ~16s response for a large article wouldn't be too weird [08:19:08] I'm still seeing around 16 seconds for a small article that normally takes 1s [08:19:11] morning akosiaris [08:19:20] good morning [08:19:24] also 40 seconds is pretty exactly the max [08:19:29] aah the answer to your question yesterday [08:19:35] RoanKattouw, this is https://en.wikipedia.org/wiki/Plaka_Pilipino?veaction=edit so not very large :) [08:19:39] which is somewhat suspicious [08:19:54] https://graphite.wikimedia.org/render/?title=VisualEditor%20activation%20time,%20one-minute%20sliding%20window,%20last%20day&vtitle=milliseconds&from=-1day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=alias%28color%28ve.performance.system.activation.median,%22blue%22%29,%22Median%22%29&target=alias%28color%28ve.performance.system.activation.75percentile,%22red%22%29,%2275th%20percentil [08:19:55] matanya: yes you can link dsh/apaches to dsh/apache-eqiad [08:20:00] and puppet will do what is right [08:20:16] oh, great. submitting a patch [08:20:24] and if we ever need to split them up in the future again we can do it then [08:20:39] thanks. one more question, what would be the fate of tarin? [08:21:06] i.e. https://rt.wikimedia.org/Ticket/Display.html?id=6265 [08:21:17] Hmm https://en.wikipedia.org/w/index.php?title=Plaka_Pilipino&oldid=462522756&veaction=edit loaded fine in 1.34s [08:21:28] (Older rev of the page because you've already put the latest one in cache) [08:22:55] matanya: we already got poolcounters in eqiad, so I 'd say just shutdown [08:23:00] RoanKattouw: re: search: http://gdash.wikimedia.org/dashboards/searchlatency/ [08:23:14] akosiaris: so no ganglia for misc's ? [08:23:31] OK so search is actually crazy slow [08:23:55] paravoid, I'm still seeing weird timings for prod parsoid [08:24:05] 16 seconds is a common number [08:24:10] and 40 again [08:24:29] we are directly pointing to the API LVS IP for most wikis [08:24:31] yes, VE latency graph confirms that something's not right [08:24:41] matanya: crap it is an aggregator for pmtpa ... [08:24:47] oh did that get fixed? [08:24:51]