[00:03:43] PROBLEM - Puppet freshness on carbon is CRITICAL: Last successful Puppet run was Wed 12 Mar 2014 03:01:44 PM UTC [01:25:33] (03PS2) 10Springle: icinga: update check_mysql-replication to v 0.2.6 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116945 (owner: 10Matanya) [01:25:50] (03CR) 10Springle: [C: 032] icinga: update check_mysql-replication to v 0.2.6 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116945 (owner: 10Matanya) [01:35:33] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [01:49:45] (03PS1) 10Hoo man: Fix title display in mwgrep [operations/puppet] - 10https://gerrit.wikimedia.org/r/118421 [01:49:53] ori: --^ Easy one ;) [01:54:13] PROBLEM - MySQL InnoDB on db1038 is CRITICAL: CRIT longest blocking idle transaction sleeps for 2147483647 seconds [01:54:25] heh [01:54:29] -1 [01:54:50] ... nice uptime :D [01:56:00] 68.05 years o.O [01:56:13] RECOVERY - MySQL InnoDB on db1038 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:57:02] 2147483647 seconds ago was 22nd February, 1946 according to Wolfram Alpha [01:57:18] that's about right [01:57:31] given that unix time runs out in '38 and started in '70 :P [01:59:01] db1038 needs upgrade to 5.5.34 to fix that [02:17:46] !log LocalisationUpdate completed (1.23wmf16) at 2014-03-13 02:17:46+00:00 [02:17:55] Logged the message, Master [02:32:34] bd808: Sorry, I'm running super late [02:32:55] Ryan_Lane: No worries. Such is life [02:33:21] I'm fighting with puppet in labs to kill time ;) [02:33:25] On a bus. Should be home soonish [02:33:52] :D [02:33:53] Puppet will do that to you [02:33:54] !log LocalisationUpdate completed (1.23wmf17) at 2014-03-13 02:33:54+00:00 [02:34:02] Logged the message, Master [02:41:57] bd808: ok, sorry about that [02:42:00] hangout? [02:42:19] Yeah. I'll pm the link [02:55:43] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [03:04:33] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [03:04:43] PROBLEM - Puppet freshness on carbon is CRITICAL: Last successful Puppet run was Wed 12 Mar 2014 03:01:44 PM UTC [03:14:52] bd808: See bz; just don't create the group at all since it already exists globally. :-) [03:15:59] bd808: See also https://gerrit.wikimedia.org/r/#/c/118071/ [03:16:33] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [03:16:36] Coren: I'll check it out [03:17:07] (03CR) 10coren: [C: 032] "Made (annoyingly) necessary by nova-network's lack of configured IPv6" [operations/puppet] - 10https://gerrit.wikimedia.org/r/118301 (owner: 10Tim Landscheidt) [03:18:33] Coren: Ah ha! That is exactly the problem I'm hitting :) [03:18:55] (03PS5) 10coren: beta: skip l10nupdate user/group creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/118071 (owner: 10Hashar) [03:19:02] * Coren attaches the bz [03:19:57] Aaugh; tabspacezmixx0rz! [03:21:01] (03PS6) 10coren: beta: skip l10nupdate user/group creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/118071 (owner: 10Hashar) [03:21:12] Sorry about the spam. [03:22:23] bd808: So yeah, that change should exactly fix the issue you have. [03:23:16] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Mar 13 03:23:13 UTC 2014 (duration 23m 12s) [03:23:24] Logged the message, Master [04:14:33] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [05:56:43] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [06:05:43] PROBLEM - Puppet freshness on carbon is CRITICAL: Last successful Puppet run was Wed 12 Mar 2014 03:01:44 PM UTC [06:16:55] 23:23 <+logmsgbot> !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Mar 13 03:23:13 UTC 2014 (duration 23m 12s) <--- That's about half the time it took the last 4 days: http://paste.debian.net/87382/ [06:17:51] * greg-g makes a note, but there's probably enough variation in the work that this is normal, we just don't know [06:19:27] greg-g, I'm noticing extreme Parsoid slowness in prod, any idea what might be going on? [06:20:17] I know gwicke deployed but wasn't able to restart the parsoid services on all boxes, not sure if that's related [06:21:15] Eloquence: the boxes don't look busy: https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [06:21:38] but.. there is that dip in performance [06:22:18] I'll page Gabriel and Roan, spending a full daylight cycle in Europe with massive performance regression would be no fun. [06:23:22] thanks [06:23:49] do you mind seeing if you can repro before I do so? [06:24:08] if I special:random on a few pages on en.wp with VE enabled, some of them will take 30+ secs [06:24:16] some will be zippy [06:26:30] yeah, confirm, The_Last_Day_(Doctor_Who), which is pretty short, is taking a very long time [06:30:18] good evening RoanKattouw [06:30:21] ^^ [06:31:04] Hey there [06:31:07] Do we have slowness on load or on save? [06:31:13] I just tested initial load [06:31:18] If this is a backend issue then saves should be consistently slow but loads should only be slow for some pages [06:31:18] I've not paged Gabriel yet but happy to do so if needed. [06:31:48] yeah, loads are extremely slow on some pages, even short ones. once a page has been loaded it's subsequently fast [06:32:20] I can send you a HAR of a 40 second request if that's helpful [06:32:34] https://en.wikipedia.org/w/api.php?format=json&action=visualeditor&paction=parse&page=Charles_Simmons_(gymnast) took 40 seconds [06:32:36] When did this start? [06:32:47] I noticed it ~20 mins ago [06:32:54] but https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 looks odd [06:33:03] there's a suspicious drop in network traffic [06:33:27] That happened earlier today [06:33:31] I think there was a Parsoid deployment [06:33:47] 20:02 gwicke: deployed Parsoid 004c7acc with deploy f97820a2; restart todo [06:33:47] greg says the deploy didn't succeed and gwicke wanted to try again tomorrow. [06:33:50] yeah, but the services weren't all restarted successfully [06:33:58] yeah, that [06:34:15] The drop follows daily trends: https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [06:34:16] the restart never happened, he's planning to do it tomorrow during LD [06:34:29] What's weird is traffic didn't rise today [06:35:10] yeah, network and cpu haven't gone back up to normal [06:35:20] OK this network graph is weirder: https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=Parsoid+Varnish+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [06:35:27] just paged gwicke as well [06:35:40] The deployment was way later than that though [06:35:44] more clear there [06:35:57] Let's see if I can find Ori's client-side load time metrics [06:36:02] If those aren't still broken [06:36:14] deploy happend at 20 UTC [06:36:16] https://ganglia.wikimedia.org/latest/?r=4hr&cs=03%2F12%2F2014+17%3A00+&ce=03%2F13%2F2014+06%3A00+&m=cpu_report&s=by+name&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [06:36:35] it coincides [06:36:35] https://gdash.wikimedia.org/dashboards/ve/ [06:36:42] these look crazy. [06:36:55] i.e. they confirm what we're seeing in prod, I think. [06:36:56] (03PS1) 10Ryan Lane: Fix eqiad labs range [operations/puppet] - 10https://gerrit.wikimedia.org/r/118431 [06:37:19] yeah [06:38:44] hi [06:38:54] hi gwicke, see ^^ [06:39:20] so we didn't actually deploy anything since Monday last week as trebuchet was broken since then [06:39:43] today the only thing that effectively happened was a restart [06:40:02] so "20:02 gwicke: deployed Parsoid 004c7acc with deploy f97820a2; restart todo" is wrong? [06:40:14] that's what I thought had happened [06:40:35] trebuchet lied and didn't update the submodule with the code [06:40:39] the exact opposite happened? :) [06:40:43] great [06:40:56] nothing happened, only Coren's service restart as root worked [06:41:00] * greg-g ndos [06:41:02] (salt is also broken btw) [06:41:32] gwicke, something happened around 15:00 UTC per https://gdash.wikimedia.org/dashboards/ve/ [06:41:32] so, any ideas about the drop in performance? [06:41:45] the timing on the VE dashboard looks like it coincides with the restart [06:41:46] I checked how happy LVS/pybal are about the parsoidcache and parsoid clusters [06:41:49] Their answer is "never happier" [06:41:59] the load also looks fine [06:42:08] Heartbeat and trivial HTTP requests are fine on all backends [06:42:18] did you check bits etc? [06:42:37] this is not just a measure of parsoid load performance, but also RL etc [06:43:32] Yeah I haven't checked those yet [06:43:34] Also, there's the API [06:43:44] Gloria, are you noticing any site issues other than VE/Parsoid being slow? [06:44:05] I noticed that at [13:36] James_F, RoanKattouw is it just me or have you found ve loading/saving to be slow on mediawiki.org -- I ran into this on a couple different pages that I tried editing today. [06:44:20] http://en.wikipedia.org/wiki/Harlingen,_Friesland was snappy for me, 788 ms [06:44:29] Grah, that was the live site [06:44:35] API cluster is seeing a crazy spike: https://ganglia.wikimedia.org/latest/?c=API%20application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [06:44:36] Thanks Parsoid, for based links :S [06:44:59] http://parsoid-lb.eqiad.wikimedia.org/enwiki/Harlingen%2C_Friesland?oldid=572387747 was 579ms [06:45:03] I haven't had any of my random articles able to load in VE yet [06:45:05] (03PS4) 10Ryan Lane: Simplify trebuchet developer environment creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112315 [06:45:06] Holy crap [06:45:12] Yeah they're getting hammered, no wonder Parsoid is slow [06:45:44] should I page some opsen? [06:45:57] cpu-wise the cluster looks fairly normal [06:46:05] is varnish bottlenecking? [06:46:24] Hmmmm [06:46:28] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Parsoid%2520eqiad&tab=m&vn= [06:46:32] I'm trying to get a measure of how slow an uncached page is [06:46:39] 11-12% cpu [06:46:46] https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=API+application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 api cluster last week, nothing weird for today [06:46:46] that's low average [06:46:56] gwicke: If the API is overloaded, that will slow down Parsoid, it will spend more time waiting for the API [06:47:13] * Jasper_Deng always had his doubts about having Parsoid as an external service, but he's not a VE designer [06:47:49] the API looks busier than normal [06:47:50] Hah, now I'm getting somewhere, http://parsoid-lb.eqiad.wikimedia.org/enwiki/INS_Pratap_(K92)?oldid=542183384 took 4 seconds to 302 me [06:48:20] also not very balanced, with many machines >= 50% cpu [06:48:36] Yeah [06:48:40] Going to check on LVS for the API cluster [06:49:11] maybe somebody started some expensive operations from their cell phone [06:49:58] OK I've now gotten http://parsoid-lb.eqiad.wikimedia.org/enwiki/Chamak-class_missile_boat to take forever [06:50:57] Loaded in 2 minutes [06:51:39] do we have response time metrics for the API? [06:51:51] (03PS5) 10Ryan Lane: Simplify trebuchet developer environment creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112315 [06:52:06] and top entry points? [06:52:39] gwicke, I only see a method breakdown at https://gdash.wikimedia.org/dashboards/apimethods/ [06:52:42] I'm not sure that we do [06:53:46] those labels are not very helpful [06:54:00] everything seems to be MediaWiki.API [06:54:33] browsing around graphite now [06:55:39] (03CR) 10Ryan Lane: [C: 032] Simplify trebuchet developer environment creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112315 (owner: 10Ryan Lane) [06:57:35] at first sight I don't see anything directly useful in graphite [06:57:38] WTF [06:57:41] The API cluster is reverse-weighted [06:57:49] The boxes with fewer CPUs have higher waits [06:57:51] *weight [06:58:10] fun [06:58:21] Also, why are our DSH groups never up to date *grumble* [06:58:32] there seem to always have been two classes of machines in the app server cluster [06:58:48] one heavily loaded, and the other maybe 2/3 to 1/2 the load [06:59:48] it's almost 9am in greece... if you want help from an opsen [07:00:00] * RoanKattouw generates list of API servers from pybal manifest [07:00:10] gwicke: https://noc.wikimedia.org/pybal/eqiad/api [07:00:12] the overall load avg on the apps server cluster is 64% currently [07:00:29] normal is around 20 [07:02:16] I don't know the # of cpus on those boxes [07:02:27] paravoid: ping [07:03:11] I'm going to crash to bed as soon as paravoid shows up [07:03:16] I just measured that [07:03:24] dsh -cM -f apiservers -- nproc | sort -k 2 -r [07:03:29] It's exactly inverse to the load settings [07:03:35] greg-g, crash now if you want, I'll still be around for a bit as well. [07:04:14] Eloquence: kk, godspeed [07:04:20] :) [07:04:26] I'm going to fix the weights [07:04:27] there is api.log on fluorine [07:05:48] !log Changed pybal weights for eqiad API cluster to # of CPUs on each machine; weights were backwards (machines with fewer CPUs had higher weights) [07:05:56] Logged the message, Mr. Obvious [07:06:02] (03PS1) 10Ryan Lane: Add missing config from role::deployment::salt_masters::labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/118432 [07:06:13] Since we're definitely in ops-y territory now I'll go ahead and page paravoid :) [07:06:54] OK so the API looks to have had two load spikes, the latter of which seems to have subsided while I was doing the reweighting [07:07:32] parsoid slowness continues [07:07:59] (03CR) 10Ryan Lane: [C: 032] Add missing config from role::deployment::salt_masters::labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/118432 (owner: 10Ryan Lane) [07:08:20] parsoid itself looks fine [07:08:24] actually, getting better [07:10:05] (not for me) [07:10:54] yeah, only sporadic [07:10:58] still getting lots of sloow requests [07:11:11] one 30s+ just now [07:11:18] actually paging faidon now [07:11:29] Hmm, so the only thing I can think of [07:11:37] Is Daniel disabling Poolcounter hosts in Tampa [07:12:16] RoanKattouw, https://gdash.wikimedia.org/dashboards/poolcounter/ [07:12:17] Which he doesn't seem to have actually done (?) [07:12:32] I did notice a spike in poolcounter latency, but it seems to average out over the long run [07:12:45] Whoa [07:12:51] That spike is bigger than you realize [07:12:54] That graph is log scale [07:13:11] PC *average* client latency jumped from ~1s to ~200s [07:14:54] the API ist still way too busy [07:15:17] Whoa, we just had another load spike there [07:16:49] the annoying thing is that it's so trivial to bring down the entire API with a handful of requests [07:17:10] makes it hard to find the culprit as it's a needle in a haystack [07:17:21] Well PC latency is really high [07:17:25] So I think something is going on there [07:17:34] But I don't know PC well enough to investigate [07:17:51] RoanKattouw, but that does seem to occur regularly in the weekly graph: https://graphite.wikimedia.org/render/?title=PoolCounter%20Client%20Average%20Latency%20(ms)%20log(2)%20-1week&from=-1week&width=1024&height=500&until=now&areaMode=none&hideLegend=false&logBase=2&lineWidth=1&lineMode=connected&target=cactiStyle(MediaWiki.PoolCounter.Client.*.tavg) [07:18:22] That's .... very odd [07:18:35] Daily, actually [07:18:41] But you're right this is perfectly regular [07:19:05] And the API has had crazier load spikes: https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=API+application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [07:19:10] I see no correlation between the parsoid load and the API load [07:19:11] API load is actually *low* right now [07:19:44] so it seems likely that something is using the API heavily [07:19:52] Hah and PC has dropped back down again [07:20:13] If we had more useful per-method API data in graphite (with actual labeling), we might know more [07:20:14] RoanKattouw, yeah. seems the issue is actually limited to Parsoid no? [07:20:35] For now my theory is that someone's hitting the API with a bunch of heavy parse requests every 24h around this time of day [07:20:47] Possibly in batches, waiting until an entire batch finishes [07:21:07] a bot ? [07:22:05] Maybe [07:22:19] We should do origin IP analysis on the API request logs for this time frame [07:22:42] Data point: http://parsoid-lb.eqiad.wikimedia.org/enwiki/Vidyut-class_missile_boat?oldid=584648112 just took 41s [07:22:46] That's still slow [07:22:50] hitting us with lots of parse requests is definitely something that a certain search engine asked us about (gwicke knows all about that) [07:22:55] But not 2 minutes [07:23:16] the VE slowness started over an hour after Coren's parsoid restart [07:23:39] actually more than two hours later [07:24:56] there were some no-op scaps around that time [07:25:22] I'm analyzing api.log for 06:00-06:59 [07:25:29] catrope@fluorine:/a/mw-log$ grep '^2014-03-13 06:' api.log | cut -d ' ' -f 8 | sort | uniq -c | sort -rn | head -n 25 [07:26:16] Ugh, that fields is internal IPs [07:26:21] Helpful [07:26:46] the problem with looking at the log is that volume is so decoupled from cost [07:28:06] Or... hold on [07:28:24] Really what seems to be going on is most top IPs are internal but some are external [07:28:42] Most top IPs are wtp* [07:28:50] hey [07:28:50] gwicke: There is duration info though [07:29:05] sorry, just saw the text :( [07:29:16] hi paravoid [07:29:47] OK here are some WTFs from the log: [07:29:50] https://gdash.wikimedia.org/dashboards/ve/ [07:30:02] 2014-03-13 06:25:34 mw1190 enwiki: API GET [IP REDACTED] [IP REDACTED] T=9606ms format=xml action=query list=search srsearch=alabama%20national-archives [07:30:05] Why [07:30:14] Why does a simple search query like that take 9.6 seconds [07:32:42] are there any stats on the number of client connections to the api? [07:32:53] Filtering api.log for >10s durations gives me only action=visualeditor (which blocks waiting for Parsoid) but also a few search queries [07:33:07] ap_rps [07:33:50] http://ganglia.wikimedia.org/latest/stacked.php?m=ap_rps&c=API%20application%20servers%20eqiad&r=hour&st=1394695990&host_regex= [07:34:25] paravoid: That big shift you see there is me reweighting the API apaches in LVS [07:34:33] We have boxes with 16 cores and boxes with 24, according to nproc [07:34:42] The 16-core ones had weight 20 and the 24-core ones had weight 10 [07:35:06] That seemed stupid and the load imbalance was very apparent in Ganglia, so I set the 16-core ones to weight 16 and the 24-core ones to weight 24 [07:35:33] 24 core ones are 12 core with HT enabled most probably [07:35:44] and I think API is balanced by memory [07:35:50] Aaah [07:36:08] I would be happy to revert the weighting change if the original weighting made sense in some way [07:36:38] But at the time we were in the middle of a load spike and the weight 20 boxes were at 100% CPU while the weight 10 ones weren't breaking a sweat [07:36:40] VE latency seems to recover lately [07:37:06] the new boxes have 64G of ram, the old ones have 12G :) [07:37:48] and I've been told that API operations are more memory intensive than CPU intensive [07:38:03] there are a few [07:39:02] paravoid: Re memory, the 16-core boxes have 11G of memory and the 24-core boxes have 62G [07:39:41] no, that's backwards [07:40:02] Oh, wait, yes it is [07:40:14] That makes the original weighting make more sense [07:40:25] In general I could see API load being more memory-intensive [07:40:38] An hour ago that was definitely not the case, and I didn't think to look for memory [07:40:42] * RoanKattouw reverts weight change [07:40:56] I also think 24 is not 24, it's just 12*2 threads ;) [07:41:05] two 6-core CPUs [07:41:05] gwicke, it seems to be recovering a bit but I'm still getting the occasional 40sec request [07:41:11] Right [07:41:21] so do we know a root cause yet? is it related to the API load? [07:41:23] paravoid: Are you editing the pybal manifest right now? [07:41:23] (03PS4) 10Matanya: nfs: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109081 [07:41:31] no, go ahead [07:41:52] Eloquence: yes, there's API load that is starving the appservers [07:41:56] the API appservers [07:41:57] Eloquence, looks very much correlated with API (over)load to me [07:42:09] they hit 100% CPU at times, so this would explain delays [07:42:27] !log Reverted API pybal weights back to original values; apparently makes sense given amount of memory [07:42:35] Logged the message, Mr. Obvious [07:42:53] RoanKattouw> For now my theory is that someone's hitting the API with a bunch of heavy parse requests every 24h around this time of day [07:42:53] 18<RoanKattouw> Possibly in batches, waiting until an entire batch finishes [07:42:57] can we verify that theory? [07:43:04] With logs, possibly [07:43:15] I was working on doing some analysis taking running time into account [07:43:15] I'm grepping through logs already :) [07:43:20] ok :) [07:43:25] But I got distracted wondering why search requests routinely take >10s [07:43:31] starting around 3pm pacific [07:43:41] Also, action=visualeditor API requests block waiting for Parsoid, so they're red herrings [07:43:41] I missed that line above, but I had the same theory [07:43:56] I know that [search engine name redacted] was asking to hit our API with more frequent requests to do full parses on articles [07:43:58] They should be spread over lots of IPs though so hopefully that will drop out [07:44:15] as I recall we discouraged them from doing so and asked them to wait for Gabriel's magical cassandra beast to emerge from the darkness [07:44:23] Eloquence: But presumably not in 3 large batches around midnight PDT, then staying quiet the rest of the day? [07:44:32] hitting Parsoid would also be fine as that's cached [07:44:36] That sounds like something they would have mentioned [07:44:49] Also, these would have to be non-pcached parses [07:45:23] RoanKattouw, They didn't specify, and in the discussion we pointed them both to the existing parsoid web service and the magical future. [07:46:05] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=API%20application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1394696706&g=network_report&z=large shows no growth in traffic [07:46:19] ...and logs confirm that [07:46:42] hrm [07:47:19] i need Joel Krauska's email, anyone has it? [07:47:48] matanya, jkrauska at wikimedia dot org [07:47:55] thanks [07:49:37] RoanKattouw, the specific API requests they were asking about were action=parse and action=expandtemplates [07:49:38] in the daily pattern the VE activation time now looks very much back to normal [07:49:46] Right [07:49:57] paravoid: What do the logs confirm? [07:51:29] no noticeable abnormalitie in traffic volume [07:52:45] gwicke, yeah, but I just did one random request and am still getting 40s+ load time. [07:52:56] on which page? [07:53:05] this was https://en.wikipedia.org/wiki/Gagebrook,_Tasmania?veaction=edit , now cached. [07:54:21] same just now on https://en.wikipedia.org/wiki/Guoba?veaction=edit [07:55:19] I just got one 16s load on Gagebrook [07:55:29] most are around 1s [07:55:39] all directly through the parsoid api [07:55:49] (append a random parameter for cache busting) [07:56:48] i get : X-Parsoid-Performance duration=32359; start=1394697324571 [07:56:54] in doubt it might make sense to restart some apaches [07:57:29] parsoid looks very much API blocked, there is little CPU load [07:58:00] do we have some tracking of avg response times by api box? [07:58:27] no [07:58:37] but you can infer some of that from apache ganglia metrics [07:59:43] I'm now computing which clients are responsible for the most aggregate API time [08:00:07] it's also slow when trying with a local parsoid [08:00:29] api appservers are not that busy now [08:00:48] sometimes at least [08:02:13] the last attempts were all fast though [08:02:35] (03PS5) 10Matanya: nfs: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109081 [08:04:48] most are around 1-2 seconds, but some take 16 [08:06:23] prod looks slower though [08:08:53] http://el.wikipedia.org/w/api.php?action=parse&oldid=4308370&format=json&prop=text|langlinks|categories|categorieshtml|langlinks|links|iwlinks|templates|images|externallinks|sections|revid|displaytitle|headitems|headhtml [08:09:15] that takes a while [08:10:06] so there's that, plus POSTs (so no params) with a Java UA [08:10:58] regularly in the tens of seconds [08:11:13] you can also do at least 50 of those per request with generators [08:13:02] paravoid, were there any changes in the LVS area recently? [08:13:06] Hmmm [08:13:10] I have some interesting data here [08:13:44] See fluorine:/home/catrope/apidata [08:13:45] the elwiki ones match up with the cpu spikes [08:13:51] in timestamps [08:14:09] That's "IP reqs time" for all of api.log [08:14:20] i.e. the past hour or so [08:14:37] The 5 top IPs are all in the same /24 [08:14:45] ...and they are the IPs hitting the elwiki links above [08:14:48] paravoid, those suspicious greeks again! :P [08:14:58] that and nothing else [08:15:08] who.is doesn't tell me much about those IPs [08:15:11] Eloquence: heh, the irony [08:15:15] Except that they're apparently reserved for the US [08:15:20] It's an ARIN range [08:15:45] it's a hosting provider [08:16:37] okay, they stopped doing requests at 07:13 UTC, which about when API recovered (+50-100s for their requests to finish...) [08:16:42] hah [08:16:44] that explains that [08:16:55] Hmm, according to this who.is data this IP block was allocated less than a month ago?!? [08:17:01] but it doesn't explain why Eloquence & gwicke are seeing intermittent issues moments ago [08:17:07] were* [08:18:10] getting mostly fast responses now, but still the occasional slow one. [08:18:34] but the slow one I just got was ~16s, not ~40s as before [08:18:45] gwicke: the answer to your question is "yes, but not in any area that should matter for this" [08:18:45] A ~16s response for a large article wouldn't be too weird [08:19:08] I'm still seeing around 16 seconds for a small article that normally takes 1s [08:19:11] morning akosiaris [08:19:20] good morning [08:19:24] also 40 seconds is pretty exactly the max [08:19:29] aah the answer to your question yesterday [08:19:35] RoanKattouw, this is https://en.wikipedia.org/wiki/Plaka_Pilipino?veaction=edit so not very large :) [08:19:39] which is somewhat suspicious [08:19:54] https://graphite.wikimedia.org/render/?title=VisualEditor%20activation%20time,%20one-minute%20sliding%20window,%20last%20day&vtitle=milliseconds&from=-1day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=alias%28color%28ve.performance.system.activation.median,%22blue%22%29,%22Median%22%29&target=alias%28color%28ve.performance.system.activation.75percentile,%22red%22%29,%2275th%20percentil [08:19:55] matanya: yes you can link dsh/apaches to dsh/apache-eqiad [08:20:00] and puppet will do what is right [08:20:16] oh, great. submitting a patch [08:20:24] and if we ever need to split them up in the future again we can do it then [08:20:39] thanks. one more question, what would be the fate of tarin? [08:21:06] i.e. https://rt.wikimedia.org/Ticket/Display.html?id=6265 [08:21:17] Hmm https://en.wikipedia.org/w/index.php?title=Plaka_Pilipino&oldid=462522756&veaction=edit loaded fine in 1.34s [08:21:28] (Older rev of the page because you've already put the latest one in cache) [08:22:55] matanya: we already got poolcounters in eqiad, so I 'd say just shutdown [08:23:00] RoanKattouw: re: search: http://gdash.wikimedia.org/dashboards/searchlatency/ [08:23:14] akosiaris: so no ganglia for misc's ? [08:23:31] OK so search is actually crazy slow [08:23:55] paravoid, I'm still seeing weird timings for prod parsoid [08:24:05] 16 seconds is a common number [08:24:10] and 40 again [08:24:29] we are directly pointing to the API LVS IP for most wikis [08:24:31] yes, VE latency graph confirms that something's not right [08:24:41] matanya: crap it is an aggregator for pmtpa ... [08:24:47] oh did that get fixed? [08:24:51] not going through varnish anymore? [08:24:58] yeah, a while ago [08:25:11] right now going through varnish looks faster though [08:25:11] I remember reverting the config for my patch since it was broken for https wikis [08:25:21] matanya: I see you already submitted patches for decom ersch, [08:25:26] i have [08:25:37] at least that's what's my local parsoid does, and it gets more consistent results [08:25:44] but not sure about tarin, becuse of ganglia [08:26:10] but if you say it is ok, i'll push the patch for tarin too [08:26:28] no dont, I had not seen that [08:26:46] https://graphite.wikimedia.org/render/?title=Top%2010%20API%20Methods%20by%20Max%2090%25%20Time%20%28ms%29%20log%282%29%20-1day&from=-1day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&logBase=2&lineWidth=1&lineMode=connected&target=cactiStyle%28substr%28highestMax%28maximumAbove%28MediaWiki.API.*.tp90,1%29,10%29,0,2%29%29 [08:26:50] seems like we will keep him until the end (well close to that) [08:26:51] is weird [08:26:53] probably broken? [08:27:05] didn't i say that in the ticket? [08:27:10] * matanya hits his head [08:27:19] ahhh [08:27:27] paravoid, that's interesting [08:27:35] coincides with the start of the issues [08:27:49] (the blue line, whatever that is) [08:28:03] that legend is awesome isn't it ;) [08:28:19] yeah, so close yet so far [08:28:46] https://graphite.wikimedia.org/render/?title=Top%2010%20API%20Methods%20by%20Max%2090%25%20Time%20%28ms%29%20log%282%29%20-1day&from=-1day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&logBase=2&lineWidth=1&lineMode=connected&target=cactiStyle%28highestMax%28maximumAbove%28MediaWiki.API.*.tp90,1%29,10%29,0%29 [08:28:59] MediaWiki.API.visualeditor is the blue line [08:29:09] Well yeah [08:29:16] action=visualeditor does a curl request to Parsoid [08:29:23] So it'll block for however long Parsoid takes [08:29:31] heh [08:29:50] That's why I used grep -v visualeditor for my api.log analysis, I didn't have that at first and it polluted the data [08:29:57] Turns out we had quite a lot of VE usage from a certain university NAT :) [08:30:10] alright, I'm heading off to bed. thanks all for the debugging so far. public service advisory - it is now 1:30 AM in SF, so if this is mostly an API issue rather than a Parsoid issue, perhaps RoanKattouw and gwicke can go to bed as well and ops can continue to investigate? Remember to get enough sleep, no matter how :) [08:30:16] could there be some network issues between the parsoid cluster and the API LVS? [08:31:47] Yeah I need to go to sleep [08:32:00] I'll follow soon as well [08:32:01] goodnight folks [08:32:09] gwicke: If switching to Varnish mitigates the issue, something weird is going on [08:32:21] Either in the network itself, or in LVS which is basically the network for all intents and purposes [08:32:57] paravoid: just as a sanity check, can you do a 'dsh -g parsoid service parsoid restart' as root ? [08:34:30] done [08:34:55] I got a 'service unavailable' error btw [08:35:07] did you run the restart in parallel? [08:35:10] springle: regarding d10, do you want to finish decom ? [08:35:16] no [08:35:30] hm, k [08:36:33] still getting slow responses occasionally [08:37:07] oh [08:37:09] hey, I have an idea [08:37:11] give me a sec [08:39:45] gwicke: how 'bout now? [08:39:58] pretty sure I fixed t [08:39:59] *it [08:40:24] yeah, looks much better [08:40:29] What was it? [08:40:30] and if I had my morning coffee, I'd have fixed it earlier [08:40:40] now I'm *very* curious [08:40:46] chris moved mw1201, mw1202, mw1203 to row D yesterday [08:41:21] there is a pybal "limitation" combined with chris not doing the exact right steps [08:41:26] moving them to row D means renumbering them [08:41:33] and also means changing them in DNS [08:41:54] so as soon as their IPs swapped in DNS, pybal completely lost track of the old IPs of those servers [08:42:02] they remained pooled, despite being "down" [08:42:06] those are the hosts that looked down in ganglia [08:42:43] yes, and I thought they were down [08:42:49] then I correlated that with SAL [08:43:31] and let me fix it on the backup LVS now as well [08:43:50] why didn't I see the same from my local parsoid? [08:44:06] hitting varnish you mean? [08:44:11] yeah [08:44:24] because varnish retries on failures/timeouts [08:44:50] ah, and probably does so more quickly than parsoid [08:44:54] yeah [08:45:02] you probably have a 15s timeout [08:45:08] you = parsoid [08:45:16] yeah, and something around 40 [08:45:16] you were getting 16s for pages that should parse in 1s you said [08:45:27] we do an exponential back-off [08:46:18] all makes sense now [08:46:29] I guess we should maybe also log API request retries [08:46:38] yeah [08:46:46] there's a few steps we can take [08:47:06] but ideally we'd have better monitoring on the api itself [08:47:24] so, that's funny [08:47:26] we have all the metrics [08:47:34] we just have no alerts [08:47:50] we csn add them with logstash [08:47:51] we do have check_graphite though and we could use that to poll the metric, if we find sensible values [08:48:35] (03PS2) 10Odder: Account creation throttle for ptwiki outreach event [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118364 [08:48:37] https://graphite.wikimedia.org/render/?title=Top%2010%20API%20Methods%20by%20Max%2090%25%20Time%20%28ms%29%20log%282%29%20-1day&from=-1day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&logBase=2&lineWidth=1&lineMode=connected&target=cactiStyle%28substr%28highestMax%28maximumAbove%28MediaWiki.API.*.tp90,1%29,10%29,0,2%29%29 confirms the fix [08:48:52] *nod* [08:49:13] !log API slowness (affecting Parsoid/VE, among others) there since ~12:00 UTC found and fixed [08:49:22] Logged the message, Master [08:49:31] if varnish exposes backend issues then that would be a good thing to monitor [08:49:39] 13:00 UTC, but oh well [08:50:07] anyway, I'm hitting the sack [08:50:12] see you tomorrow! [08:50:20] sorry for the trouble folks [08:50:26] and thanks for all the help [08:50:36] clearly not a parsoid/VE issue itself :) [08:50:44] glad that you found it [08:50:57] paravoid: Aaaaaah [08:50:59] Riiiight [08:51:11] I did see that mw120{1,2,3} had zero load in Ganglia and were up according to pybal [08:51:21] But I dismissed it and didn't look into it further [08:51:28] the signs were all there, I dismissed them too [08:51:51] until it just clicked, without me looking at anything :) [08:51:55] Also, monitoring won't necessarily help much for this scenario [08:51:56] one of these moments [08:52:02] Because icinga is able to get to mw1201 just fine [08:52:16] RoanKattouw, varnish knows the backend is broken [08:52:30] no, but it kind of sucks to have this issue appear clearly on graphite, yet get notified by users (Erik in this case) [08:52:47] so at minimum, it'd be nice to alert on increased latency for API operations [08:52:52] (although varnish might only see the LVS IP) [08:53:18] gwicke: Varnish only sees the LVS IP but retries more aggressively [08:53:29] The backend itself was probably fine when accessed directly [08:53:52] LVS monitoring / logging could be better I guess [08:54:00] and automatic depooling [08:54:12] gwicke: We have all that [08:54:23] But all of it was collectively making the same mistake it seems [08:54:32] yeah, the failure was with pybal not being that smart about DNS changes [08:55:22] LVS auto depooling is actually really fast [08:55:42] pybal has a persistent HTTP connection to each backend that it periodically reestabilishes [08:55:48] Sorry, TCP [08:55:59] indeed [08:56:04] If anything weird happens to that heartbeat connection --> instant depool [08:56:20] There are also periodic (every ~5s) checks using actual HTTP requests, those have to succeed [08:56:26] If any of those fails, depool [08:56:36] the bug is when you change the IP a hostname is pointed at behind its back [08:56:43] it gets all confused [08:56:49] In order to repool, a machine needs to have a functional heartbeat AND the periodic check needs to succeed 3(?) times in a row [08:57:07] or when the server returns error HTTP responses to app requests but not the test URL [08:57:43] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [08:57:56] okay, I'll send a proper outage report later today [08:58:01] and we can discuss on-list [08:58:03] sounds good? [08:58:13] yup, thanks! [08:58:20] sweet dreams :) [08:58:21] Yeah, let's [08:58:30] gwicke: Yeah the test URL thing.. we had the opposite issue a while ago [08:58:49] Where Parsoid was working just fine but was breaking for /_html or something [08:59:15] paravoid: wrote a check, the -t should be Top 10 API Methods by Max 90%25 Time (ms) log(2) -1day ? [09:06:22] (03CR) 10Alexandros Kosiaris: [C: 032] webserver: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/110454 (owner: 10Matanya) [09:06:43] PROBLEM - Puppet freshness on carbon is CRITICAL: Last successful Puppet run was Wed 12 Mar 2014 03:01:44 PM UTC [09:07:54] (03CR) 10Alexandros Kosiaris: [C: 032] gerrit: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109088 (owner: 10Matanya) [09:08:07] (03PS1) 10Matanya: icinga: add alert for high latency in api requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/118435 [09:08:13] paravoid: ^ [09:09:26] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia_new: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107128 (owner: 10Matanya) [10:25:52] (03PS4) 10Alexandros Kosiaris: swift: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109625 (owner: 10Matanya) [10:28:25] (03CR) 10Alexandros Kosiaris: [C: 032] ldap: remove lookupvar and replace with top scope @ var [operations/puppet] - 10https://gerrit.wikimedia.org/r/116975 (owner: 10Matanya) [10:28:45] (03CR) 10Alexandros Kosiaris: [C: 032] swift: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109625 (owner: 10Matanya) [10:29:17] akosiaris: catalog differ in action? [10:29:21] matanya: thanks, I'll have a look [10:30:00] akosiaris: i decided it would be a waste of time to link the dsh group, i'll need to break the recurse and invent new logic, not worth it [10:30:01] paravoid: yeah. It is really helpful albeit slow [10:30:18] thanks for this effort, both of you [10:30:40] where are we on the puppet3 efforts btw? [10:30:45] waiting for us to merge probably [10:30:56] but are any issues that you haven't tackled yet? [10:31:18] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia: remove lookupvar and replace with top scope @ var [operations/puppet] - 10https://gerrit.wikimedia.org/r/116973 (owner: 10Matanya) [10:31:52] paravoid: https://etherpad.wikimedia.org/p/Puppet3 [10:32:04] (03CR) 10Alexandros Kosiaris: [C: 032] mwlib: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/111619 (owner: 10Matanya) [10:34:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 10:30:23 AM UTC [10:34:48] so basicly most is done [10:34:54] but still some edge cases [10:36:21] (03CR) 10Faidon Liambotis: [C: 04-1] icinga: add alert for high latency in api requests (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/118435 (owner: 10Matanya) [10:36:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 10:30:23 AM UTC [10:38:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 10:30:23 AM UTC [10:39:09] (03CR) 10Matanya: icinga: add alert for high latency in api requests (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/118435 (owner: 10Matanya) [10:40:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 10:30:23 AM UTC [10:42:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 10:30:23 AM UTC [10:44:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 10:30:23 AM UTC [10:46:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 10:30:23 AM UTC [10:48:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 10:30:23 AM UTC [10:48:38] hashar: here is somethingfor you [10:49:14] matanya: what ? [10:49:33] in manifests/gerrit.pp you are calling $bzpass in class gerrit::instance , but the template is called in other scope in class gerrit::jetty [10:49:39] hashar: You might also want to approve https://gerrit.wikimedia.org/r/118421 ? :P [10:49:46] in the template templates/gerrit/secure.config.erb [10:50:01] so it doesn't work well or nice :) [10:50:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 10:30:23 AM UTC [10:51:04] gerrit seems down from here [10:51:33] indeed [10:52:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 10:30:23 AM UTC [10:52:47] com.google.gerrit.sshd.GerritServerSession : Exception caught java.io.IOException: Connection reset by peer [10:54:07] damn peer [10:54:24] I still dont understand how the CIA still hasn't managed to catch him [10:54:38] hoo: I have no clue what mwgrep is sorry :/ [10:54:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 10:30:23 AM UTC [10:55:31] hashar: Oh, you should :P It lets you search all MediaWiki: pages for a certain string (eg. to find deprecated functions being used) [10:55:49] That's a trivial one line change... [10:55:54] hmmm, so it is now working without me doing anything [10:56:28] hoo: dont we have elastic search for that? [10:56:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 10:30:23 AM UTC [10:57:22] hashar: Yeah, that script uses exactly that ;) [10:57:41] I mean, we could use Special:Search :] [10:57:41] But with the recent elastic update, the result format has slightly changed, thus this change is needed [10:58:10] hashar: Why? That would require calling out to 800(?) wikis [10:58:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 10:30:23 AM UTC [10:58:39] commons got crazy. doesn't load gadgets, CSS and JS in a sane manner [10:59:29] hoo: you stole my thunder, i poked hashar first :P [10:59:56] * hoo hides :P [11:00:07] I said it before and I say it again: We need more hashar :D [11:00:28] RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Thu Mar 13 11:00:21 UTC 2014 [11:01:19] * hashar vanishes [11:02:38] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 11:00:21 AM UTC [11:09:05] (03PS1) 10Faidon Liambotis: gdash: show API method names under apimethod view [operations/puppet] - 10https://gerrit.wikimedia.org/r/118442 [11:09:25] (03CR) 10Faidon Liambotis: [C: 032 V: 032] gdash: show API method names under apimethod view [operations/puppet] - 10https://gerrit.wikimedia.org/r/118442 (owner: 10Faidon Liambotis) [11:10:52] (03PS1) 10Matanya: ldap: puppet 3 compatibility fix: doamin is a fact [operations/puppet] - 10https://gerrit.wikimedia.org/r/118443 [11:30:20] RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Thu Mar 13 11:30:17 UTC 2014 [11:38:08] (03PS1) 10Matanya: dsh: remove sage, long decomed [operations/puppet] - 10https://gerrit.wikimedia.org/r/118453 [11:52:34] (03PS2) 10Matanya: dsh: remove Sun fire hosts, long decomed [operations/puppet] - 10https://gerrit.wikimedia.org/r/118453 [11:57:50] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [12:06:50] PROBLEM - Puppet freshness on carbon is CRITICAL: Last successful Puppet run was Wed 12 Mar 2014 03:01:44 PM UTC [12:38:53] (03PS1) 10Cmjohnson: Updating ipv4 address for mw1208 - 1210 [operations/dns] - 10https://gerrit.wikimedia.org/r/118457 [12:40:03] !log shutting down and relocating mw1208, mw1209 and mw1210 to row D [12:40:13] Logged the message, Master [12:43:00] PROBLEM - Host mw1208 is DOWN: PING CRITICAL - Packet loss = 100% [12:47:10] PROBLEM - Host mw1209 is DOWN: PING CRITICAL - Packet loss = 100% [12:47:30] PROBLEM - Host mw1210 is DOWN: PING CRITICAL - Packet loss = 100% [13:01:46] (03CR) 10Andrew Bogott: "This looks correct. Note, though, that as far as I know the role::openstack manifests are unused and should maybe be purged..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/118431 (owner: 10Ryan Lane) [13:02:01] (03CR) 10Andrew Bogott: [C: 032] "This looks correct. Note, though, that as far as I know the role::openstack manifests are unused and should maybe be purged..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/118431 (owner: 10Ryan Lane) [13:36:50] PROBLEM - Host carbon is DOWN: PING CRITICAL - Packet loss = 100% [14:22:46] (03PS1) 10Andrew Bogott: Add another compute node. [operations/puppet] - 10https://gerrit.wikimedia.org/r/118464 [14:24:34] (03CR) 10Andrew Bogott: [C: 032] Add another compute node. [operations/puppet] - 10https://gerrit.wikimedia.org/r/118464 (owner: 10Andrew Bogott) [15:09:49] Coren, help me with some apt magic if you're up? [15:10:25] andrewbogott: I can help [15:10:34] what wrong ? [15:10:38] akosiaris: thanks -- where should I start? [15:10:45] Right now the labs box has the 'right' file, the one I want. [15:10:53] Production boxes (seven of them) have the wrong one. [15:11:04] gimme boxes and file [15:11:17] labs instance: 'eraseme.eqiad.wmflabs' [15:11:22] production instance: virt1006.eqiad.wmnet [15:11:32] file: /usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py [15:12:04] you should have a login on that labs box already [15:16:04] andrewbogott: manifests/openstack: 893 ? [15:16:15] could this be the reason ? [15:17:07] Oh, dammit, this is the second time that has bit me in exactly the same way [15:17:11] akosiaris: thanks, easy to fix. [15:17:30] ok. happy to be of service :-) [15:21:16] ^d: died again [15:22:54] <^d> It's up. [15:22:54] slow as hell [15:22:54] (03PS1) 10BryanDavis: Remove 1.23wmf6 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118472 [15:22:54] (03PS1) 10BryanDavis: Remove 1.23wmf7 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118473 [15:22:54] (03PS1) 10BryanDavis: Remove 1.23wmf8 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118474 [15:22:54] and now server unavilable [15:22:54] (03PS1) 10BryanDavis: Remove 1.23wmf9 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118475 [15:22:54] (03PS1) 10BryanDavis: Remove 1.23wmf10 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118476 [15:23:53] <^d> I've got no errors in gerrit or apache. [15:24:14] ^^ Many stale branches will die today. [15:24:47] <^d> bd808: While you're at it turn them from branches to tags in core :p [15:26:16] ^d: gerrit web ui is timing out for me as well. [15:26:19] ^d: Looks like it happened again [15:26:22] <^d> I'm aware. [15:26:29] <^d> There's no errors anywhere. [15:26:34] <^d> And CPU/memory usage looks normal. [15:27:00] Is gerrit's web interface behind a varnish? [15:27:09] <^d> No. [15:28:14] <^d> ssh is fine, hmm [15:30:07] Jesus the traceroute from my house is a mess: boi > sea > den > dfw > tpa > pmtpa > eqiad > gerrit (19 hops) [15:30:36] can you paste it? [15:30:39] and give me your IP [15:32:23] <^d> sf -> oak -> sf -> sj -> great oaks -> was -> eqiad -> gerrit [15:32:27] <^d> (From here) [15:32:55] traceroute6 gerrit.wikimedia.org: connect: No route to host :( [15:33:19] <^d> Hmm, we do have ipv6. [15:33:35] <^d> Same with no route to host. [15:33:38] my 6to4 gateway may be borked [15:34:17] <^d> UI seems somewhat back for me. Others? [15:34:36] ^d: works for me now [15:34:39] yep [15:35:01] <^d> I guess I just had to stare at the log long enough and it'd fix itself. [15:35:30] (03PS1) 10Matanya: sunfire hosts decommed, removed from dns [operations/dns] - 10https://gerrit.wikimedia.org/r/118480 [15:35:43] yeah, that's like starting stuff in gdb... you can be (almost) sure that it wont randomly crash then .P [15:36:50] ok, enough for today. see you later [15:36:58] (03PS1) 10Mark Bergsma: Add lvs3001-lvs3004 management [operations/dns] - 10https://gerrit.wikimedia.org/r/118481 [15:37:46] (03CR) 10Mark Bergsma: [C: 032] Add lvs3001-lvs3004 management [operations/dns] - 10https://gerrit.wikimedia.org/r/118481 (owner: 10Mark Bergsma) [15:38:34] (03PS1) 10Chad: Moving all "small" wikis into Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118482 [15:39:01] (03CR) 10Chad: "Per IRC this will have some wikis fall back to building temporarily, but it shouldn't take long at all." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118482 (owner: 10Chad) [15:43:04] (03PS1) 10Jgreen: update trusty installer yet again [operations/puppet] - 10https://gerrit.wikimedia.org/r/118498 [15:44:59] bblack: can you retry the ipv4 trace again? [15:45:30] !log announcing only Tampa's /23 from pmtpa/sdtpa [15:45:39] Logged the message, Master [15:45:43] er [15:45:53] bd808: can you try the ipv4 trace again? [15:45:56] bblack: nevermind, sorry :) [15:46:16] (03CR) 10Jgreen: [C: 032 V: 031] update trusty installer yet again [operations/puppet] - 10https://gerrit.wikimedia.org/r/118498 (owner: 10Jgreen) [15:47:00] <^d> wikitech timing out :\ [15:48:29] i was about to report that too [15:48:36] can't open wikitech page [15:49:19] * bd808|deploy is cutting 1.23wmf18 branch  [15:49:54] That idiom should really be "grafting a branch" [15:50:05] The tree gets bigger, not smaller [15:51:04] (03PS2) 10Hashar: contint: python-dev on labs slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/115605 [15:51:44] May I please get python-dev installed on the Jenkins slaves in labs please ? https://gerrit.wikimedia.org/r/#/c/115605/ thx :] [15:52:22] (03PS1) 10Andrew Bogott: Update custom virt-libvirt-driver to 1:2013.2.2-0ubuntu1~cloud0 [operations/puppet] - 10https://gerrit.wikimedia.org/r/118551 [15:52:24] (03PS2) 10Hashar: deployment::target does not work in labs, skip it [operations/puppet] - 10https://gerrit.wikimedia.org/r/115624 [15:52:26] (03CR) 10Dzahn: [C: 032] "https://wikitech.wikimedia.org/w/index.php?title=Platform-specific_documentation/Sun_Fire_V20z/V40z&action=history" [operations/puppet] - 10https://gerrit.wikimedia.org/r/118453 (owner: 10Matanya) [15:52:52] (03PS4) 10Hashar: Configuration for beta cluster caches in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/115629 [15:53:26] (03PS2) 10Hashar: labs_lvm: logoutput => on_failure for all exec calls [operations/puppet] - 10https://gerrit.wikimedia.org/r/117199 [15:55:38] <^d> mutante: And wikitech-static seems out of date :( [15:55:42] (03CR) 10BryanDavis: [C: 032] Remove 1.23wmf6 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118472 (owner: 10BryanDavis) [15:55:52] (03CR) 10BryanDavis: [C: 032] Remove 1.23wmf7 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118473 (owner: 10BryanDavis) [15:55:56] ^d: i know, i reported as ticket a while ago [15:56:06] (03CR) 10BryanDavis: [C: 032] Remove 1.23wmf8 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118474 (owner: 10BryanDavis) [15:56:14] (03CR) 10BryanDavis: [C: 032] Remove 1.23wmf9 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118475 (owner: 10BryanDavis) [15:56:26] (03CR) 10BryanDavis: [C: 032] Remove 1.23wmf10 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118476 (owner: 10BryanDavis) [16:00:12] (03CR) 10Dzahn: [C: 032] contint: python-dev on labs slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/115605 (owner: 10Hashar) [16:00:39] (03Merged) 10jenkins-bot: Remove 1.23wmf6 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118472 (owner: 10BryanDavis) [16:01:07] (03Merged) 10jenkins-bot: Remove 1.23wmf7 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118473 (owner: 10BryanDavis) [16:01:09] (03Merged) 10jenkins-bot: Remove 1.23wmf8 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118474 (owner: 10BryanDavis) [16:01:11] (03Merged) 10jenkins-bot: Remove 1.23wmf9 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118475 (owner: 10BryanDavis) [16:01:13] (03Merged) 10jenkins-bot: Remove 1.23wmf10 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118476 (owner: 10BryanDavis) [16:01:21] paravoid, VE API method latency seems to be back up in the last two hours according to http://gdash.wikimedia.org/dashboards/apimethods/ [16:01:24] third graph [16:03:15] heh, cmjohnson1 moved more servers [16:03:16] fixing [16:03:25] cmjohnson1: see ops-l & my incident report [16:03:28] gwicke: thanks... [16:03:54] paravoid: subbu noticed it, I'm just the messenger [16:04:36] tracepath to wikitech: Too many hops: pmtu 1500 [16:04:41] mark, a little help with virt0? it's suddenly unpingable [16:05:10] http://paste.debian.net/87533/ [16:05:11] ^ [16:05:19] (03CR) 10Andrew Bogott: [C: 032] Update custom virt-libvirt-driver to 1:2013.2.2-0ubuntu1~cloud0 [operations/puppet] - 10https://gerrit.wikimedia.org/r/118551 (owner: 10Andrew Bogott) [16:05:43] Coren: you think it's routing and not a dos? [16:06:32] andrewbogott: I see no unusual traffic, and the server is unusually fast and snappy for me (which makes me suspect load is /lower/ than usual as you'd expect if part of the world can't reach it) [16:06:48] yeah, fair point. the lots seem totally fine [16:07:16] <^d> Was some other router change just made a bit ago by paravoid? [16:07:26] <^d> (If Coren is suspecting routing and it was the last change) [16:07:59] !log fixing pybal for mw1208/mw1209/mw1210 move; same issue as last time [16:08:02] cmjohnson1: ping [16:08:07] Logged the message, Master [16:08:11] gwicke: fixed [16:08:17] paravoid: ? [16:08:19] <^d> Coren: My traceroutes to wikitech end in tampa. [16:08:28] <^d> Before timing out [16:08:30] ^d, wikitech is in tampa [16:08:34] cmjohnson1: 18:03 < paravoid> cmjohnson1: see ops-l & my incident report [16:08:38] ^d: That's expected; but it looks like your packets aren't coming back. [16:08:49] Mine do, but through a different route entirely. [16:08:52] cmjohnson1: don't move servers before removing them from /h/w/conf/pybal/... [16:09:04] <^d> Coren: I can pastebin if that will help. [16:09:07] cmjohnson1: not disabling them, but removing them altogether [16:09:14] i didn't move these mw1021-mw1023 [16:09:35] mw1201, mw1202, mw1203 [16:09:37] typo [16:09:40] figured that out [16:09:56] and now it happened again with 1208-1210 :) [16:10:10] paravoid: is the thing you're talking about with cmjohnson1 related to the wikitech outage? [16:10:14] no [16:10:18] :( [16:10:21] Any ideas about that one, then? [16:10:23] paravoid: http://pastebin.com/S84sHSsK <-- wonky routing [16:10:43] what's "wonky" about it? [16:10:50] I was under the impression that moving the 3 at a time wouldn't cause any problems..i was misinformed. [16:11:02] if helpful I am at the office and wikitech loads well from here [16:11:13] cmjohnson1: no, moving 3 at a time is fine indeed; there's an underlying bug that causes issues because of the IPs being renumbered [16:11:52] ack [16:12:09] paravoid: My packets are entering tampa from xo.net in FL, but returning via vtl in eqiad. I'm thinking that while it works for me it might explain why many others can't reach it. [16:12:24] that isn't a problem [16:12:33] paravoid: Known to not work for ^d, andrewbogott [16:12:39] what are your IPs? [16:12:48] and which direction is it broken? [16:12:55] in which* [16:13:48] i.e. ping virt0 from your connection and tcpdump. do you see the echo requests arriving? [16:13:48] I'm @ 50.181.252.248 [16:14:24] <^d> I'm @ 24.5.175.151 [16:14:39] 24.6.41.192 [16:14:48] and I suppose you have issues reaching Tampa in general, not just wikitech [16:14:49] I have been reading pages (static and geneated) from wikitech fine from here, logged in [16:14:51] e.g. fenari [16:15:28] paravoid, wikitech is getting my pings. [16:15:30] But I get no response [16:15:49] And, yes, cannot ping fenari either [16:16:02] (03PS1) 10BryanDavis: Add 1.23wmf18 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118553 [16:16:03] (03PS1) 10BryanDavis: Wikipedias to 1.23wmf17 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118554 [16:16:06] (03PS1) 10BryanDavis: Group0 wikis to 1.23wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118555 [16:16:56] nope, nothing in Tampa is reachable for me [16:17:12] <^d> Same for me. [16:17:17] mutante, ping wikitech? [16:17:24] 100% packet loss [16:17:28] is dickson? [16:17:37] mutante, I want to see if tcpdump gets it. [16:17:47] i get replies from dickson [16:17:53] i don't from fenari or stat1 [16:18:25] <^d> +1 to everything mutante said. [16:18:32] (03CR) 10BryanDavis: [C: 032] Add 1.23wmf18 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118553 (owner: 10BryanDavis) [16:18:33] andrewbogott: ping running [16:18:38] (03Merged) 10jenkins-bot: Add 1.23wmf18 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118553 (owner: 10BryanDavis) [16:18:39] mutante: yep, I see it. [16:20:36] it works again [16:20:36] fixed [16:20:39] ack [16:20:55] yes, better. thank you paravoid [16:20:56] <^d> yay [16:20:56] !log deactivating Tele2 from eqiad, blackholing traffic [16:21:04] Logged the message, Master [16:21:25] it's one of these days [16:21:30] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [16:21:37] aha, thanks [16:21:41] like I was saying... [16:23:50] PROBLEM - Puppet freshness on brewster is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 01:22:57 PM UTC [16:24:47] ACKNOWLEDGEMENT - Puppet freshness on brewster is CRITICAL: Last successful Puppet run was Thu 13 Mar 2014 01:22:57 PM UTC alexandros kosiaris debugging carbon install [16:32:20] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Thu Mar 13 16:32:13 UTC 2014 [16:45:24] greg-g: I'm ready to scap now [16:45:49] good luck pulling from gerrit... [16:46:14] hoo: I'm all prepped on tin [16:46:46] for a moment it seemed like gerrit went fully away... looks ok now (again) [16:46:48] more or less [16:49:08] !log bd808 Started scap: testwiki to php-1.23wmf18 and rebuild l10n cache [16:49:17] Logged the message, Master [16:49:46] <^d> gerrit, again. [16:49:57] <^d> So slow. [16:52:29] (03CR) 10Hashar: [C: 031] "New MediaWiki version is being deployed right now. Once it is done I guess anyone can easily merge and deploy this change." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118342 (owner: 10Dan-nl) [16:52:31] (03PS1) 10Alexandros Kosiaris: Set carbon as a 4disk RAID5 with GPT machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/118560 [16:53:47] gerrit down again [16:54:20] maybe just slow [16:54:32] .... [16:54:35] yea, i got page for it [16:54:35] chrismcmahon: The web ui is really slow at the moment [16:54:44] gerrit? [16:54:53] (just got page) [16:54:53] someone who knows about it pls log ;] [16:54:59] oh, its up, just slow. [16:55:08] !log gerrit is just really slow, but not quite down. [16:55:15] Logged the message, Master [16:55:21] ^d: I'll take a look at it if I can get on the bo [16:55:22] x [16:56:11] 24011 gerrit2 20 0 32.9g 8.1g 15m S 15 25.9 98:34.90 java [16:56:16] ddoesn't seem that crazy [16:57:08] ^d: mind if I restart it anyways? [16:57:24] <^d> You can. [16:57:25] apergos: That matches with what ^d found ~2 hours ago. Lod on box looked ok but client https requests were slow/timing out [16:57:42] s/Lod/Load/ [16:57:49] isn't there a proxy in front of it? [16:57:49] <^d> ssh is completely fine. [16:57:55] for http [16:57:57] <^d> it runs being apache as a reverse proxy. [16:58:04] maybe it has nothing to do with gerrit itself ? [16:58:06] <^d> no errors in that log ~2h ago [16:58:12] <^d> akosiaris: That's what I said ;-) [16:58:26] bd808|deploy: heya, yeah, sorry, had a last minute need to go to the eye place for carrie (broken glasses) [16:58:43] greg-g: No worries. I'm mid scap. [16:58:50] * greg-g nods [16:58:50] (03CR) 10Alexandros Kosiaris: [C: 032] Set carbon as a 4disk RAID5 with GPT machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/118560 (owner: 10Alexandros Kosiaris) [16:59:01] (03CR) 10Alexandros Kosiaris: [V: 032] Set carbon as a 4disk RAID5 with GPT machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/118560 (owner: 10Alexandros Kosiaris) [16:59:10] did so [16:59:26] !Log restarted gerrit which had been slow [16:59:35] Logged the message, Master [17:00:21] well it seems ok for the moment, maybe some particular query was slow? [17:00:26] some particular request I mean [17:05:26] !log bd808 Finished scap: testwiki to php-1.23wmf18 and rebuild l10n cache (duration: 16m 14s) [17:05:34] Logged the message, Master [17:05:38] Not bad for a new branch scap [17:06:01] I wonder if deleting 5 brances helped [17:06:12] indeed, not bad at all [17:06:18] we'll never know! [17:06:48] !log bits symlinks for 1.23wmf6 to 1.23wmf10 deleted [17:06:57] Logged the message, Master [17:09:10] PROBLEM - SSH on lvs4001 is CRITICAL: Server answer: [17:13:41] Anyone have a sure fire way to check for broken l10n cache for extensions? I'd love to know if the cache is good for 1.23wmf18 or not. [17:14:53] (03PS1) 10Jgreen: disable pxeboot install for tantalum [operations/puppet] - 10https://gerrit.wikimedia.org/r/118570 [17:15:37] MaxSem: ^^ [17:16:13] bd808|deploy, Special:AllMessages? [17:16:13] RECOVERY - SSH on lvs4001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:18:00] * bd808|deploy scans that page [17:19:23] I at least see a lot of en messages for extensions. That's a good sign. [17:19:32] (03CR) 10Jgreen: [C: 032 V: 031] disable pxeboot install for tantalum [operations/puppet] - 10https://gerrit.wikimedia.org/r/118570 (owner: 10Jgreen) [17:20:10] PROBLEM - SSH on lvs4001 is CRITICAL: Server answer: [17:20:56] greg-g: {{done}} with prep. [17:21:22] And I got the deploy notes generated this week \o/ [17:21:30] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [17:22:01] bd808|deploy: w00t! [17:22:04] seriously [17:22:10] RECOVERY - SSH on lvs4001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:22:56] That deploy note process and the branch cut really need to be automated. That would cut ~20 minutes off of the deploy prep time [17:24:15] glue 'em together! [17:24:28] If we round up to 30 minutes https://xkcd.com/1205/ says that's worth investing 5 days in fixing [17:24:47] 7.3gb free, snapsshot4 thanks you [17:24:55] apergos: Cool! [17:25:11] PROBLEM - SSH on lvs4001 is CRITICAL: Server answer: [17:25:22] paravoid, I'm just looking into our timeouts; would the broken API backends have accepted any connection at all? [17:25:34] no [17:25:38] just a timeout [17:25:47] so no TCP connection at all [17:25:57] no [17:26:10] RECOVERY - SSH on lvs4001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:30:14] paravoid: We should keep an eye on 404 errors for a while. I dropped 5 really old branches in the scap that just finished. The latest one would have been last active on a wiki on 2014-01-30 [17:30:21] okay [17:30:27] thanks for the headsup :) [17:31:20] yw :) I don't want something boring like this to be the way I earn my "I broke enwiki" shirt [17:32:37] paravoid, in local testing refused connections are retried immediately [17:33:06] this wouldn't be refused [17:33:09] this is just dropped [17:33:38] ah [17:34:11] we might not be able to do very much there [17:34:29] IMO that kind of thing should be handled by the load balancer [17:34:40] along with backend issue reporting [17:34:56] just lower the timeout? [17:35:12] no, that would cut off slow requests [17:35:40] *connect* timeout [17:35:57] there is no separate connect timeout as far as I can tell so far [17:42:52] RECOVERY - HTTP on carbon is OK: HTTP OK: HTTP/1.0 200 OK - 314 bytes in 0.005 second response time [17:52:07] bd808: I need to change venues (barking dog next door and hyper toddler down stairs). Go forth when ready and such. I'll be online in ~15. [17:52:13] * bd808 nods [17:52:18] No rush. I need beverage and bio break anyway [17:52:29] (03CR) 10Aude: [C: 04-1] adding Amsterdam Museum to the wgCopyUploadsDomains array. (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118342 (owner: 10Dan-nl) [17:59:50] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [18:03:13] (03PS2) 10Dan-nl: adding Amsterdam Museum to the wgCopyUploadsDomains array. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118342 [18:04:00] (03CR) 10Dan-nl: "adjusting domain entry based on aude’s comment in ps1" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118342 (owner: 10Dan-nl) [18:05:30] (03CR) 10BryanDavis: [C: 032] Wikipedias to 1.23wmf17 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118554 (owner: 10BryanDavis) [18:05:33] !log rebuilding search index for itwiki [18:05:40] (03Merged) 10jenkins-bot: Wikipedias to 1.23wmf17 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118554 (owner: 10BryanDavis) [18:05:42] Logged the message, Master [18:06:28] !log bd808 rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.23wmf17 [18:06:37] Logged the message, Master [18:09:52] deploy! [18:12:57] (03CR) 10BryanDavis: [C: 032] Group0 wikis to 1.23wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118555 (owner: 10BryanDavis) [18:13:04] (03Merged) 10jenkins-bot: Group0 wikis to 1.23wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118555 (owner: 10BryanDavis) [18:14:15] !log bd808 rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.23wmf18 [18:14:22] Logged the message, Master [18:14:26] * greg-g is back [18:14:37] greg-g: no excitement [18:15:06] * greg-g crosses fingers [18:15:14] pedias on wmf17, group0 on wmf18 [18:15:27] if elasticsearch complains in the next 20 minutes, ignore it please [18:15:42] I think that monitor might be sad in a few seonds [18:15:47] Damn. Little apc barf [18:16:07] I thought it's waited long enough [18:16:27] seems to be settling down [18:17:04] * chrismcmahon is just happy for no Thursday morning bug drama. Seems like every Thu AM for the last month or so has been a scramble to fix something critical [18:17:23] * bd808|deploy knows on lots of chunks of wood [18:17:28] *knocks [18:19:09] paravoid, do you know a good black hole IP to test against? [18:19:43] springle: Do you know where the views for the labs replicas are defined? It's somewhere in puppet, but I have no idea where [18:21:05] hoo: he's asleep (Australian) [18:21:16] mh [18:21:29] Coren might know [18:21:41] oh well, nevermind, I can log in from production and view them live [18:21:45] Know what? [18:21:50] doubt I need to change [18:21:55] views for labs replica [18:22:00] paravoid, nm; testing with a local iptables rule now [18:22:02] where are they? puppet? [18:22:12] * aude actually has a change in mind [18:22:30] gwicke: yes that would work -- sorry, in the middle of something... [18:22:49] aude: maintain-replicas in operations/software [18:23:15] thanks [18:23:22] Coren: Thanks :) [18:23:45] hoo: The %customviews hash is probably what you're interested in. [18:24:26] ah, indeed :) [18:25:02] cmjohnson1: the fate of sockpuppet is recycle/donate/trash ? [18:25:20] yes it is [18:25:34] creating a ticket for that [18:26:20] matanya: for what? [18:26:39] unrack sockpuppet [18:26:52] please dont [18:26:57] ok [18:27:00] hmmm, where are the wikidata tables defined? [18:27:00] there are decom steps involved [18:27:01] the special ones? [18:27:07] so just making a 'unrack this' will disrupt that process. [18:27:23] matanya: unless you see that its already been fully decommisiosned and wiped, it needs tickets for those first =] [18:27:55] i know robh i was planning on pointing at wikitech and then and , and unrack :) [18:28:10] fwiw : https://rt.wikimedia.org/Ticket/Display.html?id=6924 [18:28:33] cool, if you link all that in the chain so no one unracks before its wiped thats cool [18:28:43] but was afaid you would put in a 'unrack this' and nothign else about wipe. [18:29:15] never do such unuseful tickets :P [18:29:20] then i amend my statement to 'by all means if you are doing all that stuff go for it' [18:29:29] thanks [18:29:34] heh =] [18:30:44] Coren: i can't find where/how to add but https://bugzilla.wikimedia.org/show_bug.cgi?id=56180 would be great to resolve [18:31:10] !log mw1201, mw1202, mw1203, mw1208, mw1209, mw1210 didn't get new branch during scap [18:31:19] Logged the message, Master [18:31:30] (03PS1) 10Jalexander: Enable Translate and EducationProgram extensions on leglateamwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118581 [18:31:51] aude: If none of the information needs to be redacted, all you need is add it to @fullviews. But that also requires Shaun to start replicating the raw table too. [18:32:28] Sean* [18:33:04] ok, [18:33:19] i don't see our other tables there, but maybe they were added before this script was made [18:33:58] If you've got other tables missing from there, add them to the bugzilla. [18:34:11] robh: https://rt.wikimedia.org/Ticket/Display.html?id=7038&results=ec31cc06044d30b11409475a18d5217a [18:34:14] ok [18:36:32] (03PS1) 10Aude: Add wikidata tables for fullview [operations/software] - 10https://gerrit.wikimedia.org/r/118582 [18:36:48] aude: Weird indeed [18:36:56] hope that's correct [18:37:04] ack, tab [18:37:06] (03CR) 10Alexandros Kosiaris: [C: 032] ldap: puppet 3 compatibility fix: doamin is a fact [operations/puppet] - 10https://gerrit.wikimedia.org/r/118443 (owner: 10Matanya) [18:37:31] (03PS2) 10Aude: Add wikidata tables for fullview [operations/software] - 10https://gerrit.wikimedia.org/r/118582 [18:37:49] !log ran sync-common on mw1201 manually [18:37:57] Logged the message, Master [18:38:08] 11 of my changes merged today, what a good day :) [18:38:12] !log running sync-common on mw1202,mw1203,mw1208,mw1209,mw1210 via dsh [18:38:20] Logged the message, Master [18:38:52] so this the second "servers moved but the appropriate places in our dsh/acl configs weren't updated" in 24 hours? [18:39:24] bd808|deploy: Looks fixed [18:39:34] (03CR) 10Alexandros Kosiaris: [C: 032] nginx: remove lookupvar and replace with top scope @ var [operations/puppet] - 10https://gerrit.wikimedia.org/r/112881 (owner: 10Matanya) [18:39:39] paravoid: when you finish being in the middle of something, some guidence regarding check_graghite would be appreciated [18:39:41] aude: Probably someone created these views per hand once? The script has no "default" for handling unknown tables, so I have no idea [18:39:51] that's what i think [18:40:15] i left out wb_changes (don't think it's needed, wasn't there before) [18:40:28] and wb_id_counters [18:40:43] (03CR) 10Alexandros Kosiaris: [C: 032] torrus: remove lookupvar and replace with top scope @ var [operations/puppet] - 10https://gerrit.wikimedia.org/r/112884 (owner: 10Matanya) [18:41:14] * aude back in a bit [18:41:48] (03CR) 10Alexandros Kosiaris: [C: 032] misc: remove lookupvar and replace with top scope @ var [operations/puppet] - 10https://gerrit.wikimedia.org/r/112887 (owner: 10Matanya) [18:42:28] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia: remove lookupvar and replace with top scope @ var [operations/puppet] - 10https://gerrit.wikimedia.org/r/112888 (owner: 10Matanya) [18:42:52] bd808: I assume you've already thought of it, but could you jot down what the issue was/how you fixed it/what the real fix is? I assume it's yet another repurcusion of moved servers without some other conf file being updated. [18:43:35] greg-g: I'm still looking to see if it was sunspots or something else [18:43:41] kk [18:43:50] The hosts pulled when I asked them to explicitly [18:45:14] robh: can you pick at https://gerrit.wikimedia.org/r/#/c/110366/ ? [18:45:16] (03CR) 10Hoo man: [C: 031] Add wikidata tables for fullview [operations/software] - 10https://gerrit.wikimedia.org/r/118582 (owner: 10Aude) [18:45:27] this is your teritory [18:46:56] !log mw1201,mw1202,mw1203,mw1208,mw1209,mw1210 in mediawiki-installation dsh group on tin; not sure why they didn't get the scap request [18:47:05] Logged the message, Master [18:47:28] !log fix deployed for bug 62497 [18:47:37] Logged the message, Master [18:52:17] matanya: I was looking it over earlier this morning actually [18:52:19] (03CR) 10Hoo man: [C: 04-1] Add wikidata tables for fullview (031 comment) [operations/software] - 10https://gerrit.wikimedia.org/r/118582 (owner: 10Aude) [18:52:31] oh, good sign i hope [18:53:02] yep, earlier means like, 20 minutes ago [18:53:29] oh, your morning [18:53:44] I'm going to append a +1 shortly, but i'd like someone else to also review since it touches well, every single ssl connection. [18:53:46] everywhere. [18:54:01] since its linting it should be fine of course, but im paranoid. [18:54:14] yeah, makes sense [18:54:35] (03PS3) 10Aude: Add wikidata tables for fullview [operations/software] - 10https://gerrit.wikimedia.org/r/118582 [18:54:55] (03CR) 10Aude: Add wikidata tables for fullview (031 comment) [operations/software] - 10https://gerrit.wikimedia.org/r/118582 (owner: 10Aude) [18:55:56] aude: Thaught so as well... but I just checked ;) [18:56:01] * Thought [18:56:19] thanks :) [18:56:37] (03CR) 10RobH: [C: 031] "This seems like a straightforward (needed) lint update to bring this in line. Since it does touch every single SSL connection on the varn" [operations/puppet] - 10https://gerrit.wikimedia.org/r/110366 (owner: 10Matanya) [18:56:43] i can't imagine id_counters is too useful, but no harm to include it [18:57:23] matanya: thanks for cleaning up lint stuff btw, its pretty thankless so figured one couldnt hurt ;] [18:57:44] thanks for your kind words [18:58:21] folks tend to update lint stuff when they write or fix things (well, i do) but i dont think anyone else is intentionally ensuring our currently deployed stuff adheres to said standard [18:58:38] and its easy to forget and miss when updating things [18:59:08] but maybe this will help so our new ops hires dont see our repo and suddenly burst into tears. [18:59:30] i try to go over every patch submitted, but mostly they get merged before i comment [19:01:11] robh: you know the easiest way to make sure it is followed is force jenkins to -1 on non lint-clean patches :P [19:01:33] (03PS1) 10Jalexander: Allow legalteamwiki to use mobile site [operations/dns] - 10https://gerrit.wikimedia.org/r/118585 [19:02:18] matanya: thats not gonna pass the teams agreement ;] [19:02:33] for good reasons [19:02:52] hopefully once everythign is linted over it can be done [19:03:48] * matanya crosses fingers [19:08:55] gah. bd808: 2014-03-13 19:08:01 mw1210 mediawikiwiki: [b5483896] /wiki/Special:CentralAutoLogin/deleteCookies?type=icon Exception from line 468 of /usr/local/apache/common-local/php-1.23wmf18/includes/cache/LocalisationCache.php: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php. [19:09:01] (03CR) 10Hoo man: [C: 031] Add wikidata tables for fullview [operations/software] - 10https://gerrit.wikimedia.org/r/118582 (owner: 10Aude) [19:09:16] blerg [19:09:32] Did the localization cache get synched? [19:09:49] Not on those boxes [19:10:02] sync-common doesn't fix the cache [19:10:16] They need to run the rebuild. I'm on it [19:10:33] :/ [19:12:10] csteipp: I ran scap-rebuild-cdbs, they should be good now [19:15:15] < gwicke> do we have response time metrics for the API? <-- gwicke is https://gdash.wikimedia.org/dashboards/apimethods/ enough for that, or is there something else you wanted? [19:17:46] greg-g, that's already pretty useful especially now that the labels are fixed [19:18:19] gwicke: cool, I just jotted down that quote from you last night to follow up on today, saw those graphs in faidon's email, so wanted to make sure that was sufficient [19:18:39] also need alerting and ideally per-backend stats [19:18:54] the latter is hard with LVS [19:19:12] gwicke: yeah,the former is hard since we don't yet have a reliable way of doing that with these kinds of metrics [19:19:32] we still use dumb hard lines for eg reqstats [19:19:47] that's still better than nothing at all [19:20:17] true, it just gets ignored (no one has ever responded to the reqstats that I can remember, in and of itself, it needs other alerts for it to be noticed) [19:21:59] just retitled https://bugzilla.wikimedia.org/show_bug.cgi?id=57882 to include graphite [19:24:23] gwicke: cool, added to post mortem page [19:26:14] I'm also wondering if LVS for backend balancing is the best solution wrt error reporting and -handling [19:26:51] could make sense to reconsider the choice [19:27:07] varnish support RR balancing too for example [19:27:11] *supports [19:29:20] (03PS1) 10Jgreen: switch tantalum back to trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/118587 [19:31:59] (03CR) 10Jgreen: [C: 032 V: 031] switch tantalum back to trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/118587 (owner: 10Jgreen) [19:48:57] mw1160 arwikisource: HTTP 100 (Continue) in 'SwiftFileBackend::doStoreInternal' [19:50:05] haha [19:50:45] maybe a content-length headers went AWOL and it turned into chunked-transfer... [20:16:21] (03PS1) 10BBlack: enable Daniel Kinzler shell access - RT#7028 [operations/puppet] - 10https://gerrit.wikimedia.org/r/118590 [20:17:29] (03CR) 10BBlack: [C: 032 V: 032] enable Daniel Kinzler shell access - RT#7028 [operations/puppet] - 10https://gerrit.wikimedia.org/r/118590 (owner: 10BBlack) [20:22:48] \o/ [20:48:48] query apergos [20:49:12] he is sleeping, i suppose [20:49:17] 11 pm? I mean I can try to answer but I'm pretty checked out [20:49:20] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:50:06] 10 to, apergos you can still reply :P [20:50:10] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 428228 bytes in 8.677 second response time [20:50:28] hahaha [20:50:38] we'll see what the brain ha to say about that, it's the final arbiter [20:50:44] (03PS1) 10Tim Landscheidt: Tools: Install package joe [operations/puppet] - 10https://gerrit.wikimedia.org/r/118595 [21:00:50] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [21:08:33] (03CR) 10Dzahn: "they SSH key is still set to "absent" and there is a "revoked" comment further down the file" [operations/puppet] - 10https://gerrit.wikimedia.org/r/118590 (owner: 10BBlack) [21:16:25] ori: around ? [21:21:42] matanya: we could use a new version of https://gerrit.wikimedia.org/r/#/c/94138/ [21:22:00] i'm going to make it, Reedy pointed it out.. unless you already have on disk again ,heh [21:22:05] There's no restore button! :( [21:22:08] funny, was reading the ticket atm [21:22:11] haha [21:22:21] and have it [21:22:31] Reedy: i knew it:) [21:23:09] According to ganglia we have a nice base 2 number of servers online in tampa: 128 [21:23:12] Reedy: please push a new version of the CU-purge [21:23:29] Uh, no, ignore that [21:26:54] < 60 hosts online [21:26:56] matanya: rebase? [21:27:10] rebase what? [21:27:23] i through some comments, there [21:27:42] Ahh [21:27:47] I'll have a look [21:29:01] (03PS1) 10Matanya: tmh: decommed [operations/puppet] - 10https://gerrit.wikimedia.org/r/118606 [21:32:09] sweeeet [21:34:06] greg-g: https://gerrit.wikimedia.org/r/#/c/118601/ LD OK for VE? https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=104221&oldid=104073 [21:34:40] (03PS1) 10Matanya: tmh: decommed, left mgmt [operations/dns] - 10https://gerrit.wikimedia.org/r/118608 [21:37:05] mutante: if you merged those two ^ i'll be very close to 20 merges today :) [21:38:15] 218 so far, as it seems, but who counts :P [21:41:51] James_F|Away: yeah [21:42:41] (03PS1) 10Reedy: Remove image scalers, mw api, mw and bits appserver and job runner ganglia groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/118611 [21:43:17] https://it.wikipedia.org/wiki/Speciale:UltimeModifiche <-- is that bug already known? [21:43:35] (03CR) 10Matanya: [C: 031] Remove image scalers, mw api, mw and bits appserver and job runner ganglia groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/118611 (owner: 10Reedy) [21:43:57] Vito: What bug? [21:44:24] recent changes are a know bug, they bother vandals [21:44:31] *known [21:47:05] Reedy: see recentchangestext [21:48:01] while mediawiki:recentchangestext is ok its inclusion into special:recentchanges shows some tags [21:48:16] well not always [21:49:34] Reedy: wops, it's not recentchagestext but the one just below [21:49:35] sorry [21:51:09] greg-g: Thanks! [21:57:11] (03PS1) 10BBlack: Enable daniel's key (should have been in 2156ba5e) [operations/puppet] - 10https://gerrit.wikimedia.org/r/118614 [21:57:46] (03CR) 10BBlack: [C: 032 V: 032] Enable daniel's key (should have been in 2156ba5e) [operations/puppet] - 10https://gerrit.wikimedia.org/r/118614 (owner: 10BBlack) [21:58:38] (03PS3) 10Reedy: Remove pmtpa apaches, leaving management entries [operations/dns] - 10https://gerrit.wikimedia.org/r/117210 [21:59:06] Reedy: copy-pasting ? :P [21:59:46] Copy pasting what? [22:00:00] my change [22:00:09] Look at the change id [22:00:54] I pressed rebase [22:01:16] (03PS1) 10Hoo man: rm access revoked comment from Daniel's entry in admins::restricted [operations/puppet] - 10https://gerrit.wikimedia.org/r/118615 [22:01:17] bblack: ^ [22:01:22] Reedy: was just kidding, way too late here [22:01:44] Make more commits! [22:02:54] (03CR) 10BBlack: [C: 032 V: 032] rm access revoked comment from Daniel's entry in admins::restricted [operations/puppet] - 10https://gerrit.wikimedia.org/r/118615 (owner: 10Hoo man) [22:03:56] did like 10 today, totally out by now :) [22:06:47] recursive negetive bblack ? [22:15:00] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 991.94261509 [22:15:10] There are quite a few php OOM errors in the fatal log in the last hour that seem to be related to api [22:18:51] http://de.wikipedia.org/w/api.php?action=query&prop=revisions&generator=recentchanges&grclimit=500&rvprop=content&rvparse [22:19:18] Across multiple wikis (en, ru...) [22:19:27] It's hardly a surprise it fails [22:28:04] (03CR) 10Dzahn: [C: 032] "RT #7018" [operations/puppet] - 10https://gerrit.wikimedia.org/r/118611 (owner: 10Reedy) [22:28:41] mine too please [22:31:02] it's still doing stuff, matanya [22:32:24] that was a general merge request, for any commit. you should ignore me until tomorrow, not making sense anymore [22:37:39] Reedy, bd808: I created a bug about those kinds of API end points earlier [22:38:01] gwicke: I wondered if it was releated [22:38:55] it is too easy to bring down the API [22:40:42] * hoo whispers action=parse [22:41:51] (03PS2) 10Dzahn: tmh: decommed [operations/puppet] - 10https://gerrit.wikimedia.org/r/118606 (owner: 10Matanya) [22:43:07] greg-g: gwicke: Can I quickly push in a small Wikidata deploy before the LD? [22:43:53] hoo: what is it? [22:44:16] greg-g: A 3-line fix prevent php fatals... like if ( !$foo ) { return null } [22:44:51] * preventing [22:44:51] in wmf18? [22:44:51] my English's broken tonight :P [22:45:09] greg-g: wmf17 (as that's what Wikidata is on atm) [22:45:19] and maybe also wmf18, but we can do that later on [22:45:36] wmf18 is the same Wikidata code as we only deploy every 2 week as you may know [22:45:45] can you confirm it on testwikidata (which is on 18) and do the fix there, then backport when it's confirmed fixed? [22:46:36] greg-g: I guess I can arrange that... will of course need more time, though [22:50:15] (03CR) 10Dzahn: [C: 032] "RT #6222" [operations/puppet] - 10https://gerrit.wikimedia.org/r/118606 (owner: 10Matanya) [22:50:18] greg-g: If it's good to go, I can push the wmf18 change [22:50:37] i think he wanted you to test if it's good to go:) [22:50:51] it's always small [22:51:12] mutante: Yeah, he wanted me to use wmf18 to test... that's why I want to push that now :P [22:51:56] i got confused though by: < hoo> and maybe also wmf18, but we can do that later on [22:52:22] yeah :P wmf18 is non vital for us (at this moment) [22:52:32] us = Wikidata Team [22:53:40] hoo: yeah, deploy to wmf18 [22:53:42] is ok now [22:54:04] ok, waiting for Jenkins [22:54:51] but yeah, just generally in these cases: let's always 1) fix on testwikidata (latest wmfXX) 2) confirm fix 3) backport to wmfXX-1 [22:55:22] Sounds sane, indeed [22:55:25] not that ya'll are cowboys and always breaking the site (the opposite, actually), but it's just a good way of covering bases [22:55:29] * hoo goes to prepare a test case on testwikidata [22:56:07] Done (https://test.wikidata.org/w/index.php?title=Q139&diff=3260&oldid=1521) [22:58:08] !log disabling puppet on tmh[12] [22:58:16] Logged the message, Master [22:59:08] Krinkle: ^d just fyi, hoo is deploying a quick wikidata fix to wmf18 (waiting on jenkins) [22:59:20] OK [22:59:25] <^d> mmk [22:59:35] * gwicke is ready to go for parsoid [22:59:44] jenkins looks slow atm... [22:59:55] <^d> I'm going to start the jenkins dance myself too [23:00:11] gwicke: go for it [23:01:10] ah, here we go [23:01:28] bblack || mutante: can you restart the parsoids for me? [23:02:10] !log deployed Parsoid 004c7acc, second attempt after trebuchet no-op the last few times [https://github.com/trebuchet-deploy/trigger/issues/26] [23:02:18] Logged the message, Master [23:02:42] I'm done, just need somebody to restart the service [23:04:14] !log hoo synchronized php-1.23wmf18/extensions/Wikidata/ 'Fix a fatal error in ContentRetriever (for testwikidata first)' [23:04:21] Logged the message, Master [23:04:30] ok, done [23:04:39] and my test case seems to work :) [23:05:24] Who's next in the queue? I guess ^d? [23:06:02] * ^d 's still waiting on jenkins [23:06:24] ok, whenever you're done (and greg-g is fine with that) I can go for wmf17 [23:06:30] (03CR) 10Chad: [C: 032] Moving all "small" wikis into Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118482 (owner: 10Chad) [23:06:50] robh || Coren: I need a service restart for Parsoid: 'dsh -g parsoid service parsoid restart' [23:07:31] thats done from tin for dsh now right? [23:07:56] (03Merged) 10jenkins-bot: Moving all "small" wikis into Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118482 (owner: 10Chad) [23:07:57] should work, yes; I'm normally running it from bast1001 [23:08:06] <^d> Huh, that's it? [23:08:08] doing now [23:08:19] <^d> Oh bleh nvm. [23:08:19] <^d> Ignore me. [23:08:53] !log restarting parsoid service [23:08:57] Logged the message, Master [23:08:57] ^d and Krinkle: whoever's ready :) [23:09:03] robh, thanks a bunch! [23:09:25] !log demon synchronized wmf-config/InitialiseSettings.php 'Moving all small wikis into cirrus' [23:09:25] gwicke: is it supposed to sit on the first one? [23:09:28] Logged the message, Master [23:09:29] ahh, there it goes [23:09:29] first one was really long, then some went faster, seems a mixed bag [23:09:29] robh, it should take some time for each [23:09:30] the idea is to have a rolling restart [23:09:39] so that the service as a whole does not go down [23:09:44] uptime, meh ;] [23:09:53] ;) [23:10:14] devs aren't responsible for uptime/stability [23:10:28] ^d: you go first [23:10:44] if we are going to argue that im gonna just cut and paste the wikitech thread. [23:10:59] he just did, I do believe :) [23:11:02] robh: :P [23:11:05] k :) [23:11:06] heh [23:11:06] robh: you mean say to me what I said on the thread? :) [23:11:25] yes, but i need you to be realllllly offended [23:11:27] oh right [23:11:29] out of scope offended [23:12:02] someone film it, i want greg-g to flip his standing desk and storm off ;] [23:12:38] gwicke: looks like the restarts are all finished [23:12:48] !log demon synchronized php-1.23wmf17/extensions/CirrusSearch/ [23:12:50] all good on yer end? [23:12:56] Logged the message, Master [23:13:12] robh: yup [23:13:37] cool [23:15:41] logs are looking good [23:15:56] ^d: Hm.. you not done yet, right? [23:15:57] <^d> Almost [23:16:07] thought for a second but misinterpreted greg-g saying 'did' start, as you being done as well. But then came another sync :) [23:16:08] my bad [23:17:08] !log demon synchronized php-1.23wmf18/extensions/CirrusSearch/ [23:17:09] Logged the message, Master [23:17:20] !log tmh1,tmh2: removed from monitoring, shutting down [23:17:21] Logged the message, Master [23:17:24] <^d> Krinkle: Done. [23:17:31] <^d> Just doing some maintenance scripts but off tin. [23:19:07] !log krinkle synchronized php-1.23wmf18/extensions/VisualEditor/ 'd121ac812fa155cd4554e95' [23:19:15] Logged the message, Master [23:20:00] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 12.8028174236 [23:20:33] Done [23:21:20] ok... [23:21:26] greg-g: Am I good to go? [23:21:42] hoo: yep! [23:22:07] ok :) [23:22:12] greg-g: can i change dsh list ?:) [23:22:41] i'll wait [23:24:14] mutante: let hoo finish ;) [23:24:24] mutante: then yeah, fixing dsh lists is great [23:24:30] waiting for jenkins... [23:27:34] (03PS1) 10Reedy: Decomision solr[1-3] [operations/puppet] - 10https://gerrit.wikimedia.org/r/118632 [23:28:46] !log hoo synchronized php-1.23wmf17/extensions/Wikidata/ 'Fix a fatal error in ContentRetriever' [23:28:53] Logged the message, Master [23:29:01] sweet, 4 deploys in under 30 minutes [23:29:06] Ok, I'm done [23:29:15] go team [23:29:20] and it works on WD... thanks, greg-g [23:29:22] :D [23:31:00] (03PS1) 10Reedy: Remove solr[1-3]. Leave mgmt [operations/dns] - 10https://gerrit.wikimedia.org/r/118634 [23:32:37] (03PS2) 10Reedy: Decommission solr[1-3] [operations/puppet] - 10https://gerrit.wikimedia.org/r/118632 [23:46:37] (03CR) 10Dzahn: [C: 032] "RT #6222" [operations/puppet] - 10https://gerrit.wikimedia.org/r/118375 (owner: 10Matanya) [23:48:05] (03PS3) 10Reedy: Decommission solr[1-3] [operations/puppet] - 10https://gerrit.wikimedia.org/r/118632 [23:48:17] (03PS2) 10Reedy: Remove solr[1-3]. Leave mgmt [operations/dns] - 10https://gerrit.wikimedia.org/r/118634 [23:48:42] !log ersch: disable puppet, remove from monitoring [23:48:51] Logged the message, Master [23:53:15] (03CR) 10Ori.livneh: "@BBlack: Yes, we do. CentralAuth calls session_start multiple times, which triggers PHP bug ." [operations/puppet] - 10https://gerrit.wikimedia.org/r/117375 (owner: 10Ori.livneh) [23:53:28] bblack: ^ [23:56:34] !log shutting down ersch (former poolcounter) [23:56:43] Logged the message, Master [23:58:20] (03PS1) 10Reedy: Decomission sq67, sq68, sq69, sq70 [operations/puppet] - 10https://gerrit.wikimedia.org/r/118641 [23:59:56] (03PS1) 10Reedy: Remove sq67, sq68, sq69, sq70. Leave mgmt [operations/dns] - 10https://gerrit.wikimedia.org/r/118642