[12:01:23] New patchset: Mark Bergsma; "Cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2531 [12:01:48] New patchset: Mark Bergsma; "Move monitor_service to the service class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2532 [12:02:09] New patchset: Mark Bergsma; "Pull LVS service IPs from lvs.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2533 [12:02:30] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2531 [12:02:30] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2531 [12:02:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2532 [12:02:31] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2532 [12:02:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2532 [12:02:46] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2533 [12:02:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2533 [12:06:57] New patchset: Mark Bergsma; "Pull in LVS service IPs from lvs.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2534 [12:08:09] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2534 [12:08:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2534 [12:17:19] New patchset: Mark Bergsma; "Scoped variable lookups" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2535 [12:17:41] New patchset: Mark Bergsma; "Make varnish_xff_sources a class parameter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2536 [12:18:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2535 [12:18:01] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2535 [12:18:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2535 [12:18:34] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2536 [12:18:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2536 [12:24:44] New patchset: Mark Bergsma; "Use class parameters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2537 [12:25:15] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2537 [12:25:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2537 [12:55:10] New review: Demon; "Are we going to be doing python things in jenkins?" [integration/jenkins] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2289 [13:28:01] New patchset: Mark Bergsma; "Use an upstart manifest for managing varnish udp loggers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2538 [13:29:07] New patchset: Mark Bergsma; "Use an upstart manifest for managing varnish udp loggers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2538 [13:30:25] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2538 [13:30:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2538 [13:32:53] New patchset: Mark Bergsma; "Move out upstart_job dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2539 [13:33:43] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2539 [13:33:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2539 [13:38:30] New patchset: Mark Bergsma; "Using a service type causes conflicts, try exec instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2540 [13:39:02] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2540 [13:39:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2540 [13:40:26] New patchset: Mark Bergsma; "Fix dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2541 [13:40:58] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2541 [13:40:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2541 [13:48:17] New patchset: Mark Bergsma; "Don't start if already running" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2542 [13:49:16] New patchset: Mark Bergsma; "Don't start if already running" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2542 [13:49:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2542 [13:49:47] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2542 [13:49:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2542 [13:50:13] New patchset: Hashar; "gitreview file" [operations/software] (master) - https://gerrit.wikimedia.org/r/2543 [13:52:03] New patchset: Mark Bergsma; "su forks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2544 [13:52:34] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2544 [13:52:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2544 [14:00:58] New patchset: Mark Bergsma; "Make varnish_instance default undefined, never mind pid files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2545 [14:02:09] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2545 [14:02:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2545 [14:04:13] New review: Hashar; "I have looked at the jenkins gerrit plugin, patchsets are polled by jenkins so we do need a "patchse..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2495 [14:05:28] New review: Demon; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2543 [14:05:28] Change merged: Demon; [operations/software] (master) - https://gerrit.wikimedia.org/r/2543 [14:06:13] New patchset: Demon; "Chad's the best" [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2546 [14:13:17] Change abandoned: Demon; "Was just testing git-review" [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2546 [14:15:56] New patchset: Mark Bergsma; "Fix su invocation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2547 [14:16:26] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2547 [14:16:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2547 [14:19:54] New patchset: Mark Bergsma; "Fix status command" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2548 [14:20:26] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2548 [14:20:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2548 [14:22:17] New patchset: Mark Bergsma; "Specify path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2549 [14:22:49] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2549 [14:22:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2549 [14:26:34] New patchset: Mark Bergsma; "Remove status" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2550 [14:26:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2550 [14:27:06] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2550 [14:27:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2550 [14:31:07] New patchset: Hashar; "Force Jenkins request through HTTPS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2551 [14:45:22] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2551 [15:13:34] New patchset: Mark Bergsma; "Upstart cookbook recommends using start-stop-daemon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2552 [15:14:12] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2552 [15:14:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2552 [15:25:35] New patchset: Mark Bergsma; "With start-stop-daemon, varnishncsa needs to fork" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2553 [15:26:16] New patchset: Mark Bergsma; "With start-stop-daemon, varnishncsa needs to fork" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2553 [15:26:53] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2553 [15:26:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2553 [15:31:23] New patchset: Mark Bergsma; "Try expect fork" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2554 [15:32:09] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2554 [15:32:10] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2554 [15:52:50] New patchset: Mark Bergsma; "Comment for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2555 [15:53:26] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2555 [15:53:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2555 [16:28:01] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2551 [16:28:02] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2551 [16:32:04] New review: Dzahn; "confirmed. jenkins on http://integration.mediawiki.org/ci/ now forces HTTPS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2551 [19:14:14] RobH: how'd the move go saturday ? [19:14:49] move? [19:15:03] db9 move ? [19:15:07] ah [19:15:17] db9...went good [19:15:26] no hiccups everything came back up normally [19:15:50] yay :) [19:16:18] the only thing learned was mysql needs to be started seperate in safe mode [19:17:04] it's the fb version, which is in /usr/local/something [19:20:05] interesting [19:20:14] so it doesn't startup on load ? [19:20:34] /etc/init.d/whatever must point to something else [19:20:38] I have to look at that again [19:20:53] but yeah we always wind up starting the fb one from the command line [19:27:06] anyone remember the equivalent of "free" on os x ? [19:27:29] nope [19:43:54] LeslieCarr: by 'os x' do you mean 'objective C'? Or do you mean some other kind of 'free'? [19:44:18] andrewbogott[gon: i mean as in "how much memory" and os x i mean apple's os called os x [19:44:49] In this case I am no help. [19:45:50] (Coding on OSX is mostly done in ObjC, and in ObjC you 'release' things that you 'free' in C. So my question wasn't a total nonsequitor.) [19:50:31] so, i'm thinking of switching over a class of machines to use initcwnd 10 [20:01:09] New review: Hashar; "Thanks for the merge. Jenkins has an other issue though, its cookie is not secure so browsing to ht..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2551 [20:21:35] anyone around from ops who can approve a moderation request on one of the mobile lists? [20:21:39] i don't have the password for it [20:21:44] tfinc: sure. [20:21:55] list address? [20:21:56] mobile-feedback [20:22:13] i'd like to get yuvi's request to be on that list approved [20:22:24] I can't find a list by that name. [20:22:31] New patchset: Lcarr; "Ensuring tcp setting tweaks are on all serving platforms" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2556 [20:22:43] https://lists.wikimedia.org/mailman/listinfo/mobile-feedback [20:22:44] maplebed: its not a public list [20:22:51] doesn't matter - the direct URL should still work. [20:23:11] maplebed: https://lists.wikimedia.org/mailman/admindb/mobile-feedback-l [20:23:15] -l :D [20:23:20] that'd do it. [20:23:55] that list should have never had a -l [20:24:17] tfinc: doesn't yuvi have an @wikimedia address? [20:24:38] he keeps all of his mailing lists on it [20:24:49] damn, there's a lot of moderated messages... [20:25:02] maplebed: the previous maintainer never di it [20:25:08] i'd love to clean it up [20:26:42] ok, all set. [20:27:08] woot! [20:27:12] thanks maplebed [20:31:38] New patchset: Pyoungmeister; "making new lucene classes and configs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2557 [20:35:16] New patchset: Pyoungmeister; "making new lucene classes and configs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2557 [20:43:59] apergos (or anyone really) any ideas on what would cause http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=&c=Miscellaneous+pmtpa&h=ms5.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS to happen? [20:44:39] my guess is that whatever it is is the root couse behind http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&hreg[]=^ms-fe[1-9]&mreg[]=swift_.*_404_avg>ype=line&title=Swift+average+query+response+time+-+404s&aggregate=1 [20:44:51] and http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=load_one&s=by+name&c=Image+scalers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [20:46:07] holy crap [20:46:12] no I don't off the top of my head [20:46:13] hmm [20:46:36] it's interesting how it's network without a lot of other activity [20:48:02] I dunno if that's true [20:48:17] oh wait, I take that back--I could have sworn I saw a disk utilization graph but now I dont [20:48:41] !log rebooting brewster [20:48:43] Logged the message, Master [20:49:10] dear microsoft solution, please work for me. my poor jetlagged brain is worthless todat. [20:49:13] *today [20:50:33] what do the scalers show, anything interesting? [20:50:55] I haven't looked at one closely yet. [20:51:16] both ms5 and swift show no significant change in the number of requests to the scalers, though their network use has also changed. [20:51:31] maybe someone's asking for scaling of a bunch of gargantuan images or something... [20:51:37] well, that didn't solve the problem. not that I thought it would :( [20:56:28] cpu_system is high [21:01:10] http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=&s=by+name&c=Image+scalers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [21:01:12] heh [21:01:26] sadly there aren't a lot of connections coming out of ms5 except to ms-fe{1,2} and rendering.svc.pmtpa.wmnet [21:01:47] AaronSchulz: I'm inclined to believe that's a symptom, not a caues. [21:01:49] cause. [21:04:55] ms5 is running an *ancient* version of php [21:05:44] also . . . "61 updates are security updates." [21:05:58] probably not related but that makes me sad [21:06:39] I'm looking for something that went boom at 12:00UTC [21:06:48] yeah [21:06:50] New patchset: Demon; "Adding .gitreview" [mediawiki/extensions/Translate] (master) - https://gerrit.wikimedia.org/r/2558 [21:07:20] New review: Demon; "(no comment)" [mediawiki/extensions/Translate] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2558 [21:07:20] Change merged: Demon; [mediawiki/extensions/Translate] (master) - https://gerrit.wikimedia.org/r/2558 [21:07:52] do we have graphs of squid/varnish cache efficiency somewhere? this isn't an area I've explored yet [21:08:02] yeah the time is suspicious [21:10:03] oh, I guess i did ask you here earlier. [21:10:34] ww [21:12:18] * maplebed switches channels this time. [21:12:22] AaronSchulz: I don't know how to read http://noc.wikimedia.org/cgi-bin/ng/report.py?db=thumb [21:12:31] what is it telling us? [21:12:50] (and do you know what the unit is for the real/c column?) [21:13:46] real/c is cpu wall time spent in a MW function divided by the number of calls to the function [21:14:02] and what is ddjvu? [21:14:03] well, the term cpu is misleading there then [21:14:06] I am not seeing it. I looked at atop for before and after [21:14:12] it could just be anything being slow, like disk [21:14:17] ok so tcpdump is busier and count_gets.py is busier [21:14:29] not a lot of info there [21:14:50] the tcpdump and count_gets has been going on for weeks now, so I have a hard time blaming it. [21:15:08] (though it was spiking up in the CPU graph so I kicked it in case there was something wonky with it - no effect) [21:15:16] maplebed: hmm, media/DjVu.php it seems [21:15:36] * AaronSchulz isn't super familiar with the transform handler for that type of file [21:16:08] well they were only busier because there's more gets [21:16:13] = more traffic [21:16:15] that's how I see it [21:16:27] except that the number of gets has remained relatively normal (according to the graph) [21:16:36] although [21:16:36] and it's confirmed by the number of 404s on swift's graph. [21:16:50] odd that 60% of thumb.php time is for DjVu, but maybe it's normal, grrr...I with that page had historical numbers [21:17:20] tcpdump is using a lot les s now for whatever reason [21:17:37] !log copied a resolv.conf to brewster, apt-get upgrade on brewster and restarted lighttpd and squid on brewster [21:17:40] Logged the message, Mistress of the network gear. [21:18:07] I do find it odd that the amount of network traffic jumped without the number of requests jumping; makes me thing someone's doing something over NFS that has nothing to do with image scaling and the results of that something else are breaking image scaling. [21:18:30] and that sorta goes with the high cpu system but not high cpu user [21:18:35] * maplebed grabs a tcpdump caputure to analyze network endpoint statistics. [21:19:18] rsync of ms5 someplace? [21:19:23] nah [21:19:25] we don't do ms5 [21:19:27] hmmm [21:22:39] Reedy: looks like you already ran checkoutMediaWiki? [21:23:00] Aye [21:23:15] Pretty much the only thing to do is set test2 to 1.19 and sync it [21:23:23] and the xff, interwiki stuff too? [21:23:30] all copied [21:23:34] nice [21:23:35] tcpdump doesn't tell me anything beyond that ms5 talks to ms-fe*, rendering, and the image scalers. [21:23:42] yeah, same with netstat [21:23:42] maplebed: there are a fair amount of nfs timeouts starting at 12 today [21:23:47] in logs on srv219 [21:23:51] I was looking for those and not seeing them [21:23:56] ah I was looking at ms5 [21:23:57] bah [21:24:03] srv219:/var/log# grep "timed out" messages|awk '{print $2 " " $3}'|awk -F: '{print $1}'|sort|uniq -c [21:24:23] course they will only be on the clients :-/ [21:24:57] i vote we run a dist-upgrade on ms5 and reboot it [21:25:04] yeah, also true on srv221, but is that just another symptom? [21:25:10] no clue [21:25:52] showmount: so not informative. yeah those clients all have it mounted :-/ [21:28:15] !log reloading brewster [21:28:18] Logged the message, Mistress of the network gear. [21:28:36] yes, it's a symptom all right [21:28:42] but the overload is caused by..? [21:32:10] so when I look at the weekly for ms5 [21:32:15] it's a little more believable [21:32:20] http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=&c=Miscellaneous+pmtpa&h=ms5.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [21:32:59] the little leap in max disk space used at the end is ... interesting [21:33:36] so yeah I guess it is an increase in scaled images [21:33:48] which may come from an increase in uploads [21:34:04] apergos: you may be on to something there. [21:34:09] time frame matches [21:34:11] so [21:34:21] first, find out if there's any bots or batched uploads going [21:34:40] that graph tangent does jump suspiciously close to noon. [21:34:54] yeah, it's pretty clear on the daily graph [21:35:18] I would think, though, that swift's disk usage would mirrror it... [21:35:29] it should [21:35:30] oh wait, we're not measuring disk usage, only object count. [21:35:33] hahaha [21:35:35] boooo [21:35:37] * maplebed goes to look. [21:35:54] if someone uploaded a batch of really big stuff, it might have this effect... [21:36:01] well [21:36:12] if someone uploaded some large things (nara images for example) [21:36:26] time to go look at locke [21:36:49] what will locke tell you? [21:37:55] * maplebed starts graphing bytes in addition to objects [21:38:04] that's where sampled 1000 is [21:40:15] not seeing anything obvious there [21:42:14] wouldn't a lot of resizing show up as user CPU? [21:42:20] somewhere? [21:42:40] yeah, it should. [21:42:56] the two CPU jumps I've seen are 'system' on ms5 and wio on srv* [21:45:09] ok, I've started counting bytes on swift in ganglia, though it's too late to look and see what happened at noon. [21:45:16] I'm asking in the commons channel about batch uploads [21:50:31] is the number of scaler requests higher than usual? [21:50:57] of course if it is that can just be people retrying after failure... now... but maybe not earlier [21:51:03] <^demon> apergos: I think I got the rules correct to migrate your dumps stuff to git :) [21:51:06] <^demon> I was having trouble friday [21:51:21] you mean and leave out the directory I deleted? [21:51:27] cool [21:51:41] apergos: the number of scaler requests doesn't look like it changed dramatically. [21:52:02] (ms5 graphs the number of requests it sends back to the scalers) [21:52:15] http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=load_one&s=by+name&c=Image+scalers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [21:52:47] <^demon> apergos: No, that directory didn't bother me. I was getting bit by this annoying bug in svn2git [21:52:49] I agree that looks annoying but I can't tell you what happened there, other than I don't think it's the number of requests [21:52:55] (it may well be the type of request? dunno.) [21:53:04] hi TimStarling ! [21:53:10] hi [21:53:25] I bet you'll know exactly what's going on with that link apergos just dropped in here. [21:53:26] :D [21:53:46] ^demon: I admit I'm curious about the bug [21:53:53] (symptom we're trying to fix - image scaling is taking ~30s instead of the normal ~300ms) [21:54:03] looks busy [21:54:18] <^demon> apergos: svn2git is very picky about how you write rules for path matching. If you leave the trailing / off a rule, it explodes with a totally unhelpful error. [21:54:42] how did you pick me for leaving off the trailing / ? [21:54:57] yeah. dunno why though. the number of requests going back to the scalers doesn't look like it's really changed dramatically: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=ms5.pmtpa.wmnet&v=%2037&m=Image%20scale%20requests%20per%20second&r=day&z=default&jr=&js=&st=1329169953&vl=qps&z=large [21:55:02] <^demon> Well I was hitting it with some other repos too when trying to export the release tags. [21:55:12] <^demon> But I happened to narrow down the problem when working on your repo :) [21:55:36] I'll be happy to start using git [21:56:03] oh dear [21:56:13] it looks like Swift isn't appending to the X-Forwarded-For header [21:56:54] I wonder if it is forwarding the User-Agent header, it looks fake at srv219 [21:57:10] <^demon> apergos: Well if this conversion I'm running right now works then you can switch today :) [21:57:11] can swift be quickly patched to send XFF? [21:57:33] TimStarling: not sure, but it will probably be annoying. [21:57:36] ^demon: that will mean tomorrow for me (I'm going to bed soon) [21:57:37] but might not be. [21:57:43] :P [21:58:11] TimStarling: which log (or thingy) were you looking at? [21:58:16] tcpdump [21:58:19] TimStarling: for us noobs, what negative effect does not passing on XFF have ? [21:58:23] ok. [21:58:33] LeslieCarr: it means you can't look at the client IPs with tcpdump [21:58:55] ah but nothing specific as far as "causes crazy crash" ? :) [21:59:00] no [21:59:40] the user agent header is probably fake, unless swift is sending "Mozilla/5.0" as the user agent for some unknown reason [22:00:17] TimStarling: those addresses are available in swift's logs. [22:00:25] (even if it's not passincg them on) [22:00:43] where? [22:00:50] * maplebed does a quick grep | cut to see if there's any pattern [22:00:56] I am on ms-fe1 looking for them [22:00:58] /var/log/syslog on the swift proxies (ms-fe1 and msn-fe2) [22:01:24] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2557 [22:01:25] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2557 [22:01:44] is there any indication here of whether a request was forwarded to the backend or not? [22:01:58] 404s fall through to ms5 [22:02:13] so a grep 'GET.* 404 ' should catch them. [22:02:58] top three requestors are pdf1 2 and 3 [22:04:03] Feb 13 22:02:46 ms-fe1 proxy-server 208.80.152.181 208.80.152.95 13/Feb/2012/22/02/46 GET /v1/AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/wikipedia-commons-local-thumb.43/4/43/VictorHurtado-LaSublevacion-PasoDelEstrecho.jpg/1199px-VictorHurtado-LaSublevacion-PasoDelEstrecho.jpg HTTP/1.0 404 - mwlib - - - - - - 24.9039 [22:04:06] where's the client IP? [22:04:26] ah, pdf3, sorry [22:04:29] yeah. [22:04:55] pdf1+2+3 are the top clients [22:05:49] not a large percentage though, the issue is probably something else [22:15:04] New patchset: Diederik; "Check if country codes are valid and finish renaming stuff" [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2559 [22:16:37] New patchset: Diederik; "Adding ip-filtering support (not working)" [analytics/udp-filters] (refactoring) - https://gerrit.wikimedia.org/r/2560 [22:16:55] so looking at the error log on ms5 at just before noon there are a bunch of these [22:17:39] 2012/02/13 11:57:44 [error] 12552#0: *745902046 connect() failed (110: Connection timed out) while connecting to upstream, client: 10.0.6.211, server: up [22:17:39] load.wikimedia.org, request: "GET (blah blah) HTTP/1.1", upstream: "fastcg [22:17:39] i://127.0.0.1:9000", host: "ms5.pmtpa.wmnet" [22:17:58] from 10.0.6.210 and 211 of course [22:20:52] http://paste.tstarling.com/p/nEpPhf.html [22:21:50] hmm [22:24:51] a lot of the convert processes on the image scalers have a destination size of 1199 [22:25:47] those are all coming from pdf1 [22:30:00] ok they've stopped now [22:30:03] I guess the job finished [22:31:56] huh [22:32:00] http://ganglia.wikimedia.org/latest/?r=20min&cs=&ce=&m=&c=Miscellaneous+pmtpa&h=ms5.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [22:32:02] looky there [22:32:30] yeah, the load also went from the image scalers [22:32:35] yup [22:33:10] woosters: ^^^ [22:33:13] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=^ms-fe[1-9]&mreg[]=swift_.*_404_avg>ype=line&title=Swift+average+query+response+time+-+404s&z=large&aggregate=1&r=day [22:34:31] pdf eh? can we throttle that traffic? [22:35:15] apergos: so was ms5 overloaded? [22:35:35] I can increase the concurrency on the image scalers but if ms5 was toast then that wouldn't help [22:35:47] well load was 9 ish I guess, and I could type on it and give commands without a problem [22:36:04] hmm [22:36:09] load might have been higher, [22:36:14] anyways it was responsive enough [22:36:29] IO utilisation was not 100% [22:36:30] if you do that, be very incremental about it [22:37:41] here's an alternate view into the number of reusets from the pdf servers to ms-fe2 vs. all 404s http://pastebin.com/7AYeEbb4 [22:38:25] quite a leap [22:38:36] from about 2% of all traffic to about 6%? [22:38:37] but how many were repeats after failrues [22:38:44] anyways it's significant [22:40:06] they are not the first failures I see in the error log however [22:40:48] apergos: so just +50%? [22:42:28] ok, I'm going to get food. [22:42:29] bbl. [22:43:34] New patchset: Tim Starling; "Increasing MaxClients slightly on the image scalers. They can probably handle a lot more than this without risking OOM, but it's not clear whether ms5 can handle the extra load." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2563 [22:44:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2563 [22:44:15] so you're going to 15? (sorry, had to go find the value) [22:44:17] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2563 [22:44:18] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2563 [22:44:44] yeah that seems like a reasonable boost [22:44:52] I am not going to be around to keep an eye on it though [22:45:00] as usual it is almost 1 am here [22:45:21] pity NFS doesn't give you any usable service time stats [22:45:32] nope, just timeout messages :-/ [22:46:48] New patchset: Pyoungmeister; "small fix and redoing one of my worst naming decisions ever" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2564 [22:47:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2564 [22:47:59] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2564 [22:47:59] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2564 [22:49:51] Feb 13 22:49:08 srv224 apache2[25924]: [error] server reached MaxClients setting, consider raising the MaxClients setting [22:49:55] that's annoying already [22:50:23] let's see if it settles down [22:51:41] the load on the image scalers went back up well before I deployed that change [22:52:05] and the sizes look like normal wikipedia sizes now, not collection extension sizes [22:52:14] normal sizes is better [22:52:44] sure, but toast is toast [22:53:07] I am not yet seeing nfs timeouts, but it's only been 3 minutes [22:53:16] (on the one host I'm camped out on) [22:53:42] look at this wchan survey: http://paste.tstarling.com/p/aolmMZ.html [22:53:57] that says ms5 is the slow part [22:54:31] but increasing concurrency on a disk-limited service tends to increase its efficiency [22:59:41] it looks like you could bump it up a little more [23:00:04] to 20? [23:00:09] uh huh [23:00:54] anyone here know ruby ? [23:01:02] New patchset: Tim Starling; "Increasing scaler MaxClients to 20" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2566 [23:01:26] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2566 [23:01:26] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2566 [23:01:44] * AaronSchulz thought twitter used ruby [23:02:11] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2566 [23:02:22] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2566 [23:02:23] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2566 [23:02:42] New patchset: Pyoungmeister; "ruby is one hell of a language..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2567 [23:03:05] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2567 [23:03:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2567 [23:03:18] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2567 [23:03:19] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2567 [23:08:27] New patchset: Asher; "auth against cn" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2568 [23:08:37] Reedy: is it time to deploy to test2? [23:08:53] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2568 [23:08:53] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2568 [23:08:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2568 [23:09:27] robla: go and bask in the error 500 ;) [23:10:24] well, that's a problem [23:10:28] indeed [23:10:43] scapping with a wiki set to 1.19 is giving some strange undefined global erorrs [23:10:47] squid thumbnail requests: http://torrus.wikimedia.org/torrus/CDN?path=%2FSquids%2Fpmtpa%2Fupload%2Fsq41.wikimedia.org%2Fbackend%2FUsage%2FServer_side_requests [23:11:45] chrismcmahon: so...there's some testing I'm hoping you can help out with once we deploy to test2 [23:11:55] however, we need a deploy to test2 first :) [23:11:59] robla: been planning on it [23:12:14] ah right, and the hit rate also changed: http://torrus.wikimedia.org/torrus/CDN?path=%2FSquids%2Fpmtpa%2Fupload%2Fsq41.wikimedia.org%2Fbackend%2FPerformance%2FHit_ratios [23:12:25] the graph is misleading, we really need a miss rate graph rather than a hit rate graph [23:12:27] I sincerely hope that drop off continues to hold [23:12:57] yes we do [23:12:58] chrismcmahon: we'll talk about qa in #wikimedia-dev I guess. Just wanted to make sure you were here to keep up with test2 status [23:13:22] and it's hit maxclients [23:13:34] robla: http://wikitech.wikimedia.org/view/Software_deployments is in my bookmarks :) [23:13:42] 9 minutes after restart [23:15:22] and nfs timeout [23:15:28] *sigh* [23:18:03] I can't do this any longer, too late for me [23:18:38] good night [23:19:00] there's some await numbera around 10-11, nothing else too outrageous [23:19:12] good night [23:24:12] how do we decide whether to send a request to pmtpa or eqiad? [23:24:36] maybe maplebed knows? [23:24:55] sorry, which kind of request? [23:25:55] upload, text, bits [23:26:01] HTTP [23:26:03] upload always goes to pmtpa [23:26:44] my memory is less sure about text and bits, but the easiest way to find out is to ask dns. [23:26:51] oh, hang on. a different way of answering: [23:27:05] routing rules choose one or the other for a specific address, but we don't balance between them yet. [23:27:43] TimStarling: robla just pointed me at that squid graph showing a drastic spike in qps - do you know why a similar sized spike doesn't appear in either swift or ms5's qps graph? [23:28:17] that is one of several mysteries :) [23:28:22] heh... [23:29:09] another mystery is: why is sq49:3128 only doing 1 req/s: http://torrus.wikimedia.org/torrus/CDN?path=%2FSquids%2Fpmtpa%2Fupload%2Fsq49.wikimedia.org%2Fbackend%2FUsage%2FClient_requests [23:30:34] TimStarling: I'm assuming we're not planning to reenable Collections for the foreseeable future (at least until we hear back from one of the PediaPress folks), correct? [23:31:09] robla: not for a while [23:31:23] I don't know what the PediaPress folks would be able to tell us, if the issue is demand driven [23:31:33] maybe there are some logs that they could analyse, I guess [23:32:34] k...thanks. I'll pretend I'm guillom for a little bit [23:32:45] (and let non-techies know) [23:33:58] thanks [23:34:39] maplebed: cachemgr output on a frontend upload squid: http://paste.tstarling.com/p/AWGcbb.html [23:35:19] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2556 [23:35:31] TimStarling: that says there are a bunch of them "out of rotation". [23:35:35] doesn't it? [23:36:05] yes: sq44, 45, 48, 49 [23:36:06] it doesn't say why [23:36:12] they work just fine, I just tested one of them [23:36:52] you know what would be funny? [23:36:54] TimStarling: increase the carp weight of one to 30 and i bet it starts getting traffic [23:37:24] it would be funny if more than one of our squid hostnames hashes to the same thing with that terrible excuse for a hashing algorithm that CARP uses [23:37:49] TimStarling: I'm not sure they are working right. [23:38:40] anyway, probably unrelated to the issue at hand [23:38:41] it's some kind of CARP lameness [23:38:45] oh, nevermind. [23:38:50] I misread output. [23:39:00] they all look fine [23:39:14] Change restored: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2556 [23:39:21] New patchset: Lcarr; "Ensuring tcp setting tweaks are on all serving platforms" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2556 [23:39:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2556 [23:40:37] so the reason the spike on sq41 doesn't correspond to a spike on ms5 is because it was only a spike of 5 req/s [23:40:42] TimStarling: graphs of all the upload squids, collapsed to the same scale: http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=bytes_out&s=by+name&c=Upload+squids+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [23:40:52] and it was isolated to sq41 [23:41:03] 41-50, 86 are all serving half of the traffic of the rest. [23:41:14] on port 80? [23:41:22] no, total. [23:41:27] "bytes_out" [23:41:52] the LVS weights are 20 and 30 so that's roughly expected [23:42:06] http://paste.tstarling.com/p/XcOpjt.html [23:42:16] notah, so they are. [23:42:45] so that's unrelated to what you were pointing out about the backend hashing. [23:43:09] yes [23:43:16] it could be hashing [23:43:20] I've read the CARP spec [23:43:37] it says "because we want this to be really high performance, we choose a really efficient and simple hashing algorithm" [23:43:49] it's a shift and rotate thing [23:44:11] doesn't really look that efficient, the most efficient hashing algorithms would operate in 64 bit blocks, presumably [23:44:29] rotate and xor I mean [23:50:04] TimStarling: did you enter a bug for the x-forwarded-for header in swift? [23:50:51] no [23:50:59] a bug against swift? [23:51:12] didn't ma rk say CARP for us is not terribly efficient, since it makes the hashes based on host name [23:51:17] and our host names are very similar [23:51:39] TimStarling: well, against the SwiftMedia extension specifically. [23:51:42] of course they are similar [23:52:01] how can you have a large cluster of servers without having similar hostnames? [23:52:28] I'm not saying that shouldn't be ;) [23:52:58] getting back to my geo DNS question, for one thing, I was wondering if wikipedia-lb.wikimedia.org is meant to resolve to wikipedia-lb.eqiad.wikimedia.org. from within pmtpa itself [23:52:59] just that CARP apparently doesn't work as well as it should because of that. unless I was confused about what ma rk was talking about [23:53:08] e.g. fenari [23:53:14] yes [23:53:41] we don't use geo-dns for in-country [23:53:44] since it doesn't work [23:53:52] so we just send all US traffic to eqiad? [23:53:56] I think we are planning on using anycast [23:54:06] but that's not done yet? [23:54:09] right [23:54:16] so all US traffic goes to eqiad right now [23:54:34] so what goes to pmtpa? [23:54:49] whichever services haven't been switched to eqiad [23:54:56] at some point nothing [23:55:14] until we use anycast, then we'll send some small percentage of traffic there [23:55:25] upload still goes to pmtpa [23:55:31] for now, yeah [23:55:33] residents of the Tampa Bay area? ;) [23:55:38] heh [23:56:07] I guess labs for now as well. we don't even have a labs cluster up in eqiad yet [23:59:06] just browsing the interface counters in torrus, my knowledge of the network is a bit out of date [23:59:39] and it sometimes comes up, e.g. when end users have problems that I need to debug