[12:01:23] <gerrit-wm_>	 New patchset: Mark Bergsma; "Cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2531
[12:01:48] <gerrit-wm_>	 New patchset: Mark Bergsma; "Move monitor_service to the service class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2532
[12:02:09] <gerrit-wm_>	 New patchset: Mark Bergsma; "Pull LVS service IPs from lvs.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2533
[12:02:30] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2531
[12:02:30] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2531
[12:02:30] <gerrit-wm_>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2532
[12:02:31] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2532
[12:02:31] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2532
[12:02:46] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2533
[12:02:46] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2533
[12:06:57] <gerrit-wm_>	 New patchset: Mark Bergsma; "Pull in LVS service IPs from lvs.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2534
[12:08:09] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2534
[12:08:09] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2534
[12:17:19] <gerrit-wm_>	 New patchset: Mark Bergsma; "Scoped variable lookups" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2535
[12:17:41] <gerrit-wm_>	 New patchset: Mark Bergsma; "Make varnish_xff_sources a class parameter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2536
[12:18:01] <gerrit-wm_>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2535
[12:18:01] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2535
[12:18:01] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2535
[12:18:34] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2536
[12:18:35] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2536
[12:24:44] <gerrit-wm_>	 New patchset: Mark Bergsma; "Use class parameters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2537
[12:25:15] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2537
[12:25:16] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2537
[12:55:10] <gerrit-wm_>	 New review: Demon; "Are we going to be doing python things in jenkins?" [integration/jenkins] (master); V: 0 C: 0;  - https://gerrit.wikimedia.org/r/2289
[13:28:01] <gerrit-wm_>	 New patchset: Mark Bergsma; "Use an upstart manifest for managing varnish udp loggers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2538
[13:29:07] <gerrit-wm_>	 New patchset: Mark Bergsma; "Use an upstart manifest for managing varnish udp loggers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2538
[13:30:25] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2538
[13:30:25] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2538
[13:32:53] <gerrit-wm_>	 New patchset: Mark Bergsma; "Move out upstart_job dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2539
[13:33:43] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2539
[13:33:43] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2539
[13:38:30] <gerrit-wm_>	 New patchset: Mark Bergsma; "Using a service type causes conflicts, try exec instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2540
[13:39:02] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2540
[13:39:02] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2540
[13:40:26] <gerrit-wm_>	 New patchset: Mark Bergsma; "Fix dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2541
[13:40:58] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2541
[13:40:59] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2541
[13:48:17] <gerrit-wm_>	 New patchset: Mark Bergsma; "Don't start if already running" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2542
[13:49:16] <gerrit-wm_>	 New patchset: Mark Bergsma; "Don't start if already running" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2542
[13:49:40] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2542
[13:49:47] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2542
[13:49:48] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2542
[13:50:13] <gerrit-wm_>	 New patchset: Hashar; "gitreview file" [operations/software] (master) - https://gerrit.wikimedia.org/r/2543
[13:52:03] <gerrit-wm_>	 New patchset: Mark Bergsma; "su forks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2544
[13:52:34] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2544
[13:52:35] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2544
[14:00:58] <gerrit-wm_>	 New patchset: Mark Bergsma; "Make varnish_instance default undefined, never mind pid files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2545
[14:02:09] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2545
[14:02:09] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2545
[14:04:13] <gerrit-wm_>	 New review: Hashar; "I have looked at the jenkins gerrit plugin, patchsets are polled by jenkins so we do need a "patchse..." [operations/puppet] (production) C: 0;  - https://gerrit.wikimedia.org/r/2495
[14:05:28] <gerrit-wm_>	 New review: Demon; "(no comment)" [operations/software] (master); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2543
[14:05:28] <gerrit-wm_>	 Change merged: Demon; [operations/software] (master) - https://gerrit.wikimedia.org/r/2543
[14:06:13] <gerrit-wm_>	 New patchset: Demon; "Chad's the best" [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2546
[14:13:17] <gerrit-wm_>	 Change abandoned: Demon; "Was just testing git-review" [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2546
[14:15:56] <gerrit-wm_>	 New patchset: Mark Bergsma; "Fix su invocation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2547
[14:16:26] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2547
[14:16:27] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2547
[14:19:54] <gerrit-wm_>	 New patchset: Mark Bergsma; "Fix status command" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2548
[14:20:26] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2548
[14:20:27] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2548
[14:22:17] <gerrit-wm_>	 New patchset: Mark Bergsma; "Specify path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2549
[14:22:49] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2549
[14:22:50] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2549
[14:26:34] <gerrit-wm_>	 New patchset: Mark Bergsma; "Remove status" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2550
[14:26:59] <gerrit-wm_>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2550
[14:27:06] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2550
[14:27:06] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2550
[14:31:07] <gerrit-wm_>	 New patchset: Hashar; "Force Jenkins request through HTTPS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2551
[14:45:22] <gerrit-wm_>	 New review: Demon; "(no comment)" [operations/puppet] (production) C: 1;  - https://gerrit.wikimedia.org/r/2551
[15:13:34] <gerrit-wm_>	 New patchset: Mark Bergsma; "Upstart cookbook recommends using start-stop-daemon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2552
[15:14:12] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2552
[15:14:13] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2552
[15:25:35] <gerrit-wm_>	 New patchset: Mark Bergsma; "With start-stop-daemon, varnishncsa needs to fork" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2553
[15:26:16] <gerrit-wm_>	 New patchset: Mark Bergsma; "With start-stop-daemon, varnishncsa needs to fork" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2553
[15:26:53] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2553
[15:26:53] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2553
[15:31:23] <gerrit-wm_>	 New patchset: Mark Bergsma; "Try expect fork" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2554
[15:32:09] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2554
[15:32:10] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2554
[15:52:50] <gerrit-wm_>	 New patchset: Mark Bergsma; "Comment for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2555
[15:53:26] <gerrit-wm_>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2555
[15:53:26] <gerrit-wm_>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2555
[16:28:01] <gerrit-wm_>	 New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2551
[16:28:02] <gerrit-wm_>	 Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2551
[16:32:04] <gerrit-wm_>	 New review: Dzahn; "confirmed. jenkins on http://integration.mediawiki.org/ci/ now forces HTTPS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2551
[19:14:14] <LeslieCarr>	 RobH: how'd the move go saturday ?
[19:14:49] <Ryan_Lane>	 move?
[19:15:03] <LeslieCarr>	 db9 move ?
[19:15:07] <Ryan_Lane>	 ah
[19:15:17] <cmjohnson1>	 db9...went good
[19:15:26] <cmjohnson1>	 no hiccups everything came back up normally
[19:15:50] <LeslieCarr>	 yay :)
[19:16:18] <cmjohnson1>	 the only thing learned was mysql needs to be started seperate in safe mode
[19:17:04] <apergos>	 it's the fb version, which is in /usr/local/something
[19:20:05] <LeslieCarr>	 interesting
[19:20:14] <LeslieCarr>	 so it doesn't startup on load ?
[19:20:34] <apergos>	 /etc/init.d/whatever must point to something else
[19:20:38] <apergos>	 I have to look at that again
[19:20:53] <apergos>	 but yeah we always wind up starting the fb one from the command line
[19:27:06] <LeslieCarr>	 anyone remember the equivalent of "free" on os x ?
[19:27:29] <apergos>	 nope
[19:43:54] <andrewbogott[gon>	 LeslieCarr:  by 'os x' do you mean 'objective C'?  Or do you mean some other kind of 'free'?
[19:44:18] <LeslieCarr>	 andrewbogott[gon: i mean as in "how much memory" and os x i mean apple's os called os x
[19:44:49] <andrewbogott[gon>	 In this case I am no help.
[19:45:50] <andrewbogott[gon>	 (Coding on OSX is mostly done in ObjC, and in ObjC you 'release' things that you 'free' in C.  So my question wasn't a total nonsequitor.)
[19:50:31] <LeslieCarr>	 so, i'm thinking of switching over a class of machines to use initcwnd 10
[20:01:09] <gerrit-wm_>	 New review: Hashar; "Thanks for the merge.  Jenkins has an other issue though, its cookie is not secure so browsing to ht..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2551
[20:21:35] <tfinc>	 anyone around from ops who can approve a moderation request on one of the mobile lists?
[20:21:39] <tfinc>	 i don't have the password for it
[20:21:44] <maplebed>	 tfinc: sure.
[20:21:55] <maplebed>	 list address?
[20:21:56] <tfinc>	 mobile-feedback
[20:22:13] <tfinc>	 i'd like to get yuvi's request to be on that list approved
[20:22:24] <maplebed>	 I can't find a list by that name.
[20:22:31] <gerrit-wm_>	 New patchset: Lcarr; "Ensuring tcp setting tweaks are on all serving platforms" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2556
[20:22:43] <maplebed>	 https://lists.wikimedia.org/mailman/listinfo/mobile-feedback
[20:22:44] <tfinc>	 maplebed: its not a public list
[20:22:51] <maplebed>	 doesn't matter - the direct URL should still work.
[20:23:11] <tfinc>	 maplebed: https://lists.wikimedia.org/mailman/admindb/mobile-feedback-l
[20:23:15] <tfinc>	 -l :D
[20:23:20] <maplebed>	 that'd do it.
[20:23:55] <tfinc>	 that list should have never had a -l
[20:24:17] <maplebed>	 tfinc: doesn't yuvi have an @wikimedia address?
[20:24:38] <tfinc>	 he keeps all of his mailing lists on it
[20:24:49] <maplebed>	 damn, there's a lot of moderated messages...
[20:25:02] <tfinc>	 maplebed: the previous maintainer never di it
[20:25:08] <tfinc>	 i'd love to clean it up
[20:26:42] <maplebed>	 ok, all set.
[20:27:08] <tfinc>	 woot!
[20:27:12] <tfinc>	 thanks maplebed
[20:31:38] <gerrit-wm_>	 New patchset: Pyoungmeister; "making new lucene classes and configs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2557
[20:35:16] <gerrit-wm_>	 New patchset: Pyoungmeister; "making new lucene classes and configs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2557
[20:43:59] <maplebed>	 apergos (or anyone really) any ideas on what would cause http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=&c=Miscellaneous+pmtpa&h=ms5.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS to happen?
[20:44:39] <maplebed>	 my guess is that whatever it is is the root couse behind http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&hreg[]=^ms-fe[1-9]&mreg[]=swift_.*_404_avg&gtype=line&title=Swift+average+query+response+time+-+404s&aggregate=1
[20:44:51] <maplebed>	 and http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=load_one&s=by+name&c=Image+scalers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[20:46:07] <apergos>	 holy crap
[20:46:12] <apergos>	 no I don't off the top of my head
[20:46:13] <apergos>	 hmm
[20:46:36] <Jeff_Green>	 it's interesting how it's network without a lot of other activity
[20:48:02] <apergos>	 I dunno if that's true
[20:48:17] <Jeff_Green>	 oh wait, I take that back--I could have sworn I saw a disk utilization graph but now I dont
[20:48:41] <Ryan_Lane>	 !log rebooting brewster
[20:48:43] <morebots>	 Logged the message, Master
[20:49:10] <Ryan_Lane>	 dear microsoft solution, please work for me. my poor jetlagged brain is worthless todat.
[20:49:13] <Ryan_Lane>	 *today
[20:50:33] <apergos>	 what do the scalers show, anything interesting?
[20:50:55] <maplebed>	 I haven't looked at one closely yet.
[20:51:16] <maplebed>	 both ms5 and swift show no significant change in the number of requests to the scalers, though their network use has also changed.
[20:51:31] <maplebed>	 maybe someone's asking for scaling of a bunch of gargantuan images or something...
[20:51:37] <Ryan_Lane>	 well, that didn't solve the problem. not that I thought it would :(
[20:56:28] <Jeff_Green>	 cpu_system is high
[21:01:10] <AaronSchulz>	 http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=&s=by+name&c=Image+scalers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[21:01:12] <AaronSchulz>	 heh
[21:01:26] <maplebed>	 sadly there aren't a lot of connections coming out of ms5 except to ms-fe{1,2} and rendering.svc.pmtpa.wmnet
[21:01:47] <maplebed>	 AaronSchulz: I'm inclined to believe that's a symptom, not a caues.
[21:01:49] <maplebed>	 cause.
[21:04:55] <Jeff_Green>	 ms5 is running an *ancient* version of php
[21:05:44] <Jeff_Green>	 also . . . "61 updates are security updates."
[21:05:58] <Jeff_Green>	 probably not related but that makes me sad
[21:06:39] <maplebed>	 I'm looking for something that went boom at 12:00UTC
[21:06:48] <Jeff_Green>	 yeah
[21:06:50] <gerrit-wm_>	 New patchset: Demon; "Adding .gitreview" [mediawiki/extensions/Translate] (master) - https://gerrit.wikimedia.org/r/2558
[21:07:20] <gerrit-wm_>	 New review: Demon; "(no comment)" [mediawiki/extensions/Translate] (master); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2558
[21:07:20] <gerrit-wm_>	 Change merged: Demon; [mediawiki/extensions/Translate] (master) - https://gerrit.wikimedia.org/r/2558
[21:07:52] <Jeff_Green>	 do we have graphs of squid/varnish cache efficiency somewhere? this isn't an area I've explored yet
[21:08:02] <apergos>	 yeah the time is suspicious
[21:10:03] <maplebed>	 oh, I guess i did ask you here earlier.
[21:10:34] <maplebed>	 ww
[21:12:18] * maplebed  switches channels this time.
[21:12:22] <maplebed>	 AaronSchulz: I don't know how to read http://noc.wikimedia.org/cgi-bin/ng/report.py?db=thumb
[21:12:31] <maplebed>	 what is it telling us?
[21:12:50] <maplebed>	 (and do you know what the unit is for the real/c column?)
[21:13:46] <AaronSchulz>	 real/c is cpu wall time spent in a MW function divided by the number of calls to the function
[21:14:02] <maplebed>	 and what is ddjvu?
[21:14:03] <AaronSchulz>	 well, the term cpu is misleading there then
[21:14:06] <apergos>	 I am not seeing it.  I looked at atop for before and after
[21:14:12] <AaronSchulz>	 it could just be anything being slow, like disk
[21:14:17] <apergos>	 ok so tcpdump is busier and count_gets.py is busier
[21:14:29] <apergos>	 not a lot of info there
[21:14:50] <maplebed>	 the tcpdump and count_gets has been going on for weeks now, so I have a hard time blaming it.
[21:15:08] <maplebed>	 (though it was spiking up in the CPU graph so I kicked it in case there was something wonky with it - no effect)
[21:15:16] <AaronSchulz>	 maplebed: hmm, media/DjVu.php it seems
[21:15:36] * AaronSchulz  isn't super familiar with the transform handler for that type of file
[21:16:08] <apergos>	 well they were only busier because there's more gets
[21:16:13] <apergos>	 = more traffic
[21:16:15] <apergos>	 that's how I see it
[21:16:27] <maplebed>	 except that the number of gets has remained relatively normal (according to the graph)
[21:16:36] <apergos>	 although
[21:16:36] <maplebed>	 and it's confirmed by the number of 404s on swift's graph.
[21:16:50] <AaronSchulz>	 odd that 60% of thumb.php time is for DjVu, but maybe it's normal, grrr...I with that page had historical numbers
[21:17:20] <apergos>	 tcpdump is using a lot les s now for whatever reason
[21:17:37] <LeslieCarr>	 !log copied a resolv.conf to brewster, apt-get upgrade on brewster and restarted lighttpd and squid on brewster
[21:17:40] <morebots>	 Logged the message, Mistress of the network gear.
[21:18:07] <maplebed>	 I do find it odd that the amount of network traffic jumped without the number of requests jumping; makes me thing someone's doing something over NFS that has nothing to do with image scaling and the results of that something else are breaking image scaling.
[21:18:30] <Jeff_Green>	 and that sorta goes with the high cpu system but not high cpu user
[21:18:35] * maplebed  grabs a tcpdump caputure to analyze network endpoint statistics.
[21:19:18] <apergos>	 rsync of ms5 someplace?
[21:19:23] <apergos>	 nah
[21:19:25] <apergos>	 we don't do ms5
[21:19:27] <apergos>	 hmmm
[21:22:39] <AaronSchulz>	 Reedy: looks like you already ran checkoutMediaWiki?
[21:23:00] <Reedy>	 Aye
[21:23:15] <Reedy>	 Pretty much the only thing to do is set test2 to 1.19 and sync it
[21:23:23] <AaronSchulz>	 and the xff, interwiki stuff too?
[21:23:30] <Reedy>	 all copied
[21:23:34] <AaronSchulz>	 nice
[21:23:35] <maplebed>	 tcpdump doesn't tell me anything beyond that ms5 talks to ms-fe*, rendering, and the image scalers.
[21:23:42] <apergos>	 yeah, same with netstat
[21:23:42] <Jeff_Green>	 maplebed: there are a fair amount of nfs timeouts starting at 12 today
[21:23:47] <Jeff_Green>	 in logs on srv219
[21:23:51] <apergos>	 I was looking for those and not seeing them
[21:23:56] <apergos>	 ah I was looking at ms5
[21:23:57] <apergos>	 bah
[21:24:03] <Jeff_Green>	 srv219:/var/log# grep "timed out" messages|awk '{print $2 " " $3}'|awk -F: '{print $1}'|sort|uniq -c
[21:24:23] <apergos>	 course they will only be on the clients :-/
[21:24:57] <Jeff_Green>	 i vote we run a dist-upgrade on ms5 and reboot it
[21:25:04] <maplebed>	 yeah, also true on srv221, but is that just another symptom?
[21:25:10] <Jeff_Green>	 no clue
[21:25:52] <apergos>	 showmount: so not informative. yeah those clients all have it mounted :-/
[21:28:15] <LeslieCarr>	 !log reloading brewster
[21:28:18] <morebots>	 Logged the message, Mistress of the network gear.
[21:28:36] <apergos>	 yes, it's a symptom all right
[21:28:42] <apergos>	 but the overload is caused by..?
[21:32:10] <apergos>	 so when I look at the weekly for ms5
[21:32:15] <apergos>	 it's a little more believable
[21:32:20] <apergos>	 http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=&c=Miscellaneous+pmtpa&h=ms5.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS
[21:32:59] <apergos>	 the little leap in max disk space used at the end is ... interesting
[21:33:36] <apergos>	 so yeah I guess it is an increase in scaled images
[21:33:48] <apergos>	 which may come from an increase in uploads
[21:34:04] <maplebed>	 apergos: you may be on to something there.
[21:34:09] <apergos>	 time frame matches
[21:34:11] <apergos>	 so
[21:34:21] <apergos>	 first, find out if there's any bots or batched uploads going
[21:34:40] <maplebed>	 that graph tangent does jump suspiciously close to noon.
[21:34:54] <apergos>	 yeah, it's pretty clear on the daily graph
[21:35:18] <maplebed>	 I would think, though, that swift's disk usage would mirrror it...
[21:35:29] <apergos>	 it should
[21:35:30] <maplebed>	 oh wait, we're not measuring disk usage, only object count.
[21:35:33] <apergos>	 hahaha
[21:35:35] <apergos>	 boooo
[21:35:37] * maplebed  goes to look.
[21:35:54] <maplebed>	 if someone uploaded a batch of really big stuff, it might have this effect...
[21:36:01] <apergos>	 well
[21:36:12] <apergos>	 if someone uploaded some large things (nara images for example)
[21:36:26] <apergos>	 time to go look at locke
[21:36:49] <maplebed>	 what will locke tell you?
[21:37:55] * maplebed  starts graphing bytes in addition to objects
[21:38:04] <apergos>	 that's where sampled 1000 is
[21:40:15] <apergos>	 not seeing anything obvious there
[21:42:14] <Jeff_Green>	 wouldn't a lot of resizing show up as user CPU?
[21:42:20] <Jeff_Green>	 somewhere?
[21:42:40] <maplebed>	 yeah, it should.
[21:42:56] <Jeff_Green>	 the two CPU jumps I've seen are 'system' on ms5 and wio on srv*
[21:45:09] <maplebed>	 ok, I've started counting bytes on swift in ganglia, though it's too late to look and see what happened at noon.
[21:45:16] <apergos>	 I'm asking in the commons channel about batch uploads
[21:50:31] <apergos>	 is the number of scaler requests higher than usual?
[21:50:57] <apergos>	 of course if it is that can just be people retrying after failure... now... but maybe not earlier
[21:51:03] <^demon>	 apergos: I think I got the rules correct to migrate your dumps stuff to git :)
[21:51:06] <^demon>	 I was having trouble friday
[21:51:21] <apergos>	 you mean and leave out the directory I deleted?
[21:51:27] <apergos>	 cool
[21:51:41] <maplebed>	 apergos: the number of scaler requests doesn't look like it changed dramatically.
[21:52:02] <maplebed>	 (ms5 graphs the number of requests it sends back to the scalers)
[21:52:15] <apergos>	 http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=load_one&s=by+name&c=Image+scalers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[21:52:47] <^demon>	 apergos: No, that directory didn't bother me. I was getting bit by this annoying bug in svn2git
[21:52:49] <maplebed>	 I agree that looks annoying but I can't tell you what happened there, other than I don't think it's the number of requests
[21:52:55] <maplebed>	 (it may well be the type of request?  dunno.)
[21:53:04] <maplebed>	 hi TimStarling !
[21:53:10] <TimStarling>	 hi
[21:53:25] <maplebed>	 I bet you'll know exactly what's going on with that link apergos just dropped in here.
[21:53:26] <maplebed>	 :D
[21:53:46] <apergos>	 ^demon: I admit I'm curious about the bug
[21:53:53] <maplebed>	 (symptom we're trying to fix - image scaling is taking ~30s instead of the normal ~300ms)
[21:54:03] <TimStarling>	 looks busy
[21:54:18] <^demon>	 apergos: svn2git is very picky about how you write rules for path matching. If you leave the trailing / off a rule, it explodes with a totally unhelpful error.
[21:54:42] <apergos>	 how did you pick me for leaving off the trailing / ?
[21:54:57] <maplebed>	 yeah.  dunno why though.  the number of requests going back to the scalers doesn't look like it's really changed dramatically: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=ms5.pmtpa.wmnet&v=%2037&m=Image%20scale%20requests%20per%20second&r=day&z=default&jr=&js=&st=1329169953&vl=qps&z=large
[21:55:02] <^demon>	 Well I was hitting it with some other repos too when trying to export the release tags.
[21:55:12] <^demon>	 But I happened to narrow down the problem when working on your repo :)
[21:55:36] <apergos>	 I'll be happy to start using git
[21:56:03] <TimStarling>	 oh dear
[21:56:13] <TimStarling>	 it looks like Swift isn't appending to the X-Forwarded-For header
[21:56:54] <TimStarling>	 I wonder if it is forwarding the User-Agent header, it looks fake at srv219
[21:57:10] <^demon>	 apergos: Well if this conversion I'm running right now works then you can switch today :)
[21:57:11] <TimStarling>	 can swift be quickly patched to send XFF?
[21:57:33] <maplebed>	 TimStarling: not sure, but it will probably be annoying.
[21:57:36] <apergos>	 ^demon: that will mean tomorrow for me (I'm going to bed soon)
[21:57:37] <maplebed>	 but might not be.
[21:57:43] <maplebed>	 :P
[21:58:11] <maplebed>	 TimStarling: which log (or thingy) were you looking at?
[21:58:16] <TimStarling>	 tcpdump
[21:58:19] <LeslieCarr>	 TimStarling: for us noobs, what negative effect does not passing on XFF have ?
[21:58:23] <maplebed>	 ok.
[21:58:33] <TimStarling>	 LeslieCarr:  it means you can't look at the client IPs with tcpdump
[21:58:55] <LeslieCarr>	 ah but nothing specific as far as "causes crazy crash" ? :)
[21:59:00] <TimStarling>	 no
[21:59:40] <TimStarling>	 the user agent header is probably fake, unless swift is sending "Mozilla/5.0" as the user agent for some unknown reason
[22:00:17] <maplebed>	 TimStarling: those addresses are available in swift's logs.
[22:00:25] <maplebed>	 (even if it's not passincg them on)
[22:00:43] <TimStarling>	 where?
[22:00:50] * maplebed  does a quick grep | cut to see if there's any pattern
[22:00:56] <TimStarling>	 I am on ms-fe1 looking for them
[22:00:58] <maplebed>	  /var/log/syslog on the swift proxies (ms-fe1 and msn-fe2)
[22:01:24] <gerrit-wm_>	 New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2557
[22:01:25] <gerrit-wm_>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2557
[22:01:44] <TimStarling>	 is there any indication here of whether a request was forwarded to the backend or not?
[22:01:58] <maplebed>	 404s fall through to ms5
[22:02:13] <maplebed>	 so a grep 'GET.* 404 ' should catch them.
[22:02:58] <maplebed>	 top three requestors are pdf1 2 and 3
[22:04:03] <TimStarling>	 Feb 13 22:02:46 ms-fe1 proxy-server 208.80.152.181 208.80.152.95 13/Feb/2012/22/02/46 GET /v1/AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/wikipedia-commons-local-thumb.43/4/43/VictorHurtado-LaSublevacion-PasoDelEstrecho.jpg/1199px-VictorHurtado-LaSublevacion-PasoDelEstrecho.jpg HTTP/1.0 404 - mwlib - - - - - - 24.9039
[22:04:06] <TimStarling>	 where's the client IP?
[22:04:26] <TimStarling>	 ah, pdf3, sorry
[22:04:29] <maplebed>	 yeah.
[22:04:55] <TimStarling>	 pdf1+2+3 are the top clients
[22:05:49] <TimStarling>	 not a large percentage though, the issue is probably something else
[22:15:04] <gerrit-wm_>	 New patchset: Diederik; "Check if country codes are valid and finish renaming stuff" [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2559
[22:16:37] <gerrit-wm_>	 New patchset: Diederik; "Adding ip-filtering support (not working)" [analytics/udp-filters] (refactoring) - https://gerrit.wikimedia.org/r/2560
[22:16:55] <apergos>	 so looking at the error log on ms5 at just before noon there are a bunch of these
[22:17:39] <apergos>	 2012/02/13 11:57:44 [error] 12552#0: *745902046 connect() failed (110: Connection timed out) while connecting to upstream, client: 10.0.6.211, server: up
[22:17:39] <apergos>	 load.wikimedia.org, request: "GET  (blah blah) HTTP/1.1", upstream: "fastcg
[22:17:39] <apergos>	 i://127.0.0.1:9000", host: "ms5.pmtpa.wmnet"
[22:17:58] <apergos>	 from 10.0.6.210 and 211 of course
[22:20:52] <TimStarling>	 http://paste.tstarling.com/p/nEpPhf.html
[22:21:50] <apergos>	 hmm
[22:24:51] <TimStarling>	 a lot of the convert processes on the image scalers have a destination size of 1199
[22:25:47] <TimStarling>	 those are all coming from pdf1
[22:30:00] <TimStarling>	 ok they've stopped now
[22:30:03] <TimStarling>	 I guess the job finished
[22:31:56] <apergos>	 huh
[22:32:00] <apergos>	 http://ganglia.wikimedia.org/latest/?r=20min&cs=&ce=&m=&c=Miscellaneous+pmtpa&h=ms5.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS
[22:32:02] <apergos>	 looky there
[22:32:30] <TimStarling>	 yeah, the load also went from the image scalers
[22:32:35] <apergos>	 yup
[22:33:10] <maplebed>	 woosters: ^^^
[22:33:13] <maplebed>	 http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=^ms-fe[1-9]&mreg[]=swift_.*_404_avg&gtype=line&title=Swift+average+query+response+time+-+404s&z=large&aggregate=1&r=day
[22:34:31] <woosters>	 pdf eh? can we throttle that traffic?
[22:35:15] <TimStarling>	 apergos: so was ms5 overloaded?
[22:35:35] <TimStarling>	 I can increase the concurrency on the image scalers but if ms5 was toast then that wouldn't help
[22:35:47] <apergos>	 well load was 9 ish I guess, and I could type on it and give commands without  a problem
[22:36:04] <apergos>	 hmm
[22:36:09] <apergos>	 load might have been higher,
[22:36:14] <apergos>	 anyways it was responsive enough
[22:36:29] <TimStarling>	 IO utilisation was not 100%
[22:36:30] <apergos>	 if you do that, be very incremental about it
[22:37:41] <maplebed>	 here's an alternate view into the number of reusets from the pdf servers to ms-fe2 vs. all 404s http://pastebin.com/7AYeEbb4
[22:38:25] <apergos>	 quite a leap
[22:38:36] <maplebed>	 from about 2% of all traffic to about 6%?
[22:38:37] <apergos>	 but how many were repeats after failrues
[22:38:44] <apergos>	 anyways it's significant
[22:40:06] <apergos>	 they are not the first failures I see in the error log however
[22:40:48] <TimStarling>	 apergos: so just +50%?
[22:42:28] <maplebed>	 ok, I'm going to get food.
[22:42:29] <maplebed>	 bbl.
[22:43:34] <gerrit-wm_>	 New patchset: Tim Starling; "Increasing MaxClients slightly on the image scalers. They can probably handle a lot more than this without risking OOM, but it's not clear whether ms5 can handle the extra load." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2563
[22:44:03] <gerrit-wm_>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2563
[22:44:15] <apergos>	 so you're going to 15? (sorry, had to go find the value)
[22:44:17] <gerrit-wm_>	 New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2563
[22:44:18] <gerrit-wm_>	 Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2563
[22:44:44] <apergos>	 yeah that seems like a reasonable boost
[22:44:52] <apergos>	 I am not going to be around to keep an eye on it though
[22:45:00] <apergos>	 as usual it is almost 1 am here
[22:45:21] <TimStarling>	 pity NFS doesn't give you any usable service time stats
[22:45:32] <apergos>	 nope, just timeout messages :-/
[22:46:48] <gerrit-wm_>	 New patchset: Pyoungmeister; "small fix and redoing one of my worst naming decisions ever" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2564
[22:47:11] <gerrit-wm_>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2564
[22:47:59] <gerrit-wm_>	 New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2564
[22:47:59] <gerrit-wm_>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2564
[22:49:51] <apergos>	 Feb 13 22:49:08 srv224 apache2[25924]: [error] server reached MaxClients setting, consider raising the MaxClients setting
[22:49:55] <apergos>	 that's annoying already
[22:50:23] <apergos>	 let's see if it settles down
[22:51:41] <TimStarling>	 the load on the image scalers went back up well before I deployed that change
[22:52:05] <TimStarling>	 and the sizes look like normal wikipedia sizes now, not collection extension sizes
[22:52:14] <apergos>	 normal sizes is better
[22:52:44] <TimStarling>	 sure, but toast is toast
[22:53:07] <apergos>	 I am not yet seeing nfs timeouts, but it's only been 3 minutes
[22:53:16] <apergos>	 (on the one host I'm camped out on)
[22:53:42] <TimStarling>	 look at this wchan survey: http://paste.tstarling.com/p/aolmMZ.html
[22:53:57] <TimStarling>	 that says ms5 is the slow part
[22:54:31] <TimStarling>	 but increasing concurrency on a disk-limited service tends to increase its efficiency
[22:59:41] <apergos>	 it looks like you could bump it up a little more
[23:00:04] <TimStarling>	 to 20?
[23:00:09] <apergos>	 uh huh
[23:00:54] <LeslieCarr>	 anyone here know ruby ?
[23:01:02] <gerrit-wm_>	 New patchset: Tim Starling; "Increasing scaler MaxClients to 20" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2566
[23:01:26] <gerrit-wm_>	 New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2566
[23:01:26] <gerrit-wm_>	 New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2566
[23:01:44] * AaronSchulz  thought twitter used ruby
[23:02:11] <gerrit-wm_>	 New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2566
[23:02:22] <gerrit-wm_>	 New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2566
[23:02:23] <gerrit-wm_>	 Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2566
[23:02:42] <gerrit-wm_>	 New patchset: Pyoungmeister; "ruby is one hell of a language..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2567
[23:03:05] <gerrit-wm_>	 New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2567
[23:03:05] <gerrit-wm_>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2567
[23:03:18] <gerrit-wm_>	 New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2567
[23:03:19] <gerrit-wm_>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2567
[23:08:27] <gerrit-wm_>	 New patchset: Asher; "auth against cn" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2568
[23:08:37] <robla>	 Reedy: is it time to deploy to test2?
[23:08:53] <gerrit-wm_>	 New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2568
[23:08:53] <gerrit-wm_>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2568
[23:08:53] <gerrit-wm_>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2568
[23:09:27] <Reedy>	 robla: go and bask in the error 500 ;)
[23:10:24] <robla>	 well, that's a problem
[23:10:28] <Reedy>	 indeed
[23:10:43] <Reedy>	 scapping with a wiki set to 1.19 is giving some strange undefined global erorrs
[23:10:47] <TimStarling>	 squid thumbnail requests: http://torrus.wikimedia.org/torrus/CDN?path=%2FSquids%2Fpmtpa%2Fupload%2Fsq41.wikimedia.org%2Fbackend%2FUsage%2FServer_side_requests
[23:11:45] <robla>	 chrismcmahon: so...there's some testing I'm hoping you can help out with once we deploy to test2
[23:11:55] <robla>	 however, we need a deploy to test2 first  :)
[23:11:59] <chrismcmahon>	 robla: been planning on it
[23:12:14] <TimStarling>	 ah right, and the hit rate also changed: http://torrus.wikimedia.org/torrus/CDN?path=%2FSquids%2Fpmtpa%2Fupload%2Fsq41.wikimedia.org%2Fbackend%2FPerformance%2FHit_ratios
[23:12:25] <TimStarling>	 the graph is misleading, we really need a miss rate graph rather than a hit rate graph
[23:12:27] <apergos>	 I sincerely hope that drop off continues to hold
[23:12:57] <apergos>	 yes we do
[23:12:58] <robla>	 chrismcmahon: we'll talk about qa in #wikimedia-dev I guess.  Just wanted to make sure you were here to keep up with test2 status
[23:13:22] <apergos>	 and it's hit maxclients
[23:13:34] <chrismcmahon>	 robla: http://wikitech.wikimedia.org/view/Software_deployments is in my bookmarks :)
[23:13:42] <apergos>	 9 minutes after restart
[23:15:22] <apergos>	 and nfs timeout
[23:15:28] <apergos>	 *sigh*
[23:18:03] <apergos>	 I can't do this any longer, too late for me
[23:18:38] <TimStarling>	 good night
[23:19:00] <apergos>	 there's some await numbera around 10-11, nothing else too outrageous
[23:19:12] <apergos>	 good night
[23:24:12] <TimStarling>	 how do we decide whether to send a request to pmtpa or eqiad?
[23:24:36] <TimStarling>	 maybe maplebed knows?
[23:24:55] <maplebed>	 sorry, which kind of request?
[23:25:55] <TimStarling>	 upload, text, bits
[23:26:01] <TimStarling>	 HTTP
[23:26:03] <maplebed>	 upload always goes to pmtpa
[23:26:44] <maplebed>	 my memory is less sure about text and bits, but the easiest way to find out is to ask dns.
[23:26:51] <maplebed>	 oh, hang on.  a different way of answering:
[23:27:05] <maplebed>	 routing rules choose one or the other for a specific address, but we don't balance between them yet.
[23:27:43] <maplebed>	 TimStarling: robla just pointed me at that squid graph showing a drastic spike in qps - do you know why a similar sized spike doesn't appear in either swift or ms5's qps graph?
[23:28:17] <TimStarling>	 that is one of several mysteries :)
[23:28:22] <maplebed>	 heh...
[23:29:09] <TimStarling>	 another mystery is: why is sq49:3128 only doing 1 req/s: http://torrus.wikimedia.org/torrus/CDN?path=%2FSquids%2Fpmtpa%2Fupload%2Fsq49.wikimedia.org%2Fbackend%2FUsage%2FClient_requests
[23:30:34] <robla>	 TimStarling: I'm assuming we're not planning to reenable Collections for the foreseeable future (at least until we hear back from one of the PediaPress folks), correct?
[23:31:09] <TimStarling>	 robla: not for a while
[23:31:23] <TimStarling>	 I don't know what the PediaPress folks would be able to tell us, if the issue is demand driven
[23:31:33] <TimStarling>	 maybe there are some logs that they could analyse, I guess
[23:32:34] <robla>	 k...thanks. I'll pretend I'm guillom for a little bit
[23:32:45] <robla>	 (and let non-techies know)
[23:33:58] <TimStarling>	 thanks
[23:34:39] <TimStarling>	 maplebed: cachemgr output on a frontend upload squid: http://paste.tstarling.com/p/AWGcbb.html
[23:35:19] <gerrit-wm_>	 Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2556
[23:35:31] <maplebed>	 TimStarling: that says there are a bunch of them "out of rotation".
[23:35:35] <maplebed>	 doesn't it?
[23:36:05] <TimStarling>	 yes: sq44, 45, 48, 49
[23:36:06] <TimStarling>	 it doesn't say why
[23:36:12] <TimStarling>	 they work just fine, I just tested one of them
[23:36:52] <TimStarling>	 you know what would be funny?
[23:36:54] <binasher>	 TimStarling: increase the carp weight of one to 30 and i bet it starts getting traffic
[23:37:24] <TimStarling>	 it would be funny if more than one of our squid hostnames hashes to the same thing with that terrible excuse for a hashing algorithm that CARP uses
[23:37:49] <maplebed>	 TimStarling: I'm not sure they are working right.
[23:38:40] <TimStarling>	 anyway, probably unrelated to the issue at hand
[23:38:41] <binasher>	 it's some kind of CARP lameness
[23:38:45] <maplebed>	 oh, nevermind.
[23:38:50] <maplebed>	 I misread output.
[23:39:00] <binasher>	 they all look fine
[23:39:14] <gerrit-wm_>	 Change restored: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2556
[23:39:21] <gerrit-wm_>	 New patchset: Lcarr; "Ensuring tcp setting tweaks are on all serving platforms" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2556
[23:39:44] <gerrit-wm_>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2556
[23:40:37] <TimStarling>	 so the reason the spike on sq41 doesn't correspond to a spike on ms5 is because it was only a spike of 5 req/s
[23:40:42] <maplebed>	 TimStarling: graphs of all the upload squids, collapsed to the same scale: http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=bytes_out&s=by+name&c=Upload+squids+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[23:40:52] <TimStarling>	 and it was isolated to sq41
[23:41:03] <maplebed>	 41-50, 86 are all serving half of the traffic of the rest.
[23:41:14] <TimStarling>	 on port 80?
[23:41:22] <maplebed>	 no, total.
[23:41:27] <maplebed>	 "bytes_out"
[23:41:52] <TimStarling>	 the LVS weights are 20 and 30 so that's roughly expected
[23:42:06] <TimStarling>	 http://paste.tstarling.com/p/XcOpjt.html
[23:42:16] <maplebed>	 notah, so they are.
[23:42:45] <maplebed>	 so that's unrelated to what you were pointing out about the backend hashing.
[23:43:09] <TimStarling>	 yes
[23:43:16] <TimStarling>	 it could be hashing
[23:43:20] <TimStarling>	 I've read the CARP spec
[23:43:37] <TimStarling>	 it says "because we want this to be really high performance, we choose a really efficient and simple hashing algorithm"
[23:43:49] <TimStarling>	 it's a shift and rotate thing
[23:44:11] <TimStarling>	 doesn't really look that efficient, the most efficient hashing algorithms would operate in 64 bit blocks, presumably
[23:44:29] <TimStarling>	 rotate and xor I mean
[23:50:04] <maplebed>	 TimStarling: did you enter a bug for the x-forwarded-for header in swift?
[23:50:51] <TimStarling>	 no
[23:50:59] <TimStarling>	 a bug against swift?
[23:51:12] <Ryan_Lane>	 didn't ma rk say CARP for us is not terribly efficient, since it makes the hashes based on host name
[23:51:17] <Ryan_Lane>	 and our host names are very similar
[23:51:39] <maplebed>	 TimStarling: well, against the SwiftMedia extension specifically.
[23:51:42] <TimStarling>	 of course they are similar
[23:52:01] <TimStarling>	 how can you have a large cluster of servers without having similar hostnames?
[23:52:28] <Ryan_Lane>	 I'm not saying that shouldn't be ;)
[23:52:58] <TimStarling>	 getting back to my geo DNS question, for one thing, I was wondering if wikipedia-lb.wikimedia.org is meant to resolve to wikipedia-lb.eqiad.wikimedia.org. from within pmtpa itself
[23:52:59] <Ryan_Lane>	 just that CARP apparently doesn't work as well as it should because of that. unless I was confused about what ma rk was talking about
[23:53:08] <TimStarling>	 e.g. fenari
[23:53:14] <Ryan_Lane>	 yes
[23:53:41] <Ryan_Lane>	 we don't use geo-dns for in-country
[23:53:44] <Ryan_Lane>	 since it doesn't work
[23:53:52] <TimStarling>	 so we just send all US traffic to eqiad?
[23:53:56] <Ryan_Lane>	 I think we are planning on using anycast
[23:54:06] <TimStarling>	 but that's not done yet?
[23:54:09] <Ryan_Lane>	 right
[23:54:16] <Ryan_Lane>	 so all US traffic goes to eqiad right now
[23:54:34] <TimStarling>	 so what goes to pmtpa?
[23:54:49] <Ryan_Lane>	 whichever services haven't been switched to eqiad
[23:54:56] <Ryan_Lane>	 at some point nothing
[23:55:14] <Ryan_Lane>	 until we use anycast, then we'll send some small percentage of traffic there
[23:55:25] <binasher>	 upload still goes to pmtpa
[23:55:31] <Ryan_Lane>	 for now, yeah
[23:55:33] <TimStarling>	 residents of the Tampa Bay area? ;)
[23:55:38] <Ryan_Lane>	 heh
[23:56:07] <Ryan_Lane>	 I guess labs for now as well. we don't even have a labs cluster up in eqiad yet
[23:59:06] <TimStarling>	 just browsing the interface counters in torrus, my knowledge of the network is a bit out of date
[23:59:39] <TimStarling>	 and it sometimes comes up, e.g. when end users have problems that I need to debug