[00:00:12] New patchset: J; "remove transcoding stub" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17365 [00:25:16] RECOVERY - MySQL Slave Delay on db63 is OK: OK replication delay 0 seconds [00:26:19] RECOVERY - MySQL Replication Heartbeat on db63 is OK: OK replication delay 0 seconds [00:28:43] maplebed: around? [00:42:40] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [00:56:37] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [01:11:37] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [01:22:09] New patchset: Asher; "adding db12 to decom list" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17373 [01:22:53] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17373 [01:30:22] binasher: \o/ [01:30:40] death to db12! [01:31:01] what was wrong with it? [01:31:05] old? [01:31:43] yeah, nothing more than that [01:32:07] binasher: I just subscribed to perfromance-l^H^H [01:32:22] terry said it was public :) [01:34:33] i should probably join it too.. [01:35:18] hahahahaha [01:35:44] sorry, I found this incredibly funny :) [01:36:03] says a lot about my humour I guess [01:36:18] * paravoid is doing VM migrations [01:36:30] "I'm not slacking off, my VM is migrating!" [01:36:32] block by block? ;) [01:41:31] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 223 seconds [01:43:01] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 282 seconds [01:45:24] paravoid: back if you're still here. [01:46:10] I am [01:46:30] I'm doing VM migrations and I'd like to migrate (which involves reboot) two of your instances [01:47:06] su-fe1 and su-be1 [01:47:12] just thought of giving you the heads-up [01:48:34] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 615s [01:49:46] maplebed: ^ [01:50:12] sorry, got a phone call. [01:50:29] yeah, go ahead. it'll be interesting to see if they come back. [01:50:29] np [01:50:51] you're welcome to reboot anything in both the swift-upgrade and swift projects. [01:50:58] I should kill the swift3 project, i think. [01:51:22] * maplebed goes to do that now. [01:51:52] oh hey, looks like I already did. good for me! [01:52:08] hah [01:53:04] any idea why https://labsconsole.wikimedia.org/wiki/Special:NovaInstance wouldn't be showing my instances? [01:53:13] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:54:25] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 4s [01:54:26] maplebed: log out and log in again [01:54:43] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 22 seconds [01:54:54] grumblegrumble. funny thing is, https://labsconsole.wikimedia.org/wiki/Nova_Resource:Swift does show my instances. [01:55:46] but you're right. logging out and back in again now shows them. [01:56:34] it's known [01:57:26] jeremyb: ryan's assertion is that the regular lose-stuff bug is fixed and the new one is triggered by changing prefs. [01:57:36] so far as I understand it at least. [01:57:42] maplebed: that's irrelevant i think [01:57:53] * jeremyb is searching [01:57:54] which, changing the prefrences or ryan's assertion? [01:57:55] :P [01:58:06] which ryan? [01:58:10] oh [01:58:10] that [01:58:13] ;) [01:58:19] changing your preferences will trigger the bug [01:58:27] ah [01:58:28] I didn't change my prefs though. [01:58:31] this is a different bug [01:58:35] you have credentials [01:58:39] keep refreshing [01:58:49] well, I see them now that I logged out and back in again. [01:58:49] we're having scaling issues with nova [01:58:58] but I loaded the page several times with no instances showing before doing that. [01:59:00] oh, yeah, scaling [01:59:06] our version of diablo has an issue with 500 errors [01:59:07] i can't keep it all straight ;P [01:59:14] due to the number of instance's we've created over time [01:59:23] when we upgrade the issue will be gone [02:11:56] !log deployed OAUTHAuth 8992e6f541cd7adf8111ccd87f25a481d3759b33 on labsconsole [02:12:06] Logged the message, Master [02:40:06] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [02:40:06] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [02:40:06] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [03:00:12] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [03:07:24] RECOVERY - swift-account-replicator on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [03:21:12] PROBLEM - Puppet freshness on calcium is CRITICAL: Puppet has not run in the last 10 hours [06:50:05] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [06:51:13] TimStarling: I've been looking over the Tidy options for a while but can't seem to find out what ridiculous feature this is. Any ideas? [06:52:36] I can try to find the source if you like [06:53:07] That would be very nice. [06:53:57] Thx for the revert btw. I've been sticking my head into the VisualEditor lately, hadn't been looking at production wikis extensively for almost 3 weeks now. [06:54:19] (I joined the ve team as of last week - currently in SF getting on-boarded with everything) [06:57:47] good to know, I was afraid I was going to piss you off [06:59:56] nah, I may be front-end, but I'm still a half-robot :P. This makes perfect sense, it is/was broken [07:00:43] now, on the other hand, the toolserver SGE sending me 492 e-mails when I'm away for dinner because crontab is broken - that pisses me off. [07:06:24] so it seems unconditional, but we can patch it out [07:06:36] /* flush the current buffer only if it is known to be safe, [07:06:36] i.e. it will not introduce some spurious white spaces. [07:06:36] See bug #996484 */ [07:06:36] else if ( mode & NOWRAP || [07:06:36] nodeIsBR(node) || AfterSpace(doc->lexer, node)) [07:06:37] PCondFlushLine( doc, indent ); [07:07:11] mode & NOWRAP is presumably true, we don't wrap our output [07:07:11] interseting [07:07:21] right [07:07:28] hmm, there is a mode & PREFORMATTED outside this [07:07:39] wrapper here is, like, , , etc. ? [07:08:20] I guess it doesn't do the new lines for
, basically making the assumption that css is never used to style anything beyond the default soft-recommended w3c stylesheet
[07:08:35] 	 and thus insert them happily for anything else.
[07:09:01] 	 I experimented a bit with the tidy cli to figure out what the conditions are that trigger it.
[07:10:30] 	 mode = PREFORMATTED is used for pre, textarea and script
[07:10:38] 	 that's how it suppresses line breaks in them
[07:10:49] 	 right
[07:11:03] 	 would it take them out otherwise? Or just not add them
[07:11:45] 	 that would be on the parser side
[07:12:26] 	 right
[07:13:03] 	 yeah, there is TrimSpaces() in parser.c that does this
[07:16:32] 	 and it's called unconditionally from the generic block level element parser, which I suppose is what 
is [07:16:47] it only skips
 tags
[07:17:47] 	 so yes, it seems like avoiding space character removal would need a patch as well
[07:18:09] 	 So the patch you propose is to make it never make it insert newlines (basically like the PREFORMATTED behavior) - and to not trim them either (also like 
)
[07:18:43] 	 Do we have patches for Tidy already, or would this be a first?
[07:20:47] 	 yes, I think treating everything like 
 would work, don't you?
[07:20:56] 	 no, we don't have patches for tidy
[07:21:25] 	 I did do a patch once, but it was accepted upstream pretty quickly, and that was years ago
[07:21:43] 	 That sounds good.
[07:21:44] 	 if we want this to be upstream then everything will have to be configurable
[07:22:01] 	 TimStarling: yeah, I think that makes sense. Prettification can be useful, but selective whitespace changes like this seem odd and out of place.
[07:22:44] 	 well, it's in a module called pretty print
[07:22:55] 	 copyright from 1998
[07:22:59] 	 It looks like we have some kind of post-processing inside MediaWiki though, not sure how that works (we wrap it inside a doctype with html/head/title/body etc.) - but I can't find how that wrapper is being removed. It is added around it and apparently it is magically removed on the other end.
[07:23:07] 	 I don't think they ever expected it to be used for cleaning up wiki markup when they wrote it in 1998
[07:24:18] 	 CSS would have been new at the time
[07:24:29] 	 right, that sounds fair. But even then I'd argue that changing whitespace like that is wrong. Stylesheets are separate for a reason. Just assuming that any non-PRE element doesn't care about whitespace is wrong.
[07:24:30] 	  I'm surprised this hasn't shown problems before
[07:24:38] 	 1998.
[07:25:03] 	 just checked CSS1, it does have a white-space property
[07:25:10] 	 Right
[07:25:12] 	 so yeah, we can accuse them of being negligent ;)
[07:25:32] 	 But it doesn't do nice tree indentation either. So what does it pretty print?
[07:25:57] 	 So there is a Tidy "pretty print" module? Or is this another C lib being called?
[07:26:16] 	 try tidy -i
[07:26:40] 	 modems were slow back then
[07:27:14] 	 interesting
[07:28:22] 	 But -i is not by default / and not in wmf's wgTidyOpts, right ?
[07:28:43] 	 right
[07:29:02] 	 so the new line is in Tidy core, or is it half-enabled somehow by default in Tidy
[07:29:30] 	 adding newlines is unconditional in tidy's core
[07:29:40] 	 alrighty
[07:29:58] 	 after every non-preformatted element, it prints a newline
[07:31:49] 	 So what would you recommend as the path to take here? Put a patched version in wikimedia's ubuntu repo? Or propose upstream and wait for reasonable response time? Or doesn't it need to be in the apt repo?
[07:32:40] 	 http://tidy.cvs.sourceforge.net/viewvc/tidy/tidy/src/pprint.c?view=markup#l1353
[07:33:25] 	 best to at least try upstream
[07:33:38] 	 that is the code for the -i mode though, right?
[07:33:49] 	 no, for all modes
[07:33:56] 	 oh, interesting
[07:34:57] 	 see ShouldIndent
[07:35:13] 	 right
[07:35:25] 	     TidyTriState indentContent = cfgAutoBool( doc, TidyIndentContent );
[07:35:26] 	     if ( indentContent == TidyNoState )
[07:35:26] 	         return no;
[07:35:31] 	 hello :)
[07:35:41] 	 maybe our configuration variable could work the same way
[07:37:24] 	 it looks like nothing in the src directory has been updated since 2009
[07:37:29] 	 but you never know
[07:57:45] 	 PROBLEM - Puppet freshness on analytics1006 is CRITICAL: Puppet has not run in the last 10 hours
[08:00:00] 	 PROBLEM - Puppet freshness on cp1032 is CRITICAL: Puppet has not run in the last 10 hours
[08:39:54] 	 PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours
[09:24:04] 	 PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours
[09:24:29] 	 New patchset: Hashar; "missing php5 packages on formey" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17387
[09:25:08] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17387
[09:37:53] 	 New patchset: Hashar; "missing php5 packages on formey" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17387
[09:38:32] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17387
[09:38:37] 	 New review: Hashar; "Explicitly declare the list of PHP5 packages for svn server has to be independent of changes that co..." [operations/puppet] (production) C: 0;  - https://gerrit.wikimedia.org/r/17387
[09:49:07] 	 PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours
[10:43:10] 	 PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours
[10:57:07] 	 PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours
[11:12:07] 	 PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours
[11:13:59] 	 New review: Demon; "Rather than installing packages we don't need, wouldn't it make more sense to fix php.ini to not loa..." [operations/puppet] (production) C: -1;  - https://gerrit.wikimedia.org/r/17387
[11:44:41] 	 New patchset: Alex Monk; "(bug 37226) Add Portal namespaces to bswiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17398
[11:48:59] 	 New patchset: Alex Monk; "(bug 38943) Add import sources for ndswiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17399
[12:01:01] 	 New review: Mark Bergsma; "Not a bad start. Please fix the issues I commented on, then we can merge this and improve it further." [operations/puppet] (production); V: 0 C: 0;  - https://gerrit.wikimedia.org/r/17342
[12:09:03] 	 Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17035
[12:09:13] 	 Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17036
[12:09:22] 	 Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16532
[12:38:02] 	 New review: TheDJ; "You removed the TTF mime-type definition. I presume this was not intended." [operations/puppet] (production) C: -1;  - https://gerrit.wikimedia.org/r/17149
[12:40:31] 	 PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours
[12:40:31] 	 PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours
[12:40:31] 	 PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours
[13:01:40] 	 PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours
[13:22:31] 	 PROBLEM - Puppet freshness on calcium is CRITICAL: Puppet has not run in the last 10 hours
[13:31:06] 	 New patchset: Hashar; "let Jenkins update the integration.mw.org root dir" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17413
[13:31:45] 	 any root around to merge + apply change 17413 above on gallium.wikimedia.org please ?
[13:31:45] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17413
[13:31:55] 	 a simple directory fix I need for jenkins
[13:36:13] 	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17413
[13:47:17] 	 PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100%
[13:47:19] 	 mark: Dank je ;-)
[13:48:38] 	 RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms
[13:52:23] 	 PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused
[14:08:08] 	 RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time
[14:10:05] 	 New patchset: RobH; "removed a misc server from autopart" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17420
[14:10:46] 	 New review: Hashar; "Would simply need to purge the packages indeed:" [operations/puppet] (production) C: 0;  - https://gerrit.wikimedia.org/r/17387
[14:10:46] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17420
[14:10:57] 	 Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17420
[14:32:54] 	 RECOVERY - SSH on calcium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[14:38:03] 	 New patchset: RobH; "updates for smokeping" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17423
[14:38:42] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17423
[14:40:02] 	 Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17423
[14:45:54] 	 New patchset: RobH; "calcium tasked in site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17424
[14:46:34] 	 New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/17424
[14:50:30] 	 New patchset: RobH; "calcium tasked in site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17424
[14:51:09] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17424
[14:51:24] 	 Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17424
[14:55:24] 	 RECOVERY - Puppet freshness on calcium is OK: puppet ran at Thu Aug 2 14:55:07 UTC 2012
[14:58:37] 	 grmf, my irc box has been down for two hours and isn't getting up
[14:58:47] 	 and I have no local backup of my irssi config
[15:00:11] 	 paravoid: is it physically near you?
[15:00:39] 	 kind of
[15:00:46] 	 anyway
[15:01:15] 	 RECOVERY - NTP on calcium is OK: NTP OK: Offset -0.01633191109 secs
[15:02:03] 	 paravoid:  I spent a bit of yesterday fretting because I thought I lost my .dircproxyrc when a labs instance crashed… but then I remembered that there are shared homedirs in labs.
[15:02:22] 	 oh hi
[15:02:32] 	 I just pinged you on the other channel :)
[15:02:36] 	 so, re: migrations
[15:02:51] 	 it's basically grunt work
[15:02:57] 	 see virt0:~/cold-migrate
[15:04:14] * andrewbogott  sees...
[15:07:07] 	 and, like clockwork, my network goes down :(
[15:08:00] 	 so, I'm basically doing nova-manage vm list |grep -v virt[6-8] and migrating those
[15:08:13] 	 as I said it's repetitive and boring
[15:08:24] 	 and I can handle it if you have more interesting things to do
[15:08:26] 	 I don't mind at all
[15:09:41] 	 paravoid:  The very first thing that script does is 'destroy' so I am assuming that destroy doesn't mean what I think it means.
[15:10:11] 	 heheh, no it doesn't
[15:10:15] 	 PROBLEM - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:15] 	 PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:15] 	 PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:15] 	 PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:15] 	 PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:16] 	 PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:16] 	 PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:17] 	 PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:17] 	 PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:18] 	 it's libvirt's term for "shutdown
[15:10:18] 	 PROBLEM - Apache HTTP on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:18] 	 PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:19] 	 PROBLEM - Apache HTTP on srv202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:19] 	 PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:20] 	 PROBLEM - Apache HTTP on srv208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:20] 	 PROBLEM - Apache HTTP on srv197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:21] 	 PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:21] 	 PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:22] 	 PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:22] 	 PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:23] 	 PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:23] 	 PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:24] 	 PROBLEM - Apache HTTP on srv209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:24] 	 PROBLEM - Apache HTTP on mw56 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:25] 	 PROBLEM - Apache HTTP on srv201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:26] 	 PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:26] 	 PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:26] 	 PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:27] 	 paravoid:  Actually, grunt work sounds pretty good right now, I've been feeling extra dumb this week.
[15:10:27] 	 PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:27] 	 PROBLEM - Apache HTTP on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:28] 	 PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:28] 	 PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:29] 	 PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:29] 	 PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:33] 	 PROBLEM - Apache HTTP on srv210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:33] 	 PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:33] 	 PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:33] 	 PROBLEM - Apache HTTP on srv198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:33] 	 PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:34] 	 PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:34] 	 PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:35] 	 PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:35] 	 PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:36] 	 PROBLEM - Apache HTTP on mw10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:36] 	 PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:37] 	 PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:37] 	 PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:37] 	 uh oh
[15:10:38] 	 PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:38] 	 PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:39] 	 PROBLEM - Apache HTTP on srv207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:39] 	 PROBLEM - Apache HTTP on srv211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:40] 	 PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:40] 	 PROBLEM - Apache HTTP on srv285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:42] 	 PROBLEM - Apache HTTP on srv213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:42] 	 PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:42] 	 PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:42] 	 PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:43] 	 PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:43] 	 PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:44] 	 PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:44] 	 PROBLEM - Apache HTTP on srv242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:45] 	 PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:45] 	 PROBLEM - Apache HTTP on srv243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:46] 	 PROBLEM - Apache HTTP on srv258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:46] 	 PROBLEM - Apache HTTP on srv239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:47] 	 PROBLEM - Apache HTTP on srv244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:47] 	 PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:47] 	 So it's just a matter of running that script for a given instance, waiting 20 minutes, and then running the next one?  Are you trying to e.g. check in with users on IRC before shutting down instances?
[15:10:48] 	 PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:48] 	 PROBLEM - Apache HTTP on srv245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:49] 	 PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:51] 	 PROBLEM - Apache HTTP on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:51] 	 PROBLEM - Apache HTTP on srv247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:51] 	 PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:51] 	 PROBLEM - Apache HTTP on srv287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:51] 	 PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:52] 	 PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:52] 	 PROBLEM - Apache HTTP on srv238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:10:53] 	 PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:00] 	 PROBLEM - Apache HTTP on srv237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:00] 	 PROBLEM - Apache HTTP on srv231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:00] 	 PROBLEM - Apache HTTP on srv234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:00] 	 PROBLEM - Apache HTTP on srv226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:00] 	 PROBLEM - Apache HTTP on srv240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:01] 	 PROBLEM - Apache HTTP on srv241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:01] 	 PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:02] 	 PROBLEM - Apache HTTP on srv246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:09] 	 PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:18] 	 PROBLEM - Apache HTTP on mw16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:18] 	 PROBLEM - Apache HTTP on srv227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:23] 	 wth is going on
[15:11:27] 	 PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:27] 	 PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:27] 	 PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:36] 	 RECOVERY - Apache HTTP on srv203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.498 second response time
[15:11:36] 	 RECOVERY - Apache HTTP on srv201 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time
[15:11:36] 	 RECOVERY - Apache HTTP on srv208 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.269 second response time
[15:11:36] 	 RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.372 second response time
[15:11:36] 	 RECOVERY - Apache HTTP on srv197 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.408 second response time
[15:11:37] 	 RECOVERY - Apache HTTP on srv202 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.639 second response time
[15:11:37] 	 PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:38] 	 PROBLEM - Apache HTTP on mw9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:11:38] 	 RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.252 second response time
[15:11:39] 	 RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.423 second response time
[15:11:39] 	 RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.416 second response time
[15:11:40] 	 RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.720 second response time
[15:11:40] 	 RECOVERY - Apache HTTP on srv209 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.449 second response time
[15:11:40] 	 looks like I don't need to report a site outage
[15:11:41] 	 RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.079 second response time
[15:11:41] 	 RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.004 second response time
[15:11:45] 	 RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.703 second response time
[15:11:45] 	 RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60616 bytes in 3.440 seconds
[15:11:45] 	 RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.734 second response time
[15:11:45] 	 RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.149 second response time
[15:11:54] 	 RECOVERY - Apache HTTP on srv210 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time
[15:11:54] 	 RECOVERY - Apache HTTP on srv198 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time
[15:11:55] 	 RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 48012 bytes in 0.139 seconds
[15:11:55] 	 RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 60541 bytes in 0.277 seconds
[15:11:56] 	 RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.393 second response time
[15:11:56] 	 RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 48003 bytes in 0.935 seconds
[15:11:56] 	 RECOVERY - Apache HTTP on srv207 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time
[15:11:56] 	 RECOVERY - Apache HTTP on srv285 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time
[15:11:56] 	 RECOVERY - Apache HTTP on srv211 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time
[15:11:57] 	 RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.961 second response time
[15:11:57] 	 RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.161 second response time
[15:11:58] 	 RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.670 second response time
[15:12:03] 	 RECOVERY - Apache HTTP on srv258 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time
[15:12:03] 	 RECOVERY - Apache HTTP on srv243 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time
[15:12:03] 	 RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time
[15:12:03] 	 RECOVERY - Apache HTTP on srv242 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time
[15:12:03] 	 RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time
[15:12:04] 	 RECOVERY - Apache HTTP on srv245 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time
[15:12:04] 	 RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.184 second response time
[15:12:05] 	 RECOVERY - Apache HTTP on srv228 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.600 second response time
[15:12:05] 	 RECOVERY - Apache HTTP on srv239 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.884 second response time
[15:12:05] 	 hey
[15:12:06] 	 RECOVERY - Apache HTTP on srv213 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.549 second response time
[15:12:13] 	 RECOVERY - Apache HTTP on srv244 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.152 second response time
[15:12:13] 	 RECOVERY - Apache HTTP on srv247 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.460 second response time
[15:12:13] 	 RECOVERY - Apache HTTP on srv238 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time
[15:12:13] 	 RECOVERY - Apache HTTP on srv287 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.350 second response time
[15:12:13] 	 RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.800 second response time
[15:12:13] 	 RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39542 bytes in 3.147 seconds
[15:12:21] 	 RECOVERY - Apache HTTP on srv234 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time
[15:12:21] 	 RECOVERY - Apache HTTP on srv231 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.061 second response time
[15:12:21] 	 RECOVERY - Apache HTTP on srv237 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.066 second response time
[15:12:21] 	 RECOVERY - Apache HTTP on srv241 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.226 second response time
[15:12:21] 	 RECOVERY - Apache HTTP on srv226 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.803 second response time
[15:12:22] 	 RECOVERY - Apache HTTP on srv246 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.238 second response time
[15:12:39] 	 RECOVERY - Apache HTTP on srv227 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.062 second response time
[15:12:48] 	 RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time
[15:12:51] 	 New patchset: Hashar; "https://integration.mediawiki.org/nightly/ browseable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17427
[15:13:06] 	 RECOVERY - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60419 bytes in 0.176 seconds
[15:13:06] 	 RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.099 second response time
[15:13:06] 	 RECOVERY - Apache HTTP on mw56 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.533 second response time
[15:13:06] 	 RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.238 second response time
[15:13:06] 	 RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.285 second response time
[15:13:25] 	 RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.936 second response time
[15:13:26] 	 RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.174 second response time
[15:13:26] 	 RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.906 second response time
[15:13:30] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17427
[15:13:42] 	 RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39535 bytes in 0.120 seconds
[15:13:42] 	 RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.364 second response time
[15:13:42] 	 RECOVERY - Apache HTTP on srv240 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.012 second response time
[15:13:51] 	 db31 had a weird load spike
[15:13:51] 	 RECOVERY - Apache HTTP on mw13 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.080 second response time
[15:13:51] 	 RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.137 second response time
[15:14:09] 	 RECOVERY - Apache HTTP on mw16 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time
[15:14:16] 	 appservers have an increased load but it maybe just cascading from db31
[15:14:18] 	 RECOVERY - Apache HTTP on mw7 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time
[15:14:18] 	 RECOVERY - Apache HTTP on mw5 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time
[15:14:19] 	 and it's the s4 masters
[15:14:25] 	 master
[15:14:27] 	 RECOVERY - Apache HTTP on mw9 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time
[15:14:27] 	 RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.744 second response time
[15:14:36] 	 RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time
[15:14:36] 	 RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.941 second response time
[15:14:37] 	 RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.802 second response time
[15:14:37] 	 RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.154 second response time
[15:14:37] 	 RECOVERY - Apache HTTP on mw4 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.008 second response time
[15:14:37] 	 RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.149 second response time
[15:14:37] 	 RECOVERY - Apache HTTP on mw6 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.072 second response time
[15:14:45] 	 RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time
[15:14:45] 	 RECOVERY - Apache HTTP on mw14 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.020 second response time
[15:14:45] 	 RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.363 second response time
[15:14:54] 	 RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time
[15:14:54] 	 RECOVERY - Apache HTTP on mw10 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.571 second response time
[15:14:54] 	 RECOVERY - Apache HTTP on mw12 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time
[15:14:55] 	 RECOVERY - Apache HTTP on mw1 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.583 second response time
[15:14:55] 	 RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.612 second response time
[15:14:55] 	 RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.856 second response time
[15:14:55] 	 RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.436 second response time
[15:14:56] 	 RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.038 second response time
[15:15:00] 	 yeah, app server network doubled for a brief period
[15:15:12] 	 RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.194 second response time
[15:15:12] 	 RECOVERY - Apache HTTP on mw3 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.041 second response time
[15:15:17] 	 network and load
[15:15:21] 	 RECOVERY - Apache HTTP on mw15 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time
[15:15:40] 	 New patchset: J; "Add webm mimetype to apache configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17149
[15:15:43] 	 load is still relatively up, although dropping as time passes
[15:16:15] 	 PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:16:20] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17149
[15:17:36] 	 RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.932 second response time
[15:17:45] 	 RECOVERY - Apache HTTP on mw2 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.087 second response time
[15:21:49] 	 ok, everything looks fine now
[15:21:53] 	 but no idea what caused that
[15:24:49] 	 maybe a transient network issue that made nagios wild ?
[15:25:02] 	 oh no, there was definitely something
[15:25:53] 	 there were a few complaints in #-tech
[15:26:25] 	 s4 being commons it's liklely to have wider range of people seeing it
[15:27:10] 	 well sure. but if apaches go nuts then everyone sees it ;)
[15:28:01] 	 New review: Mark Bergsma; "Please fix those indentation errors, and preferably let someone else review your changes if it's not..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17423
[15:28:59] 	 could someone review https://gerrit.wikimedia.org/r/#/c/17427/  please ? That is to make https://integration.mediawiki.org/nightly/ browseable by users.
[15:29:06] 	 New patchset: Mark Bergsma; "Revert "Restart the NTP client if hit by the leap second bug"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17430
[15:29:45] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17430
[15:29:56] 	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17430
[15:31:27] 	 mark: paravoid: can one of you review / merge a simple apache configuration change on gallium please ?  https://gerrit.wikimedia.org/r/#/c/17427/   makes https://integration.mediawiki.org/nightly/  browseable :-D
[15:32:29] 	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17427
[15:33:45] 	 hey RobH - did you have any chance to look at yttrium's problem?
[15:34:34] 	 sorry, got sidetracked with other things
[15:34:47] 	 let me find an rt ticket and dump the info we have thus far
[15:36:43] 	 the ticket's at https://rt.wikimedia.org/Ticket/Display.html?id=3221
[15:41:46] 	 mark: do you know if I should reload ms7? apergos hasn't appeared today
[15:41:52] 	 Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17149
[15:51:59] 	 PROBLEM - Apache HTTP on srv234 is CRITICAL: Connection refused
[16:13:53] 	 PROBLEM - Apache HTTP on srv196 is CRITICAL: Connection refused
[16:20:33] 	 !log authdns-update for smokeping
[16:20:42] 	 Logged the message, RobH
[16:21:50] 	 RECOVERY - Apache HTTP on srv234 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time
[16:25:24] 	 Reedy: is it possible to test !wikipedia on test?
[16:25:32] 	 Reedy: (re: shorturl)
[16:25:48] 	 unfortunately not, no :(
[16:26:14] 	 okay
[16:26:21] 	 thankfully it's trivial enough
[16:27:14] 	 !wikipedia == not wikipedia? as in sister projects?
[16:29:30] 	 yes
[16:30:25] 	 New review: Faidon; "thanks!" [operations/apache-config] (master); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/17191
[16:30:32] 	 New review: Faidon; "thanks!" [operations/apache-config] (master); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/17191
[16:30:32] 	 Change merged: Faidon; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/17191
[16:37:54] 	 something sucky is happening on the apaches.. memcached latency is up a ton as of 1500 - http://bit.ly/MB17yh ( https://graphite.wikimedia.org/render/?title=Memcached%20Gets%20Avg%20and%2099th%20Percentile%20Latency%20(ms)%20log(2)%20-1day&from=-1day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=cactiStyle(MWMemcached.get.tp99)&target=cactiStyle(MWMemcached.get.tavg) 
[16:38:16] 	 "yay"
[16:39:10] 	 I wonder if this is a bad time to do a graceful-all :)
[16:39:16] 	 outbound traffic from the appservers went up by ~700mbps then too
[16:39:27] 	 maybe its a great time!
[16:41:51] 	 ok then
[16:41:53] 	 RECOVERY - Apache HTTP on srv196 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.050 second response time
[16:42:17] 	 Why do I need to login to access the dashboard but not /render ^.^
[16:43:12] 	 paravoid: if you are doing a graceful-all, that sort of thing is good to !log
[16:43:20] 	 I was just writing the log line
[16:43:26] 	 :)
[16:43:27] 	 I did so the other time, so yeah
[16:43:35] 	 I figured as much
[16:43:40] 	 PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100%
[16:43:50] 	 Damianz: so that https://gdash.wikimedia.org can have no auth at all
[16:43:52] 	 I do wonder though why sync-apache doesn't log automatically
[16:43:55] 	 Damianz: do you have an account?
[16:44:13] 	 Oooh that's sorta shiny
[16:44:14] 	 paravoid: that's a good point
[16:44:18] 	 And yeah I have an account for it :)
[16:44:29] 	 paravoid: {{sofixit}} ;)
[16:44:37] 	 !log sync-apache/apache-gracefull-all for BZ #38905 (ShortUrl on !Wikipedia)
[16:44:46] 	 Logged the message, Master
[16:44:52] 	 RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms
[16:44:56] 	 don't know how, too busy to care right now
[16:45:07] 	 heh
[16:45:08] 	 heh
[16:45:09] 	 let's see
[16:45:18] 	 Damianz: the graphite dashboard allows people to write / overwrite certain things with no security model of its own, so i didn't want it completely open to the public
[16:45:30] 	 but do want the graphs to be public
[16:45:38] 	 Makes sense
[16:46:06] 	 I actually have something sorta like that for work but pipe stuff though a script that does a CURL in the background then cached it in the frontend :) <3 graphite
[16:46:31] 	 Damianz: there's some varnish in there too
[16:46:39] 	 Damianz: and mod_proxy
[16:46:40] 	 binasher: btw, how did you catch the increased memcached latency? are you polling the graphs in your day?
[16:47:07] 	 yeah, gdash goes via a varnish instance that caches graphs for a short time.. i think i set it to 2 minutes but don't remember
[16:48:37] 	 PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused
[16:48:47] 	 paravoid: which script did you use?
[16:48:58] 	 /h/w/bin/apache-graceful-all already has dologmsg calls
[16:49:38] 	 paravoid: yeah, graph polling, but specifically the pcache graphs - http://gdash.wikimedia.org/dashboards/pcache/ - see the 4th graph down
[16:49:40] 	 PROBLEM - Apache HTTP on srv279 is CRITICAL: Connection refused
[16:51:28] 	 PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours
[16:51:46] 	 RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time
[16:52:12] 	 paravoid: so then i was worried that pc1 was having trouble, but found no sign of it.. even logged timing for all queries for 3 minutes and found 99% time for selects = 2.7ms and for writes = 4ms.  it's pretty sad when a mysql db making pretend to be a cache is incredibly faster than a memcached cluster
[16:52:56] 	 Reedy: /usr/local/bin/sync-apache & /home/wikipedia/bin/apache-graceful-all
[16:54:04] 	 if [ `cat /etc/cluster` == pmtpa ]; then
[16:54:04] 	         /home/wikipedia/bin/dologmsg "$USER is doing a graceful restart of all apaches"
[16:54:04] 	 fi
[16:54:51] 	 if [ -e /etc/redhat-release ]; then
[16:54:51] 	         echo "$*" > /dev/udp/$host/53412
[16:54:53] 	 I love our scripts
[16:55:31] 	 seems legit
[16:55:49] 	 Reedy: nothing listening on 53412 on fenari
[16:55:53] 	 that explains it :)
[16:56:04] 	 hmm
[16:56:28] 	 scap uses $BINDIR/dologmsg "!log $USER Finished syncing Wikimedia installation... : $*"
[16:56:36] 	 BINDIR=/usr/local/bin
[16:56:50] 	 ooooof course.
[16:57:06] 	 Doesn't "dologmsg" imply you don't need !log at the start?
[17:00:05] 	 it still writes it to IRC
[17:00:48] 	 the two dologmsgs are different
[17:01:49] 	 indeed
[17:02:09] 	 /home/wikipedia/bin is definitely wrong
[17:02:19] 	 but these seem to be in a svn repo that only has 1 commit :/
[17:03:38] 	 guess we could symlink the /h/w/bin one to the other
[17:03:53] 	 j^: ping?
[17:05:02] 	 j^: you're patching in a high rate and we/I've been unable to keep up :) just to make sure, could you do a quick summary of all the things that you're waiting from us?
[17:16:58] 	 RECOVERY - Apache HTTP on srv279 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time
[17:22:43] 	 preilly: not sure what that is, I didn't start any scripts then or something
[17:59:01] 	 PROBLEM - Puppet freshness on analytics1006 is CRITICAL: Puppet has not run in the last 10 hours
[18:00:58] 	 PROBLEM - Puppet freshness on cp1032 is CRITICAL: Puppet has not run in the last 10 hours
[18:39:46] 	 LeslieCarr: hii
[18:40:11] 	 hey dschoon
[18:40:13] 	 wazzzzuuuuuup
[18:40:30] 	 are you cuddling with row C? they need love
[18:41:01] 	 PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours
[18:41:31] 	 hehehe
[18:41:37] 	 ok, what's wrong ? :)
[18:41:47] 	 me and row c don't like to talk about our relationship
[18:42:15] 	 well, we have a bunch of boxes in there. (25)
[18:42:28] 	 and we're waiting for ops handoff
[18:42:38] 	 i saw you were the last person working on them, so i figured i'd ask
[18:42:40] 	 https://rt.wikimedia.org/Ticket/Display.html?id=3067
[18:43:04] 	 so analytics1011 to analytics1027 was done a while ago
[18:43:08] 	 sorry, should have updated
[18:43:19] 	 any other machines i haven't gotten the port assignments for
[18:43:28] 	 i should resolve that ticket
[18:44:09] 	 sweet. what happens next? (i'm not familiar enough with the ops hardware setup workflow to know)
[18:45:24] 	 someone needs to install them ...
[18:45:37] 	 let me see if their macs are in the dhcp files
[18:47:15] 	 "".148 to .157 from 10.64.16.0/22 as that is private1-b."" afair
[18:47:17] 	 nope they're not
[18:47:29] 	 yeah
[18:47:56] 	 dschoon: so open a ticket saying "install all the analytics machines, bitches"
[18:48:08] 	 RT-1985
[18:48:15] 	 (that is at least for 10 of them)
[18:48:22] 	 feel free to bump that
[18:48:45] 	 better a new ticket because 1985 is only the cisco machines
[18:48:52] 	 ah yeah
[18:48:57] 	 those were weird
[18:49:04] 	 like sometimes sda and sdb would install and sometimes not ?
[18:49:13] 	 true. changing ticket subjects is also an option
[18:49:38] 	 yes, skipped sda and sdb in the partman recipe
[18:49:45] 	 but there was some inconsistency
[18:50:22] 	 i think paravoid solved this, i think sda had some weird virtual cisco only partitions IIRC
[18:51:21] 	 aye, I heard somethign about tthat too
[18:51:28] 	 I can open a new ticket
[18:52:17] 	 or would it be better to change the subject?
[18:52:25] 	 mutante?
[18:52:44] 	 LeslieCarr: is there anything you can look at on pmtpa switches for signs of trouble? the site is still having some trouble
[18:53:02] 	 oh
[18:53:07] 	 what kind of stuff
[18:53:13] 	 i can check out generically
[18:53:29] 	 tcp segment retrans counters seem to be going up a lot on the apaches, but maybe that's always the case
[18:54:23] 	 ottomata: a new one is fine in this case, since they arent all Cisco
[18:54:26] 	 ok cool
[18:54:54] 	 I'm not exactly sure what I should put, except "please install analytics1011-1027
[18:55:15] 	 install precise maybe?
[18:55:29] 	 memcached get times went up a lot at 15:00, along with raw amount of network traffic from the apaches, and requests started timing out, but i haven't found a specific cause, or request / cpu / ram changes on the apache
[18:55:30] 	 ottomata: details on what OS you would like, if you want RAID, which level, and how it should be partitioned, with/without LVM...
[18:55:42] 	 ok, yeah, I can get you that, but you guys know which boxes to use?
[18:56:00] 	 yeah, the hostnames should be enough
[18:57:56] 	 can one of the opsen tell me which file in puppet contains the squid log configuration and in particular i am looking for settings about the http referer header?
[18:58:23] 	 binasher: hrm, don't necessarily see anything… looking more
[18:59:51] 	 drdee_: that would be manifests/squid.pp . well at least i see some logrotate setup there
[18:59:51] 	 drdee, i'm pretty sure it is not in puppet
[19:00:02] 	 its in a php template on fenari
[19:00:06] 	 or somethign weird like that, lemme see
[19:00:06] 	 source => "puppet:///files/logrotate/squid-frontend";
[19:00:42] 	 ja, drdee_, if you log into fenari
[19:00:56] 	 ok
[19:00:56] 	 the squid config template file is at  /home/w/conf/squid/frontend.conf.php.
[19:00:59] 	 asher, robh, notpeter:  OK if I reboot your labs instances?
[19:01:02] 	 ty!
[19:01:11] 	 andrewbogott: wont hurt me any
[19:01:26] 	 andrewbogott: sure
[19:01:36] 	 ottomata:  Same question:  mind if I reboot your labs instances in the next few minutes?
[19:01:48] 	 yes!
[19:02:01] 	 please wait while the metrics meeting is running
[19:02:07] 	 'k
[19:02:13] 	 maybe somebody wants to show reportcard
[19:02:32] 	 andrewbogotto, i think you can reboot any of them except for reportcard2 and kripke
[19:02:39] 	 the other ones are mine and are not used for the reportcard stuff
[19:02:56] 	 ewwwwww
[19:03:01] 	 andrewbogotto
[19:03:16] 	 mutant otto + andrew B
[19:04:14] 	 ottomata, drdee_:  metrics meeting is over now… does that mean reportcard is fair game now?
[19:04:17] 	 is there a way to relate tickets together in rt
[19:04:22] 	 ?
[19:04:30] 	 yeah go ahead and reboot :)
[19:04:30] 	 links tab
[19:04:45] 	 ottomata: links, you can refer them, or make them parents or children of other tickets
[19:04:50] 	 drdee_:  thanks!
[19:05:39] 	 ok cool, danke
[19:05:48] 	 guess I have to create the ticket before I can link it
[19:09:19] 	 andrewbogott: it's now safe to reboot the kripke and/or reportcard2 labs instances
[19:09:41] 	 dschoon:  ok, good to know.
[19:12:21] 	 mutante: where are the debian build rules located for squid?
[19:13:00] 	 if it's still in svn trunk/debs/squid
[19:13:07] 	 ty Reedy
[19:13:24] 	 drdee: it's in git seemingly https://gerrit.wikimedia.org/r/gitweb?p=operations/debs/squid.git
[19:13:32] 	 lol.
[19:14:11] 	 drdee: drdee_: it's in git seemingly https://gerrit.wikimedia.org/r/gitweb?p=operations/debs/squid.git
[19:14:28] 	 tyank!
[19:24:14] 	 question for the opsen, erosen, dartar and myself have done on three separate occasions looked the referer http header from the squid logs, surprisingly in approx. 90% of the cases the referer value is '-', we find that hard to believe :) does anybody have an idea what could cause this? i did look at how squid is build and it has the --enable-referer-log option enabled so that is looking good, and obviously the log format does
[19:24:14] 	 contain the directive for the referer header as well.
[19:24:38] 	 PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours
[19:27:56] 	 drdee: when it's a reverse proxied request, does it become X-PROXY-REFERER or some such?
[19:28:52] 	 dschoon: I wouldn't know, hope some of the ops folks have the answer.....
[19:30:27] 	 why would reverse proxy mess with referer?
[19:31:22] 	 because the referring page is the one that triggered the sub-request, not the one that sent them to that page? dunno. it almost makes sense?
[19:31:43] 	 can someone take a look at https://rt.wikimedia.org/Ticket/Display.html?id=3366 please?
[19:37:34] 	 Looks like faidon answered you :p
[19:40:32] 	 PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100%
[19:45:38] 	 RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 35.45 ms
[19:48:14] 	 btw, mutante:
[19:48:15] 	 https://rt.wikimedia.org/Ticket/Display.html?id=3367
[19:49:33] 	 PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours
[19:50:53] 	 PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100%
[19:53:34] 	 New patchset: preilly; "test reset of X-Forwarded-For" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17453
[19:53:45] 	 notpeter: can you push ^^
[19:54:17] 	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17453
[19:55:31] 	 notpeter: can you force push it
[19:55:48] 	 doing so now
[19:56:03] 	 notpeter: thanks
[19:56:12] 	 notpeter: just let me know when it's done
[19:57:00] 	 RobH: are you in the colo today?
[19:57:12] 	 nope, will be tomorrow though
[19:57:16] 	 why?
[19:57:20] 	 ok.  ms-be1003 and 1005.
[19:57:26] 	 =/
[19:57:29] 	 they're misbehaving.
[19:57:44] 	 ok, whats wrong with them?
[19:57:50] 	 i dont see any ticket
[19:57:55] 	 I haven't made one yet.
[19:57:58] 	 ahh
[19:57:59] 	 ok
[19:58:05] 	 1005 is being weird - first it couldn't talk to the disks,
[19:58:18] 	 then it could but when I ran puppet it seems like it spontaneuosly rebooted.
[19:58:30] 	 PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100%
[19:58:33] 	 1003 I'm trying a fresh install
[19:58:45] 	 it looked ok but on reboot complained about disks not being present.
[19:59:18] 	 I'm thinking maybe just reseat all the cables in both?
[20:00:23] 	 though that seems a little 'turn it off and back on again'ish, I don't really have any better ideas (and am open to suggestions)
[20:02:05] 	 gonna reinstall 1005 too, just for kicks.
[20:04:03] 	 RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 35.44 ms
[20:05:19] 	 New patchset: Demon; "Bunch of Gerrit UI fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16841
[20:06:00] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16841
[20:08:06] 	 RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 35.37 ms
[20:10:39] 	 RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[20:11:06] 	 PROBLEM - swift-object-replicator on ms-be1005 is CRITICAL: Connection refused by host
[20:11:06] 	 PROBLEM - swift-account-replicator on ms-be1005 is CRITICAL: Connection refused by host
[20:11:15] 	 PROBLEM - swift-object-server on ms-be1005 is CRITICAL: Connection refused by host
[20:11:15] 	 PROBLEM - swift-account-reaper on ms-be1005 is CRITICAL: Connection refused by host
[20:11:24] 	 PROBLEM - swift-container-server on ms-be1005 is CRITICAL: Connection refused by host
[20:11:42] 	 PROBLEM - swift-account-server on ms-be1005 is CRITICAL: Connection refused by host
[20:11:42] 	 PROBLEM - swift-object-updater on ms-be1005 is CRITICAL: Connection refused by host
[20:11:51] 	 PROBLEM - swift-container-updater on ms-be1005 is CRITICAL: Connection refused by host
[20:11:51] 	 PROBLEM - SSH on ms-be1005 is CRITICAL: Connection refused
[20:12:09] 	 PROBLEM - swift-object-auditor on ms-be1005 is CRITICAL: Connection refused by host
[20:12:09] 	 PROBLEM - swift-container-auditor on ms-be1005 is CRITICAL: Connection refused by host
[20:12:09] 	 PROBLEM - swift-account-auditor on ms-be1005 is CRITICAL: Connection refused by host
[20:12:27] 	 PROBLEM - swift-container-replicator on ms-be1005 is CRITICAL: Connection refused by host
[20:13:39] 	 RECOVERY - Puppet freshness on ms-be1003 is OK: puppet ran at Thu Aug 2 20:13:28 UTC 2012
[20:15:36] 	 RECOVERY - swift-object-server on ms-be1003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[20:15:37] 	 RECOVERY - swift-account-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[20:15:45] 	 RECOVERY - swift-container-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[20:15:45] 	 RECOVERY - swift-object-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[20:16:03] 	 RECOVERY - swift-account-reaper on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[20:16:03] 	 RECOVERY - swift-container-server on ms-be1003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[20:16:12] 	 RECOVERY - swift-object-auditor on ms-be1003 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[20:16:12] 	 RECOVERY - SSH on ms-be1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[20:16:39] 	 RECOVERY - swift-container-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[20:16:39] 	 RECOVERY - swift-object-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[20:16:39] 	 RECOVERY - swift-account-server on ms-be1003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[20:16:57] 	 RECOVERY - swift-container-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[20:18:20] 	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16841
[20:19:29] 	 ^demon: ugh
[20:19:34] 	 that change was bad
[20:19:37] 	 it needs a followup
[20:19:42] 	 err: /Stage[main]/Gerrit::Jetty/File[/var/lib/gerrit2/review_site/etc/GerritSiteHeader.html]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///files/gerrit/skin/GerritSite.html at /var/lib/git/operations/puppet/manifests/gerrit.pp:163
[20:20:05] <^demon>	 Whoops, sorry
[20:20:06] 	 RECOVERY - SSH on ms-be10 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[20:20:19] 	 it sooooo less ugly now!
[20:21:54] 	 New patchset: Demon; "Fixup to I90fc1c45, forgot the word Header" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17458
[20:22:25] <^demon>	 Ryan_Lane: Don't merge, I'll add the missing image file too
[20:22:34] 	 cool
[20:22:38] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17458
[20:22:48] 	 j^: yep, we know it's a little broken right now :)
[20:22:52] 	 missing files
[20:25:57] 	 PROBLEM - NTP on ms-be1003 is CRITICAL: NTP CRITICAL: Offset unknown
[20:26:07] 	 Ryan_Lane: https://labsconsole.wikimedia.org/wiki/Special:NovaAddress is 500ing on me
[20:26:15] 	 yeah
[20:26:17] 	 known issue
[20:26:24] <^demon>	 Trying to push, it's just sitting there :(
[20:26:26] 	 Oh now it works
[20:26:27] 	 it's due to scaling issues with our current version of nova
[20:26:42] 	 New patchset: Demon; "Fixup to I90fc1c45, copy of other UI fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17458
[20:26:42] 	 ^demon: did you break gerrit? heh
[20:26:44] <^demon>	 Thur we go.
[20:26:46] 	 \o/
[20:26:55] <^demon>	 Meh, s/copy/couple/
[20:27:22] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17458
[20:28:12] 	 RECOVERY - Puppet freshness on ms-be10 is OK: puppet ran at Thu Aug 2 20:28:00 UTC 2012
[20:28:21] 	 RECOVERY - swift-object-replicator on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[20:28:30] 	 ^demon: Krinkle is complaining about certain features of the new stylesheet so I'm showing him how to change the CSS on the labs instance, is that alright?
[20:28:30] 	 RECOVERY - swift-account-server on ms-be1005 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[20:29:10] <^demon>	 Yeah, just show him how to fiddle with /var/lib/gerrit2/review_site/etc/GerritSite.css
[20:29:17] <^demon>	 No restart of gerrit required for changes to that.
[20:29:23] 	 Invalid query:
[20:29:23] 	 line 0:-1 no viable alternative at input ''
[20:29:31] 	 Cool, thanks
[20:29:42] 	 RECOVERY - swift-container-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[20:30:00] 	 RECOVERY - swift-container-updater on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[20:30:06] 	 oh, duh. nvm
[20:30:09] 	 RECOVERY - swift-object-updater on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[20:30:09] 	 RECOVERY - swift-account-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[20:30:09] 	 RECOVERY - swift-container-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[20:30:27] 	 RECOVERY - swift-account-auditor on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[20:30:31] <^demon>	 RoanKattouw: If you guys pick a better color for that sky-blue highlight, that's in gerrit.config, selectionColor.
[20:30:36] 	 PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100%
[20:30:44] 	 Thanks
[20:30:45] 	 RECOVERY - swift-object-auditor on ms-be10 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[20:30:54] 	 RECOVERY - swift-object-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[20:30:54] 	 RECOVERY - swift-account-reaper on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[20:30:54] 	 RECOVERY - swift-container-auditor on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[20:31:03] 	 RECOVERY - swift-object-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[20:31:03] 	 RECOVERY - swift-container-replicator on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[20:31:12] 	 RECOVERY - swift-account-replicator on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[20:31:39] 	 RECOVERY - swift-container-server on ms-be1005 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[20:31:39] 	 RECOVERY - swift-account-reaper on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[20:31:39] 	 RECOVERY - swift-object-server on ms-be1005 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[20:31:52] <^demon>	 Ryan_Lane: https://gerrit.wikimedia.org/r/#/c/17458/ is ready to fix the header problem.
[20:31:57] 	 RECOVERY - swift-container-updater on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[20:32:06] 	 RECOVERY - swift-object-updater on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[20:32:15] 	 RECOVERY - swift-object-auditor on ms-be1005 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[20:32:15] 	 RECOVERY - swift-account-auditor on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[20:32:15] 	 RECOVERY - swift-container-auditor on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[20:33:02] 	 New patchset: Demon; "Fixup to I90fc1c45, couple of other UI fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17458
[20:33:09] 	 PROBLEM - NTP on ms-be10 is CRITICAL: NTP CRITICAL: Offset unknown
[20:33:43] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17458
[20:34:39] 	 RECOVERY - NTP on ms-be1003 is OK: NTP OK: Offset 0.001237273216 secs
[20:34:47] 	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17458
[20:34:48] 	 RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 35.72 ms
[20:35:34] 	 ^demon: running
[20:35:59] <^demon>	 Wheee.
[20:36:02] <^demon>	 That's so much better
[20:36:21] 	 it would be nice to have a wikimedia specific logo :(
[20:36:43] 	 New theme is a slight improvement
[20:37:48] 	 PROBLEM - swift-container-server on ms-be1003 is CRITICAL: Connection refused by host
[20:37:57] 	 PROBLEM - SSH on ms-be1003 is CRITICAL: Connection refused
[20:38:06] 	 PROBLEM - swift-account-server on ms-be1003 is CRITICAL: Connection refused by host
[20:38:15] 	 PROBLEM - swift-container-updater on ms-be1003 is CRITICAL: Connection refused by host
[20:38:24] 	 PROBLEM - swift-object-updater on ms-be1003 is CRITICAL: Connection refused by host
[20:38:33] 	 PROBLEM - swift-object-server on ms-be1003 is CRITICAL: Connection refused by host
[20:38:33] 	 PROBLEM - swift-container-auditor on ms-be1003 is CRITICAL: Connection refused by host
[20:38:43] 	 PROBLEM - swift-container-replicator on ms-be1003 is CRITICAL: Connection refused by host
[20:38:51] 	 PROBLEM - swift-object-replicator on ms-be1003 is CRITICAL: Connection refused by host
[20:39:00] 	 PROBLEM - swift-account-reaper on ms-be1003 is CRITICAL: Connection refused by host
[20:39:00] 	 PROBLEM - swift-account-auditor on ms-be1003 is CRITICAL: Connection refused by host
[20:39:00] 	 PROBLEM - swift-object-auditor on ms-be1003 is CRITICAL: Connection refused by host
[20:39:33] 	 ^demon: can we have a bear as the logo?
[20:39:51] 	 we all know gerrit is a beast. we should make a logo that reflects it
[20:39:59] 	 a fire-breathing bear
[20:40:57] <^demon>	 Gerrit the Grizzly.
[20:42:21] 	 i feel like we need to ask the oatmeal guy for that kind of logo
[20:43:49] 	 PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours
[20:44:53] 	 binasher: there's a couple of memcahed related errors appearing in the apache logs
[20:45:03] 	 yeah there are
[20:45:05] 	 12 PHP Warning:  stream_select() [function.stream-select]: No stream arrays were passed in /usr/local/apache/common-local/php-1.20wmf8/includes/objectcache/MemcachedClient.php on line 1119
[20:45:05] 	 12 PHP Warning:  stream_select() [function.stream-select]: cannot cast a filtered stream on this system in /usr/local/apache/common-local/php-1.20wmf8/includes/objectcache/MemcachedClient.php on line 1119
[20:45:45] 	 Reedy: are those uncommon?
[20:45:54] 	 yup
[20:47:38] 	 LeslieCarr: heh. he might do it if we asked for one. jorm is going to draw one for us over the weekend, though
[20:48:32] 	 Reedy: those were while persistent was on, there's a bug in the client that's only triggered when using pfsockopen
[20:48:37] 	 New review: Anomie; "Now gerrit has horizontal scrollbars :(" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16841
[20:48:47] 	 ah
[20:48:51] 	 :)
[20:48:54] 	 RECOVERY - NTP on ms-be10 is OK: NTP OK: Offset -0.01242733002 secs
[20:49:52] 	 New patchset: Hashar; "gerrit: body color in black, links lighter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17462
[20:50:32] 	 paravoid: there's a problem with the php5-memcached lucid pkg you built - /var/lib/dpkg/info/php5-memcached.postinst: 6: dpkg-maintscript-helper: not found
[20:50:33] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17462
[20:50:52] 	 LeslieCarr: I think we should just ask for him to release *that* logo under cc-by-sa ;)
[20:51:26] 	 :)
[20:52:26] 	 Damianz: it'll be licensed however the other ones are
[20:52:41] 	 I have a feeling it won't be trademarked
[20:52:41] 	 Change abandoned: Hashar; "Timo working on tweaking the CSS already." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17462
[20:56:27] 	 Ryan_Lane: Lawyers love you :D
[20:57:54] 	 PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours
[20:58:06] 	 Damianz: heh
[21:02:30] 	 ok, we are just being attacked by someone who is adding randomized query strings to bypass the squids
[21:02:34] 	 well, that's good!
[21:02:41] 	 hehehe whew ?
[21:02:48] 	 yeah!
[21:02:59] 	 bwahaha, and they started at 15:09
[21:03:24] 	 interesting what the internal impact was
[21:03:28] 	 re: memcached
[21:03:50] 	 New patchset: Hashar; "making David Schoonover a Jenkins admin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17465
[21:04:30] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17465
[21:08:15] 	 can some please review / merge change 17465 ? dsc is going to be a Jenkins guru :-)
[21:10:36] 	 ty, gerrit-wm!
[21:10:56] 	 (and also ty whoever merges 17465 <3)
[21:12:27] 	 PROBLEM - LVS HTTP IPv4 on wikisource-lb.esams.wikimedia.org is CRITICAL: Connection refused
[21:12:27] 	 PROBLEM - LVS HTTP IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:28] 	 PROBLEM - LVS HTTP IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:28] 	 PROBLEM - LVS HTTP IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:28] 	 PROBLEM - LVS HTTP IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:28] 	 PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:28] 	 PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:29] 	 PROBLEM - LVS HTTPS IPv4 on wikibooks-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:29] 	 PROBLEM - LVS HTTP IPv4 on mediawiki-lb.esams.wikimedia.org is CRITICAL: Connection refused
[21:12:30] 	 PROBLEM - LVS HTTPS IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:30] 	 PROBLEM - Frontend Squid HTTP on knsq24 is CRITICAL: Connection refused
[21:12:31] 	 PROBLEM - Frontend Squid HTTP on knsq26 is CRITICAL: Connection refused
[21:12:31] 	 PROBLEM - Frontend Squid HTTP on knsq27 is CRITICAL: Connection refused
[21:12:32] 	 PROBLEM - Frontend Squid HTTP on knsq23 is CRITICAL: Connection refused
[21:12:32] 	 PROBLEM - Frontend Squid HTTP on knsq28 is CRITICAL: Connection refused
[21:12:33] 	 PROBLEM - LVS HTTPS IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:33] 	 PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:34] 	 PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:34] 	 PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:36] 	 PROBLEM - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:36] 	 PROBLEM - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:36] 	 PROBLEM - LVS HTTP IPv6 on wikimedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:36] 	 PROBLEM - LVS HTTP IPv6 on wikibooks-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:37] 	 PROBLEM - LVS HTTP IPv6 on wikiversity-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:37] 	 PROBLEM - LVS HTTP IPv4 on wikiquote-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused
[21:12:38] 	 PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:38] 	 PROBLEM - LVS HTTPS IPv6 on wikinews-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:39] 	 PROBLEM - LVS HTTPS IPv4 on wikisource-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:39] 	 PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:40] 	 PROBLEM - LVS HTTP IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:45] 	 PROBLEM - Frontend Squid HTTP on cp1006 is CRITICAL: Connection refused
[21:12:45] 	 PROBLEM - Frontend Squid HTTP on cp1004 is CRITICAL: Connection refused
[21:12:45] 	 PROBLEM - Frontend Squid HTTP on cp1008 is CRITICAL: Connection refused
[21:12:45] 	 PROBLEM - Frontend Squid HTTP on cp1013 is CRITICAL: Connection refused
[21:12:45] 	 PROBLEM - Frontend Squid HTTP on cp1009 is CRITICAL: Connection refused
[21:12:46] 	 PROBLEM - Frontend Squid HTTP on cp1007 is CRITICAL: Connection refused
[21:12:46] 	 PROBLEM - Frontend Squid HTTP on cp1010 is CRITICAL: Connection refused
[21:12:47] 	 PROBLEM - Frontend Squid HTTP on cp1012 is CRITICAL: Connection refused
[21:12:47] 	 PROBLEM - Frontend Squid HTTP on cp1017 is CRITICAL: Connection refused
[21:12:48] 	 PROBLEM - Frontend Squid HTTP on cp1014 is CRITICAL: Connection refused
[21:12:48] 	 PROBLEM - Frontend Squid HTTP on cp1015 is CRITICAL: Connection refused
[21:12:49] 	 PROBLEM - Frontend Squid HTTP on cp1011 is CRITICAL: Connection refused
[21:12:49] 	 PROBLEM - Frontend Squid HTTP on cp1018 is CRITICAL: Connection refused
[21:12:50] 	 PROBLEM - Frontend Squid HTTP on cp1020 is CRITICAL: Connection refused
[21:12:50] 	 PROBLEM - Frontend Squid HTTP on cp1016 is CRITICAL: Connection refused
[21:12:51] 	 PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: Connection refused
[21:12:51] 	 PROBLEM - LVS HTTPS IPv4 on mediawiki-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:52] 	 PROBLEM - LVS HTTP IPv4 on mediawiki-lb.eqiad.wikimedia.org is CRITICAL: Connection refused
[21:12:52] 	 PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:53] 	 PROBLEM - LVS HTTP IPv4 on mediawiki-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused
[21:12:53] 	 PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:54] 	 PROBLEM - LVS HTTPS IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:54] 	 PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:55] 	 PROBLEM - LVS HTTP IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:55] 	 PROBLEM - LVS HTTP IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:56] 	 PROBLEM - LVS HTTP IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:56] 	 PROBLEM - LVS HTTPS IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:12:57] 	 PROBLEM - Frontend Squid HTTP on cp1003 is CRITICAL: Connection refused
[21:12:57] 	 PROBLEM - LVS HTTP IPv4 on wikisource-lb.eqiad.wikimedia.org is CRITICAL: Connection refused
[21:13:03] 	 PROBLEM - Frontend Squid HTTP on cp1002 is CRITICAL: Connection refused
[21:13:03] 	 PROBLEM - LVS HTTPS IPv4 on mediawiki-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:13:03] 	 PROBLEM - LVS HTTP IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:13:03] 	 PROBLEM - LVS HTTPS IPv6 on wikibooks-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:13:12] 	 PROBLEM - LVS HTTP IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:13:12] 	 PROBLEM - LVS HTTPS IPv6 on foundation-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:13:12] 	 PROBLEM - LVS HTTP IPv4 on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: Connection refused
[21:13:12] 	 PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:13:12] 	 PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:13:13] 	 PROBLEM - LVS HTTP IPv4 on wikinews-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused
[21:13:13] 	 PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:13:14] 	 PROBLEM - LVS HTTP IPv4 on wikiquote-lb.esams.wikimedia.org is CRITICAL: Connection refused
[21:13:14] 	 PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:13:15] 	 PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours
[21:13:21] 	 PROBLEM - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused
[21:13:21] 	 PROBLEM - LVS HTTP IPv4 on wikimedia-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused
[21:13:22] 	 PROBLEM - LVS HTTP IPv4 on wikibooks-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused
[21:13:22] 	 PROBLEM - LVS HTTPS IPv4 on wikinews-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:13:22] 	 PROBLEM - LVS HTTP IPv4 on wikinews-lb.eqiad.wikimedia.org is CRITICAL: Connection refused
[21:13:22] 	 PROBLEM - LVS HTTP IPv4 on wikiquote-lb.eqiad.wikimedia.org is CRITICAL: Connection refused
[21:13:22] 	 PROBLEM - LVS HTTP IPv4 on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: Connection refused
[21:13:23] 	 PROBLEM - LVS HTTP IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:13:23] 	 PROBLEM - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway
[21:14:39] 	 RobH: ping?
[21:17:08] 	 hey Danny_B: it's down here too
[21:18:53] 	 all the ganglia graphs have flatlined
[21:20:03] 	 well, that's what happens when you block any incoming connection :P
[21:21:54] 	 Wikipedia's back!
[21:23:30] 	 it's up now, too
[21:23:42] 	 nope
[21:23:48] 	  en.wikipedia.org is Up
[21:24:03] 	 What happened?
[21:24:38] 	 you know, everyone else gets vacation, the wiki takes a 5 minute bathroom break and everyone's upset !
[21:25:34] 	 rofl
[21:26:08] 	 in reality, looks like a bad squid config was pushed, then when it was rolled back, the site came back up
[21:41:34] 	 RECOVERY - Frontend Squid HTTP on sq71 is OK: HTTP OK HTTP/1.0 200 OK - 27541 bytes in 0.015 seconds
[21:41:34] 	 RECOVERY - Frontend Squid HTTP on sq78 is OK: HTTP OK HTTP/1.0 200 OK - 27544 bytes in 0.009 seconds
[21:41:44] 	 RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 48003 bytes in 0.211 seconds
[21:41:52] 	 RECOVERY - Frontend Squid HTTP on sq37 is OK: HTTP OK HTTP/1.0 200 OK - 27544 bytes in 0.005 seconds
[21:41:53] 	 RECOVERY - LVS HTTPS IPv6 on wikinews-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 61020 bytes in 0.033 seconds
[21:42:19] 	 RECOVERY - LVS HTTP IPv6 on wiktionary-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 61021 bytes in 0.009 seconds
[21:42:37] 	 RECOVERY - LVS HTTP IPv4 on wikinews-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 69490 bytes in 0.003 seconds
[21:43:45] 	 RobH: it looks like faidon has a solution to the puppet issue (https://rt.wikimedia.org/Ticket/Display.html?id=3366) when you've got a sec, can you give it a whirl?
[21:43:49] 	 RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39540 bytes in 0.168 seconds
[21:44:26] 	 RECOVERY - Frontend Squid HTTP on sq62 is OK: HTTP OK HTTP/1.0 200 OK - 27544 bytes in 0.006 seconds
[21:44:34] 	 RECOVERY - Frontend Squid HTTP on sq36 is OK: HTTP OK HTTP/1.0 200 OK - 27543 bytes in 0.019 seconds
[21:45:46] 	 RECOVERY - Frontend Squid HTTP on sq59 is OK: HTTP OK HTTP/1.0 200 OK - 27543 bytes in 0.015 seconds
[21:45:46] 	 RECOVERY - Frontend Squid HTTP on sq74 is OK: HTTP OK HTTP/1.0 200 OK - 27544 bytes in 0.016 seconds
[21:46:04] 	 RECOVERY - Frontend Squid HTTP on sq33 is OK: HTTP OK HTTP/1.0 200 OK - 27543 bytes in 0.006 seconds
[21:47:09] 	 !log added Dsc to the Gerrit 'integration' group. David now got access to the various integration/* repositories.
[21:47:17] 	 Logged the message, Master
[21:47:44] 	 RECOVERY - Frontend Squid HTTP on cp1012 is OK: HTTP OK HTTP/1.0 200 OK - 27544 bytes in 0.146 seconds
[21:48:28] 	 RECOVERY - Frontend Squid HTTP on amssq34 is OK: HTTP OK HTTP/1.0 200 OK - 27738 bytes in 0.473 seconds
[21:48:37] 	 RECOVERY - Frontend Squid HTTP on sq76 is OK: HTTP OK HTTP/1.0 200 OK - 27543 bytes in 0.015 seconds
[21:50:08] 	 RECOVERY - Frontend Squid HTTP on sq75 is OK: HTTP OK HTTP/1.0 200 OK - 27545 bytes in 0.018 seconds
[21:50:17] 	 RECOVERY - Frontend Squid HTTP on amssq37 is OK: HTTP OK HTTP/1.0 200 OK - 27734 bytes in 0.473 seconds
[21:50:17] 	 RECOVERY - Frontend Squid HTTP on amssq38 is OK: HTTP OK HTTP/1.0 200 OK - 27733 bytes in 0.472 seconds
[21:50:34] 	 RECOVERY - Frontend Squid HTTP on cp1011 is OK: HTTP OK HTTP/1.0 200 OK - 27544 bytes in 0.144 seconds
[22:02:18] 	 New patchset: J; "make validnames check case insensitive" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17470
[22:02:58] 	 New patchset: Aaron Schulz; "Switched all wikis to multiwrite backend." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17471
[22:02:58] 	 New review: RobLa; "Thanks Antoine!" [operations/puppet] (production) C: 1;  - https://gerrit.wikimedia.org/r/17465
[22:02:58] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17470
[22:03:09] 	 Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17471
[22:03:19] 	 RECOVERY - swift-account-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[22:03:19] 	 RECOVERY - swift-account-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[22:05:10] 	 !log dist-upgrading sodium
[22:05:19] 	 Logged the message, Master
[22:14:33] 	 New patchset: Ryan Lane; "Ensure we don't accidentally upgrade" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17472
[22:14:34] 	 PROBLEM - swift-container-server on ms-be10 is CRITICAL: Connection refused by host
[22:14:34] 	 PROBLEM - swift-container-server on ms-be10 is CRITICAL: Connection refused by host
[22:14:43] 	 PROBLEM - swift-account-reaper on ms-be10 is CRITICAL: Connection refused by host
[22:14:43] 	 PROBLEM - swift-object-server on ms-be10 is CRITICAL: Connection refused by host
[22:14:43] 	 PROBLEM - swift-account-reaper on ms-be10 is CRITICAL: Connection refused by host
[22:14:43] 	 PROBLEM - swift-account-replicator on ms-be10 is CRITICAL: Connection refused by host
[22:14:43] 	 PROBLEM - swift-object-server on ms-be10 is CRITICAL: Connection refused by host
[22:14:43] 	 PROBLEM - swift-account-replicator on ms-be10 is CRITICAL: Connection refused by host
[22:14:52] 	 PROBLEM - swift-object-updater on ms-be10 is CRITICAL: Connection refused by host
[22:14:52] 	 PROBLEM - swift-container-replicator on ms-be10 is CRITICAL: Connection refused by host
[22:14:52] 	 PROBLEM - swift-object-updater on ms-be10 is CRITICAL: Connection refused by host
[22:14:52] 	 PROBLEM - swift-container-replicator on ms-be10 is CRITICAL: Connection refused by host
[22:15:01] 	 PROBLEM - swift-container-updater on ms-be10 is CRITICAL: Connection refused by host
[22:15:01] 	 PROBLEM - swift-container-updater on ms-be10 is CRITICAL: Connection refused by host
[22:15:10] 	 PROBLEM - swift-account-auditor on ms-be10 is CRITICAL: Connection refused by host
[22:15:10] 	 PROBLEM - swift-account-auditor on ms-be10 is CRITICAL: Connection refused by host
[22:15:13] 	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17472
[22:15:29] 	 PROBLEM - swift-object-auditor on ms-be10 is CRITICAL: Connection refused by host
[22:15:29] 	 PROBLEM - SSH on ms-be10 is CRITICAL: Connection refused
[22:15:29] 	 PROBLEM - swift-object-auditor on ms-be10 is CRITICAL: Connection refused by host
[22:15:29] 	 PROBLEM - SSH on ms-be10 is CRITICAL: Connection refused
[22:15:46] 	 PROBLEM - swift-account-server on ms-be10 is CRITICAL: Connection refused by host
[22:15:46] 	 PROBLEM - swift-account-server on ms-be10 is CRITICAL: Connection refused by host
[22:16:04] 	 PROBLEM - swift-container-auditor on ms-be10 is CRITICAL: Connection refused by host
[22:16:04] 	 PROBLEM - swift-container-auditor on ms-be10 is CRITICAL: Connection refused by host
[22:16:04] 	 PROBLEM - swift-object-replicator on ms-be10 is CRITICAL: Connection refused by host
[22:16:04] 	 PROBLEM - swift-object-replicator on ms-be10 is CRITICAL: Connection refused by host
[22:21:37] 	 RECOVERY - swift-object-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[22:21:37] 	 RECOVERY - swift-account-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[22:21:37] 	 RECOVERY - swift-container-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[22:21:37] 	 RECOVERY - swift-object-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[22:21:37] 	 RECOVERY - swift-account-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[22:21:38] 	 RECOVERY - swift-container-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[22:41:25] 	 PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours
[22:41:25] 	 PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours
[22:41:25] 	 PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours
[22:41:26] 	 PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours
[22:41:26] 	 PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours
[22:41:26] 	 PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours
[22:48:55] 	 !log upgrading glusterfs for project storage to 3.3
[22:49:03] 	 Logged the message, Master
[22:52:40] 	 PROBLEM - LDAP on virt0 is CRITICAL: Connection refused
[22:52:41] 	 PROBLEM - LDAP on virt0 is CRITICAL: Connection refused
[22:53:22] 	 o.O
[22:53:43] 	 New patchset: Bhartshorne; "when someone mislabels a file (i.e. named foo.png but it's actually jpg) the thumb name ends .png.jpeg.  adding .jpeg as an allowed extension." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17478
[22:54:29] 	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17478
[22:55:01] 	 wow. virt0's ldap server is being slammed
[22:55:11] 	 nscd must be broken on some instances
[23:01:34] 	 nope. it's the stupid script on labs-nfs1
[23:01:40] 	 I really need to make that only manage keys
[23:02:06] 	 Ryan_Lane: there's a stack for that
[23:02:11] 	 heh
[23:04:22] 	 RECOVERY - LDAP on virt0 is OK: TCP OK - 0.015 second response time on port 389
[23:04:23] 	 RECOVERY - LDAP on virt0 is OK: TCP OK - 0.015 second response time on port 389
[23:24:48] 	 New patchset: Ori.livneh; "Escape dot (.) in beacon regexp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17479
[23:25:26] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17479
[23:27:56] 	 New patchset: Bhartshorne; "turning off the swiftcleaner" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17481
[23:28:37] 	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17481
[23:29:25] 	 maplebed: you're too fast, was about to amend you
[23:29:27] 	 ;P
[23:29:34] 	 doh.
[23:29:36] 	 sorry.
[23:29:42] 	 what'd you want to change?
[23:29:46] 	 you didn't know it was on the way
[23:29:51] 	 sift -> swift
[23:29:55] 	 heh...
[23:41:13] 	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17479
[23:52:22] 	 New patchset: Pyoungmeister; "mediawiki and application server modules." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17342
[23:53:00] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17342
[23:56:37] 	 New patchset: Pyoungmeister; "mediawiki and application server modules." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17342
[23:57:16] 	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17342