[00:00:23] <rdwrer>	 Right, so gwicke's going first
[00:00:43] <rdwrer>	 I nominate Flow (who?) for second
[00:00:47] <rdwrer>	 I'll take sloppy thirds
[00:01:04] <gwicke>	 alright, deploying
[00:01:05] <rdwrer>	 Unless the Flow team doesn't show up
[00:01:22] <rdwrer>	 spagewmf1: I assume you have something to do with this
[00:01:26] <ori>	 gwicke: i can do it if no one from ops is around
[00:01:31] <gwicke>	 and done
[00:01:46] <ori>	 gwicke: should i restart?
[00:01:58] <gwicke>	 ori, lets see if there are some that don't come back up
[00:02:03] <gwicke>	 only those need to be restarted
[00:02:07] <ori>	 okay
[00:02:31] <paravoid>	 I'm kind of around :)
[00:02:41] <gwicke>	 rdwrer or spagewmf1, you can go ahead
[00:02:49] <rdwrer>	 Right, flow is being slow flow
[00:02:50] <rdwrer>	 I'll go
[00:02:53] <gwicke>	 paravoid, ahhhh.. ;)
[00:03:03] <paravoid>	 it's a bit late
[00:03:14] <rdwrer>	 LIGHTENING DEPLOYYYYY
[00:03:19] <gwicke>	 yeah, am I'm guessing
[00:03:38] <gwicke>	 paravoid, could we chat about debs tomorrow?
[00:03:53] <spagewmf1>	 rdwrer: I think our fix got in OK, let me check
[00:03:58] <paravoid>	 I need to look at the emails first
[00:04:21] <gwicke>	 k
[00:04:42] <gwicke>	 I'd like to figure out the repo situation soon so that we can start publishing it
[00:05:03] <gwicke>	 there is now testreduce (the rt test server), mathoid, parsoid and soon storoid
[00:05:28] <gwicke>	 pdf renderer potentially too
[00:06:10] <greg-g>	 rdwrer: good work on following the rules
[00:06:50] <rdwrer>	 greg-g: I am nothing if not lawful good
[00:07:05] <logmsgbot>	 !log mholmquist synchronized php-1.23wmf13/extensions/MultimediaViewer/resources/mmv/mmv.lightboxinterface.js  'Fix for arrow keys in MultimediaViewer'
[00:07:09] <rdwrer>	 Lo, Thor - shine brightly on this deploy
[00:07:13] <morebots>	 Logged the message, Master
[00:07:32] <rdwrer>	 I'm done, just waiting for caches to clear so I can confirm
[00:07:44] <rdwrer>	 FLOWWWWWWWWWW TIIIIIIIIIME
[00:07:51] <ebernhardson>	 rdwrer: thanks
[00:08:10] * gwicke  forgot to call service-restart
[00:08:21] <gwicke>	 ori, restarting now
[00:08:46] <gwicke>	 oh wow, salt does not emit broken output any more with one char every two lines
[00:09:54] <gwicke>	 https://gist.github.com/gwicke/374c20f20efbcf4b4022
[00:10:14] <gwicke>	 looking into wtp1019 and wtp1016
[00:10:32] <icinga-wm>	 PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused  
[00:10:38] <AaronSchulz>	 gwicke: so apparently the fix for extreme Math slowness is to set $wgMathDisableTexFilter
[00:10:40] * ori  restarts wtp1015
[00:10:42] <icinga-wm>	 PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused  
[00:10:50] <AaronSchulz>	 does this makes things worse off than the pre-refactored state?
[00:10:52] <icinga-wm>	 PROBLEM - Parsoid on wtp1014 is CRITICAL: Connection refused  
[00:11:03] <icinga-wm>	 PROBLEM - Parsoid on wtp1023 is CRITICAL: Connection refused  
[00:11:03] <icinga-wm>	 PROBLEM - Parsoid on wtp1021 is CRITICAL: Connection refused  
[00:11:03] <icinga-wm>	 PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused  
[00:11:03] <icinga-wm>	 PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused  
[00:11:06] <gwicke>	 oh oh
[00:11:09] <AaronSchulz>	 pages that took like 2-3 seconds to parse can take like 26
[00:11:11] <rdwrer>	 AGH
[00:11:12] <icinga-wm>	 PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused  
[00:11:15] <rdwrer>	 Guys I fucked it up
[00:11:21] <rdwrer>	 I read the wrong status update for mediawiki.org
[00:11:23] <AaronSchulz>	 people may have noticed ;)
[00:11:24] <gwicke>	 rdwrer, don't worry about parsoid
[00:11:32] <icinga-wm>	 RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.002 second response time  
[00:11:33] <rdwrer>	 I'll go after flow
[00:11:35] <gwicke>	 it is unrelated / different cluster etc
[00:11:42] <icinga-wm>	 RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time  
[00:11:43] <greg-g>	 gwicke: I think he means something else
[00:11:45] <ebernhardson>	 rdwrer: almost there, waiting on  zuul
[00:11:51] <gwicke>	 greg-g, k
[00:11:52] <icinga-wm>	 RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time  
[00:11:56] <ori>	 gwicke: i restarted it on all the ones that caused CRITICALs
[00:12:02] <icinga-wm>	 RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time  
[00:12:03] <icinga-wm>	 RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time  
[00:12:03] <icinga-wm>	 RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time  
[00:12:03] <icinga-wm>	 RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time  
[00:12:04] <ori>	 this is a little ridiculous though
[00:12:07] <AaronSchulz>	 https://en.wikipedia.org/w/index.php?title=Real_projective_line&oldid=543928239
[00:12:13] <gwicke>	 ori, thanks
[00:12:17] <AaronSchulz>	 probably the worse perf regression in a while :(
[00:12:43] <gwicke>	 AaronSchulz, my understanding is that caching is disabled completely?
[00:13:15] <gwicke>	 ori, the upstart config looks all fine, and according to the docs upstart should send a kill after five seconds
[00:13:18] <AaronSchulz>	 it keeps shelling out to do syntax checks even if a png was made already
[00:13:35] <gwicke>	 I have not been able to reproduce the behavior yet on non-prod machines
[00:13:45] <gwicke>	 and can't test on prod machines
[00:14:00] <gwicke>	 AaronSchulz: ah, that sounds stupid
[00:14:16] <logmsgbot>	 !log ebernhardson synchronized php-1.23wmf13/extensions/Flow
[00:14:19] <ebernhardson>	 rdwrer: all done, your back up.
[00:14:24] <morebots>	 Logged the message, Master
[00:14:37] <ori>	 gwicke: five seconds after what?
[00:14:48] <gwicke>	 ori, five seconds after sigterm
[00:14:58] <gwicke>	 if the process has not exited yet, it sends a sigkill
[00:15:06] <ori>	 gwicke: and what is actually happening?
[00:15:21] <gwicke>	 ori, hard to tell
[00:15:23] <ori>	 it is sending sigterm, but no sigkill when the process fails to exit?
[00:15:35] <gwicke>	 I doubt that
[00:16:16] <gwicke>	 one theory is that re-forking of children throws off the upstart ptrace stuff
[00:16:24] <ori>	 are you setting SO_REUSEADDR on the listening socket?
[00:16:35] <gwicke>	 I verified that the expect setting is fine
[00:16:36] <rdwrer>	 LIGHTENING DEPLOY AGAAAAAAAAIN
[00:16:42] <AndyRussG>	 Reedy: patch for that error here https://gerrit.wikimedia.org/r/#/c/113301/
[00:16:46] <gwicke>	 ori, afaik that is the node default
[00:16:48] <logmsgbot>	 !log mholmquist synchronized php-1.23wmf14/extensions/MultimediaViewer/resources/mmv/mmv.lightboxinterface.js  'Fix for arrow keys in MultimediaViewer'
[00:16:57] <morebots>	 Logged the message, Master
[00:17:05] <ori>	 gwicke: i don't think so; it was an issue with the statsd upstart job
[00:17:06] <rdwrer>	 Effin a
[00:17:09] <rdwrer>	 Still testing
[00:17:21] <rdwrer>	 OK we're good, /me done
[00:17:24] <greg-g>	 whew
[00:17:25] <ori>	 gwicke: it would try to restart, fail to bind the port, and exit
[00:17:33] <greg-g>	 ebernhardson: you all tested and such?
[00:17:35] <ori>	 statsd being a nodejs app
[00:17:52] <gwicke>	 ori, http://nodejs.org/api/net.html#net_server_listen_port_host_backlog_callback
[00:17:59] <gwicke>	 "Note: All sockets in Node set SO_REUSEADDR already"
[00:18:20] <ebernhardson>	 greg-g: yup
[00:18:27] <greg-g>	 cool
[00:18:30] <greg-g>	 superm401: you're up
[00:18:39] <superm401>	 Alright, doing the cherrypick
[00:19:12] <gwicke>	 ori, I have seen two copies of node running before after a restart, which made me suspect that upstart sometimes kills the wrong process
[00:19:29] <gwicke>	 when a worker dies it is re-forked
[00:19:45] <gwicke>	 maybe that throws off the upstart pid tracking
[00:20:09] <gwicke>	 was not an issue with the init.d script
[00:20:26] <ori>	 ahhh could very well be
[00:21:22] <gwicke>	 the systemd folks mention that their cgroups stuff is more reliable for forking daemons
[00:22:12] <gwicke>	 I might just revert to the new init script on Ubuntu if upstart continues to create issues
[00:22:40] <gwicke>	 https://gerrit.wikimedia.org/r/#/c/110666/32/debian/parsoid.init
[00:32:56] <grrrit-wm>	 (03PS1) 10Springle: remove es[123] for decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/113306 
[00:33:00] <superm401>	 greg-g, I'm getting a sync-dir error:
[00:33:08] <superm401>	 mflaschen@tin:/a/common (master)$ sync-dir 'Sync for GENDER fix to jQueryMsg' php-1.23wmf13/resources/mediawiki/
[00:33:10] <superm401>	 Target file is not a directory
[00:33:14] <gwicke>	 ori, this is parsoid.log from wtp1001, which is not currently reachable:
[00:33:15] <gwicke>	 https://gist.github.com/gwicke/4b17f1837258027fb392
[00:33:27] <ori>	 superm401: order of parameters :)
[00:33:30] <ori>	 superm401: dir comes first
[00:33:35] <superm401>	 Doh
[00:34:22] <grrrit-wm>	 (03CR) 10Springle: [C: 032] remove es[123] for decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/113306 (owner: 10Springle)
[00:34:35] <logmsgbot>	 !log mflaschen synchronized php-1.23wmf13/resources/mediawiki/  'Sync for GENDER fix to jQueryMsg'
[00:34:43] <morebots>	 Logged the message, Master
[00:36:14] <logmsgbot>	 !log mflaschen synchronized php-1.23wmf13/tests/qunit/suites/resources/mediawiki/mediawiki.jqueryMsg.test.js  'Sync for GENDER fix to jQueryMsg'
[00:36:22] <morebots>	 Logged the message, Master
[00:40:29] <logmsgbot>	 !log mflaschen synchronized php-1.23wmf14/resources/mediawiki/  'Sync for GENDER fix to jQueryMsg'
[00:40:37] <morebots>	 Logged the message, Master
[00:41:35] <grrrit-wm>	 (03PS3) 10Yurik: Updated whitelisted language lists to match config [operations/puppet] - 10https://gerrit.wikimedia.org/r/113168 (owner: 10QChris)
[00:41:51] <gwicke>	 ori, some more parsoids look unhappy
[00:41:54] <logmsgbot>	 !log mflaschen synchronized php-1.23wmf14/tests/qunit/suites/resources/mediawiki/mediawiki.jqueryMsg.test.js  'Sync for GENDER fix to jQueryMsg'
[00:41:58] <gwicke>	 https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Parsoid%2520eqiad&tab=m&vn=
[00:42:03] <morebots>	 Logged the message, Master
[00:42:14] <superm401>	 Done, greg-g, sorry I ran over.
[00:42:31] <gwicke>	 wtp1001, 1006, 1014, 1015, 1020, 1021, 1023
[00:43:36] <gwicke>	 only wtp1001 seems to be all down
[00:44:12] <icinga-wm>	 RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time  
[00:44:55] <ori>	 i restarted them
[00:45:01] <ori>	 but not doing that again, this is silly
[00:45:04] <gwicke>	 ori, thanks!
[00:46:08] <grrrit-wm>	 (03PS1) 10Spage: Add qa_automation group and grant it Flow rights [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113311 
[00:46:09] <gwicke>	 it is definitely annoying, and not what I had hoped for by moving to upstart
[00:46:35] <Gloria>	 Upstart? More like restart.
[00:46:39] <Gloria>	 Gloria: Hush.
[00:47:03] <icinga-wm>	 PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused  
[00:47:13] <ori>	 i restarted that one already
[00:47:39] <gwicke>	 ori, based on the log my suspicion is that upstart is sending sigterm to random workers
[00:48:09] <gwicke>	 it does not seem to send a sigkill 
[00:48:28] <gwicke>	 but at the same time some workers don't seem to exit in time before the restart
[00:48:29] <ori>	 gwicke: as an ugly workaround, try having a pre-start clause that kills any lingering instances
[00:48:51] <gwicke>	 which explains why the service is then wedged
[00:49:59] <gwicke>	 yeah, but then moving to start-stop-daemon might actually be better
[00:52:27] <grrrit-wm>	 (03PS1) 10Springle: x1 depool db1030 for maintenance [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113313 
[00:52:57] <grrrit-wm>	 (03CR) 10Springle: [C: 032] x1 depool db1030 for maintenance [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113313 (owner: 10Springle)
[00:53:20] <grrrit-wm>	 (03CR) 10Yurik: [C: 04-1] "I placed all the missmatched languages into patch https://gerrit.wikimedia.org/r/#/c/113168/ (there were a few more i found). I think we " [operations/puppet] - 10https://gerrit.wikimedia.org/r/113167 (owner: 10QChris)
[00:53:57] <logmsgbot>	 !log springle synchronized wmf-config/db-eqiad.php  'x1 depool db1030 for maintenance'
[00:54:05] <morebots>	 Logged the message, Master
[00:57:51] <grrrit-wm>	 (03PS1) 10Springle: reassign db1030 to s6 [operations/puppet] - 10https://gerrit.wikimedia.org/r/113314 
[01:01:21] <grrrit-wm>	 (03CR) 10Springle: [C: 032] reassign db1030 to s6 [operations/puppet] - 10https://gerrit.wikimedia.org/r/113314 (owner: 10Springle)
[01:06:38] <ori>	 gwicke: did you determine how git-deploy/salt is attempting to restart the process?
[01:08:04] <gwicke>	 ori, salt is using some python module it seems
[01:08:18] <gwicke>	 I reported a bug against it as it was preferring init.d over upstart
[01:08:50] <gwicke>	 apparently it looks for the files itself
[01:09:39] <ori>	 !log restarting EventLogging on vanadium
[01:09:49] <morebots>	 Logged the message, Master
[01:15:26] <springle>	 !log xtrabackup clone db1010 to db1030
[01:15:39] <morebots>	 Logged the message, Master
[01:30:32] <grrrit-wm>	 (03CR) 10CSteipp: [C: 031] "Thanks" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113311 (owner: 10Spage)
[01:42:26] <grrrit-wm>	 (03PS1) 10GWicke: Wait 60 seconds before killing the parsoid master [operations/puppet] - 10https://gerrit.wikimedia.org/r/113316 
[01:47:09] <grrrit-wm>	 (03CR) 10GWicke: "Also see https://gerrit.wikimedia.org/r/#/c/113318/ for a related parsoid change" [operations/puppet] - 10https://gerrit.wikimedia.org/r/113316 (owner: 10GWicke)
[02:22:19] <anomie>	 springle: Have you seen bug 61319? It seems like the page table for enwiki on db1056 (maybe others too?) is somehow out of sync.
[02:22:40] <grrrit-wm>	 (03PS1) 10Aaron Schulz: Set Memcached retry_timeout to -1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113321 
[02:23:12] <grrrit-wm>	 (03PS2) 10Aaron Schulz: Set Memcached retry_timeout to -1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113321 
[02:23:33] <grrrit-wm>	 (03CR) 10Aaron Schulz: [C: 032] Set Memcached retry_timeout to -1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113321 (owner: 10Aaron Schulz)
[02:23:44] <grrrit-wm>	 (03Merged) 10jenkins-bot: Set Memcached retry_timeout to -1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113321 (owner: 10Aaron Schulz)
[02:24:53] <logmsgbot>	 !log aaron synchronized wmf-config/mc.php  'Set Memcached retry_timeout to -1'
[02:25:02] <morebots>	 Logged the message, Master
[02:25:25] * AaronSchulz  wonders if cygwin doesn't disable Nagle :s
[02:26:49] <springle>	 anomie: no hadn't seen it. looking
[02:27:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:27:03] <logmsgbot>	 !log LocalisationUpdate completed (1.23wmf13) at 2014-02-14 02:27:02+00:00
[02:27:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:27:09] <morebots>	 Logged the message, Master
[02:27:22] <icinga-wm>	 PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:27:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:28:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:28:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:28:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:28:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:28:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:28:04] <icinga-wm>	 PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:28:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.565 second response time  
[02:28:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.566 second response time  
[02:28:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.587 second response time  
[02:28:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.570 second response time  
[02:29:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.395 second response time  
[02:29:03] <icinga-wm>	 RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.261 second response time  
[02:29:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:29:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.957 second response time  
[02:30:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:30:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:30:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:30:04] <icinga-wm>	 PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:30:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:30:52] <icinga-wm>	 RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.967 second response time  
[02:31:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:31:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:31:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:31:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.030 second response time  
[02:31:22] <icinga-wm>	 RECOVERY - Apache HTTP on srv300 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.066 second response time  
[02:31:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.062 second response time  
[02:32:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.113 second response time  
[02:32:03] <icinga-wm>	 RECOVERY - Apache HTTP on srv249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.228 second response time  
[02:32:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.280 second response time  
[02:32:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:32:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw61 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.748 second response time  
[02:32:30] <grrrit-wm>	 (03PS1) 10Springle: depol db1056 for pt-table-sync checks bug 61319 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113322 
[02:32:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.971 second response time  
[02:32:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.051 second response time  
[02:32:54] <grrrit-wm>	 (03CR) 10Springle: [C: 032] depol db1056 for pt-table-sync checks bug 61319 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113322 (owner: 10Springle)
[02:33:00] <grrrit-wm>	 (03Merged) 10jenkins-bot: depol db1056 for pt-table-sync checks bug 61319 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113322 (owner: 10Springle)
[02:34:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:34:23] <logmsgbot>	 !log springle synchronized wmf-config/db-eqiad.php  'depool db1056 for pt-table-sync bug 61319'
[02:34:31] <morebots>	 Logged the message, Master
[02:35:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.206 second response time  
[02:35:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.569 second response time  
[02:35:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.005 second response time  
[02:36:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.132 second response time  
[02:36:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:36:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:36:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:37:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.225 second response time  
[02:37:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.030 second response time  
[02:38:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:38:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:38:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.066 second response time  
[02:39:03] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.208 second response time  
[02:39:03] <icinga-wm>	 RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.278 second response time  
[02:39:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:39:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:39:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.953 second response time  
[02:39:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.992 second response time  
[02:40:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:41:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw62 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.100 second response time  
[02:42:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:42:35] <manybubbles>	 that is a lot of things
[02:43:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:43:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.208 second response time  
[02:43:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.535 second response time  
[02:44:50] <AaronSchulz>	 bd808|BUFFER: I can't get graphs to go back very far in Kibana. Is there some low elastic query result size limit?
[02:45:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:46:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.245 second response time  
[02:46:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:46:53] <icinga-wm>	 PROBLEM - Apache HTTP on mw76 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:46:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.569 second response time  
[02:47:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:47:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:47:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:47:42] <icinga-wm>	 PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:47:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw76 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.163 second response time  
[02:48:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.098 second response time  
[02:48:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.205 second response time  
[02:48:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.280 second response time  
[02:48:32] <icinga-wm>	 RECOVERY - Apache HTTP on srv299 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.012 second response time  
[02:50:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:50:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:51:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:51:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:51:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:51:03] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:51:52] <icinga-wm>	 RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.062 second response time  
[02:52:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.100 second response time  
[02:52:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.248 second response time  
[02:52:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:52:08] <rdwrer>	 Attention everyone: I'm syncing a one-file change to MultimediaViewer to fix errors in prod.
[02:52:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.066 second response time  
[02:53:53] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 4405 bytes in 0.550 second response time  
[02:54:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:54:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.028 second response time  
[02:55:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:55:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:55:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:55:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:55:52] <icinga-wm>	 RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.061 second response time  
[02:56:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.136 second response time  
[02:56:03] <icinga-wm>	 RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.278 second response time  
[02:56:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:56:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:56:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.279 second response time  
[02:56:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.958 second response time  
[02:57:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:57:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.542 second response time  
[02:57:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.568 second response time  
[02:57:54] <gwicke>	 any roots around to poke a parsoid box?
[02:58:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.242 second response time  
[02:58:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.276 second response time  
[02:58:05] <grrrit-wm>	 (03PS1) 10Chad: Remove old public key [operations/puppet] - 10https://gerrit.wikimedia.org/r/113326 
[02:59:22] <icinga-wm>	 PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:59:30] <rdwrer>	 Are you getting hit, too, gwicke?
[02:59:38] <rdwrer>	 I'm having a hell of a time sync-file'ing
[02:59:51] <gwicke>	 rdwrer, nope
[03:00:06] <gwicke>	 just a box that didn't restart correctly and is not accepting any traffic
[03:00:12] <Eloquence>	 jgage, ori bblack springle ^^ apaches flapping, also gwicke would like some help if you're around :)
[03:00:27] <logmsgbot>	 !log mholmquist synchronized php-1.23wmf14/extensions/MultimediaViewer/resources/mmv/ui/mmv.ui.metadataPanel.js  'Fix for arrow keys in MultimediaViewer'
[03:00:35] <morebots>	 Logged the message, Master
[03:00:42] <gwicke>	 relatively low prio compared to actual site breakage
[03:01:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:01:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:01:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:01:22] <icinga-wm>	 RECOVERY - Apache HTTP on srv300 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.116 second response time  
[03:01:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:01:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.960 second response time  
[03:02:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.279 second response time  
[03:02:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:02:19] <Eloquence>	 gwicke, not noticing any actual user-facing impact yet
[03:02:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:02:31] <springle>	 it's all pmtpa stuff. don't know why
[03:02:32] <gwicke>	 Eloquence, there is none
[03:02:42] <gwicke>	 from the parsoid side at least
[03:02:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw73 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:02:46] <gwicke>	 hence low prio
[03:02:50] <Eloquence>	 yeah, I meant the apaches
[03:03:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.095 second response time  
[03:03:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.205 second response time  
[03:03:05] <gwicke>	 ah, didn't check those
[03:03:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.538 second response time  
[03:03:42] <icinga-wm>	 RECOVERY - Apache HTTP on mw73 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.420 second response time  
[03:03:50] <Eloquence>	 so not paging any opsen just yet
[03:04:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:04:11] <gwicke>	 enwiki seems to work fine
[03:04:20] <gwicke>	 rdwrer, which branch is faulty?
[03:04:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw68 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.062 second response time  
[03:05:50] <rdwrer>	 gwicke: Not totally sure, this is above my pay grade
[03:05:58] <rdwrer>	 I just pushed to wmf14
[03:06:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.239 second response time  
[03:06:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:06:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:06:06] <rdwrer>	 But I doubt that was the issue, it was an extension update
[03:06:19] <rdwrer>	 Anyway I'm off now
[03:06:51] <gwicke>	 hmm
[03:06:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.568 second response time  
[03:06:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.569 second response time  
[03:09:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:10:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:10:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:10:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw64 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:10:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.569 second response time  
[03:10:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.031 second response time  
[03:11:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw64 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.955 second response time  
[03:13:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:13:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:13:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.017 second response time  
[03:13:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.025 second response time  
[03:14:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:14:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:14:22] <icinga-wm>	 PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:14:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:14:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.958 second response time  
[03:14:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.028 second response time  
[03:15:13] <icinga-wm>	 RECOVERY - Apache HTTP on srv300 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.612 second response time  
[03:15:13] <icinga-wm>	 RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.068 second response time  
[03:16:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.207 second response time  
[03:16:22] <icinga-wm>	 PROBLEM - Apache HTTP on srv292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:17:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:17:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:17:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:17:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:17:12] <icinga-wm>	 RECOVERY - Apache HTTP on srv292 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.958 second response time  
[03:18:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:18:34] <springle>	 mwdeploy rsync on pmtpa app servers
[03:18:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.569 second response time  
[03:18:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.191 second response time  
[03:19:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.167 second response time  
[03:19:52] <icinga-wm>	 RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.993 second response time  
[03:19:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.060 second response time  
[03:20:38] <logmsgbot>	 !log LocalisationUpdate completed (1.23wmf14) at 2014-02-14 03:20:37+00:00
[03:20:44] <morebots>	 Logged the message, Master
[03:21:22] <springle>	 http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=API+application+servers+pmtpa&m=cpu_report&s=by+name&mc=2&g=network_report
[03:25:05] <springle>	 just a symptom
[03:26:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:26:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:27:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:27:42] <Eloquence>	 springle, this is all due to the LocalisationUpdates?
[03:27:52] <icinga-wm>	 RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.555 second response time  
[03:28:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:29:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.243 second response time  
[03:29:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:29:52] <springle>	 unsure. if i'm reading librenms properly, cr2-pmtpa saturated
[03:29:52] <icinga-wm>	 RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.956 second response time  
[03:30:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.105 second response time  
[03:30:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:30:06] <springle>	 whether due to localisation or not...
[03:30:22] <icinga-wm>	 PROBLEM - Apache HTTP on srv298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:30:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:31:12] <icinga-wm>	 RECOVERY - Apache HTTP on srv298 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.919 second response time  
[03:31:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.064 second response time  
[03:31:58] <grrrit-wm>	 (03PS2) 10Yurik: Zero: 470-01 now handles M & Zero, on both Opera & regular [operations/puppet] - 10https://gerrit.wikimedia.org/r/113299 
[03:32:42] <icinga-wm>	 RECOVERY - Apache HTTP on mw68 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.167 second response time  
[03:33:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:33:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:34:01] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Zero: 470-01 now handles M & Zero, on both Opera & regular [operations/puppet] - 10https://gerrit.wikimedia.org/r/113299 (owner: 10Yurik)
[03:34:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.238 second response time  
[03:34:32] <icinga-wm>	 PROBLEM - LDAP on virt0 is CRITICAL: Connection refused  
[03:35:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:35:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:35:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:35:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:36:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.132 second response time  
[03:36:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.244 second response time  
[03:36:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:36:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw59 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.571 second response time  
[03:36:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.958 second response time  
[03:37:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.135 second response time  
[03:38:07] <grrrit-wm>	 (03PS2) 10Yurik: Removed obsolete carrier 405-25 [operations/puppet] - 10https://gerrit.wikimedia.org/r/113289 
[03:39:12] <icinga-wm>	 PROBLEM - Apache HTTP on mw67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:39:40] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Removed obsolete carrier 405-25 [operations/puppet] - 10https://gerrit.wikimedia.org/r/113289 (owner: 10Yurik)
[03:39:45] <Eloquence>	 springle, odd, network graph seems fine now. but the LDAP alert above may need more urgent attention
[03:40:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.274 second response time  
[03:40:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:40:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:40:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:40:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:40:11] <springle>	 yeah something else going on
[03:40:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw67 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.993 second response time  
[03:40:30] <grrrit-wm>	 (03PS4) 10QChris: Updated whitelisted language lists to match config [operations/puppet] - 10https://gerrit.wikimedia.org/r/113168 
[03:41:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.098 second response time  
[03:41:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:41:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.568 second response time  
[03:42:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:42:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:42:32] <icinga-wm>	 RECOVERY - LDAP on virt0 is OK: TCP OK - 0.035 second response time on port 389  
[03:42:42] <icinga-wm>	 RECOVERY - Apache HTTP on mw68 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.204 second response time  
[03:42:49] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Updated whitelisted language lists to match config [operations/puppet] - 10https://gerrit.wikimedia.org/r/113168 (owner: 10QChris)
[03:42:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.960 second response time  
[03:43:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:43:19] <springle>	 !log restarted opendj on virt0
[03:43:27] <morebots>	 Logged the message, Master
[03:43:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.561 second response time  
[03:43:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.063 second response time  
[03:44:00] <Eloquence>	 thanks springle, can get back into wikitech now
[03:44:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.136 second response time  
[03:44:03] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.175 second response time  
[03:44:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:44:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:44:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:44:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.030 second response time  
[03:45:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.140 second response time  
[03:45:14] <icinga-wm>	 RECOVERY - Apache HTTP on mw61 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.029 second response time  
[03:50:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.920 second response time  
[03:51:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.204 second response time  
[03:51:42] <icinga-wm>	 PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:52:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:53:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.158 second response time  
[03:53:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:53:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:53:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:54:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.138 second response time  
[03:54:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:54:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:54:32] <icinga-wm>	 RECOVERY - Apache HTTP on srv299 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.563 second response time  
[03:54:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.030 second response time  
[03:55:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.204 second response time  
[03:56:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:56:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:57:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw61 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.926 second response time  
[03:57:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw27 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.279 second response time  
[03:57:32] <icinga-wm>	 PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:58:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:58:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.062 second response time  
[03:59:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.248 second response time  
[04:00:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:02:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:02:38] <Eloquence>	 TimStarling, got some time to look into the cluster flappiness above?
[04:02:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.465 second response time  
[04:03:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.209 second response time  
[04:03:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:03:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:03:20] <TimStarling>	 yes
[04:03:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:03:32] <TimStarling>	 pmtpa servers?
[04:03:40] <springle>	 yes
[04:03:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.066 second response time  
[04:04:32] <Eloquence>	 I notice git.wm.o just went down as well, is that still in tampa?
[04:05:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:05:06] <Eloquence>	 also ldap on labs died earlier
[04:05:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.531 second response time  
[04:05:24] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-14 04:05:23+00:00
[04:05:32] <morebots>	 Logged the message, Master
[04:05:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.540 second response time  
[04:06:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:06:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw63 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:06:29] <springle>	 gitblit you mean? it's antimony.eqiad.wmnet
[04:07:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.239 second response time  
[04:07:06] <Eloquence>	 yeah, see alert above - git.wikimedia.org now throwing errors/timing out
[04:07:27] <springle>	 amusingly, the documented method of restart for gitblit doesn't work
[04:07:44] <TimStarling>	 what is librenms?
[04:08:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.274 second response time  
[04:08:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:08:04] <TimStarling>	 and how do I log in to it?
[04:08:09] <springle>	 the observium replacement
[04:08:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw63 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.240 second response time  
[04:08:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:08:27] <springle>	 go to observium.wikimedia.org, should redirect. same pw as before
[04:08:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.976 second response time  
[04:08:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.995 second response time  
[04:09:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:09:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:09:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:09:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw61 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.168 second response time  
[04:09:39] <TimStarling>	 ok, I'm in
[04:09:39] <Eloquence>	 bbl, thanks for poking
[04:09:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.995 second response time  
[04:09:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.034 second response time  
[04:13:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.208 second response time  
[04:13:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:13:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:13:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:13:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:14:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.243 second response time  
[04:14:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw61 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.142 second response time  
[04:14:32] <icinga-wm>	 RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 474987 bytes in 8.653 second response time  
[04:16:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.226 second response time  
[04:16:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:16:19] <TimStarling>	 the server are idle, it's not an overload
[04:16:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.991 second response time  
[04:17:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.465 second response time  
[04:20:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:20:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:20:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:21:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.143 second response time  
[04:21:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:21:32] <icinga-wm>	 PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100%  
[04:21:57] <andrewbogott>	 ^ that one is me, doing a reinstall
[04:22:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:22:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.529 second response time  
[04:22:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.985 second response time  
[04:22:59] <TimStarling>	 and there's no detectable packet loss from neon to mw61 etc.
[04:23:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:23:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:23:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw63 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:24:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:24:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.534 second response time  
[04:25:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.165 second response time  
[04:25:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:25:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:25:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw63 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.023 second response time  
[04:25:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw22 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.245 second response time  
[04:25:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.530 second response time  
[04:25:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.538 second response time  
[04:26:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:26:42] <icinga-wm>	 RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms  
[04:26:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw73 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:27:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:27:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:27:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:27:42] <icinga-wm>	 RECOVERY - Apache HTTP on mw73 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.155 second response time  
[04:28:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.175 second response time  
[04:28:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.229 second response time  
[04:28:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:28:52] <icinga-wm>	 PROBLEM - SSH on virt1002 is CRITICAL: Connection refused  
[04:29:02] <icinga-wm>	 PROBLEM - Disk space on virt1002 is CRITICAL: Connection refused by host  
[04:29:02] <icinga-wm>	 PROBLEM - DPKG on virt1002 is CRITICAL: Connection refused by host  
[04:29:12] <icinga-wm>	 PROBLEM - puppet disabled on virt1002 is CRITICAL: Connection refused by host  
[04:29:22] <icinga-wm>	 PROBLEM - RAID on virt1002 is CRITICAL: Connection refused by host  
[04:29:42] <icinga-wm>	 RECOVERY - Apache HTTP on mw68 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.277 second response time  
[04:30:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.242 second response time  
[04:30:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:30:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:30:42] <icinga-wm>	 PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:30:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.521 second response time  
[04:30:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.956 second response time  
[04:31:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:31:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:31:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:31:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.520 second response time  
[04:32:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.687 second response time  
[04:32:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:32:42] <icinga-wm>	 RECOVERY - Apache HTTP on srv299 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.858 second response time  
[04:32:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.996 second response time  
[04:32:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.015 second response time  
[04:32:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.102 second response time  
[04:33:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.350 second response time  
[04:33:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.028 second response time  
[04:34:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:34:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:34:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.074 second response time  
[04:35:42] <icinga-wm>	 RECOVERY - Apache HTTP on mw42 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.160 second response time  
[04:36:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:36:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:36:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:36:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:36:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:36:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.920 second response time  
[04:37:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:37:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:37:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:37:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw29 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.534 second response time  
[04:37:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.084 second response time  
[04:38:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.137 second response time  
[04:38:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.171 second response time  
[04:38:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.204 second response time  
[04:38:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.536 second response time  
[04:39:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.160 second response time  
[04:41:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:41:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:41:12] <icinga-wm>	 PROBLEM - NTP on virt1002 is CRITICAL: NTP CRITICAL: No response from NTP server  
[04:42:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.166 second response time  
[04:42:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:42:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:42:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:42:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:43:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:44:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.159 second response time  
[04:44:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:44:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:44:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:44:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.568 second response time  
[04:44:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.960 second response time  
[04:44:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.100 second response time  
[04:45:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.183 second response time  
[04:45:02] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:45:53] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 4405 bytes in 0.562 second response time  
[04:45:55] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.029 second response time  
[04:46:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.170 second response time  
[04:46:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.240 second response time  
[04:48:52] <icinga-wm>	 PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100%  
[04:49:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:49:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:49:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:49:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:49:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.483 second response time  
[04:50:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.158 second response time  
[04:50:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.201 second response time  
[04:50:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:51:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:51:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:51:53] <icinga-wm>	 RECOVERY - SSH on virt1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0)  
[04:51:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.959 second response time  
[04:52:02] <icinga-wm>	 RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms  
[04:53:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.170 second response time  
[04:53:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:54:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:55:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:55:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:55:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:55:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:55:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:55:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:55:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:55:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.520 second response time  
[04:55:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.538 second response time  
[04:55:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.536 second response time  
[04:55:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.520 second response time  
[04:56:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.193 second response time  
[04:56:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.205 second response time  
[04:56:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.241 second response time  
[04:56:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.244 second response time  
[04:56:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:57:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.142 second response time  
[04:57:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.205 second response time  
[04:58:22] <icinga-wm>	 RECOVERY - RAID on virt1002 is OK: OK: Active: 16, Working: 16, Failed: 0, Spare: 0  
[04:58:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.944 second response time  
[04:59:02] <icinga-wm>	 RECOVERY - Disk space on virt1002 is OK: DISK OK  
[04:59:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:59:03] <icinga-wm>	 RECOVERY - DPKG on virt1002 is OK: All packages OK  
[04:59:12] <icinga-wm>	 RECOVERY - puppet disabled on virt1002 is OK: OK  
[04:59:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.098 second response time  
[05:00:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:00:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:01:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.936 second response time  
[05:01:32] <ori>	 if i run the check_http check for a random apache in tampa in a loop, every so often i'll get a result >5s, occasionally even a timeout
[05:01:41] <ori>	 on neon, that is
[05:02:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:02:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:02:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:02:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:02:07] <ori>	 on iron too, actually
[05:02:16] <ori>	 watch -n5 /usr/bin/time -f '%E' /usr/lib/nagios/plugins/check_http -H en.wikipedia.org -I 10.0.11.60 -u /
[05:02:41] <ori>	 so it's not icinga's fault and not neon's fault
[05:02:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.194 second response time  
[05:04:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:04:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:04:18] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Add a second compute node. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113330 
[05:04:29] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Change eqiad instance IP range. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113136 (owner: 10Andrew Bogott)
[05:04:52] <icinga-wm>	 PROBLEM - Apache HTTP on mw60 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:04:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.572 second response time  
[05:04:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.571 second response time  
[05:05:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:05:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:05:42] <icinga-wm>	 RECOVERY - Apache HTTP on mw60 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.530 second response time  
[05:06:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.083 second response time  
[05:06:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.155 second response time  
[05:06:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.206 second response time  
[05:06:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.241 second response time  
[05:06:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:06:10] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Add a second compute node. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113330 
[05:06:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw40 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.570 second response time  
[05:07:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.133 second response time  
[05:07:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.226 second response time  
[05:07:59] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Add a second compute node. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113330 (owner: 10Andrew Bogott)
[05:08:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.099 second response time  
[05:09:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:09:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:09:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:09:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:09:42] <icinga-wm>	 PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:09:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.575 second response time  
[05:10:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:10:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:10:42] <icinga-wm>	 PROBLEM - Apache HTTP on srv252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:11:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:11:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:11:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw67 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.534 second response time  
[05:11:32] <icinga-wm>	 RECOVERY - Apache HTTP on srv252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.087 second response time  
[05:11:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:11:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.099 second response time  
[05:12:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.120 second response time  
[05:12:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.277 second response time  
[05:12:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:12:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:12:32] <icinga-wm>	 RECOVERY - Apache HTTP on srv299 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.533 second response time  
[05:12:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.567 second response time  
[05:13:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.566 second response time  
[05:13:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.050 second response time  
[05:14:12] <icinga-wm>	 RECOVERY - NTP on virt1002 is OK: NTP OK: Offset -0.001201152802 secs  
[05:14:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw68 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.093 second response time  
[05:15:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:15:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:15:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:15:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:15:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.569 second response time  
[05:16:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.204 second response time  
[05:16:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:17:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.260 second response time  
[05:17:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:17:33] <ori>	 in fact, i get it on the actual apache itself
[05:17:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.567 second response time  
[05:17:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.098 second response time  
[05:19:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:19:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.961 second response time  
[05:19:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.031 second response time  
[05:19:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.069 second response time  
[05:20:52] <grrrit-wm>	 (03Abandoned) 10Tim Landscheidt: Tools: Remove SGE shadow master [operations/puppet] - 10https://gerrit.wikimedia.org/r/112671 (owner: 10Tim Landscheidt)
[05:20:53] <icinga-wm>	 RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.098 second response time  
[05:21:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:21:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:21:22] <icinga-wm>	 PROBLEM - Apache HTTP on srv298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:22:22] <icinga-wm>	 RECOVERY - Apache HTTP on srv298 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.127 second response time  
[05:22:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.101 second response time  
[05:23:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.237 second response time  
[05:23:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:23:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:23:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:23:52] <icinga-wm>	 RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.556 second response time  
[05:24:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:24:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.168 second response time  
[05:24:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.104 second response time  
[05:25:02] <icinga-wm>	 PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:25:32] * springle  wonders about 02:24 logmsgbot: aaron synchronized wmf-config/mc.php 'Set Memcached retry_timeout to -1' 
[05:26:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:26:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:26:26] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Add cron entries to update puppet repos on labs puppetmasters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113332 
[05:26:27] <ori>	 yeah, that's a good guess
[05:27:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.088 second response time  
[05:27:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:27:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw63 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:27:53] <grrrit-wm>	 (03PS1) 10Ori.livneh: Revert "Set Memcached retry_timeout to -1" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113333 
[05:28:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.206 second response time  
[05:28:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:28:03] <icinga-wm>	 PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:28:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:28:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw63 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.917 second response time  
[05:28:12] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Set Memcached retry_timeout to -1" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113333 (owner: 10Ori.livneh)
[05:28:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw61 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.115 second response time  
[05:28:27] <logmsgbot>	 !log ori updated /a/common to {{Gerrit|Iac9f51209}}: Revert "Set Memcached retry_timeout to -1"
[05:28:34] <morebots>	 Logged the message, Master
[05:28:50] <springle>	 stracing mw32 nutcracker seems to have unusual waits.. but then again dont know what usual is
[05:29:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.130 second response time  
[05:29:02] <icinga-wm>	 RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.241 second response time  
[05:29:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:29:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[05:29:12] <logmsgbot>	 !log ori synchronized wmf-config/mc.php  'Iac9f51209: Revert 'Set Memcached retry_timeout to -1''
[05:29:19] <morebots>	 Logged the message, Master
[05:29:51] <ori>	 aaron made the change to reduce logspam; it has been an issue for a long while (several months). so safe to revert.
[05:29:52] <icinga-wm>	 RECOVERY - Apache HTTP on srv254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.413 second response time  
[05:29:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.423 second response time  
[05:29:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.423 second response time  
[05:29:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.422 second response time  
[05:29:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.422 second response time  
[05:29:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw40 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.421 second response time  
[05:30:31] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Revert "Add Apple Touch icon for Labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113334 
[05:31:07] <ori>	 springle: good catch
[05:31:32] <grrrit-wm>	 (03PS2) 10Tim Landscheidt: Revert "Add Apple Touch icon for Labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113334 
[05:33:23] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "Reverted in If18d4215b603f5461451c27ccb8e2a8165f2b0d0." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111434 (owner: 10Odder)
[05:46:48] <springle>	 can't immediately see the relationship between retry_timeout and server_retry_timeout in nutcracker
[05:47:01] <springle>	 twemproxy rather
[05:47:30] <springle>	 but server)retry_timeout is int64_t and couple spots in code would break with negative values
[05:48:32] <icinga-wm>	 PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100%  
[05:49:02] <icinga-wm>	 RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms  
[05:51:12] <icinga-wm>	 PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused  
[05:52:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw27 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.456 second response time  
[06:08:30] <Krinkle>	 !log Reloading Zuul to deploy Ie02531143511f418a6
[06:08:38] <morebots>	 Logged the message, Master
[06:23:20] <ori>	 springle: i'll e-mail the log to aaron
[06:36:15] <springle>	 ori: thanks. also see greg-g's email to ops@
[06:37:00] <ori>	 oh, i missed that. thanks
[06:37:54] <scfc_de>	 springle: A question re LabsDB: In replication prod-DBs => sanitizer => LabsDB, if the link prod-DBs => sanitizer stalls/breaks, but sanitizer => LabsDB keeps working, what will Seconds_Behind_Master on LabsDB show?
[06:40:02] <springle>	 scfc_de: Seconds_Behind_Master is virtually useless at the second level
[06:47:44] <springle>	 scfc_de: we use http://www.percona.com/doc/percona-toolkit/2.2/pt-heartbeat.html . but to tell (like for labsdb users) one needs access to heartbeat schema
[06:48:01] <springle>	 which i don't know whether we allow or now, actually
[06:56:18] <scfc_de>	 We would have to wrap any access to Seconds_Behind_Master in a SECURITY DEFINER anyway, so we could do the same with pt-heartbeat if it's non-public.  I'll add a note about pt-heartbeat to https://bugzilla.wikimedia.org/48694 and https://bugzilla.wikimedia.org/48628.  Thanks!
[06:57:57] <springle>	 yw
[07:26:24] <AaronSchulz>	 ori: I still don't understand those warnings at all
[07:27:00] <AaronSchulz>	 what do they have to do with twemproxy?
[08:14:14] <grrrit-wm>	 (03CR) 10Fabriceflorin: "Thanks, MZ. I'm comfortable with the proposal to remove AFT5 on enwiki and frwiki on Monday, March 3, 2014. This should give enough time f" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112639 (owner: 10MZMcBride)
[08:28:53] <GerardM->	 I regularly get messages like "Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes." I take it it is no news
[08:33:17] <qchris>	 apergos: ^
[08:33:23] <qchris>	 Same for me.
[08:36:23] <qchris>	 For example on http://en.wikipedia.org/wiki/Selberg_sieve
[08:36:53] <qchris>	 If you report this error to the Wikimedia System Administrators, please include the details below.
[08:36:53] <qchris>	  Request: GET http://en.wikipedia.org/wiki/Selberg_sieve, from 91.198.174.72 via amssq57 amssq57 ([91.198.174.67]:3128), Varnish XID 921749154
[08:36:53] <qchris>	 Forwarded for: 90.146.67.180, 91.198.174.72
[08:36:53] <qchris>	 Error: 503, Service Unavailable at Fri, 14 Feb 2014 08:36:01 GMT
[08:40:43] <qchris>	 average: I tried dewiki, eowiki, jawiki and they all seem to work. Could you find other wikis that are affected?
[08:47:10] <apergos>	 I see it. 
[08:47:21] <qchris>	 Thanks!
[08:47:36] <apergos>	 looking
[08:56:53] <twkozlowski>	 Wow, still the 503
[08:57:03] <twkozlowski>	 http://wikimania2014.wikimedia.org/wiki/Special:MyLanguage/Main_Page
[08:57:16] <twkozlowski>	 but not on HTTPS
[08:59:37] <twkozlowski>	 !b 61364
[08:59:45] <twkozlowski>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=61364
[09:19:21] <grrrit-wm>	 (03PS1) 10Whym: Add autopatrol and related settings for Japanese Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113338 
[09:27:38] <grrrit-wm>	 (03PS2) 10Whym: Enable autopatrol and patrolling of RecentChanges on Japanese Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113338 
[09:28:38] <grrrit-wm>	 (03PS3) 10Whym: Enable autopatrol and patrolling of RecentChanges on Japanese Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113338 
[09:46:06] <grrrit-wm>	 (03CR) 10Odder: [C: 04-1] "I suggest you give the ability to remove users from the autopatrolled group to a local group (such as sysops or bureaucrats); with the cur" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113338 (owner: 10Whym)
[11:18:08] <grrrit-wm>	 (03Abandoned) 10QChris: Add zero tag for carrier 413-02 for simlpewiki on zerodot [operations/puppet] - 10https://gerrit.wikimedia.org/r/113167 (owner: 10QChris)
[11:18:10] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Wait 60 seconds before killing the parsoid master [operations/puppet] - 10https://gerrit.wikimedia.org/r/113316 (owner: 10GWicke)
[11:24:02] <icinga-wm>	 RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.004 second response time  
[11:32:12] <icinga-wm>	 PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused  
[11:34:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC  
[11:34:57] <grrrit-wm>	 (03PS1) 10Ori.livneh: mwgrep: use a filtered boolean query [operations/puppet] - 10https://gerrit.wikimedia.org/r/113351 
[11:36:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC  
[11:38:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC  
[11:40:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC  
[11:42:10] <icinga-wm>	 PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused  
[11:42:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC  
[11:44:27] <akosiaris>	 please don't touch wtp1004. Investigating
[11:44:45] <akosiaris>	 !log restart parsoid on wtp1022. 
[11:44:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC  
[11:44:53] <morebots>	 Logged the message, Master
[11:45:09] <icinga-wm>	 RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.012 second response time  
[11:45:10] <apergos>	 ok, I was just looknig at it
[11:46:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC  
[11:46:54] <icinga-wm>	 ACKNOWLEDGEMENT - Parsoid on wtp1004 is CRITICAL: Connection refused alexandros kosiaris Investigating parsoid restarting on an ephemeral port
[11:47:08] <apergos>	 ah
[11:47:38] <akosiaris>	 seems like a bug. It happened the other day, and now again
[11:47:50] <apergos>	 I remember the report from the other day
[11:47:50] <akosiaris>	 !log depooled wtp1004 
[11:47:52] <apergos>	 yuck
[11:47:58] <morebots>	 Logged the message, Master
[11:48:10] <icinga-wm>	 PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused  
[11:48:11] <physikerwelt>	 I have a few questions concerning the mathoid-debian package, basically there are two options either install the dependend node modules while building the package with npm or shipping a set of files that contain the required modules. The first option requires to have a npm version > 1.3? ... at least newer than the current precise version of npm furthermore the machine that builds the package must be connected to the internet 
[11:48:27] <akosiaris>	 hmmm so this time it shows up everywhere... nice
[11:48:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC  
[11:49:10] <icinga-wm>	 RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.008 second response time  
[11:49:12] <akosiaris>	 !log restarted parsoid on wtp1001
[11:49:19] <morebots>	 Logged the message, Master
[11:50:10] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000  
[11:50:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC  
[11:51:06] <akosiaris>	 physikerwelt: When building packages, fetching stuff from the internet is a bad security practice.
[11:51:38] <akosiaris>	 I 'd rather you fetched them manually, verified them and then just include them in the repo used to build the deb
[11:52:30] <physikerwelt>	 akosiaris: ok thanks for the quick and defenitve answer
[11:52:48] <akosiaris>	 you are welcome :-)
[11:52:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC  
[11:53:50] <grrrit-wm>	 (03PS4) 10ArielGlenn: Add shell account for santhosh, admins restricted + stats1002 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112912 
[11:54:09] <icinga-wm>	 RECOVERY - Puppet freshness on analytics1024 is OK: puppet ran at Fri Feb 14 11:54:01 UTC 2014  
[11:55:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:54:01 AM UTC  
[11:57:40] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] Add shell account for santhosh, admins restricted + stats1002 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112912 (owner: 10ArielGlenn)
[11:57:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:54:01 AM UTC  
[11:58:49] <icinga-wm>	 RECOVERY - Puppet freshness on analytics1024 is OK: puppet ran at Fri Feb 14 11:58:40 UTC 2014  
[12:00:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:58:40 AM UTC  
[12:01:09] <icinga-wm>	 RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.008 second response time  
[12:02:49] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:58:40 AM UTC  
[12:03:04] <grrrit-wm>	 (03CR) 10Whym: "@Odder thanks for your suggestions. Allowing sysops to remove the flag makes sense, and it was originally our intention. I would like to" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113338 (owner: 10Whym)
[12:28:14] <icinga-wm>	 PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused  
[12:28:54] <icinga-wm>	 RECOVERY - Puppet freshness on analytics1024 is OK: puppet ran at Fri Feb 14 12:28:43 UTC 2014  
[12:31:14] <icinga-wm>	 RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time  
[12:53:21] <Nemo_bis>	 omg scfc_de the censor
[12:58:15] <scfc_de>	 Nemo_bis: ?
[13:08:16] <Nemo_bis>	 carthago delenda est
[13:14:57] <scfc_de>	 Ah.
[13:24:44] <icinga-wm>	 PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100%  
[13:25:15] <paravoid>	 apergos: hey can you have a look ^^?
[13:25:18] <paravoid>	 I'm doing some other stuff
[13:25:21] <apergos>	 yes
[13:30:02] <grrrit-wm>	 (03CR) 10Hashar: "The trailing / is defined in RFC 1738 section-3.1 "Common Internet Scheme Syntax" https://tools.ietf.org/html/rfc1738#section-3.1" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106110 (owner: 10Jeremyb)
[13:30:23] <apergos>	 !log powercycling ms-be1005, unresponsive even on mgmt console
[13:30:32] <morebots>	 Logged the message, Master
[13:32:04] <icinga-wm>	 RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms  
[13:33:06] <paravoid>	 thanks
[13:37:42] <grrrit-wm>	 (03PS1) 10Springle: s6 pool db1030 depool db1010 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113360 
[13:38:05] <grrrit-wm>	 (03CR) 10Springle: [C: 032] s6 pool db1030 depool db1010 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113360 (owner: 10Springle)
[13:38:16] <grrrit-wm>	 (03Merged) 10jenkins-bot: s6 pool db1030 depool db1010 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113360 (owner: 10Springle)
[13:39:07] <logmsgbot>	 !log springle synchronized wmf-config/db-eqiad.php  's6 pool db1030 depool db1010'
[13:39:15] <morebots>	 Logged the message, Master
[13:43:14] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000  
[13:46:54] <icinga-wm>	 PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[13:47:44] <icinga-wm>	 RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0)  
[13:52:46] <grrrit-wm>	 (03PS4) 10Whym: Enable autopatrol and patrolling of RecentChanges on Japanese Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113338 
[13:54:21] <grrrit-wm>	 (03PS1) 10Springle: s1 repool db1056 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113361 
[13:54:45] <grrrit-wm>	 (03CR) 10Springle: [C: 032] s1 repool db1056 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113361 (owner: 10Springle)
[13:54:51] <grrrit-wm>	 (03Merged) 10jenkins-bot: s1 repool db1056 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113361 (owner: 10Springle)
[13:55:30] <logmsgbot>	 !log springle synchronized wmf-config/db-eqiad.php  's1 repool db1056 warm up'
[13:55:38] <morebots>	 Logged the message, Master
[14:13:08] <grrrit-wm>	 (03PS1) 10Springle: s1 db1056 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113363 
[14:13:33] <grrrit-wm>	 (03CR) 10Springle: [C: 032] s1 db1056 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113363 (owner: 10Springle)
[14:13:39] <grrrit-wm>	 (03Merged) 10jenkins-bot: s1 db1056 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113363 (owner: 10Springle)
[14:14:14] <logmsgbot>	 !log springle synchronized wmf-config/db-eqiad.php  's1 db1056 full steam'
[14:14:22] <morebots>	 Logged the message, Master
[14:22:28] <YuviPanda>	 hmm so systemd on Ubuntu as well... eventually
[14:22:28] <YuviPanda>	 http://www.markshuttleworth.com/archives/1316
[14:25:42] <manybubbles>	 !log beginning cirrus reindex of all wikipedias running cirrus except enwiki
[14:25:51] <morebots>	 Logged the message, Master
[14:49:24] <icinga-wm>	 PROBLEM - Disk space on virt11 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 44204 MB (3% inode=99%):  
[14:52:39] <matanya>	 paravoid: guess you saw it: Ubuntu is to switch from Upstart to the Systemd init system
[14:53:35] <paravoid>	 yup, I did
[14:57:14] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000  
[15:12:26] <bblack>	 grrrr @ systemd :P
[15:26:14] <manybubbles>	 !log aborted reindex due to https://bugzilla.wikimedia.org/show_bug.cgi?id=61377
[15:26:22] <morebots>	 Logged the message, Master
[15:44:05] <icinga-wm>	 PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138  
[15:49:17] <manybubbles>	 bd808: ^^^^
[15:49:54] <icinga-wm>	 RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 34: active_shards: 68: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0  
[15:50:13] <^d>	 Self-healing? ^
[15:50:57] <bd808>	 I'm not sure what happened. The secondary index for todays logs was rebulidling when I logged in
[15:51:44] <bd808>	 Also I should setup a highlight filter for "logstash"
[15:59:56] <bd808>	 manybubbles: Replica of today's logstash index just relocated from from 1001 to 1002. Something is up in my cluster for sure.
[16:00:15] <manybubbles>	 they do get to relocate on their own, you know
[16:00:21] <manybubbles>	 that shouldn't cause failures
[16:01:55] <manybubbles>	 I'm about to venture outside for the first time in a while.
[16:01:57] <manybubbles>	 wish me luck
[16:15:57] <cmjohnson1>	 !log db1034 swapping cables 
[16:16:06] <morebots>	 Logged the message, Master
[16:18:04] <greg-g>	 manybubbles|away: safe journeys
[16:48:11] <grrrit-wm>	 (03CR) 10Ryan Lane: [C: 031] Add cron entries to update puppet repos on labs puppetmasters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113332 (owner: 10Andrew Bogott)
[16:48:19] * werdna  is pung
[16:55:19] <grrrit-wm>	 (03CR) 10Odder: "This breaks Things. https://meta.wikimedia.org/wiki/Tech/News/2014/01" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111426 (owner: 10TTO)
[17:04:30] <bd808>	 Does anybody have a few minutes to help me figure out why the elasticsearch and redis monitors on the logstash100[123] nodes aren't populating graphs in ganglia?
[17:05:19] <bd808>	 I can see the *.pyconf files in /etc/ganglia/conf.d and don't know the next troubleshooting step
[17:11:27] <bd808>	 !log Restarted ganglia-monitor on logstash1001 to see if that makes the elasticsearch and redis metrics show up in ganglia
[17:11:36] <morebots>	 Logged the message, Master
[17:12:53] <ori>	 bd808: stop gmond and start it in the foreground by running with gmond -d999
[17:13:08] <ori>	 well, since you've already restarted it, you can wait and see if that fixes things
[17:14:03] <bd808>	 ori: Thanks. I'll give it 5 minutes and then do the foreground run if needed
[17:14:54] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] mwgrep: use a filtered boolean query [operations/puppet] - 10https://gerrit.wikimedia.org/r/113351 (owner: 10Ori.livneh)
[17:15:28] <bd808>	 Redis is showing up now so fingers crossed that I just need to bump the ganglia-monitor service on the others
[17:20:01] <grrrit-wm>	 (03CR) 10Helder.wiki: "=/" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111426 (owner: 10TTO)
[17:21:56] <grrrit-wm>	 (03CR) 10Ottomata: "Thanks Antoine." (031 comment) [operations/debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/113018 (owner: 10Ottomata)
[17:39:15] <grrrit-wm>	 (03CR) 10Matthias Mullie: [C: 031] Add qa_automation group and grant it Flow rights [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113311 (owner: 10Spage)
[17:39:43] <grrrit-wm>	 (03PS1) 10Legoktm: Revert "Add local interwiki for metawiki" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113377 
[17:39:50] <legoktm>	 Nemo_bis: ^
[17:41:27] <grrrit-wm>	 (03CR) 10Odder: [C: 031] "Yes." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113377 (owner: 10Legoktm)
[17:50:20] <grrrit-wm>	 (03PS6) 10Ottomata: Initial debian release [operations/debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/113018 
[17:53:53] <bd808>	 !log Running gmond in foreground on logstash1001 to debug elasticsearch reporting
[17:54:02] <morebots>	 Logged the message, Master
[17:59:57] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: Initial debian release (031 comment) [operations/debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/113018 (owner: 10Ottomata)
[18:00:51] <ori>	 bd808: if the module doesn't throw exceptions, the next step is tcpdump
[18:00:57] <ori>	 bd808: on the ganglia aggregator
[18:01:05] <ori>	 which iirc is logstash1003, but you should check
[18:01:12] <bd808>	 ori: It's the elasticsearch module
[18:01:34] <twkozlowski>	 greg-g: so do we unbreak things on a Friday?
[18:01:41] <twkozlowski>	 https://gerrit.wikimedia.org/r/113377 specifically
[18:02:13] <bd808>	 ori: I think Nik made a change that requires a newer version of elasticsearch than I'm running, but I'm going to double check the release notes to make sure that's the problem and not something else
[18:02:26] <greg-g>	 twkozlowski: yeah, for that kind of thing
[18:13:20] <grrrit-wm>	 (03CR) 10Gergő Tisza: [C: 031] Start sampling detailed network performance for Multimedia Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112452 (owner: 10Gilles)
[18:18:02] <jeremyb>	 dr0ptp4kt: yurik: where does one learn about portalwiki? and the phases?
[18:19:29] <yurik>	 jeremyb, we have only outlined the first phase of the portalwiki - which is basically to relocate meta zero config pages to a separate wiki and set it up for further dev, but it seems we might have to continue developing on meta until ops create a separate cluster
[18:20:28] <jeremyb>	 yurik: see https://bugzilla.wikimedia.org/61222
[18:20:54] <jeremyb>	 yurik: legal didn't want to use legal.wm.o in case they had some future broader use. is that not a concern for your proposed domain name?
[18:21:31] <yurik>	 hm, not sure what you mean - we would be totally fine with zero.wikimedia.org
[18:24:09] <bd808>	 !log Starting ganglia-monitor on logstash1001. Filed bug 61384 about problem in elasticsearch_monitoring.py effecting the logstash cluster.
[18:24:17] <morebots>	 Logged the message, Master
[18:26:00] <grrrit-wm>	 (03PS1) 10Aaron Schulz: Set wgMathDisableTexFilter to reduce performance problems [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113382 
[18:26:56] <grrrit-wm>	 (03PS2) 10Aaron Schulz: Set wgMathDisableTexFilter to fix performance regression [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113382 
[18:27:01] <grrrit-wm>	 (03CR) 10Aaron Schulz: [C: 032] Set wgMathDisableTexFilter to fix performance regression [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113382 (owner: 10Aaron Schulz)
[18:27:08] <grrrit-wm>	 (03Merged) 10jenkins-bot: Set wgMathDisableTexFilter to fix performance regression [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113382 (owner: 10Aaron Schulz)
[18:27:41] <jeremyb>	 yurik: so not "portal.wikimedia.org" then?
[18:27:54] <dr0ptp4kt>	 jeremyb, yurik, i would prefer that we use portal.wikimedia.org to eliminate confusion about zero.wikipedia.org and zero.wikimedia.org and a potential of rebranding the program at some point down the road. note, no one is planning a rebrand, it's just that the term "zero" has negative connotations sometimes. hence, something other than 'zero', even if it isn't 'portal', seems advisable. my recommendation earlier was partners.wikimedia.org
[18:27:55] <dr0ptp4kt>	 which aligns with what we do
[18:28:32] <dr0ptp4kt>	 jeremy b + yurik, i'm stepping away from my computer for a while, just so you know
[18:28:33] <jeremyb>	 "portal" can have many meanings
[18:28:40] <jeremyb>	 dr0ptp4kt: sure
[18:28:51] <^d>	 dr0ptp4kt: domainsdontmatter.wikimedia.org? :)
[18:29:18] <jeremyb>	 ^d: they really matter if you're trying to get someone to remember it or type it manually...
[18:29:22] <jeremyb>	 :-)
[18:29:44] <^d>	 I don't think eiximenis is in use anymore.
[18:29:46] <^d>	 ;-)
[18:29:51] <logmsgbot>	 !log aaron synchronized wmf-config/CommonSettings.php  'Set wgMathDisableTexFilter to fix performance regression'
[18:29:58] <morebots>	 Logged the message, Master
[18:30:18] <mutante>	 don't forget to get domainsdontmatter.wikimedia.org SSL cert, thx
[18:30:41] <jeremyb>	 <3 mutante
[18:30:45] <^d>	 AaronSchulz: I have PoolCounter wrapping for texvccheck that should be working its way into the next wmf branch.
[18:33:33] <gwicke>	 paravoid: ping
[18:36:30] <AaronSchulz>	 ^d: still terrible
[18:36:39] <AaronSchulz>	 ok, https://en.wikipedia.org/w/index.php?title=Real_projective_line&oldid=543928239&forceprofile=true is still slow
[18:36:58] <AaronSchulz>	 like 36 on math that was already rendered...
[18:37:01] <AaronSchulz>	 *36sec
[18:42:25] <^d>	 AaronSchulz: :(
[18:43:26] <AaronSchulz>	 ori: if there any more mc details can you leave them on bug 56882...since that still confuses me? Can it just be re-done for eqiad only?
[18:43:36] * AaronSchulz  looks at the Math code
[18:45:01] <ori>	 AaronSchulz: i'll look
[18:45:25] <AaronSchulz>	 the logs show the spam went away when it was on
[18:47:38] <ori>	 AaronSchulz: Sean mentioned that tampa is using nutcracker, which I guess is an earlier version of twemproxy
[18:48:59] <AaronSchulz>	 so somehow not using backoff after failure in low-traffic tampa made nutcracker unhappy? Were apaches waiting on it or something? How did they get seen as down?
[18:49:30] <AaronSchulz>	 still doesn't seem to make sense
[18:50:23] <ori>	 AaronSchulz: if an HTTP request takes more than 10 seconds to get the 301 to /wiki/Main_Page, an alert is issued
[18:50:43] <ori>	 some requests were taking more than 10 seconds
[18:50:47] <ori>	 most were under 1 second
[18:51:00] <ori>	 but i ran it with 'watch' with a 2-second delay between calls
[18:51:11] <mutante>	 how it was seen, example  21:36 <+icinga-wm> PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:51:21] <ori>	 and one every dozen or so requests would take ~10 seconds
[18:51:28] <mutante>	 they were all doing that during that time, but only tampa
[18:51:38] <AaronSchulz>	 0.32% 0.122560     84 - FileBackendStore::storeInternal-global-swift-eqiad
[18:52:04] <AaronSchulz>	 ^d: hmm, it must be regenerating and re-storing the pngs each time
[18:52:04] <ori>	 AaronSchulz: make the change locally on one of the tampa apaches
[18:52:16] <ori>	 and !log it so people know to expect the alerts
[18:52:55] <AaronSchulz>	 ^d: I mean by the monitoring, of course I saw the logs :)
[18:53:20] <AaronSchulz>	 trying to figure out how apaches were delayed
[18:53:45] <^d>	 I really want to finish wrapping more of this in PoolCounter.
[18:53:46] <AaronSchulz>	 ori: though I'm tempted to just to an eqiad conditional and be done with it, I don't really care about tampa at all
[18:53:53] <^d>	 So worst it can do is fail math and not take apaches with it.
[18:53:58] <dr0ptp4kt>	 ^d search is the new domains, i know. unless it isn't indexed!
[18:54:08] <dr0ptp4kt>	 i guess we just index the homepage and nothing else
[18:54:12] <ori>	 AaronSchulz: that's fine
[18:54:20] <AaronSchulz>	 1.83% 0.696300      1 - FileBackendStore::doQuickOperationsInternal-global-swift-eqiad
[18:54:31] <^d>	 dr0ptp4kt: Huh? Context?
[18:54:34] <^d>	 Aw, left.
[18:54:37] <AaronSchulz>	 ^d: you know if I didn't hack Math to batch store all files in swift at the end this would be even worse
[18:54:42] <ori>	 AaronSchulz: i think if you spent several hours chasing it down it'll end up being something particular and nearly-irrelevant about tampa
[18:54:53] <AaronSchulz>	 it would really suck if that was the case *and* we were still writing to tampa ;)
[18:55:00] <AaronSchulz>	 that page would have timed out
[19:02:52] <bd808>	 !log Updated /src/scap on tin to b2d8042
[19:02:59] <morebots>	 Logged the message, Master
[19:03:22] <bd808>	 greg-g: Ok to push latest scap scripts to the cluster and test with a no-op scap?
[19:05:09] <grrrit-wm>	 (03PS1) 10Aaron Schulz: Set retry_timeout to -1 for memcached in eqiad only [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113387 
[19:05:25] <AaronSchulz>	 ori: ^
[19:05:58] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] "Aaron <3s ternary expressions" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113387 (owner: 10Aaron Schulz)
[19:07:24] <grrrit-wm>	 (03CR) 10Aaron Schulz: [C: 032] Set retry_timeout to -1 for memcached in eqiad only [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113387 (owner: 10Aaron Schulz)
[19:07:33] <grrrit-wm>	 (03Merged) 10jenkins-bot: Set retry_timeout to -1 for memcached in eqiad only [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113387 (owner: 10Aaron Schulz)
[19:09:08] <logmsgbot>	 !log aaron synchronized wmf-config/mc.php  'Set retry_timeout to -1 for memcached in eqiad only'
[19:09:17] <morebots>	 Logged the message, Master
[19:09:44] <bd808>	 idle greg-g is idle
[19:09:58] <ori>	 bd808: he okayed it earlier
[19:10:40] <greg-g>	 bd808: yessir
[19:10:42] <greg-g>	 sorry sir
[19:10:46] <greg-g>	 was multitasking sir
[19:11:13] <bd808>	 greg-g: thanks. :)
[19:11:51] <bd808>	 !log Updating scap on mediawiki-installation dsh hosts
[19:11:57] <morebots>	 Logged the message, Master
[19:12:14] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000  
[19:13:12] <AaronSchulz>	 fully, the fire alarm in my apartment went off a minute ago
[19:13:25] <AaronSchulz>	 then off again
[19:13:27] <AaronSchulz>	 *funny
[19:13:53] <bd808>	 ori: Lots and lots of "Already uo-to-date" responses. Is puppet pulling automatically for eventual consistency?
[19:14:24] <bd808>	 AaronSchulz: You may or may not be on fire. Please check :)
[19:14:51] <bd808>	 Unstable waveforms must be collapsed via observation
[19:15:04] * AaronSchulz  look at odd stuff on http://ganglia.wikimedia.org/latest/
[19:16:27] <AaronSchulz>	 meh, that gluster spike seems random
[19:24:03] <mutante>	 twkozlowski: still on?
[19:25:16] <logmsgbot>	 !log bd808 started scap: no-diff scap to test script changes
[19:25:24] <morebots>	 Logged the message, Master
[19:25:35] <twkozlowski>	 mutante: Yeah
[19:26:31] <mutante>	 twkozlowski: 113073  the bugzilla sidebar thing, looks good to you? shall we merge it even without andre?:)
[19:26:34] <mutante>	 would
[19:26:50] <mutante>	 but you gave -1 before, so
[19:27:45] <twkozlowski>	 oh, it looks good
[19:27:48] <mutante>	 i guess the question is more, like
[19:27:55] <mutante>	 is it more important to fix that for weekend
[19:28:09] <mutante>	 or to wait for more labs testing by andre because it's just his PS1
[19:28:12] <mutante>	 ok
[19:28:23] <AaronSchulz>	 the memcached log spam is mostly tampa anyway...so I guess that log will stay fat for a while
[19:28:38] <AaronSchulz>	 what's a few gigs on fluorine? ;)
[19:28:44] * AaronSchulz  goes back to Math
[19:28:52] <andre__>	 mutante: I'll try to set Labs up now. But yeah, would be cool to get it in for the weekend. How much more time do I have before your weekend starts? :)
[19:28:54] <bd808>	 !log no-diff scap updated 366 JSON l10n files
[19:29:04] <morebots>	 Logged the message, Master
[19:29:13] <mutante>	 andre__: oh, i thought you are away :)
[19:29:31] <andre__>	 on and off...
[19:29:38] <mutante>	 andre__: eh, so i can either just merge that PS2 , i fixed the tabs
[19:29:46] <mutante>	 andre__: or you can tell me to come back in like 2 hours
[19:30:01] <mutante>	 i'll just be afk in between as well
[19:32:20] <andre__>	 mutante, two hours sounds good. I might be away then, but hopefully have tested everything
[19:32:34] <andre__>	 just realizing there's a bit more cleanup work to do, e.g. removing /extensions/Sitemap from Gerrit
[19:33:33] <mutante>	 andre__: alright, i'll be off and back in a bit, feel free to leave query messages,i'll read them 
[19:33:33] <AaronSchulz>	 ok <<git submodule update>> in the extension repo is definitely crap with msysgit
[19:38:54] <greg-g>	 bd808: was the lack of a scap completed message from logmsgbot intentional?
[19:39:06] <bd808>	 greg-g: Still running
[19:39:11] <greg-g>	 bd808: also, I assume you're done based on your last... oh
[19:39:18] <bd808>	 Our "no-op" picked up l10n updates
[19:39:19] <greg-g>	 the !log no-diff updated json files thing confused me
[19:39:27] * greg-g  nods
[19:39:59] <bd808>	 On a related note… scap is f'ing slow
[19:40:23] <greg-g>	 yep!
[19:45:38] <grrrit-wm>	 (03CR) 10Cmcmahon: [C: 031] Add qa_automation group and grant it Flow rights [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113311 (owner: 10Spage)
[19:45:46] <Reedy>	 All but 42 minutes yesterday
[19:45:59] <Reedy>	 To push new mw version (all code), l10n cache for 2 versions
[19:47:21] <bd808>	 I think it's going to be >30m today just for l10n updates
[19:47:58] <bd808>	 The rsync fanout was horribly slow
[19:48:49] <AaronSchulz>	 also the use of dsh instead of salt cause a YMMV
[19:48:54] <AaronSchulz>	 *causes, gah
[19:49:37] <bd808>	 AaronSchulz: Do you think using salt for messaging would be a significant difference?
[19:50:29] <bd808>	 Does salt have a "batch size" option? I don't think we want all the rsyncs running at once with the current number of servers
[19:50:38] <AaronSchulz>	 depends on your ssh-agent
[19:50:43] <logmsgbot>	 !log bd808 finished scap: no-diff scap to test script changes (duration: 25m 26s)
[19:50:50] <ori>	 i just wouldn't worry about it yet
[19:50:51] <morebots>	 Logged the message, Master
[19:51:02] <AaronSchulz>	 if it is fast already then it won't make much difference
[19:51:11] <bd808>	 greg-g: ^^ {{done}}
[19:51:28] <ori>	 $ cat sync-common 
[19:51:28] <ori>	 #!/bin/bash
[19:51:28] <ori>	 /usr/local/bin/scap-1
[19:51:54] <icinga-wm>	 PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 33: active_shards: 66: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 1  
[19:51:54] <icinga-wm>	 PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 33: active_shards: 66: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 1  
[19:52:03] <bd808>	 grrr
[19:52:10] <ori>	 heh
[19:52:14] <greg-g>	 bd808: if it ain't scap, it's logstash :)
[19:52:14] <icinga-wm>	 PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 33: active_shards: 66: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 1  
[19:52:49] <greg-g>	 kaldari: they're done with the "no-op" scap test
[19:52:50] <bd808>	 Something nasty keeps happening there. A shard replica is flapping
[19:53:23] * greg-g  goes to do some lunch type thing
[19:53:27] <greg-g>	 ya'll play nice now
[19:53:38] * bd808  thinks that sounds like a good idea
[19:54:23] <grrrit-wm>	 (03PS1) 10Ori.livneh: webperf: Adapt NavigationTiming Graphite reporter to schema rev. 7494934 [operations/puppet] - 10https://gerrit.wikimedia.org/r/113392 
[19:54:38] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] webperf: Adapt NavigationTiming Graphite reporter to schema rev. 7494934 [operations/puppet] - 10https://gerrit.wikimedia.org/r/113392 (owner: 10Ori.livneh)
[19:55:13] <bd808>	 !log during scap test snapshot[1234] reported "sudo: no tty present and no askpass program specified"
[19:55:21] <morebots>	 Logged the message, Master
[19:56:47] <ori>	 bd808: that's been a problem for ages. might be worth filing an RT
[19:57:34] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] Removing dns entries for db78 [operations/dns] - 10https://gerrit.wikimedia.org/r/113140 (owner: 10Cmjohnson)
[19:59:38] <logmsgbot>	 !log ori synchronized php-1.23wmf14/extensions/NavigationTiming  'Update NavigationTiming for schema revision to 7494934'
[19:59:46] <morebots>	 Logged the message, Master
[20:00:38] <bd808>	 ori: Filed snapshot scap errors as https://rt.wikimedia.org/Ticket/Display.html?id=6847
[20:00:49] <ori>	 bd808: sweet, thanks
[20:01:28] <logmsgbot>	 !log ori synchronized php-1.23wmf13/extensions/NavigationTiming  'Update NavigationTiming for schema revision to 7494934'
[20:01:37] <morebots>	 Logged the message, Master
[20:08:35] <twkozlowski>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=36623 can be closed?
[20:08:43] * twkozlowski  dunno what Ubuntu version is used nowadays
[20:09:51] <mutante>	 not yet, just a little bit longer
[20:10:14] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000  
[20:10:40] <mutante>	 the one-off thing even though it's practically done
[20:11:59] <mutante>	 twkozlowski:  that'll be resolved like when tampa is dead
[20:12:31] <mutante>	 technically, you can argue if it matters that an unused thing isnt upgraded but also not decom'ed yet, etc ..bla bla
[20:14:08] <twkozlowski>	 mutante: oh, I don't care that much. Just digged that bug out of Bugzilla's botomless... well, database
[20:14:48] <mutante>	 yep, you could link it to a Tampa tracking .. if you want ..
[20:16:59] <twkozlowski>	 I can't find any bug that include the word 'Tampa
[20:17:22] <twkozlowski>	 -z
[20:18:52] <mutante>	 twkozlowski: ugh, true.. Bugzilla ticket: (no value)
[20:19:21] <twkozlowski>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=45528 mutante ?
[20:19:23] <mutante>	 twkozlowski: well, https://wikitech.wikimedia.org/wiki/Tampa_cluster is the same thing and RT #6099 and 
[20:19:44] <mutante>	 https://wikitech.wikimedia.org/w/index.php?title=Tampa_cluster&action=history
[20:20:52] <mutante>	 twkozlowski: maybe, thx, i linked that, better than none
[20:21:07] <mutante>	 for me it's 6099
[20:21:28] <mutante>	 and i'd like to keep the wikitech site in sync
[20:21:33] <mutante>	 for the public part
[20:28:50] <jgage>	 Sorry! We could not process your edit due to a loss of session data. Please try again. If it still does not work, try logging out and logging back in. 
[20:28:53] <jgage>	 :(
[20:28:54] <jgage>	 is that a simple timeout?
[20:29:16] <hoo>	 jgage: Nope... more of a session timeout thingy
[20:29:38] <jgage>	 oddly it seems to have accepted my edit despite the error message
[20:29:54] <jgage>	 oh no, it gave me a preview
[20:30:03] <Reedy>	 all is not lost
[20:30:56] <jgage>	 yeah ok i just had to save the edit a second itme
[20:30:57] <jgage>	 time
[20:33:38] <mutante>	 twkozlowski: please check bugzilla sidebar:)
[20:34:06] <Reedy>	 yay
[20:34:20] <mutante>	 Reedy: good?
[20:34:33] <mutante>	 deployed https://gerrit.wikimedia.org/r/#/c/113073/
[20:34:39] <Reedy>	 Certainly better than it was
[20:34:51] <mutante>	 cool
[20:35:29] <mutante>	 that's good enough for the weekend:)
[20:35:52] <Reedy>	 Deploy and go home
[20:35:55] * Reedy  high fives mutante
[20:36:10] <mutante>	 hah:) thx
[20:36:39] <mutante>	 i can wait a bit to see if it breaks suddenly:)
[20:39:06] <twkozlowski>	 mutante: yay, works like it did before
[20:39:46] <mutante>	 nice
[20:40:07] * twkozlowski  also suggests that someone deploys https://gerrit.wikimedia.org/r/#/c/113377/
[20:40:19] <twkozlowski>	 greg-g: ^^
[20:41:34] <icinga-wm>	 PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:41:44] <icinga-wm>	 PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:41:44] <icinga-wm>	 PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:41:54] <icinga-wm>	 PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:41:56] <Reedy>	 Forgot a bout that
[20:42:02] <grrrit-wm>	 (03PS2) 10Legoktm: Revert "Add local interwiki for metawiki" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113377 
[20:42:06] <grrrit-wm>	 (03CR) 10Reedy: [C: 032] Revert "Add local interwiki for metawiki" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113377 (owner: 10Legoktm)
[20:42:14] <icinga-wm>	 PROBLEM - Host text-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1  
[20:42:16] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "Add local interwiki for metawiki" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113377 (owner: 10Legoktm)
[20:42:18] <icinga-wm>	 PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:18] <icinga-wm>	 PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:24] <icinga-wm>	 PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1:c  
[20:42:27] <icinga-wm>	 PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::2:b  
[20:42:30] <icinga-wm>	 PROBLEM - Host bits-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1:a  
[20:42:34] <icinga-wm>	 PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:34] <icinga-wm>	 PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:34] <icinga-wm>	 PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:34] <icinga-wm>	 PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:34] <icinga-wm>	 PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:35] <Reedy>	 That doesn't look good
[20:42:35] <icinga-wm>	 PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:35] <icinga-wm>	 PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:36] <icinga-wm>	 PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:36] <icinga-wm>	 PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:37] <icinga-wm>	 PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:37] <icinga-wm>	 PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:38] <icinga-wm>	 PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:38] <icinga-wm>	 PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:39] <icinga-wm>	 PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:39] <icinga-wm>	 PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:40] <apergos>	 uh
[20:42:48] <Coren>	 Bang.  IPv6 is gone.
[20:42:50] <apergos>	 are people on this or should I be looking?
[20:42:54] <icinga-wm>	 PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:54] <icinga-wm>	 PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:55] <icinga-wm>	 PROBLEM - Host lvs4004 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:55] <icinga-wm>	 PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:42:55] <icinga-wm>	 PROBLEM - Host upload-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%  
[20:43:00] <Coren>	 D'oh!
[20:43:04] <icinga-wm>	 PROBLEM - Host bits-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%  
[20:43:08] <icinga-wm>	 PROBLEM - Host text-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%  
[20:43:11] <apergos>	 wt
[20:43:11] <icinga-wm>	 PROBLEM - Host mobile-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%  
[20:43:32] <apergos>	 all of ulsfo 
[20:43:34] <apergos>	 awesome
[20:43:37] <Reedy>	 transit down?
[20:43:41] <logmsgbot>	 !log reedy synchronized wmf-config/InitialiseSettings.php  'Revert Add local interwiki for metawiki'
[20:43:41] <Coren>	 I think ulsfo just lost its transit.
[20:43:42] <apergos>	 seems so
[20:43:49] <morebots>	 Logged the message, Master
[20:44:08] <bblack>	 texts were a bit delayed there
[20:44:33] <bblack>	 and yeah I had an old ssh session open into a ulsfo machine and my session's hung :(
[20:44:57] <Jeff_Green>	 uff
[20:45:06] <apergos>	 so move stuff off to eqiad I guess?
[20:45:14] <icinga-wm>	 PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx  
[20:45:20] <bblack>	 yeah I'll do the dns change
[20:45:37] <apergos>	 k
[20:45:45] <bblack>	 unless someone has a bright idea about what's wrong with the network and/or how to fix it quickly :)
[20:45:54] <apergos>	 eh no
[20:46:04] <icinga-wm>	 PROBLEM - Host backup4001 is DOWN: PING CRITICAL - Packet loss = 100%  
[20:46:05] <Reedy>	 !log ULSFO down, traffic to Asia etc affected. Being worked on
[20:46:11] <morebots>	 Logged the message, Master
[20:46:11] <Reedy>	 think of the backups!
[20:46:30] <Coren>	 AFAICT, ulsfo just completely dropped off the 'net inside and out
[20:47:00] <Reedy>	 That's not good
[20:47:36] <grrrit-wm>	 (03PS1) 10BBlack: ulsfo outage, temporarily s/ulsfo/eqiad/ [operations/dns] - 10https://gerrit.wikimedia.org/r/113456 
[20:48:02] <apergos>	 ah the easy way
[20:48:07] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] ulsfo outage, temporarily s/ulsfo/eqiad/ [operations/dns] - 10https://gerrit.wikimedia.org/r/113456 (owner: 10BBlack)
[20:48:46] <Coren>	 ... dafu?  Looks like the routes are gone.
[20:49:08] <bblack>	 the dns stuff is done, TTLs notwithstanding
[20:50:14] <icinga-wm>	 PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx  
[20:50:42] <Jeff_Green>	 [payments stuff is me running updates]
[20:51:14] <Coren>	 Shall we poke GTT?  That definitely looks like their pipe just went boom.
[20:51:41] <Jeff_Green>	 did you see the device reboot notices just arriving in email?
[20:51:41] <bblack>	 maybe it got the measles
[20:51:54] <icinga-wm>	 RECOVERY - Host cp4019 is UP: PING WARNING - Packet loss = 44%, RTA = 74.37 ms  
[20:51:56] <apergos>	 beyond my scope (and endurance, 11 pm here and desperately trying to get some dinner in me)
[20:52:04] <icinga-wm>	 RECOVERY - Host cp4005 is UP: PING OK - Packet loss = 0%, RTA = 75.36 ms  
[20:52:04] <icinga-wm>	 RECOVERY - Host cp4017 is UP: PING OK - Packet loss = 0%, RTA = 73.39 ms  
[20:52:04] <icinga-wm>	 RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 72.05 ms  
[20:52:04] <icinga-wm>	 RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 75.22 ms  
[20:52:04] <icinga-wm>	 RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 74.60 ms  
[20:52:05] <icinga-wm>	 RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 73.84 ms  
[20:52:05] <icinga-wm>	 RECOVERY - Host cp4014 is UP: PING OK - Packet loss = 0%, RTA = 73.96 ms  
[20:52:06] <icinga-wm>	 RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 75.19 ms  
[20:52:06] <icinga-wm>	 RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 72.07 ms  
[20:52:07] <icinga-wm>	 RECOVERY - Host lvs4002 is UP: PING OK - Packet loss = 0%, RTA = 72.19 ms  
[20:52:07] <icinga-wm>	 RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 73.82 ms  
[20:52:08] <icinga-wm>	 RECOVERY - Host cp4018 is UP: PING OK - Packet loss = 0%, RTA = 72.19 ms  
[20:52:08] <icinga-wm>	 RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 75.02 ms  
[20:52:09] <icinga-wm>	 RECOVERY - Host cp4006 is UP: PING OK - Packet loss = 0%, RTA = 72.05 ms  
[20:52:09] <icinga-wm>	 RECOVERY - Host lvs4004 is UP: PING OK - Packet loss = 0%, RTA = 72.65 ms  
[20:52:10] <icinga-wm>	 RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 73.81 ms  
[20:52:10] <icinga-wm>	 RECOVERY - Host cp4008 is UP: PING OK - Packet loss = 0%, RTA = 74.18 ms  
[20:52:11] <icinga-wm>	 RECOVERY - Host cp4013 is UP: PING OK - Packet loss = 0%, RTA = 74.58 ms  
[20:52:14] <icinga-wm>	 RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 75.05 ms  
[20:52:14] <icinga-wm>	 RECOVERY - Host cp4010 is UP: PING OK - Packet loss = 0%, RTA = 73.31 ms  
[20:52:14] <icinga-wm>	 RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 73.73 ms  
[20:52:14] <icinga-wm>	 RECOVERY - Host cp4016 is UP: PING OK - Packet loss = 0%, RTA = 72.00 ms  
[20:52:14] <icinga-wm>	 RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 75.07 ms  
[20:52:44] <icinga-wm>	 RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 70.60 ms  
[20:52:48] <icinga-wm>	 RECOVERY - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 71.85 ms  
[20:52:51] <icinga-wm>	 RECOVERY - Host bits-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 71.86 ms  
[20:53:11] <bblack>	 wow, they all rebooted
[20:53:13] <bblack>	 power fail?
[20:53:14] <icinga-wm>	 RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 72.48 ms  
[20:53:25] <icinga-wm>	 RECOVERY - Host upload-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 73.80 ms  
[20:53:28] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000  
[20:53:34] <icinga-wm>	 RECOVERY - Host bits-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 73.22 ms  
[20:53:37] <icinga-wm>	 RECOVERY - Host mobile-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 73.24 ms  
[20:53:40] <icinga-wm>	 RECOVERY - Host text-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 71.97 ms  
[20:53:42] <bblack>	 I'm seeing 4minute uptimes on machines I look at manually
[20:53:50] <apergos>	 oh?
[20:53:56] <Jeff_Green>	 wow
[20:54:04] <Coren>	 I was about to say.
[20:54:14] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend on cp4006 is CRITICAL: Connection refused  
[20:54:15] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend on cp4005 is CRITICAL: Connection refused  
[20:54:15] <icinga-wm>	 PROBLEM - Varnish HTTP mobile-backend on cp4020 is CRITICAL: Connection refused  
[20:54:15] <icinga-wm>	 PROBLEM - Varnish HTCP daemon on cp4015 is CRITICAL: Connection refused by host  
[20:54:15] <icinga-wm>	 PROBLEM - Varnish HTTP text-backend on cp4018 is CRITICAL: Connection refused  
[20:54:15] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend on cp4017 is CRITICAL: Connection refused  
[20:54:15] <icinga-wm>	 PROBLEM - Varnish HTTP upload-backend on cp4005 is CRITICAL: Connection refused  
[20:54:15] <Coren>	 It looks like all the racks just went dark.
[20:54:16] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend on cp4015 is CRITICAL: Connection refused  
[20:54:16] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend on cp4018 is CRITICAL: Connection refused  
[20:54:17] <icinga-wm>	 PROBLEM - Varnish HTTP upload-backend on cp4015 is CRITICAL: Connection refused  
[20:54:17] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend on cp4013 is CRITICAL: Connection refused  
[20:54:18] <icinga-wm>	 PROBLEM - Varnish HTTP mobile-frontend on cp4020 is CRITICAL: Connection refused  
[20:54:18] <icinga-wm>	 PROBLEM - RAID on cp4020 is CRITICAL: Connection refused by host  
[20:54:19] <icinga-wm>	 PROBLEM - Varnish traffic logger on cp4017 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa  
[20:54:24] <icinga-wm>	 PROBLEM - Varnish HTTP mobile-frontend on cp4019 is CRITICAL: Connection refused  
[20:54:25] <icinga-wm>	 PROBLEM - Varnish HTCP daemon on cp4019 is CRITICAL: Connection refused by host  
[20:54:25] <icinga-wm>	 PROBLEM - puppet disabled on cp4020 is CRITICAL: Connection refused by host  
[20:54:25] <icinga-wm>	 PROBLEM - DPKG on cp4018 is CRITICAL: Connection refused by host  
[20:54:25] <icinga-wm>	 PROBLEM - puppet disabled on cp4006 is CRITICAL: Connection refused by host  
[20:54:25] <icinga-wm>	 PROBLEM - RAID on lvs4003 is CRITICAL: Connection refused by host  
[20:54:25] <icinga-wm>	 PROBLEM - Disk space on cp4019 is CRITICAL: Connection refused by host  
[20:54:26] <icinga-wm>	 PROBLEM - Varnish traffic logger on cp4011 is CRITICAL: Connection refused by host  
[20:54:26] <icinga-wm>	 PROBLEM - Varnish HTCP daemon on cp4010 is CRITICAL: Connection refused by host  
[20:54:27] <icinga-wm>	 PROBLEM - Varnishkafka log producer on cp4020 is CRITICAL: Connection refused by host  
[20:54:27] <icinga-wm>	 PROBLEM - Varnish traffic logger on cp4018 is CRITICAL: Connection refused by host  
[20:54:28] <icinga-wm>	 PROBLEM - Disk space on lvs4003 is CRITICAL: Connection refused by host  
[20:54:28] <icinga-wm>	 PROBLEM - DPKG on cp4020 is CRITICAL: Connection refused by host  
[20:54:29] <icinga-wm>	 PROBLEM - Varnishkafka log producer on cp4011 is CRITICAL: Connection refused by host  
[20:54:29] <icinga-wm>	 PROBLEM - puppet disabled on cp4013 is CRITICAL: Connection refused by host  
[20:54:30] <icinga-wm>	 PROBLEM - Varnish HTCP daemon on cp4006 is CRITICAL: Connection refused by host  
[20:54:30] <icinga-wm>	 PROBLEM - Disk space on cp4018 is CRITICAL: Connection refused by host  
[20:54:31] <icinga-wm>	 PROBLEM - RAID on cp4013 is CRITICAL: Connection refused by host  
[20:54:31] <icinga-wm>	 PROBLEM - Varnish traffic logger on cp4015 is CRITICAL: Connection refused by host  
[20:54:32] <icinga-wm>	 PROBLEM - Varnish traffic logger on cp4014 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa  
[20:54:32] <icinga-wm>	 PROBLEM - Varnish traffic logger on cp4010 is CRITICAL: Connection refused by host  
[20:54:33] <icinga-wm>	 PROBLEM - Varnish HTCP daemon on cp4013 is CRITICAL: Connection refused by host  
[20:54:33] <icinga-wm>	 PROBLEM - Varnish traffic logger on cp4019 is CRITICAL: Connection refused by host  
[20:54:34] <icinga-wm>	 PROBLEM - Disk space on cp4020 is CRITICAL: Connection refused by host  
[20:54:48] <apergos>	 all the cp4* hossts are 5 mins all right
[20:54:54] <icinga-wm>	 PROBLEM - puppet disabled on cp4011 is CRITICAL: Connection refused by host  
[20:54:54] <icinga-wm>	 PROBLEM - puppet disabled on lvs4003 is CRITICAL: Connection refused by host  
[20:54:54] <icinga-wm>	 PROBLEM - puppet disabled on cp4019 is CRITICAL: Connection refused by host  
[20:54:54] <icinga-wm>	 PROBLEM - Varnish traffic logger on cp4013 is CRITICAL: Connection refused by host  
[20:54:54] <icinga-wm>	 PROBLEM - puppet disabled on cp4010 is CRITICAL: Connection refused by host  
[20:54:55] <icinga-wm>	 PROBLEM - RAID on cp4018 is CRITICAL: Connection refused by host  
[20:54:55] <icinga-wm>	 PROBLEM - Varnishkafka log producer on cp4019 is CRITICAL: Connection refused by host  
[20:54:56] <icinga-wm>	 PROBLEM - Disk space on cp4015 is CRITICAL: Connection refused by host  
[20:54:56] <icinga-wm>	 PROBLEM - DPKG on cp4006 is CRITICAL: Connection refused by host  
[20:55:05] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend on cp4014 is CRITICAL: Connection refused  
[20:55:05] <icinga-wm>	 PROBLEM - Varnish HTTP upload-backend on cp4013 is CRITICAL: Connection refused  
[20:55:05] <icinga-wm>	 PROBLEM - Varnish HTTP text-backend on cp4008 is CRITICAL: Connection refused  
[20:55:05] <icinga-wm>	 PROBLEM - Varnish HTTP mobile-frontend on cp4011 is CRITICAL: Connection refused  
[20:55:05] <icinga-wm>	 PROBLEM - Varnish HTTP upload-backend on cp4006 is CRITICAL: Connection refused  
[20:55:22] <bblack>	 yeah so UL might be better to contact than GTT
[20:55:55] <greg-g>	 can we call mark?
[20:56:25] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: Connection refused  
[20:56:26] <greg-g>	 I don't have his cell number, and officewiki is down :)
[20:56:29] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: Connection refused  
[20:56:46] * greg-g  puts that in todo for Monday: put all of ops in phone
[20:56:47] <apergos>	 and the lvs's too so that's all of em 
[20:56:51] <apergos>	 (5 mins, now 7)
[20:56:54] <icinga-wm>	 RECOVERY - puppet disabled on cp4010 is OK: OK  
[20:56:54] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.379 second response time  
[20:56:54] <greg-g>	 apergos: can you call mark please?
[20:57:04] <icinga-wm>	 RECOVERY - DPKG on cp4010 is OK: All packages OK  
[20:57:04] <greg-g>	 kaldari: so, no deploys right now :/
[20:57:04] <icinga-wm>	 RECOVERY - RAID on cp4010 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[20:57:10] <apergos>	 sec
[20:57:14] <icinga-wm>	 RECOVERY - Varnish HTTP text-backend on cp4010 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.149 second response time  
[20:57:15] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend on cp4010 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 0.152 second response time  
[20:57:15] <icinga-wm>	 RECOVERY - Disk space on cp4010 is OK: DISK OK  
[20:57:20] <kaldari>	 greg-g: ok
[20:57:25] <icinga-wm>	 RECOVERY - Varnish HTCP daemon on cp4010 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd  
[20:57:25] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27847 bytes in 0.218 second response time  
[20:57:30] <icinga-wm>	 RECOVERY - Varnish traffic logger on cp4010 is OK: PROCS OK: 2 processes with command name varnishncsa  
[20:57:32] <kaldari>	 greg-g: the patch still hasn't been merged anyway :P
[20:57:37] <greg-g>	 kaldari: ulsfo is down (well, maybe recovering now)
[20:57:47] <bd808>	 Cleaning staff unplugged the racks to plug in the floor polisher?
[20:57:54] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 70053 bytes in 0.620 second response time  
[20:58:14] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: Connection refused  
[20:58:20] <bblack>	 we'll leave DNS failed over till we get to the bottom of it in any case
[20:58:25] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69937 bytes in 0.398 second response time  
[20:59:04] <icinga-wm>	 RECOVERY - Varnish HTTP upload-backend on cp4013 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.154 second response time  
[20:59:04] <icinga-wm>	 RECOVERY - Disk space on cp4013 is OK: DISK OK  
[20:59:04] <icinga-wm>	 RECOVERY - DPKG on cp4013 is OK: All packages OK  
[20:59:14] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 27898 bytes in 0.455 second response time  
[20:59:18] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend on cp4013 is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 0.154 second response time  
[20:59:18] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 541 bytes in 0.150 second response time  
[20:59:18] <greg-g>	 bblack: yeah
[20:59:24] <icinga-wm>	 RECOVERY - RAID on cp4013 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[20:59:24] <icinga-wm>	 RECOVERY - puppet disabled on cp4013 is OK: OK  
[20:59:24] <icinga-wm>	 RECOVERY - Varnish HTCP daemon on cp4013 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd  
[20:59:34] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 563 bytes in 0.382 second response time  
[20:59:40] <jorm>	 back up for me.
[20:59:54] <icinga-wm>	 RECOVERY - Varnish traffic logger on cp4013 is OK: PROCS OK: 2 processes with command name varnishncsa  
[20:59:54] <icinga-wm>	 RECOVERY - RAID on cp4018 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[21:00:04] <icinga-wm>	 RECOVERY - Varnish HTCP daemon on cp4018 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd  
[21:00:05] <icinga-wm>	 RECOVERY - puppet disabled on cp4018 is OK: OK  
[21:00:14] <icinga-wm>	 PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx  
[21:00:15] <jorm>	 Mark Bergsma: +31-654282595
[21:00:15] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend on cp4018 is OK: HTTP OK: HTTP/1.1 200 OK - 197 bytes in 0.148 second response time  
[21:00:15] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 541 bytes in 0.148 second response time  
[21:00:24] <icinga-wm>	 RECOVERY - DPKG on cp4018 is OK: All packages OK  
[21:00:24] <icinga-wm>	 RECOVERY - Disk space on cp4018 is OK: DISK OK  
[21:00:25] <icinga-wm>	 RECOVERY - Varnish traffic logger on cp4018 is OK: PROCS OK: 2 processes with command name varnishncsa  
[21:00:29] <Coren>	 Yeah, from what I see in the SEL the boxes just out and lost power.
[21:00:47] <bblack>	 I love how icinga signs all the texts "<3, Icinga" :P  It makes my valentine's day special-er :)
[21:00:54] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 27848 bytes in 0.221 second response time  
[21:00:55] <Jeff_Green>	 ha
[21:01:44] <icinga-wm>	 PROBLEM - Host payments1003 is DOWN: PING CRITICAL - Packet loss = 100%  
[21:01:50] <andre__>	 greg-g: hmm, office.wm.org works for me
[21:02:05] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.380 second response time  
[21:02:16] <greg-g>	 andre__: yeah, you aren't in asia/west coast :)
[21:02:21] <greg-g>	 andre__: it was ulsfo
[21:02:54] <andre__>	 greg-g: ahah oh well makes sense
[21:03:04] <icinga-wm>	 RECOVERY - Varnish traffic logger on cp4006 is OK: PROCS OK: 2 processes with command name varnishncsa  
[21:03:05] <icinga-wm>	 RECOVERY - RAID on cp4006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[21:03:05] <icinga-wm>	 RECOVERY - Disk space on cp4006 is OK: DISK OK  
[21:03:05] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 27898 bytes in 0.449 second response time  
[21:03:14] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend on cp4006 is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 0.148 second response time  
[21:03:16] <greg-g>	 :)
[21:03:24] <icinga-wm>	 RECOVERY - Varnish HTCP daemon on cp4006 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd  
[21:03:25] <icinga-wm>	 RECOVERY - puppet disabled on cp4006 is OK: OK  
[21:03:54] <icinga-wm>	 RECOVERY - DPKG on cp4006 is OK: All packages OK  
[21:03:55] <Coren>	 I think we shall need to have /words/ with UL.
[21:04:05] <icinga-wm>	 RECOVERY - Varnish HTCP daemon on cp4020 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd  
[21:04:05] <icinga-wm>	 RECOVERY - Varnish traffic logger on cp4020 is OK: PROCS OK: 2 processes with command name varnishncsa  
[21:04:14] <icinga-wm>	 RECOVERY - Varnish HTTP mobile-frontend on cp4020 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.151 second response time  
[21:04:15] <icinga-wm>	 RECOVERY - RAID on cp4020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[21:04:24] <icinga-wm>	 RECOVERY - puppet disabled on cp4020 is OK: OK  
[21:04:25] <icinga-wm>	 RECOVERY - DPKG on cp4020 is OK: All packages OK  
[21:04:25] <icinga-wm>	 RECOVERY - Disk space on cp4020 is OK: DISK OK  
[21:04:54] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27898 bytes in 0.459 second response time  
[21:05:04] <icinga-wm>	 RECOVERY - Varnishkafka log producer on cp4012 is OK: PROCS OK: 1 process with command name varnishkafka  
[21:05:14] <icinga-wm>	 PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx  
[21:05:24] <icinga-wm>	 RECOVERY - Host payments1003 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms  
[21:06:04] <icinga-wm>	 RECOVERY - Varnish HTTP text-backend on cp4008 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.147 second response time  
[21:06:24] <icinga-wm>	 PROBLEM - NTP on cp4013 is CRITICAL: NTP CRITICAL: Offset unknown  
[21:06:24] <icinga-wm>	 PROBLEM - NTP on cp4012 is CRITICAL: NTP CRITICAL: Offset unknown  
[21:06:25] <icinga-wm>	 PROBLEM - NTP on cp4018 is CRITICAL: NTP CRITICAL: Offset unknown  
[21:06:25] <icinga-wm>	 PROBLEM - NTP on cp4007 is CRITICAL: NTP CRITICAL: Offset unknown  
[21:06:46] <Coren>	 Timestamp       = 2014-02-14 19:42:14
[21:06:46] <Coren>	 Message         = System is turning off.
[21:06:46] <Coren>	 FQDD            = iDRAC.Embedded.1#HostPowerCtrl
[21:06:54] <icinga-wm>	 PROBLEM - NTP on cp4005 is CRITICAL: NTP CRITICAL: Offset unknown  
[21:06:54] <icinga-wm>	 PROBLEM - NTP on cp4011 is CRITICAL: NTP CRITICAL: Offset unknown  
[21:06:55] <greg-g>	 Coren: a few of them, of the strong variety
[21:06:56] <paravoid_>	 hey
[21:06:57] <paravoid_>	 what's going on?
[21:06:59] <paravoid_>	 I'm out
[21:07:04] <icinga-wm>	 PROBLEM - NTP on cp4010 is CRITICAL: NTP CRITICAL: Offset unknown  
[21:07:05] <icinga-wm>	 PROBLEM - NTP on cp4019 is CRITICAL: NTP CRITICAL: Offset unknown  
[21:07:05] <icinga-wm>	 PROBLEM - NTP on bast4001 is CRITICAL: NTP CRITICAL: Offset unknown  
[21:07:05] <icinga-wm>	 PROBLEM - NTP on cp4015 is CRITICAL: NTP CRITICAL: Offset unknown  
[21:07:06] <paravoid_>	 with no access to my SSH keys
[21:07:13] <greg-g>	 paravoid_: we failed over dns to eqiad from ulsfo, it just dropped
[21:07:17] <Coren>	 paravoid: We lost all power to ULSFO for a while.
[21:07:19] <greg-g>	 no clear real cause yet, maybe power?
[21:07:24] <icinga-wm>	 RECOVERY - Varnish traffic logger on cp4011 is OK: PROCS OK: 2 processes with command name varnishncsa  
[21:07:41] <Coren>	 greg-g: Definitely power according to the RAC logs.
[21:07:46] * greg-g  nods
[21:07:46] <greg-g>	 k
[21:07:54] <icinga-wm>	 RECOVERY - puppet disabled on cp4011 is OK: OK  
[21:07:57] <paravoid_>	 did anyone call united layer?
[21:08:04] <icinga-wm>	 RECOVERY - Varnish HTTP mobile-frontend on cp4011 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 0.151 second response time  
[21:08:05] <icinga-wm>	 RECOVERY - DPKG on cp4011 is OK: All packages OK  
[21:08:05] <icinga-wm>	 RECOVERY - Disk space on cp4011 is OK: DISK OK  
[21:08:05] <icinga-wm>	 RECOVERY - Varnish HTCP daemon on cp4011 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd  
[21:08:05] <icinga-wm>	 RECOVERY - RAID on cp4011 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[21:08:09] <Coren>	 I dunno about the layout enough to figure out if we lost one rack or both?
[21:08:14] <icinga-wm>	 PROBLEM - Host payments1002 is DOWN: PING CRITICAL - Packet loss = 100%  
[21:08:29] <twkozlowski>	 "Hey, WMF here; 3.4 billion users couldn't access Wikipedia for 10 minutes. Cheers."
[21:08:32] <bblack>	 I believe we lost all machines
[21:08:46] <paravoid_>	 RobH: ping?
[21:09:03] <Coren>	 List of things you expect your DC to not do:  lose all power without warning on all your gear.
[21:09:04] <icinga-wm>	 RECOVERY - DPKG on lvs4003 is OK: All packages OK  
[21:09:13] <paravoid_>	 is Rob around?
[21:09:24] <icinga-wm>	 RECOVERY - Disk space on lvs4003 is OK: DISK OK  
[21:09:25] <icinga-wm>	 RECOVERY - RAID on lvs4003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[21:09:27] <Coren>	 RobH was still sick, I think, though better.
[21:09:34] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.365 second response time  
[21:09:54] <icinga-wm>	 RECOVERY - puppet disabled on lvs4003 is OK: OK  
[21:09:54] <icinga-wm>	 RECOVERY - NTP on cp4005 is OK: NTP OK: Offset -0.0002267360687 secs  
[21:09:54] <icinga-wm>	 RECOVERY - NTP on cp4011 is OK: NTP OK: Offset 0.0005394220352 secs  
[21:10:05] <icinga-wm>	 RECOVERY - NTP on cp4015 is OK: NTP OK: Offset 0.0004215240479 secs  
[21:10:05] <icinga-wm>	 RECOVERY - NTP on cp4019 is OK: NTP OK: Offset 0.0003026723862 secs  
[21:10:05] <icinga-wm>	 RECOVERY - NTP on cp4010 is OK: NTP OK: Offset -0.000821352005 secs  
[21:10:07] <paravoid_>	 cp4xx in ganglia say uptime 119 days
[21:10:11] <paravoid_>	 monitoring flake?
[21:10:14] <icinga-wm>	 PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx  
[21:10:24] <icinga-wm>	 RECOVERY - NTP on cp4013 is OK: NTP OK: Offset -0.0003784894943 secs  
[21:10:24] <icinga-wm>	 RECOVERY - NTP on cp4012 is OK: NTP OK: Offset 0.0007705688477 secs  
[21:10:25] <icinga-wm>	 RECOVERY - Host payments1002 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms  
[21:10:25] <icinga-wm>	 RECOVERY - NTP on cp4018 is OK: NTP OK: Offset 0.0005210638046 secs  
[21:10:25] <icinga-wm>	 RECOVERY - NTP on cp4007 is OK: NTP OK: Offset -0.003658771515 secs  
[21:10:28] <greg-g>	 paravoid_: no, I definitely had bad gateway
[21:10:34] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 566 bytes in 0.392 second response time  
[21:10:42] <greg-g>	 paravoid_: nvm me
[21:11:07] <twkozlowski>	 17 minutes according to log + Jorm's OK
[21:11:13] <Coren>	 root@cp4011:~# uptime
[21:11:13] <Coren>	  21:11:02 up 22 min,  1 user,  load average: 0.08, 0.19, 0.13
[21:11:33] <Coren>	 I think ganglia just reports what it see and not what the box says.
[21:11:47] <bblack>	 paravoid_: yeah ganglia's wrong, they all have short uptimes on manual check
[21:15:14] <icinga-wm>	 PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx  
[21:15:24] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.384 second response time  
[21:15:41] <paravoid_>	 can people call UL to see what happened?
[21:15:45] <greg-g>	 Jeff_Green: btw, you said all the payments stuff was 'just you', still the case?
[21:15:47] <paravoid_>	 or, preferrably, call RobH and have him do that?
[21:16:11] <greg-g>	 paravoid_: brandon is trying one number, he doesn't have an access code though, we'll see how far he gets
[21:16:19] * greg-g  plays cross channel relay
[21:16:34] <paravoid_>	 heh, sorry, I can't join other channels from here :)
[21:16:43] <greg-g>	 yeah, no worries :)
[21:16:46] <paravoid_>	 we have ops people in SF, don't we
[21:17:12] <greg-g>	 other than sick RobH, jgage is next.
[21:17:24] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 564 bytes in 0.375 second response time  
[21:17:29] <greg-g>	 or contract Leslie-Carr  :)
[21:17:43] <Coren>	 paravoid_: We're susprisingly short on opsen physically at the office nowadays.
[21:17:50] <paravoid_>	 I am aware
[21:18:04] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend on cp4014 is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 0.147 second response time  
[21:18:14] <icinga-wm>	 RECOVERY - Varnish HTTP upload-backend on cp4015 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.149 second response time  
[21:18:15] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend on cp4015 is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 0.150 second response time  
[21:18:15] <icinga-wm>	 RECOVERY - Varnish HTCP daemon on cp4015 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd  
[21:18:24] <icinga-wm>	 RECOVERY - Varnish HTTP mobile-frontend on cp4019 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.144 second response time  
[21:18:24] <icinga-wm>	 RECOVERY - Disk space on cp4019 is OK: DISK OK  
[21:18:25] <icinga-wm>	 RECOVERY - Varnish HTCP daemon on cp4019 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd  
[21:18:25] <icinga-wm>	 RECOVERY - Varnish traffic logger on cp4015 is OK: PROCS OK: 2 processes with command name varnishncsa  
[21:18:25] <icinga-wm>	 RECOVERY - Varnish traffic logger on cp4019 is OK: PROCS OK: 2 processes with command name varnishncsa  
[21:18:25] <icinga-wm>	 RECOVERY - Varnish traffic logger on cp4014 is OK: PROCS OK: 2 processes with command name varnishncsa  
[21:18:53] <greg-g>	 alright, so, we're failed over to eqiad, thus we're back to where we were 3 weeks ago, so I'm going to tell kaldari he can deploy his decently important bug fix soon. Unless I hear yelling.
[21:18:54] <icinga-wm>	 RECOVERY - Disk space on cp4015 is OK: DISK OK  
[21:18:54] <icinga-wm>	 RECOVERY - puppet disabled on cp4019 is OK: OK  
[21:19:04] <icinga-wm>	 RECOVERY - puppet disabled on cp4015 is OK: OK  
[21:19:04] <icinga-wm>	 RECOVERY - DPKG on cp4019 is OK: All packages OK  
[21:19:05] <icinga-wm>	 RECOVERY - DPKG on cp4015 is OK: All packages OK  
[21:19:05] <icinga-wm>	 RECOVERY - RAID on cp4015 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[21:19:05] <icinga-wm>	 RECOVERY - RAID on cp4019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0  
[21:19:14] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend on cp4017 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 0.149 second response time  
[21:19:24] <icinga-wm>	 RECOVERY - Varnish traffic logger on cp4017 is OK: PROCS OK: 2 processes with command name varnishncsa  
[21:20:04] <icinga-wm>	 RECOVERY - Varnish traffic logger on cp4005 is OK: PROCS OK: 2 processes with command name varnishncsa  
[21:20:10] <bblack>	 paravoid_: I talked to ulsfo briefly on the phone, they did confirm a "power event"
[21:20:14] <icinga-wm>	 PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx  
[21:20:15] <icinga-wm>	 RECOVERY - Varnish HTTP upload-backend on cp4005 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.150 second response time  
[21:20:15] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend on cp4005 is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 0.145 second response time  
[21:20:42] <bblack>	 they claim everything's kosher now and it's sorted out, but, it would be nice to get more-official confirmation of the situation than just "whatever this guy said to me on the phone with no identification at all"
[21:21:07] <MaxSem>	 eh, what about their redundancy?
[21:22:22] <paravoid_>	 I'm going to sign off now, since I'm not at home anyway
[21:22:23] <greg-g>	 MaxSem: "don't ask, don't tell"
[21:22:29] <greg-g>	 paravoid_: ok, thanks for checking in
[21:22:51] <paravoid_>	 please try to get a proper postmortem from UL
[21:22:59] <paravoid_>	 what happened and if we're sure it won't happen again
[21:23:16] <paravoid_>	 probably better to wait for robh to get back
[21:23:35] <paravoid_>	 it's not any hurry now, but it'd be nice to get this information before tuesday
[21:24:00] <greg-g>	 twkozlowski: another one for the technews ^^ :) "The ULSFO caching center went offline momentarily causing access to all Wikimedia hosted sites to fail for Oceania and the West Coast of the US for around 10(?) minutes."
[21:24:33] <paravoid_>	 don't forget southeast/east asia
[21:25:11] <paravoid_>	 and western territories of canada
[21:25:14] <icinga-wm>	 PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx  
[21:25:21] <greg-g>	 s/US/North America/
[21:25:27] * jgage  returns from lunch to see what all the alerts are about
[21:25:31] * greg-g  is a us-centric jerk
[21:25:52] <twkozlowski>	 "On February 14, all sites were broken for about 15 minutes for users in Southeast Asia and western North America due to a cache server problem."
[21:26:10] <twkozlowski>	 ulsfo gobbledygook
[21:26:21] <greg-g>	 "caching center"?
[21:26:32] <paravoid_>	 twkozlowski: you forgot oceania
[21:26:36] <paravoid_>	 :)
[21:26:44] <paravoid_>	 anyway
[21:26:46] <paravoid_>	 bye now
[21:26:48] <greg-g>	 later
[21:26:57] <MaxSem>	 twkozlowski, not "cache server problem" but whole bloody datacenter
[21:27:03] <greg-g>	 :)
[21:27:07] <greg-g>	 what Max said
[21:27:08] <twkozlowski>	 yes, cache servers
[21:27:32] <MaxSem>	 and LVS!
[21:27:38] <MaxSem>	 ;)
[21:27:49] <twkozlowski>	 try explaining that to your average non-geeky Wikipedian ;-)
[21:28:27] <bd808>	 !log Upgraded and restarted elasticsearch on logstash1002
[21:28:31] * twkozlowski  notes with some irritation that he still doesn't know what the Feb 11 outage was about and how long it was
[21:28:34] <morebots>	 Logged the message, Master
[21:28:59] <twkozlowski>	 greg-g: we've taken it off Tech News #8 temporarily because of lack of info
[21:29:14] <bd808>	 !log Upgraded and restarted elasticsearch on logstash1001
[21:29:25] <morebots>	 Logged the message, Master
[21:29:47] <greg-g>	 twkozlowski: for now, you could say something like "due to database issues"
[21:30:21] <greg-g>	 I poked the people investigating, but there's (I believe) still ongoing investigation on root cause
[21:30:33] <bd808>	 !log Upgraded and restarted elasticsearch on logstash1003
[21:30:40] <morebots>	 Logged the message, Master
[21:30:56] <greg-g>	 twkozlowski: so, I apologize, and feel your frustration
[21:31:04] <greg-g>	 I deeply sympathize
[21:31:07] <greg-g>	 ;)
[21:31:10] <ori>	 bd808: props for !logging verbosely
[21:32:06] <jgage>	 is there documentation on where/how to view the RAC logs? i'm not finding it.
[21:32:21] <ori>	 RAC?
[21:32:34] <jgage>	 mentioned in scrollback as confirming power outage
[21:32:44] <ori>	 ah, dunno
[21:32:46] <jgage>	 out of band management thingos
[21:33:01] <twkozlowski>	 greg-g: database issues sounds good; how long did it take?
[21:33:56] <greg-g>	 twkozlowski: not sure, started at 2014-01-11 22:10 UTC
[21:34:04] <icinga-wm>	 RECOVERY - Varnish HTTP upload-backend on cp4006 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.149 second response time  
[21:34:24] <icinga-wm>	 RECOVERY - Varnishkafka log producer on cp4020 is OK: PROCS OK: 1 process with command name varnishkafka  
[21:34:54] * twkozlowski  checks channel logs yet again
[21:35:32] <twkozlowski>	 greg-g: 02-11 surely :)
[21:35:59] <bawolff>	 Did something funky happen to the job queue about 24 hours ago? There are 6 a/v files queued to be transcoded that never happened (Not something that causes a problem since they can be easily restarted, just curious)
[21:36:25] <greg-g>	 twkozlowski: heh, yeah, that was a copy paste from the initial email :)
[21:37:19] <Jamesofur>	 twkozlowski: on a side note because this discussion is reminding me, thank you so much for all of your work on the tech news. Hugely helpful.
[21:37:24] <icinga-wm>	 RECOVERY - Varnishkafka log producer on cp4011 is OK: PROCS OK: 1 process with command name varnishkafka  
[21:37:35] <jgage>	 sounds like nobody has contacted UL yet? i'll look into how to do that.
[21:38:14] * twkozlowski  hugs Jamesofur
[21:38:22] <ori>	 Jamesofur: ++
[21:38:23] <twkozlowski>	 https://meta.wikimedia.org/w/index.php?title=Tech/News/2014/08&diff=7496055&oldid=7496052 greg-g 
[21:38:27] <Jamesofur>	 :D /hugs
[21:39:29] * jgage  is calling UL
[21:39:29] <greg-g>	 ditto, thanks much twkozlowski 
[21:39:51] <greg-g>	 jgage: brandon called, we got a quick message
[21:40:08] * ori  cheers jgage on
[21:40:38] <greg-g>	 jgage: to be clear, he didn't get much, if you can get more, please plese do
[21:40:47] <twkozlowski>	 greg-g: so you see that's all the info I have right now; not much to go with
[21:40:58] <jgage>	 all they said is power outage, restored, email has been sent. "UnitedLayer SF7 Power Event on 2/14/14
[21:41:03] <jgage>	 " to ops@
[21:41:27] <greg-g>	 twkozlowski: that made me laugh, then frown
[21:41:29] <twkozlowski>	 greg-g: edits welcome, and encouraged :)
[21:41:44] * greg-g  looks at his IRC logs
[21:42:01] <bblack>	 the gist of which is "At approximately 12:50 PM today during a routine maintenance of a UPS unit in SF7, a power event caused the UPS to drop load to some customer circuits.  We are still looking into the root cause and impact"
[21:42:21] <bblack>	 some meaning "all" in our case :P
[21:43:07] <Jamesofur>	 that's a pretty nasty power event
[21:43:14] <bd808>	 !log lostash fatalmonitor dashboard working again after restarts to backend
[21:43:21] <morebots>	 Logged the message, Master
[21:44:23] <grrrit-wm>	 (03PS12) 10BBlack: Handle HTTPS for Zero traffic [operations/puppet] - 10https://gerrit.wikimedia.org/r/102316 (owner: 10Yurik)
[21:44:23] <Coren>	 Clearly wasn't routine if they managed to flub it that hard.
[21:45:06] <Jeff_Green>	 we're not dual-feed?
[21:45:11] <grrrit-wm>	 (03CR) 10BBlack: "^ PS12 is just a manual rebase onto all the other recent changes" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102316 (owner: 10Yurik)
[21:46:05] <bd808>	 !log disabled puppetd on logstash1002 to test ganglia monitor fix
[21:46:14] <morebots>	 Logged the message, Master
[21:48:54] <icinga-wm>	 RECOVERY - Varnishkafka log producer on cp4019 is OK: PROCS OK: 1 process with command name varnishkafka  
[21:50:04] <jgage>	 none of the ulsfo power strips are in librenms :\
[21:51:15] * jgage  makes RT 6850 about that
[21:51:26] <greg-g>	 twkozlowski: looking at my irc logs, looks like that one was ~20 minutes
[21:51:55] <twkozlowski>	 Good greg-g. Now we only need to know what happened :)
[21:52:22] <twkozlowski>	 since it was a Parsoid problem, did it affect VE editing somehow?
[21:52:31] <greg-g>	 twkozlowski: ohhhh, that was separate
[21:52:34] <greg-g>	 :)
[21:52:39] <greg-g>	 there were two right around the same time :)
[21:52:55] <greg-g>	 the one I've been talking about, and then the parsoid one, gwicke should be getting that posted soon
[21:54:08] <gwicke>	 twkozlowski, what I heard is that people got an error in about 1 out of 20 requests
[21:55:47] <twkozlowski>	 as in page loads, similar to what happened on Feb 13?
[21:55:51] <twkozlowski>	 [crazy week, eh?]
[21:56:45] <gwicke>	 twkozlowski, as in VE edits
[21:56:56] <gwicke>	 parsoid does not affects normal page loads
[21:57:02] <gwicke>	 *affect
[21:57:19] <twkozlowski>	 gwicke: greg says that's not what he had in mind :)
[21:57:34] <twkozlowski>	 so first outage: 20 minuted due to database issues, erorrs for 1 out of 20 requests
[21:57:48] <twkozlowski>	 then on the same day, around the same time, problems with VE edits due to ?
[21:57:50] <greg-g>	 twkozlowski: s/, erorrs for 1 out of 20 requests//
[21:58:01] <greg-g>	 that errors 1 in 20 was for the parsoid one
[21:58:19] <greg-g>	 boy would this be clearer if we had finished incident reports on wiki
[21:58:19] <twkozlowski>	 okay
[21:58:22] * greg-g  smikes
[21:58:27] * greg-g  smiles, even
[21:58:53] * gwicke  feels the pressure
[21:59:26] <twkozlowski>	 gwicke: so errors = couldn't save edits?
[21:59:47] <greg-g>	 gwicke: not just you :)
[22:00:25] <gwicke>	 twkozlowski, or got an error when loading content into VE
[22:00:54] <twkozlowski>	 On February 11, users experienced problems with VisualEditor for about 20 minutes due to database issues."
[22:01:02] <twkozlowski>	 perhaps not quite...
[22:01:11] <gwicke>	 this is a highly unscientific number, quoted from what I remember about the IRC backlog- I think Eloquence did some manual testing at the time
[22:01:15] <James_F>	 The Parsoid cluster outage?
[22:01:18] <greg-g>	 - database issues
[22:01:40] <greg-g>	 gwicke: the parsoid thing wasn't caused by db stuff was it?
[22:02:03] <gwicke>	 no, that was log files filling up the disk and taking out about 3/4 of the parsoid nodes
[22:02:27] * gwicke  writes up a report
[22:02:35] <greg-g>	 :)
[22:02:39] <greg-g>	 that's what I thought
[22:02:52] <James_F>	 twkozlowski: Oh, I should write VE items for Tech/News shouldn't I?
[22:03:04] <twkozlowski>	 James_F: We did earlier today
[22:03:13] <James_F>	 Eurgh. Wrongly.
[22:03:18] <twkozlowski>	 https://meta.wikimedia.org/wiki/Tech/News/2014/08
[22:03:24] <twkozlowski>	 see, that's the problem with VE
[22:03:33] <twkozlowski>	 You add stuff on a Friday evening :-)
[22:03:44] <James_F>	 Thursday evening from our POV. :-P
[22:03:48] <James_F>	 Forthcoming change !!!!== "you can now".
[22:03:57] <grrrit-wm>	 (03PS1) 10Ori.livneh: Migrate EventLogging to "%{..}x"-style format specifiers [operations/puppet] - 10https://gerrit.wikimedia.org/r/113470 
[22:04:09] * James_F  fixes.
[22:04:31] <twkozlowski>	 oh, that's because of bug status in BZ
[22:04:41] <James_F>	 Yeah. FIXED !== DEPLOYED.
[22:04:45] <James_F>	 (If only we had that state.)
[22:05:02] <grrrit-wm>	 (03PS1) 10BryanDavis: Make elasticsearch ganglia monitor compatible with logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/113471 
[22:05:14] <icinga-wm>	 RECOVERY - Varnish HTTP mobile-backend on cp4020 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.151 second response time  
[22:05:37] <twkozlowski>	 yay greg-g, thanks
[22:06:32] <twkozlowski>	 James_F: for site requests, I know that people keep their bugs open until their patches are deployed
[22:06:55] <twkozlowski>	 but I guess to each their own, maybe that won't work for you
[22:06:56] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Migrate EventLogging to "%{..}x"-style format specifiers [operations/puppet] - 10https://gerrit.wikimedia.org/r/113470 (owner: 10Ori.livneh)
[22:07:06] <grrrit-wm>	 (03PS2) 10Ori.livneh: eventlogging: manage /etc/eventlogging.d recursively [operations/puppet] - 10https://gerrit.wikimedia.org/r/113277 
[22:07:12] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] eventlogging: manage /etc/eventlogging.d recursively [operations/puppet] - 10https://gerrit.wikimedia.org/r/113277 (owner: 10Ori.livneh)
[22:07:12] <James_F>	 twkozlowski: For site requests merged and deployed are the same state (or should be unless someone screwed up).
[22:07:45] <twkozlowski>	 True.
[22:08:06] <grrrit-wm>	 (03CR) 10BryanDavis: "Tested by manual application on logstash1002. Needs to be verified not to disrupt reporting on the Search cluster." [operations/puppet] - 10https://gerrit.wikimedia.org/r/113471 (owner: 10BryanDavis)
[22:08:56] <andre__>	 James_F: I'd introduce a DEPLOYED status in Bugzilla if we had an automatic way to set it :)
[22:09:40] <James_F>	 andre__: I'd be delighted to set it manually for VE.
[22:09:47] <James_F>	 andre__: Others might not want to. :-)
[22:09:51] <andre__>	 hmm.
[22:09:55] <bd808>	 andre__: That would be cool. Want to add it to the deploy system wish list at https://etherpad.wikimedia.org/p/DeploymentSystemRequirements?
[22:10:10] <James_F>	 andre__: Also, https://www.mediawiki.org/wiki/MediaWiki_1.23/wmf14/Changelog#VisualEditor from Reedy is a good opportunity to set the flag…
[22:10:12] <andre__>	 James_F, DEPLOYED_PHASE1, DEPLOYED_PHASE2, and DEPLOYED_PHASE3? :P
[22:10:46] <ori>	 DEPLOYED_TO_ONE_PRODUCTION_BRANCH_BUT_NOT_THE_OTHER, etc.
[22:10:46] <James_F>	 andre__: I think DEPLOYED_TEST, DEPLOYED_SOME, DEPLOYED_ALL, and RELEASED (for MW releases) would make more sense.
[22:10:53] <andre__>	 ah, phases are now called groups on https://www.mediawiki.org/wiki/MediaWiki_1.23/Roadmap
[22:11:04] <andre__>	 James_F, hmm, good point.
[22:11:18] <James_F>	 andre__: Can we have a multi-level state like for "RESOLVED"?
[22:11:26] <greg-g>	 :(
[22:11:32] <andre__>	 James_F, no :(
[22:11:33] <James_F>	 andre__: So "DEPLOYED" / "ALL"
[22:11:35] <James_F>	 Boo.
[22:11:41] * James_F  grumps about Bugzilla.
[22:11:49] <andre__>	 Hmm. That would be interesting, yeah
[22:11:59] <greg-g>	 I'd rather not tie those BZ status to explicit stages of deployment that may change (probably will)
[22:12:13] <^d>	 greg-g: +1
[22:12:16] <James_F>	 greg-g: Hence TEST vs. ALL.
[22:12:18] <greg-g>	 I'd rather like a "this gerrit change fixed it, click here to see where it lives"
[22:12:30] <^d>	 Again, greg-g+1
[22:12:30] <James_F>	 greg-g: We already have that – but not in Bugzilla.
[22:12:46] <James_F>	 greg-g: ("Included in" in gerrit.)
[22:12:48] <greg-g>	 James_F: no we don't, we just have what branch it is in, which is not a 1:1 for deployed
[22:12:58] <greg-g>	 wmf13 is where, again? :P
[22:13:12] <James_F>	 greg-g: Right now? Phases 1 and 2 but not 0.
[22:13:13] <greg-g>	 there needs to be a bit more logic before that's helpful
[22:13:19] <greg-g>	 James_F: and you know that through?
[22:13:26] <James_F>	 greg-g: [[Deployments]]. :-)
[22:13:27] <greg-g>	 gerrit? BZ? no a by-hand wiki page :)
[22:13:53] <James_F>	 Actually, I know that from grrrit-wm's reports of Reedy's merges.
[22:13:59] <greg-g>	 so yeah, one can peice things together, but Yuvi is working on a 'dashboard' for lack of a better word that'll help with that
[22:14:05] <greg-g>	 James_F: :P
[22:14:07] <James_F>	 Oh, he is?
[22:14:14] <James_F>	 I should send him my thoughts.
[22:14:18] <greg-g>	 well, he started, then got distracted, it was a weekend thing for him
[22:14:19] <bd808>	 greg-g: https://noc.wikimedia.org/conf/highlight.php?file=wikiversions.dat
[22:14:24] <James_F>	 greg-g: What? I stalk grrrit-wm a lot. :-)
[22:14:24] <greg-g>	 bd808: :P
[22:15:12] <greg-g>	 https://commons.wikimedia.org/wiki/File:PersonalDashboard_v1.jpg <- my 10 second thougths on what the dashboard should be like
[22:15:14] <James_F>	 twkozlowski: OK if I mark https://meta.wikimedia.org/wiki/Tech/News/2014/08 for translation?
[22:15:14] <icinga-wm>	 PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-db1047  
[22:15:21] <twkozlowski>	 No!
[22:15:25] <greg-g>	 bd808: ^^
[22:15:31] * ori  handles EL stuff
[22:15:38] <twkozlowski>	 James_F: I'd wait for Guillaume to wake up :-)
[22:15:43] <greg-g>	 thanks or
[22:15:46] <greg-g>	 i
[22:15:54] <twkozlowski>	 He's got a real knack for simplifying messages
[22:16:01] <James_F>	 twkozlowski: Bah. All those translators working on bad messages. :-(
[22:16:20] <gwicke>	 twkozlowski: https://wikitech.wikimedia.org/wiki/Incident_documentation/20140211-Parsoid#Summary
[22:16:29] <bd808>	 greg-g: That's cute. I like it.
[22:16:49] <greg-g>	 bd808: thanks dear
[22:17:24] <James_F>	 greg-g: We'd probably want some hooks into new BZ tickets tagged against something.
[22:17:25] <greg-g>	 thanks much gwicke 
[22:17:29] <James_F>	 greg-g: No idea how that would work.
[22:17:34] <greg-g>	 James_F: yeah /me shrugs on that
[22:17:34] <twkozlowski>	 You can now convert block images between some different types (like thumbnail, framed and frameless).
[22:17:43] <twkozlowski>	 James_F: block images?
[22:17:46] <James_F>	 twkozlowski: Yes.
[22:17:57] <twkozlowski>	 Q: What are block images?
[22:18:15] <James_F>	 Images (actually, media item transclusions, but I was simplifying) that are blocks, not inline.
[22:18:24] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000  
[22:18:33] <James_F>	 The examples of the types follow the clause.
[22:19:04] <twkozlowski>	 James_F: I had a look at it last week
[22:19:07] <James_F>	 greg-g: Some metrics on performance, not just fatals, would be good too.
[22:19:30] <twkozlowski>	 James_F: so it's kind of file properties like thumbnail, frame, etc
[22:19:51] <James_F>	 twkozlowski: When last week? IIRC the code only merged on Friday.
[22:19:51] <greg-g>	 James_F: let me know when we have those :)
[22:20:14] <icinga-wm>	 RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning.  
[22:20:18] <James_F>	 greg-g: Dan's working on some for VE right now. Not split by phase, though. :-(
[22:20:21] <twkozlowski>	 James_F: when I reviewed VE's product on BZ :)
[22:20:31] <greg-g>	 Dan who?
[22:21:07] <James_F>	 twkozlowski: Given that I write https://www.mediawiki.org/wiki/VisualEditor/status specifically so that people know what's in each VE release, maybe read that instead to avoid wasting your time? :-)
[22:21:19] <James_F>	 greg-g: Garry.
[22:22:12] <greg-g>	 James_F: ..... from where?
[22:22:15] <bd808>	 !log enabled puppetd on logstash1002
[22:22:22] <morebots>	 Logged the message, Master
[22:22:36] <twkozlowski>	 James_F: That's posted every two weeks; Tech News is a weekly
[22:22:48] <James_F>	 twkozlowski: No, it's posted weekly.
[22:22:54] <twkozlowski>	 James_F: in any case, Guillaume's the one who's doing VE items
[22:22:58] <bd808>	 !log manually applied Ie56d3a5 on logstash100[123] hosts and restarted gmond
[22:23:05] <morebots>	 Logged the message, Master
[22:23:10] <twkozlowski>	 2013-01-16, then 2013-01-30, that's two weeks
[22:23:11] <James_F>	 twkozlowski: More specifically, it's posted every week that MW deploys.
[22:23:50] <twkozlowski>	 "You can now change file properties like thumbnail and frame with VisualEditor."
[22:23:55] <twkozlowski>	 James_F: ^^ ?
[22:24:07] <James_F>	 twkozlowski: They're not called "file properties".
[22:24:12] <James_F>	 twkozlowski: That's just… confusing.
[22:24:54] <gwicke>	 twkozlowski, in case you also want to report on Parsoid: https://www.mediawiki.org/wiki/Parsoid/Deployments
[22:25:12] <James_F>	 twkozlowski: Parsoid is probably more interesting in some ways.
[22:26:10] <twkozlowski>	 MW manual says 'file format'
[22:30:07] <logmsgbot>	 !log kaldari synchronized php-1.23wmf14/extensions/VectorBeta/  'sync update for VectorBeta on wmf14'
[22:30:16] <James_F>	 twkozlowski: It does?
[22:30:16] <morebots>	 Logged the message, Master
[22:31:03] <twkozlowski>	 Yeah; I didn't even know that ;)
[22:32:32] <twkozlowski>	 James_F: can you OK this? https://meta.wikimedia.org/w/index.php?title=Tech/News/2014/08&diff=7496614&oldid=7496373
[22:34:01] <James_F>	 twkozlowski: "with VisualEditor" seems superfluous when it's in the VE section. :-)
[22:34:04] <icinga-wm>	 RECOVERY - ElasticSearch health check on logstash1003 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 35: active_shards: 70: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[22:34:05] <icinga-wm>	 RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 35: active_shards: 70: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[22:34:14] <bd808>	 W00t! ^^
[22:34:25] <icinga-wm>	 RECOVERY - ElasticSearch health check on logstash1002 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 35: active_shards: 70: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[22:34:28] <James_F>	 bd808: Yay?
[22:34:33] <bd808>	 Hopefully the damn thing will stay up now
[22:34:38] <James_F>	 Hopefully. :-)
[22:36:37] <bd808>	 Something "bad" happened last night that got the elasticsearch nodes behind logstash wedged up against their max jvm heap size.
[22:37:01] <James_F>	 twkozlowski: Better?
[22:37:27] <twkozlowski>	 is it images or files?
[22:38:19] <twkozlowski>	 You will soon be able to create and edit redirect pages suggests you can't right now
[22:38:42] <James_F>	 twkozlowski: It's "media items".
[22:38:45] <James_F>	 twkozlowski: But that sucks.
[22:39:25] <James_F>	 twkozlowski: Well, you can't (in VisualEditor). The entire section is about VE.
[22:40:19] * twkozlowski  is skeptic
[22:40:57] <twkozlowski>	 James_F: if you translate stuff, the message doesn't say it's in the VE section :)
[22:41:03] <twkozlowski>	 s/if/when
[22:41:30] <Nemo_bis>	 twkozlowski: it should, add to qqq if relevant for understanding
[22:45:44] <icinga-wm>	 PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC  
[22:45:58] <grrrit-wm>	 (03PS2) 10Chad: Remove old public key [operations/puppet] - 10https://gerrit.wikimedia.org/r/113326 
[22:53:16] <logmsgbot>	 !log kaldari synchronized php-1.23wmf13/extensions/VectorBeta/  'sync update for VectorBeta on wmf13'
[22:53:26] <morebots>	 Logged the message, Master
[22:57:34] <icinga-wm>	 PROBLEM - Disk space on labstore4 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error  
[23:00:57] <GerardM->	 hoi Reasonator gives me Bad Request
[23:00:58] <GerardM->	 Your browser sent a request that this server could not understand.
[23:01:20] <greg-g>	 GerardM-: no known issues currently...
[23:01:37] <greg-g>	 are you sure you're sending a valid request? :)
[23:01:42] <GerardM->	 yes
[23:01:47] <GerardM->	 Magnus has the same issue
[23:01:49] <GerardM->	 try http://tools.wmflabs.org/
[23:02:01] <GerardM->	 try reasonator.info
[23:02:07] <greg-g>	 ( I should say, no known production cluster issues right now)
[23:02:11] <greg-g>	 oh
[23:02:15] <greg-g>	 huh, that's nice
[23:02:17] <greg-g>	 Coren: ^^
[23:02:32] <greg-g>	 cmjohnson1: good timing, anything stupid in tampa right now that would affect wmflabs?
[23:03:37] <cmjohnson1>	 greg-g: there was a fpl wave down earlier
[23:03:38] <gwicke>	 ori, who would be the right person to tweak the disk space alert for parsoid hosts?
[23:03:47] <greg-g>	 cmjohnson1: that's over though, right?
[23:04:19] <ori>	 gwicke: whomever is on RT duty
[23:04:19] <greg-g>	 s/over/fixed/
[23:04:39] <gwicke>	 ori, ok; apergos: ping
[23:04:44] * cmjohnson1  checking 
[23:05:12] * apergos  points out that at 1 am I'm not tweakign anything... (sorry)
[23:05:16] <Sveta>	 hi
[23:05:23] <greg-g>	 apergos: you shoulda stayed silent :)
[23:05:33] <Sveta>	 the labs thing appears to have fs issues, something you can help with?
[23:05:33] <gwicke>	 hehe
[23:05:35] <apergos>	 I shoulda but I was pinged and happened to pass by
[23:05:44] <greg-g>	 Sveta: yeah, looking into it
[23:05:50] <apergos>	 someone in a real tz should have a look at it 
[23:06:23] <gwicke>	 yeah, just hard to pinpoint somebody
[23:06:37] <Sveta>	 greg-g, Coren might be also looking into it
[23:06:42] <apergos>	 also gwicke as you know the logs are rotated once an hour now (at least last I checked they seemed to be doing that)
[23:06:45] * gwicke  tries to open an rt ticket
[23:06:55] <greg-g>	 Sveta: yep, thanks much
[23:07:00] <gwicke>	 apergos, I know, thanks for setting that up!
[23:07:01] <Sveta>	 nod
[23:07:27] <apergos>	 it's only a stopgap.  but it would take some serious log explosion for parsoid to fill up 6gb in an hour
[23:07:40] <gwicke>	 apergos, just thought that it makes sense to lower the disk space threshold to something that gives us enough time to react
[23:07:47] <apergos>	 agree
[23:08:33] <cmjohnson1>	 greg-g: I don't see anything that says its been fixed but if it were a problem for labs it would have been a problem all day. it's the ashburn to tampa link. I doubt very much that's it
[23:08:35] <greg-g>	 cmjohnson1: btw, you're off the hook
[23:08:38] <greg-g>	 heh, thanks much
[23:08:47] <cmjohnson1>	 greg-g thx!
[23:10:54] <icinga-wm>	 PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100%  
[23:16:34] <icinga-wm>	 RECOVERY - Disk space on labstore4 is OK: DISK OK  
[23:16:34] <icinga-wm>	 RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms  
[23:29:17] <gwicke>	 apergos, https://rt.wikimedia.org/Ticket/Display.html?id=6851&results=bb1c6fbf4f72ec9a7649095081b3bfc2
[23:31:54] <Coren>	 !log tools Rebooted labstore4 -- XFS done got broken agun
[23:32:01] <morebots>	 Logged the message, Master
[23:32:01] <greg-g>	 thanks Coren 
[23:34:01] <ori>	 gwicke: https://github.com/wikimedia/operations-puppet/blob/production/modules/base/manifests/monitoring/host.pp#L68
[23:34:15] <ori>	 that won't be easy to customize
[23:34:55] <gwicke>	 ori, k
[23:35:26] <gwicke>	 we likely also have that data in ganglia, but then there is no way to define alerts on it
[23:49:12] <legoktm>	 gwicke: on https://wikitech.wikimedia.org/wiki/Incident_documentation/20140211-Parsoid#Summary, is "I" you?
[23:49:59] <gwicke>	 legoktm: no, that's Roan
[23:50:01] <gwicke>	 let me fix
[23:50:21] <legoktm>	 thanks
[23:51:44] <gwicke>	 done
[23:52:29] <legoktm>	 ty
[23:54:56] <grrrit-wm>	 (03PS1) 10BBlack: Move traffic back to ulsfo, reverts fa5fa2e8 [operations/dns] - 10https://gerrit.wikimedia.org/r/113479 
[23:54:58] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Move traffic back to ulsfo, reverts fa5fa2e8 [operations/dns] - 10https://gerrit.wikimedia.org/r/113479 (owner: 10BBlack)
[23:55:13] <grrrit-wm>	 (03CR) 10TTO: "I am not sure, since you didn't really explain what was broken, but I suspect this is bug 61357?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113377 (owner: 10Legoktm)
[23:59:04] <logmsgbot>	 !log aaron synchronized php-1.23wmf13/extensions/Math  '9e75a1b'
[23:59:11] <morebots>	 Logged the message, Master
[23:59:14] <icinga-wm>	 RECOVERY - Varnish HTTP text-backend on cp4018 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.152 second response time