[00:00:23] Right, so gwicke's going first [00:00:43] I nominate Flow (who?) for second [00:00:47] I'll take sloppy thirds [00:01:04] alright, deploying [00:01:05] Unless the Flow team doesn't show up [00:01:22] spagewmf1: I assume you have something to do with this [00:01:26] gwicke: i can do it if no one from ops is around [00:01:31] and done [00:01:46] gwicke: should i restart? [00:01:58] ori, lets see if there are some that don't come back up [00:02:03] only those need to be restarted [00:02:07] okay [00:02:31] I'm kind of around :) [00:02:41] rdwrer or spagewmf1, you can go ahead [00:02:49] Right, flow is being slow flow [00:02:50] I'll go [00:02:53] paravoid, ahhhh.. ;) [00:03:03] it's a bit late [00:03:14] LIGHTENING DEPLOYYYYY [00:03:19] yeah, am I'm guessing [00:03:38] paravoid, could we chat about debs tomorrow? [00:03:53] rdwrer: I think our fix got in OK, let me check [00:03:58] I need to look at the emails first [00:04:21] k [00:04:42] I'd like to figure out the repo situation soon so that we can start publishing it [00:05:03] there is now testreduce (the rt test server), mathoid, parsoid and soon storoid [00:05:28] pdf renderer potentially too [00:06:10] rdwrer: good work on following the rules [00:06:50] greg-g: I am nothing if not lawful good [00:07:05] !log mholmquist synchronized php-1.23wmf13/extensions/MultimediaViewer/resources/mmv/mmv.lightboxinterface.js 'Fix for arrow keys in MultimediaViewer' [00:07:09] Lo, Thor - shine brightly on this deploy [00:07:13] Logged the message, Master [00:07:32] I'm done, just waiting for caches to clear so I can confirm [00:07:44] FLOWWWWWWWWWW TIIIIIIIIIME [00:07:51] rdwrer: thanks [00:08:10] * gwicke forgot to call service-restart [00:08:21] ori, restarting now [00:08:46] oh wow, salt does not emit broken output any more with one char every two lines [00:09:54] https://gist.github.com/gwicke/374c20f20efbcf4b4022 [00:10:14] looking into wtp1019 and wtp1016 [00:10:32] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [00:10:38] gwicke: so apparently the fix for extreme Math slowness is to set $wgMathDisableTexFilter [00:10:40] * ori restarts wtp1015 [00:10:42] PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused [00:10:50] does this makes things worse off than the pre-refactored state? [00:10:52] PROBLEM - Parsoid on wtp1014 is CRITICAL: Connection refused [00:11:03] PROBLEM - Parsoid on wtp1023 is CRITICAL: Connection refused [00:11:03] PROBLEM - Parsoid on wtp1021 is CRITICAL: Connection refused [00:11:03] PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused [00:11:03] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [00:11:06] oh oh [00:11:09] pages that took like 2-3 seconds to parse can take like 26 [00:11:11] AGH [00:11:12] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [00:11:15] Guys I fucked it up [00:11:21] I read the wrong status update for mediawiki.org [00:11:23] people may have noticed ;) [00:11:24] rdwrer, don't worry about parsoid [00:11:32] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.002 second response time [00:11:33] I'll go after flow [00:11:35] it is unrelated / different cluster etc [00:11:42] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time [00:11:43] gwicke: I think he means something else [00:11:45] rdwrer: almost there, waiting on zuul [00:11:51] greg-g, k [00:11:52] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [00:11:56] gwicke: i restarted it on all the ones that caused CRITICALs [00:12:02] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [00:12:03] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [00:12:03] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [00:12:03] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time [00:12:04] this is a little ridiculous though [00:12:07] https://en.wikipedia.org/w/index.php?title=Real_projective_line&oldid=543928239 [00:12:13] ori, thanks [00:12:17] probably the worse perf regression in a while :( [00:12:43] AaronSchulz, my understanding is that caching is disabled completely? [00:13:15] ori, the upstart config looks all fine, and according to the docs upstart should send a kill after five seconds [00:13:18] it keeps shelling out to do syntax checks even if a png was made already [00:13:35] I have not been able to reproduce the behavior yet on non-prod machines [00:13:45] and can't test on prod machines [00:14:00] AaronSchulz: ah, that sounds stupid [00:14:16] !log ebernhardson synchronized php-1.23wmf13/extensions/Flow [00:14:19] rdwrer: all done, your back up. [00:14:24] Logged the message, Master [00:14:37] gwicke: five seconds after what? [00:14:48] ori, five seconds after sigterm [00:14:58] if the process has not exited yet, it sends a sigkill [00:15:06] gwicke: and what is actually happening? [00:15:21] ori, hard to tell [00:15:23] it is sending sigterm, but no sigkill when the process fails to exit? [00:15:35] I doubt that [00:16:16] one theory is that re-forking of children throws off the upstart ptrace stuff [00:16:24] are you setting SO_REUSEADDR on the listening socket? [00:16:35] I verified that the expect setting is fine [00:16:36] LIGHTENING DEPLOY AGAAAAAAAAIN [00:16:42] Reedy: patch for that error here https://gerrit.wikimedia.org/r/#/c/113301/ [00:16:46] ori, afaik that is the node default [00:16:48] !log mholmquist synchronized php-1.23wmf14/extensions/MultimediaViewer/resources/mmv/mmv.lightboxinterface.js 'Fix for arrow keys in MultimediaViewer' [00:16:57] Logged the message, Master [00:17:05] gwicke: i don't think so; it was an issue with the statsd upstart job [00:17:06] Effin a [00:17:09] Still testing [00:17:21] OK we're good, /me done [00:17:24] whew [00:17:25] gwicke: it would try to restart, fail to bind the port, and exit [00:17:33] ebernhardson: you all tested and such? [00:17:35] statsd being a nodejs app [00:17:52] ori, http://nodejs.org/api/net.html#net_server_listen_port_host_backlog_callback [00:17:59] "Note: All sockets in Node set SO_REUSEADDR already" [00:18:20] greg-g: yup [00:18:27] cool [00:18:30] superm401: you're up [00:18:39] Alright, doing the cherrypick [00:19:12] ori, I have seen two copies of node running before after a restart, which made me suspect that upstart sometimes kills the wrong process [00:19:29] when a worker dies it is re-forked [00:19:45] maybe that throws off the upstart pid tracking [00:20:09] was not an issue with the init.d script [00:20:26] ahhh could very well be [00:21:22] the systemd folks mention that their cgroups stuff is more reliable for forking daemons [00:22:12] I might just revert to the new init script on Ubuntu if upstart continues to create issues [00:22:40] https://gerrit.wikimedia.org/r/#/c/110666/32/debian/parsoid.init [00:32:56] (03PS1) 10Springle: remove es[123] for decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/113306 [00:33:00] greg-g, I'm getting a sync-dir error: [00:33:08] mflaschen@tin:/a/common (master)$ sync-dir 'Sync for GENDER fix to jQueryMsg' php-1.23wmf13/resources/mediawiki/ [00:33:10] Target file is not a directory [00:33:14] ori, this is parsoid.log from wtp1001, which is not currently reachable: [00:33:15] https://gist.github.com/gwicke/4b17f1837258027fb392 [00:33:27] superm401: order of parameters :) [00:33:30] superm401: dir comes first [00:33:35] Doh [00:34:22] (03CR) 10Springle: [C: 032] remove es[123] for decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/113306 (owner: 10Springle) [00:34:35] !log mflaschen synchronized php-1.23wmf13/resources/mediawiki/ 'Sync for GENDER fix to jQueryMsg' [00:34:43] Logged the message, Master [00:36:14] !log mflaschen synchronized php-1.23wmf13/tests/qunit/suites/resources/mediawiki/mediawiki.jqueryMsg.test.js 'Sync for GENDER fix to jQueryMsg' [00:36:22] Logged the message, Master [00:40:29] !log mflaschen synchronized php-1.23wmf14/resources/mediawiki/ 'Sync for GENDER fix to jQueryMsg' [00:40:37] Logged the message, Master [00:41:35] (03PS3) 10Yurik: Updated whitelisted language lists to match config [operations/puppet] - 10https://gerrit.wikimedia.org/r/113168 (owner: 10QChris) [00:41:51] ori, some more parsoids look unhappy [00:41:54] !log mflaschen synchronized php-1.23wmf14/tests/qunit/suites/resources/mediawiki/mediawiki.jqueryMsg.test.js 'Sync for GENDER fix to jQueryMsg' [00:41:58] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Parsoid%2520eqiad&tab=m&vn= [00:42:03] Logged the message, Master [00:42:14] Done, greg-g, sorry I ran over. [00:42:31] wtp1001, 1006, 1014, 1015, 1020, 1021, 1023 [00:43:36] only wtp1001 seems to be all down [00:44:12] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time [00:44:55] i restarted them [00:45:01] but not doing that again, this is silly [00:45:04] ori, thanks! [00:46:08] (03PS1) 10Spage: Add qa_automation group and grant it Flow rights [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113311 [00:46:09] it is definitely annoying, and not what I had hoped for by moving to upstart [00:46:35] Upstart? More like restart. [00:46:39] Gloria: Hush. [00:47:03] PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused [00:47:13] i restarted that one already [00:47:39] ori, based on the log my suspicion is that upstart is sending sigterm to random workers [00:48:09] it does not seem to send a sigkill [00:48:28] but at the same time some workers don't seem to exit in time before the restart [00:48:29] gwicke: as an ugly workaround, try having a pre-start clause that kills any lingering instances [00:48:51] which explains why the service is then wedged [00:49:59] yeah, but then moving to start-stop-daemon might actually be better [00:52:27] (03PS1) 10Springle: x1 depool db1030 for maintenance [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113313 [00:52:57] (03CR) 10Springle: [C: 032] x1 depool db1030 for maintenance [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113313 (owner: 10Springle) [00:53:20] (03CR) 10Yurik: [C: 04-1] "I placed all the missmatched languages into patch https://gerrit.wikimedia.org/r/#/c/113168/ (there were a few more i found). I think we " [operations/puppet] - 10https://gerrit.wikimedia.org/r/113167 (owner: 10QChris) [00:53:57] !log springle synchronized wmf-config/db-eqiad.php 'x1 depool db1030 for maintenance' [00:54:05] Logged the message, Master [00:57:51] (03PS1) 10Springle: reassign db1030 to s6 [operations/puppet] - 10https://gerrit.wikimedia.org/r/113314 [01:01:21] (03CR) 10Springle: [C: 032] reassign db1030 to s6 [operations/puppet] - 10https://gerrit.wikimedia.org/r/113314 (owner: 10Springle) [01:06:38] gwicke: did you determine how git-deploy/salt is attempting to restart the process? [01:08:04] ori, salt is using some python module it seems [01:08:18] I reported a bug against it as it was preferring init.d over upstart [01:08:50] apparently it looks for the files itself [01:09:39] !log restarting EventLogging on vanadium [01:09:49] Logged the message, Master [01:15:26] !log xtrabackup clone db1010 to db1030 [01:15:39] Logged the message, Master [01:30:32] (03CR) 10CSteipp: [C: 031] "Thanks" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113311 (owner: 10Spage) [01:42:26] (03PS1) 10GWicke: Wait 60 seconds before killing the parsoid master [operations/puppet] - 10https://gerrit.wikimedia.org/r/113316 [01:47:09] (03CR) 10GWicke: "Also see https://gerrit.wikimedia.org/r/#/c/113318/ for a related parsoid change" [operations/puppet] - 10https://gerrit.wikimedia.org/r/113316 (owner: 10GWicke) [02:22:19] springle: Have you seen bug 61319? It seems like the page table for enwiki on db1056 (maybe others too?) is somehow out of sync. [02:22:40] (03PS1) 10Aaron Schulz: Set Memcached retry_timeout to -1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113321 [02:23:12] (03PS2) 10Aaron Schulz: Set Memcached retry_timeout to -1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113321 [02:23:33] (03CR) 10Aaron Schulz: [C: 032] Set Memcached retry_timeout to -1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113321 (owner: 10Aaron Schulz) [02:23:44] (03Merged) 10jenkins-bot: Set Memcached retry_timeout to -1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113321 (owner: 10Aaron Schulz) [02:24:53] !log aaron synchronized wmf-config/mc.php 'Set Memcached retry_timeout to -1' [02:25:02] Logged the message, Master [02:25:25] * AaronSchulz wonders if cygwin doesn't disable Nagle :s [02:26:49] anomie: no hadn't seen it. looking [02:27:03] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:03] !log LocalisationUpdate completed (1.23wmf13) at 2014-02-14 02:27:02+00:00 [02:27:03] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:09] Logged the message, Master [02:27:22] PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:22] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:03] PROBLEM - Apache HTTP on mw122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:03] PROBLEM - Apache HTTP on srv277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:03] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:03] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:03] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:04] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:53] RECOVERY - Apache HTTP on srv277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.565 second response time [02:28:53] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.566 second response time [02:28:53] RECOVERY - Apache HTTP on mw122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.587 second response time [02:28:53] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.570 second response time [02:29:02] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.395 second response time [02:29:03] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.261 second response time [02:29:03] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:53] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.957 second response time [02:30:04] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:04] PROBLEM - Apache HTTP on mw113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:04] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:04] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:22] PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:52] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.967 second response time [02:31:02] PROBLEM - Apache HTTP on mw124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:31:03] PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:31:03] PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:31:12] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.030 second response time [02:31:22] RECOVERY - Apache HTTP on srv300 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.066 second response time [02:31:53] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.062 second response time [02:32:02] RECOVERY - Apache HTTP on mw113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.113 second response time [02:32:03] RECOVERY - Apache HTTP on srv249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.228 second response time [02:32:03] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.280 second response time [02:32:03] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:32:22] RECOVERY - Apache HTTP on mw61 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.748 second response time [02:32:30] (03PS1) 10Springle: depol db1056 for pt-table-sync checks bug 61319 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113322 [02:32:53] RECOVERY - Apache HTTP on mw124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.971 second response time [02:32:53] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.051 second response time [02:32:54] (03CR) 10Springle: [C: 032] depol db1056 for pt-table-sync checks bug 61319 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113322 (owner: 10Springle) [02:33:00] (03Merged) 10jenkins-bot: depol db1056 for pt-table-sync checks bug 61319 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113322 (owner: 10Springle) [02:34:03] PROBLEM - Apache HTTP on mw122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:34:23] !log springle synchronized wmf-config/db-eqiad.php 'depool db1056 for pt-table-sync bug 61319' [02:34:31] Logged the message, Master [02:35:02] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.206 second response time [02:35:53] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.569 second response time [02:35:53] RECOVERY - Apache HTTP on mw122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.005 second response time [02:36:02] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.132 second response time [02:36:03] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:03] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:03] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:37:03] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.225 second response time [02:37:53] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.030 second response time [02:38:03] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:03] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:53] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.066 second response time [02:39:03] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.208 second response time [02:39:03] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.278 second response time [02:39:03] PROBLEM - Apache HTTP on mw124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:39:03] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:39:53] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.953 second response time [02:39:53] RECOVERY - Apache HTTP on mw124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.992 second response time [02:40:32] PROBLEM - Apache HTTP on mw62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:41:22] RECOVERY - Apache HTTP on mw62 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.100 second response time [02:42:22] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:42:35] that is a lot of things [02:43:03] PROBLEM - Apache HTTP on mw124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:43:22] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.208 second response time [02:43:53] RECOVERY - Apache HTTP on mw124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.535 second response time [02:44:50] bd808|BUFFER: I can't get graphs to go back very far in Kibana. Is there some low elastic query result size limit? [02:45:03] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:02] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.245 second response time [02:46:03] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:53] PROBLEM - Apache HTTP on mw76 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:53] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.569 second response time [02:47:03] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:03] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:22] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:42] PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:52] RECOVERY - Apache HTTP on mw76 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.163 second response time [02:48:02] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.098 second response time [02:48:02] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.205 second response time [02:48:22] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.280 second response time [02:48:32] RECOVERY - Apache HTTP on srv299 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.012 second response time [02:50:03] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:50:03] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:03] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:03] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:03] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:03] PROBLEM - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:52] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.062 second response time [02:52:02] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.100 second response time [02:52:02] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.248 second response time [02:52:03] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:52:08] Attention everyone: I'm syncing a one-file change to MultimediaViewer to fix errors in prod. [02:52:53] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.066 second response time [02:53:53] RECOVERY - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 4405 bytes in 0.550 second response time [02:54:22] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:54:53] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.028 second response time [02:55:03] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:55:03] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:55:03] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:55:03] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:55:52] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.061 second response time [02:56:02] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.136 second response time [02:56:03] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.278 second response time [02:56:03] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:56:03] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:56:22] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.279 second response time [02:56:53] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.958 second response time [02:57:03] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:53] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.542 second response time [02:57:53] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.568 second response time [02:57:54] any roots around to poke a parsoid box? [02:58:02] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.242 second response time [02:58:03] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.276 second response time [02:58:05] (03PS1) 10Chad: Remove old public key [operations/puppet] - 10https://gerrit.wikimedia.org/r/113326 [02:59:22] PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:59:30] Are you getting hit, too, gwicke? [02:59:38] I'm having a hell of a time sync-file'ing [02:59:51] rdwrer, nope [03:00:06] just a box that didn't restart correctly and is not accepting any traffic [03:00:12] jgage, ori bblack springle ^^ apaches flapping, also gwicke would like some help if you're around :) [03:00:27] !log mholmquist synchronized php-1.23wmf14/extensions/MultimediaViewer/resources/mmv/ui/mmv.ui.metadataPanel.js 'Fix for arrow keys in MultimediaViewer' [03:00:35] Logged the message, Master [03:00:42] relatively low prio compared to actual site breakage [03:01:03] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:03] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:03] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:22] RECOVERY - Apache HTTP on srv300 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.116 second response time [03:01:42] PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:53] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.960 second response time [03:02:02] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.279 second response time [03:02:02] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:19] gwicke, not noticing any actual user-facing impact yet [03:02:22] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:31] it's all pmtpa stuff. don't know why [03:02:32] Eloquence, there is none [03:02:42] from the parsoid side at least [03:02:42] PROBLEM - Apache HTTP on mw73 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:46] hence low prio [03:02:50] yeah, I meant the apaches [03:03:02] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.095 second response time [03:03:02] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.205 second response time [03:03:05] ah, didn't check those [03:03:12] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.538 second response time [03:03:42] RECOVERY - Apache HTTP on mw73 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.420 second response time [03:03:50] so not paging any opsen just yet [03:04:03] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:04:11] enwiki seems to work fine [03:04:20] rdwrer, which branch is faulty? [03:04:32] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.062 second response time [03:05:50] gwicke: Not totally sure, this is above my pay grade [03:05:58] I just pushed to wmf14 [03:06:02] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.239 second response time [03:06:02] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:03] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:06] But I doubt that was the issue, it was an extension update [03:06:19] Anyway I'm off now [03:06:51] hmm [03:06:53] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.568 second response time [03:06:53] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.569 second response time [03:09:02] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:03] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:03] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:32] PROBLEM - Apache HTTP on mw64 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:53] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.569 second response time [03:10:53] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.031 second response time [03:11:22] RECOVERY - Apache HTTP on mw64 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.955 second response time [03:13:03] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:13:03] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:13:53] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.017 second response time [03:13:53] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.025 second response time [03:14:03] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:14:03] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:14:22] PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:14:22] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:14:53] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.958 second response time [03:14:53] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.028 second response time [03:15:13] RECOVERY - Apache HTTP on srv300 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.612 second response time [03:15:13] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.068 second response time [03:16:02] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.207 second response time [03:16:22] PROBLEM - Apache HTTP on srv292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:03] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:03] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:03] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:03] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:12] RECOVERY - Apache HTTP on srv292 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.958 second response time [03:18:03] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:18:34] mwdeploy rsync on pmtpa app servers [03:18:53] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.569 second response time [03:18:53] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.191 second response time [03:19:02] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.167 second response time [03:19:52] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.993 second response time [03:19:53] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.060 second response time [03:20:38] !log LocalisationUpdate completed (1.23wmf14) at 2014-02-14 03:20:37+00:00 [03:20:44] Logged the message, Master [03:21:22] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=API+application+servers+pmtpa&m=cpu_report&s=by+name&mc=2&g=network_report [03:25:05] just a symptom [03:26:03] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:26:03] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:27:02] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:27:42] springle, this is all due to the LocalisationUpdates? [03:27:52] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.555 second response time [03:28:03] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:29:03] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.243 second response time [03:29:03] PROBLEM - Apache HTTP on mw113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:29:52] unsure. if i'm reading librenms properly, cr2-pmtpa saturated [03:29:52] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.956 second response time [03:30:02] RECOVERY - Apache HTTP on mw113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.105 second response time [03:30:03] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:30:06] whether due to localisation or not... [03:30:22] PROBLEM - Apache HTTP on srv298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:30:42] PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:31:12] RECOVERY - Apache HTTP on srv298 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.919 second response time [03:31:53] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.064 second response time [03:31:58] (03PS2) 10Yurik: Zero: 470-01 now handles M & Zero, on both Opera & regular [operations/puppet] - 10https://gerrit.wikimedia.org/r/113299 [03:32:42] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.167 second response time [03:33:03] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:03] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:01] (03CR) 10BBlack: [C: 032 V: 032] Zero: 470-01 now handles M & Zero, on both Opera & regular [operations/puppet] - 10https://gerrit.wikimedia.org/r/113299 (owner: 10Yurik) [03:34:02] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.238 second response time [03:34:32] PROBLEM - LDAP on virt0 is CRITICAL: Connection refused [03:35:02] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:35:03] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:35:03] PROBLEM - Apache HTTP on mw113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:35:03] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:36:02] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.132 second response time [03:36:03] RECOVERY - Apache HTTP on mw113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.244 second response time [03:36:03] PROBLEM - Apache HTTP on mw59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:36:53] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.571 second response time [03:36:53] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.958 second response time [03:37:02] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.135 second response time [03:38:07] (03PS2) 10Yurik: Removed obsolete carrier 405-25 [operations/puppet] - 10https://gerrit.wikimedia.org/r/113289 [03:39:12] PROBLEM - Apache HTTP on mw67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:39:40] (03CR) 10BBlack: [C: 032 V: 032] Removed obsolete carrier 405-25 [operations/puppet] - 10https://gerrit.wikimedia.org/r/113289 (owner: 10Yurik) [03:39:45] springle, odd, network graph seems fine now. but the LDAP alert above may need more urgent attention [03:40:02] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.274 second response time [03:40:03] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:40:03] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:40:03] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:40:03] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:40:11] yeah something else going on [03:40:12] RECOVERY - Apache HTTP on mw67 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.993 second response time [03:40:30] (03PS4) 10QChris: Updated whitelisted language lists to match config [operations/puppet] - 10https://gerrit.wikimedia.org/r/113168 [03:41:02] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.098 second response time [03:41:42] PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:53] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.568 second response time [03:42:03] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:42:03] PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:42:32] RECOVERY - LDAP on virt0 is OK: TCP OK - 0.035 second response time on port 389 [03:42:42] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.204 second response time [03:42:49] (03CR) 10BBlack: [C: 032 V: 032] Updated whitelisted language lists to match config [operations/puppet] - 10https://gerrit.wikimedia.org/r/113168 (owner: 10QChris) [03:42:52] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.960 second response time [03:43:02] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:43:19] !log restarted opendj on virt0 [03:43:27] Logged the message, Master [03:43:53] RECOVERY - Apache HTTP on srv249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.561 second response time [03:43:53] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.063 second response time [03:44:00] thanks springle, can get back into wikitech now [03:44:02] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.136 second response time [03:44:03] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.175 second response time [03:44:03] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:44:22] PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:44:22] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:44:53] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.030 second response time [03:45:02] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.140 second response time [03:45:14] RECOVERY - Apache HTTP on mw61 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.029 second response time [03:50:53] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.920 second response time [03:51:02] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.204 second response time [03:51:42] PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:52:02] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:02] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.158 second response time [03:53:02] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:03] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:03] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:54:02] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.138 second response time [03:54:02] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:54:02] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:54:32] RECOVERY - Apache HTTP on srv299 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.563 second response time [03:54:53] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.030 second response time [03:55:02] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.204 second response time [03:56:22] PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:56:22] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:57:12] RECOVERY - Apache HTTP on mw61 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.926 second response time [03:57:22] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.279 second response time [03:57:32] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:58:02] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:58:53] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.062 second response time [03:59:02] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.248 second response time [04:00:02] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:02:03] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:02:38] TimStarling, got some time to look into the cluster flappiness above? [04:02:53] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.465 second response time [04:03:02] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.209 second response time [04:03:03] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:03:03] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:03:20] yes [04:03:22] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:03:32] pmtpa servers? [04:03:40] yes [04:03:53] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.066 second response time [04:04:32] I notice git.wm.o just went down as well, is that still in tampa? [04:05:02] PROBLEM - Apache HTTP on mw116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:05:06] also ldap on labs died earlier [04:05:12] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.531 second response time [04:05:24] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-14 04:05:23+00:00 [04:05:32] Logged the message, Master [04:05:53] RECOVERY - Apache HTTP on mw116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.540 second response time [04:06:03] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:06:22] PROBLEM - Apache HTTP on mw63 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:06:29] gitblit you mean? it's antimony.eqiad.wmnet [04:07:02] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.239 second response time [04:07:06] yeah, see alert above - git.wikimedia.org now throwing errors/timing out [04:07:27] amusingly, the documented method of restart for gitblit doesn't work [04:07:44] what is librenms? [04:08:02] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.274 second response time [04:08:02] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:08:04] and how do I log in to it? [04:08:09] the observium replacement [04:08:12] RECOVERY - Apache HTTP on mw63 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.240 second response time [04:08:22] PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:08:27] go to observium.wikimedia.org, should redirect. same pw as before [04:08:53] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.976 second response time [04:08:53] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.995 second response time [04:09:02] PROBLEM - Apache HTTP on mw116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:09:02] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:09:03] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:09:22] RECOVERY - Apache HTTP on mw61 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.168 second response time [04:09:39] ok, I'm in [04:09:39] bbl, thanks for poking [04:09:53] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.995 second response time [04:09:53] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.034 second response time [04:13:02] RECOVERY - Apache HTTP on mw116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.208 second response time [04:13:02] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:03] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:03] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:22] PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:14:02] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.243 second response time [04:14:22] RECOVERY - Apache HTTP on mw61 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.142 second response time [04:14:32] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 474987 bytes in 8.653 second response time [04:16:02] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.226 second response time [04:16:02] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:16:19] the server are idle, it's not an overload [04:16:53] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.991 second response time [04:17:02] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.465 second response time [04:20:02] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:20:03] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:20:03] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:21:02] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.143 second response time [04:21:02] PROBLEM - Apache HTTP on mw116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:21:32] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [04:21:57] ^ that one is me, doing a reinstall [04:22:03] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:22:53] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.529 second response time [04:22:53] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.985 second response time [04:22:59] and there's no detectable packet loss from neon to mw61 etc. [04:23:02] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:23:03] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:23:22] PROBLEM - Apache HTTP on mw63 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:24:02] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:24:53] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.534 second response time [04:25:02] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.165 second response time [04:25:03] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:25:03] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:25:12] RECOVERY - Apache HTTP on mw63 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.023 second response time [04:25:52] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.245 second response time [04:25:53] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.530 second response time [04:25:53] RECOVERY - Apache HTTP on mw116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.538 second response time [04:26:02] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:42] RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [04:26:42] PROBLEM - Apache HTTP on mw73 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:27:02] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:27:03] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:27:03] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:27:42] RECOVERY - Apache HTTP on mw73 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.155 second response time [04:28:02] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.175 second response time [04:28:02] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.229 second response time [04:28:42] PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:28:52] PROBLEM - SSH on virt1002 is CRITICAL: Connection refused [04:29:02] PROBLEM - Disk space on virt1002 is CRITICAL: Connection refused by host [04:29:02] PROBLEM - DPKG on virt1002 is CRITICAL: Connection refused by host [04:29:12] PROBLEM - puppet disabled on virt1002 is CRITICAL: Connection refused by host [04:29:22] PROBLEM - RAID on virt1002 is CRITICAL: Connection refused by host [04:29:42] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.277 second response time [04:30:02] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.242 second response time [04:30:03] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:30:03] PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:30:42] PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:30:53] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.521 second response time [04:30:53] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.956 second response time [04:31:02] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:31:02] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:31:03] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:31:53] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.520 second response time [04:32:02] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.687 second response time [04:32:02] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:42] RECOVERY - Apache HTTP on srv299 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.858 second response time [04:32:53] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.996 second response time [04:32:53] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.015 second response time [04:32:53] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.102 second response time [04:33:02] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.350 second response time [04:33:53] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.028 second response time [04:34:02] PROBLEM - Apache HTTP on mw116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:34:42] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:34:53] RECOVERY - Apache HTTP on mw116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.074 second response time [04:35:42] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.160 second response time [04:36:02] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:36:03] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:36:03] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:36:03] PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:36:03] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:36:53] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.920 second response time [04:37:02] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:37:03] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:37:03] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:37:53] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.534 second response time [04:37:53] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.084 second response time [04:38:02] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.137 second response time [04:38:02] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.171 second response time [04:38:02] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.204 second response time [04:38:53] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.536 second response time [04:39:02] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.160 second response time [04:41:02] PROBLEM - Apache HTTP on mw116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:41:03] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:41:12] PROBLEM - NTP on virt1002 is CRITICAL: NTP CRITICAL: No response from NTP server [04:42:02] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.166 second response time [04:42:02] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:42:03] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:42:03] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:42:03] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:43:02] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:44:02] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.159 second response time [04:44:02] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:44:03] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:44:03] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:44:53] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.568 second response time [04:44:53] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.960 second response time [04:44:53] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.100 second response time [04:45:02] RECOVERY - Apache HTTP on mw116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.183 second response time [04:45:02] PROBLEM - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:45:53] RECOVERY - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 4405 bytes in 0.562 second response time [04:45:55] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.029 second response time [04:46:02] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.170 second response time [04:46:02] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.240 second response time [04:48:52] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [04:49:03] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:49:03] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:49:03] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:49:03] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:49:53] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.483 second response time [04:50:02] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.158 second response time [04:50:02] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.201 second response time [04:50:03] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:51:02] PROBLEM - Apache HTTP on mw116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:51:03] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:51:53] RECOVERY - SSH on virt1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [04:51:53] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.959 second response time [04:52:02] RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [04:53:02] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.170 second response time [04:53:02] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:54:02] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:02] PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:02] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:02] PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:02] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:02] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:03] PROBLEM - Apache HTTP on mw124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:03] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:53] RECOVERY - Apache HTTP on srv249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.520 second response time [04:55:53] RECOVERY - Apache HTTP on mw116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.538 second response time [04:55:53] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.536 second response time [04:55:53] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.520 second response time [04:56:02] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.193 second response time [04:56:02] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.205 second response time [04:56:02] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.241 second response time [04:56:03] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.244 second response time [04:56:03] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:57:02] RECOVERY - Apache HTTP on mw124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.142 second response time [04:57:03] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.205 second response time [04:58:22] RECOVERY - RAID on virt1002 is OK: OK: Active: 16, Working: 16, Failed: 0, Spare: 0 [04:58:53] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.944 second response time [04:59:02] RECOVERY - Disk space on virt1002 is OK: DISK OK [04:59:03] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:59:03] RECOVERY - DPKG on virt1002 is OK: All packages OK [04:59:12] RECOVERY - puppet disabled on virt1002 is OK: OK [04:59:53] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.098 second response time [05:00:03] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:00:03] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:01:02] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.936 second response time [05:01:32] if i run the check_http check for a random apache in tampa in a loop, every so often i'll get a result >5s, occasionally even a timeout [05:01:41] on neon, that is [05:02:02] PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:02:03] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:02:03] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:02:03] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:02:07] on iron too, actually [05:02:16] watch -n5 /usr/bin/time -f '%E' /usr/lib/nagios/plugins/check_http -H en.wikipedia.org -I 10.0.11.60 -u / [05:02:41] so it's not icinga's fault and not neon's fault [05:02:53] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.194 second response time [05:04:03] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:04:03] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:04:18] (03PS1) 10Andrew Bogott: Add a second compute node. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113330 [05:04:29] (03CR) 10Andrew Bogott: [C: 032] Change eqiad instance IP range. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113136 (owner: 10Andrew Bogott) [05:04:52] PROBLEM - Apache HTTP on mw60 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:04:53] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.572 second response time [05:04:53] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.571 second response time [05:05:02] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:05:03] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:05:42] RECOVERY - Apache HTTP on mw60 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.530 second response time [05:06:02] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.083 second response time [05:06:02] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.155 second response time [05:06:02] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.206 second response time [05:06:02] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.241 second response time [05:06:03] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:06:10] (03PS2) 10Andrew Bogott: Add a second compute node. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113330 [05:06:53] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.570 second response time [05:07:02] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.133 second response time [05:07:02] RECOVERY - Apache HTTP on srv249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.226 second response time [05:07:59] (03CR) 10Andrew Bogott: [C: 032] Add a second compute node. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113330 (owner: 10Andrew Bogott) [05:08:52] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.099 second response time [05:09:02] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:09:03] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:09:03] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:09:03] PROBLEM - Apache HTTP on mw124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:09:42] PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:09:53] RECOVERY - Apache HTTP on mw124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.575 second response time [05:10:02] PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:10:22] PROBLEM - Apache HTTP on mw67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:10:42] PROBLEM - Apache HTTP on srv252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:02] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:03] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:12] RECOVERY - Apache HTTP on mw67 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.534 second response time [05:11:32] RECOVERY - Apache HTTP on srv252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.087 second response time [05:11:42] PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:53] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.099 second response time [05:12:02] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.120 second response time [05:12:03] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.277 second response time [05:12:03] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:12:03] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:12:32] RECOVERY - Apache HTTP on srv299 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.533 second response time [05:12:53] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.567 second response time [05:13:53] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.566 second response time [05:13:53] RECOVERY - Apache HTTP on srv249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.050 second response time [05:14:12] RECOVERY - NTP on virt1002 is OK: NTP OK: Offset -0.001201152802 secs [05:14:32] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.093 second response time [05:15:02] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:15:03] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:15:03] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:15:03] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:15:53] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.569 second response time [05:16:02] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.204 second response time [05:16:02] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:02] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.260 second response time [05:17:02] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:33] in fact, i get it on the actual apache itself [05:17:53] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.567 second response time [05:17:53] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.098 second response time [05:19:02] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:19:52] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.961 second response time [05:19:53] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.031 second response time [05:19:53] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.069 second response time [05:20:52] (03Abandoned) 10Tim Landscheidt: Tools: Remove SGE shadow master [operations/puppet] - 10https://gerrit.wikimedia.org/r/112671 (owner: 10Tim Landscheidt) [05:20:53] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.098 second response time [05:21:02] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:21:22] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:21:22] PROBLEM - Apache HTTP on srv298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:22:22] RECOVERY - Apache HTTP on srv298 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.127 second response time [05:22:53] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.101 second response time [05:23:02] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.237 second response time [05:23:03] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:03] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:03] PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:52] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.556 second response time [05:24:02] PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:24:22] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.168 second response time [05:24:53] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.104 second response time [05:25:02] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:25:32] * springle wonders about 02:24 logmsgbot: aaron synchronized wmf-config/mc.php 'Set Memcached retry_timeout to -1'  [05:26:02] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:26:22] PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:26:26] (03PS1) 10Andrew Bogott: Add cron entries to update puppet repos on labs puppetmasters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113332 [05:26:27] yeah, that's a good guess [05:27:02] RECOVERY - Apache HTTP on srv249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.088 second response time [05:27:03] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:27:22] PROBLEM - Apache HTTP on mw63 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:27:53] (03PS1) 10Ori.livneh: Revert "Set Memcached retry_timeout to -1" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113333 [05:28:02] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.206 second response time [05:28:03] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:28:03] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:28:03] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:28:12] RECOVERY - Apache HTTP on mw63 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.917 second response time [05:28:12] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Set Memcached retry_timeout to -1" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113333 (owner: 10Ori.livneh) [05:28:12] RECOVERY - Apache HTTP on mw61 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.115 second response time [05:28:27] !log ori updated /a/common to {{Gerrit|Iac9f51209}}: Revert "Set Memcached retry_timeout to -1" [05:28:34] Logged the message, Master [05:28:50] stracing mw32 nutcracker seems to have unusual waits.. but then again dont know what usual is [05:29:02] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.130 second response time [05:29:02] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.241 second response time [05:29:03] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:03] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:12] !log ori synchronized wmf-config/mc.php 'Iac9f51209: Revert 'Set Memcached retry_timeout to -1'' [05:29:19] Logged the message, Master [05:29:51] aaron made the change to reduce logspam; it has been an issue for a long while (several months). so safe to revert. [05:29:52] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.413 second response time [05:29:53] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.423 second response time [05:29:53] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.423 second response time [05:29:53] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.422 second response time [05:29:53] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.422 second response time [05:29:53] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.421 second response time [05:30:31] (03PS1) 10Tim Landscheidt: Revert "Add Apple Touch icon for Labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113334 [05:31:07] springle: good catch [05:31:32] (03PS2) 10Tim Landscheidt: Revert "Add Apple Touch icon for Labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113334 [05:33:23] (03CR) 10Tim Landscheidt: "Reverted in If18d4215b603f5461451c27ccb8e2a8165f2b0d0." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111434 (owner: 10Odder) [05:46:48] can't immediately see the relationship between retry_timeout and server_retry_timeout in nutcracker [05:47:01] twemproxy rather [05:47:30] but server)retry_timeout is int64_t and couple spots in code would break with negative values [05:48:32] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [05:49:02] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [05:51:12] PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused [05:52:12] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.456 second response time [06:08:30] !log Reloading Zuul to deploy Ie02531143511f418a6 [06:08:38] Logged the message, Master [06:23:20] springle: i'll e-mail the log to aaron [06:36:15] ori: thanks. also see greg-g's email to ops@ [06:37:00] oh, i missed that. thanks [06:37:54] springle: A question re LabsDB: In replication prod-DBs => sanitizer => LabsDB, if the link prod-DBs => sanitizer stalls/breaks, but sanitizer => LabsDB keeps working, what will Seconds_Behind_Master on LabsDB show? [06:40:02] scfc_de: Seconds_Behind_Master is virtually useless at the second level [06:47:44] scfc_de: we use http://www.percona.com/doc/percona-toolkit/2.2/pt-heartbeat.html . but to tell (like for labsdb users) one needs access to heartbeat schema [06:48:01] which i don't know whether we allow or now, actually [06:56:18] We would have to wrap any access to Seconds_Behind_Master in a SECURITY DEFINER anyway, so we could do the same with pt-heartbeat if it's non-public. I'll add a note about pt-heartbeat to https://bugzilla.wikimedia.org/48694 and https://bugzilla.wikimedia.org/48628. Thanks! [06:57:57] yw [07:26:24] ori: I still don't understand those warnings at all [07:27:00] what do they have to do with twemproxy? [08:14:14] (03CR) 10Fabriceflorin: "Thanks, MZ. I'm comfortable with the proposal to remove AFT5 on enwiki and frwiki on Monday, March 3, 2014. This should give enough time f" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112639 (owner: 10MZMcBride) [08:28:53] I regularly get messages like "Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes." I take it it is no news [08:33:17] apergos: ^ [08:33:23] Same for me. [08:36:23] For example on http://en.wikipedia.org/wiki/Selberg_sieve [08:36:53] If you report this error to the Wikimedia System Administrators, please include the details below. [08:36:53] Request: GET http://en.wikipedia.org/wiki/Selberg_sieve, from 91.198.174.72 via amssq57 amssq57 ([91.198.174.67]:3128), Varnish XID 921749154 [08:36:53] Forwarded for: 90.146.67.180, 91.198.174.72 [08:36:53] Error: 503, Service Unavailable at Fri, 14 Feb 2014 08:36:01 GMT [08:40:43] average: I tried dewiki, eowiki, jawiki and they all seem to work. Could you find other wikis that are affected? [08:47:10] I see it. [08:47:21] Thanks! [08:47:36] looking [08:56:53] Wow, still the 503 [08:57:03] http://wikimania2014.wikimedia.org/wiki/Special:MyLanguage/Main_Page [08:57:16] but not on HTTPS [08:59:37] !b 61364 [08:59:45] https://bugzilla.wikimedia.org/show_bug.cgi?id=61364 [09:19:21] (03PS1) 10Whym: Add autopatrol and related settings for Japanese Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113338 [09:27:38] (03PS2) 10Whym: Enable autopatrol and patrolling of RecentChanges on Japanese Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113338 [09:28:38] (03PS3) 10Whym: Enable autopatrol and patrolling of RecentChanges on Japanese Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113338 [09:46:06] (03CR) 10Odder: [C: 04-1] "I suggest you give the ability to remove users from the autopatrolled group to a local group (such as sysops or bureaucrats); with the cur" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113338 (owner: 10Whym) [11:18:08] (03Abandoned) 10QChris: Add zero tag for carrier 413-02 for simlpewiki on zerodot [operations/puppet] - 10https://gerrit.wikimedia.org/r/113167 (owner: 10QChris) [11:18:10] (03CR) 10Alexandros Kosiaris: [C: 032] Wait 60 seconds before killing the parsoid master [operations/puppet] - 10https://gerrit.wikimedia.org/r/113316 (owner: 10GWicke) [11:24:02] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.004 second response time [11:32:12] PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused [11:34:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC [11:34:57] (03PS1) 10Ori.livneh: mwgrep: use a filtered boolean query [operations/puppet] - 10https://gerrit.wikimedia.org/r/113351 [11:36:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC [11:38:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC [11:40:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC [11:42:10] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [11:42:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC [11:44:27] please don't touch wtp1004. Investigating [11:44:45] !log restart parsoid on wtp1022. [11:44:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC [11:44:53] Logged the message, Master [11:45:09] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.012 second response time [11:45:10] ok, I was just looknig at it [11:46:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC [11:46:54] ACKNOWLEDGEMENT - Parsoid on wtp1004 is CRITICAL: Connection refused alexandros kosiaris Investigating parsoid restarting on an ephemeral port [11:47:08] ah [11:47:38] seems like a bug. It happened the other day, and now again [11:47:50] I remember the report from the other day [11:47:50] !log depooled wtp1004 [11:47:52] yuck [11:47:58] Logged the message, Master [11:48:10] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [11:48:11] I have a few questions concerning the mathoid-debian package, basically there are two options either install the dependend node modules while building the package with npm or shipping a set of files that contain the required modules. The first option requires to have a npm version > 1.3? ... at least newer than the current precise version of npm furthermore the machine that builds the package must be connected to the internet [11:48:27] hmmm so this time it shows up everywhere... nice [11:48:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC [11:49:10] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.008 second response time [11:49:12] !log restarted parsoid on wtp1001 [11:49:19] Logged the message, Master [11:50:10] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [11:50:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC [11:51:06] physikerwelt: When building packages, fetching stuff from the internet is a bad security practice. [11:51:38] I 'd rather you fetched them manually, verified them and then just include them in the repo used to build the deb [11:52:30] akosiaris: ok thanks for the quick and defenitve answer [11:52:48] you are welcome :-) [11:52:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:28:13 AM UTC [11:53:50] (03PS4) 10ArielGlenn: Add shell account for santhosh, admins restricted + stats1002 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112912 [11:54:09] RECOVERY - Puppet freshness on analytics1024 is OK: puppet ran at Fri Feb 14 11:54:01 UTC 2014 [11:55:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:54:01 AM UTC [11:57:40] (03CR) 10ArielGlenn: [C: 032] Add shell account for santhosh, admins restricted + stats1002 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112912 (owner: 10ArielGlenn) [11:57:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:54:01 AM UTC [11:58:49] RECOVERY - Puppet freshness on analytics1024 is OK: puppet ran at Fri Feb 14 11:58:40 UTC 2014 [12:00:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:58:40 AM UTC [12:01:09] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.008 second response time [12:02:49] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 11:58:40 AM UTC [12:03:04] (03CR) 10Whym: "@Odder thanks for your suggestions. Allowing sysops to remove the flag makes sense, and it was originally our intention. I would like to" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113338 (owner: 10Whym) [12:28:14] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [12:28:54] RECOVERY - Puppet freshness on analytics1024 is OK: puppet ran at Fri Feb 14 12:28:43 UTC 2014 [12:31:14] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time [12:53:21] omg scfc_de the censor [12:58:15] Nemo_bis: ? [13:08:16] carthago delenda est [13:14:57] Ah. [13:24:44] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:15] apergos: hey can you have a look ^^? [13:25:18] I'm doing some other stuff [13:25:21] yes [13:30:02] (03CR) 10Hashar: "The trailing / is defined in RFC 1738 section-3.1 "Common Internet Scheme Syntax" https://tools.ietf.org/html/rfc1738#section-3.1" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106110 (owner: 10Jeremyb) [13:30:23] !log powercycling ms-be1005, unresponsive even on mgmt console [13:30:32] Logged the message, Master [13:32:04] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [13:33:06] thanks [13:37:42] (03PS1) 10Springle: s6 pool db1030 depool db1010 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113360 [13:38:05] (03CR) 10Springle: [C: 032] s6 pool db1030 depool db1010 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113360 (owner: 10Springle) [13:38:16] (03Merged) 10jenkins-bot: s6 pool db1030 depool db1010 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113360 (owner: 10Springle) [13:39:07] !log springle synchronized wmf-config/db-eqiad.php 's6 pool db1030 depool db1010' [13:39:15] Logged the message, Master [13:43:14] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [13:46:54] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:44] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:52:46] (03PS4) 10Whym: Enable autopatrol and patrolling of RecentChanges on Japanese Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113338 [13:54:21] (03PS1) 10Springle: s1 repool db1056 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113361 [13:54:45] (03CR) 10Springle: [C: 032] s1 repool db1056 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113361 (owner: 10Springle) [13:54:51] (03Merged) 10jenkins-bot: s1 repool db1056 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113361 (owner: 10Springle) [13:55:30] !log springle synchronized wmf-config/db-eqiad.php 's1 repool db1056 warm up' [13:55:38] Logged the message, Master [14:13:08] (03PS1) 10Springle: s1 db1056 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113363 [14:13:33] (03CR) 10Springle: [C: 032] s1 db1056 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113363 (owner: 10Springle) [14:13:39] (03Merged) 10jenkins-bot: s1 db1056 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113363 (owner: 10Springle) [14:14:14] !log springle synchronized wmf-config/db-eqiad.php 's1 db1056 full steam' [14:14:22] Logged the message, Master [14:22:28] hmm so systemd on Ubuntu as well... eventually [14:22:28] http://www.markshuttleworth.com/archives/1316 [14:25:42] !log beginning cirrus reindex of all wikipedias running cirrus except enwiki [14:25:51] Logged the message, Master [14:49:24] PROBLEM - Disk space on virt11 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 44204 MB (3% inode=99%): [14:52:39] paravoid: guess you saw it: Ubuntu is to switch from Upstart to the Systemd init system [14:53:35] yup, I did [14:57:14] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [15:12:26] grrrr @ systemd :P [15:26:14] !log aborted reindex due to https://bugzilla.wikimedia.org/show_bug.cgi?id=61377 [15:26:22] Logged the message, Master [15:44:05] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [15:49:17] bd808: ^^^^ [15:49:54] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 34: active_shards: 68: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [15:50:13] <^d> Self-healing? ^ [15:50:57] I'm not sure what happened. The secondary index for todays logs was rebulidling when I logged in [15:51:44] Also I should setup a highlight filter for "logstash" [15:59:56] manybubbles: Replica of today's logstash index just relocated from from 1001 to 1002. Something is up in my cluster for sure. [16:00:15] they do get to relocate on their own, you know [16:00:21] that shouldn't cause failures [16:01:55] I'm about to venture outside for the first time in a while. [16:01:57] wish me luck [16:15:57] !log db1034 swapping cables [16:16:06] Logged the message, Master [16:18:04] manybubbles|away: safe journeys [16:48:11] (03CR) 10Ryan Lane: [C: 031] Add cron entries to update puppet repos on labs puppetmasters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113332 (owner: 10Andrew Bogott) [16:48:19] * werdna is pung [16:55:19] (03CR) 10Odder: "This breaks Things. https://meta.wikimedia.org/wiki/Tech/News/2014/01" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111426 (owner: 10TTO) [17:04:30] Does anybody have a few minutes to help me figure out why the elasticsearch and redis monitors on the logstash100[123] nodes aren't populating graphs in ganglia? [17:05:19] I can see the *.pyconf files in /etc/ganglia/conf.d and don't know the next troubleshooting step [17:11:27] !log Restarted ganglia-monitor on logstash1001 to see if that makes the elasticsearch and redis metrics show up in ganglia [17:11:36] Logged the message, Master [17:12:53] bd808: stop gmond and start it in the foreground by running with gmond -d999 [17:13:08] well, since you've already restarted it, you can wait and see if that fixes things [17:14:03] ori: Thanks. I'll give it 5 minutes and then do the foreground run if needed [17:14:54] (03CR) 10Ori.livneh: [C: 032] mwgrep: use a filtered boolean query [operations/puppet] - 10https://gerrit.wikimedia.org/r/113351 (owner: 10Ori.livneh) [17:15:28] Redis is showing up now so fingers crossed that I just need to bump the ganglia-monitor service on the others [17:20:01] (03CR) 10Helder.wiki: "=/" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111426 (owner: 10TTO) [17:21:56] (03CR) 10Ottomata: "Thanks Antoine." (031 comment) [operations/debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/113018 (owner: 10Ottomata) [17:39:15] (03CR) 10Matthias Mullie: [C: 031] Add qa_automation group and grant it Flow rights [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113311 (owner: 10Spage) [17:39:43] (03PS1) 10Legoktm: Revert "Add local interwiki for metawiki" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113377 [17:39:50] Nemo_bis: ^ [17:41:27] (03CR) 10Odder: [C: 031] "Yes." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113377 (owner: 10Legoktm) [17:50:20] (03PS6) 10Ottomata: Initial debian release [operations/debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/113018 [17:53:53] !log Running gmond in foreground on logstash1001 to debug elasticsearch reporting [17:54:02] Logged the message, Master [17:59:57] (03CR) 10Alexandros Kosiaris: Initial debian release (031 comment) [operations/debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/113018 (owner: 10Ottomata) [18:00:51] bd808: if the module doesn't throw exceptions, the next step is tcpdump [18:00:57] bd808: on the ganglia aggregator [18:01:05] which iirc is logstash1003, but you should check [18:01:12] ori: It's the elasticsearch module [18:01:34] greg-g: so do we unbreak things on a Friday? [18:01:41] https://gerrit.wikimedia.org/r/113377 specifically [18:02:13] ori: I think Nik made a change that requires a newer version of elasticsearch than I'm running, but I'm going to double check the release notes to make sure that's the problem and not something else [18:02:26] twkozlowski: yeah, for that kind of thing [18:13:20] (03CR) 10Gergő Tisza: [C: 031] Start sampling detailed network performance for Multimedia Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112452 (owner: 10Gilles) [18:18:02] dr0ptp4kt: yurik: where does one learn about portalwiki? and the phases? [18:19:29] jeremyb, we have only outlined the first phase of the portalwiki - which is basically to relocate meta zero config pages to a separate wiki and set it up for further dev, but it seems we might have to continue developing on meta until ops create a separate cluster [18:20:28] yurik: see https://bugzilla.wikimedia.org/61222 [18:20:54] yurik: legal didn't want to use legal.wm.o in case they had some future broader use. is that not a concern for your proposed domain name? [18:21:31] hm, not sure what you mean - we would be totally fine with zero.wikimedia.org [18:24:09] !log Starting ganglia-monitor on logstash1001. Filed bug 61384 about problem in elasticsearch_monitoring.py effecting the logstash cluster. [18:24:17] Logged the message, Master [18:26:00] (03PS1) 10Aaron Schulz: Set wgMathDisableTexFilter to reduce performance problems [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113382 [18:26:56] (03PS2) 10Aaron Schulz: Set wgMathDisableTexFilter to fix performance regression [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113382 [18:27:01] (03CR) 10Aaron Schulz: [C: 032] Set wgMathDisableTexFilter to fix performance regression [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113382 (owner: 10Aaron Schulz) [18:27:08] (03Merged) 10jenkins-bot: Set wgMathDisableTexFilter to fix performance regression [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113382 (owner: 10Aaron Schulz) [18:27:41] yurik: so not "portal.wikimedia.org" then? [18:27:54] jeremyb, yurik, i would prefer that we use portal.wikimedia.org to eliminate confusion about zero.wikipedia.org and zero.wikimedia.org and a potential of rebranding the program at some point down the road. note, no one is planning a rebrand, it's just that the term "zero" has negative connotations sometimes. hence, something other than 'zero', even if it isn't 'portal', seems advisable. my recommendation earlier was partners.wikimedia.org [18:27:55] which aligns with what we do [18:28:32] jeremy b + yurik, i'm stepping away from my computer for a while, just so you know [18:28:33] "portal" can have many meanings [18:28:40] dr0ptp4kt: sure [18:28:51] <^d> dr0ptp4kt: domainsdontmatter.wikimedia.org? :) [18:29:18] ^d: they really matter if you're trying to get someone to remember it or type it manually... [18:29:22] :-) [18:29:44] <^d> I don't think eiximenis is in use anymore. [18:29:46] <^d> ;-) [18:29:51] !log aaron synchronized wmf-config/CommonSettings.php 'Set wgMathDisableTexFilter to fix performance regression' [18:29:58] Logged the message, Master [18:30:18] don't forget to get domainsdontmatter.wikimedia.org SSL cert, thx [18:30:41] <3 mutante [18:30:45] <^d> AaronSchulz: I have PoolCounter wrapping for texvccheck that should be working its way into the next wmf branch. [18:33:33] paravoid: ping [18:36:30] ^d: still terrible [18:36:39] ok, https://en.wikipedia.org/w/index.php?title=Real_projective_line&oldid=543928239&forceprofile=true is still slow [18:36:58] like 36 on math that was already rendered... [18:37:01] *36sec [18:42:25] <^d> AaronSchulz: :( [18:43:26] ori: if there any more mc details can you leave them on bug 56882...since that still confuses me? Can it just be re-done for eqiad only? [18:43:36] * AaronSchulz looks at the Math code [18:45:01] AaronSchulz: i'll look [18:45:25] the logs show the spam went away when it was on [18:47:38] AaronSchulz: Sean mentioned that tampa is using nutcracker, which I guess is an earlier version of twemproxy [18:48:59] so somehow not using backoff after failure in low-traffic tampa made nutcracker unhappy? Were apaches waiting on it or something? How did they get seen as down? [18:49:30] still doesn't seem to make sense [18:50:23] AaronSchulz: if an HTTP request takes more than 10 seconds to get the 301 to /wiki/Main_Page, an alert is issued [18:50:43] some requests were taking more than 10 seconds [18:50:47] most were under 1 second [18:51:00] but i ran it with 'watch' with a 2-second delay between calls [18:51:11] how it was seen, example 21:36 <+icinga-wm> PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:21] and one every dozen or so requests would take ~10 seconds [18:51:28] they were all doing that during that time, but only tampa [18:51:38] 0.32% 0.122560 84 - FileBackendStore::storeInternal-global-swift-eqiad [18:52:04] ^d: hmm, it must be regenerating and re-storing the pngs each time [18:52:04] AaronSchulz: make the change locally on one of the tampa apaches [18:52:16] and !log it so people know to expect the alerts [18:52:55] ^d: I mean by the monitoring, of course I saw the logs :) [18:53:20] trying to figure out how apaches were delayed [18:53:45] <^d> I really want to finish wrapping more of this in PoolCounter. [18:53:46] ori: though I'm tempted to just to an eqiad conditional and be done with it, I don't really care about tampa at all [18:53:53] <^d> So worst it can do is fail math and not take apaches with it. [18:53:58] ^d search is the new domains, i know. unless it isn't indexed! [18:54:08] i guess we just index the homepage and nothing else [18:54:12] AaronSchulz: that's fine [18:54:20] 1.83% 0.696300 1 - FileBackendStore::doQuickOperationsInternal-global-swift-eqiad [18:54:31] <^d> dr0ptp4kt: Huh? Context? [18:54:34] <^d> Aw, left. [18:54:37] ^d: you know if I didn't hack Math to batch store all files in swift at the end this would be even worse [18:54:42] AaronSchulz: i think if you spent several hours chasing it down it'll end up being something particular and nearly-irrelevant about tampa [18:54:53] it would really suck if that was the case *and* we were still writing to tampa ;) [18:55:00] that page would have timed out [19:02:52] !log Updated /src/scap on tin to b2d8042 [19:02:59] Logged the message, Master [19:03:22] greg-g: Ok to push latest scap scripts to the cluster and test with a no-op scap? [19:05:09] (03PS1) 10Aaron Schulz: Set retry_timeout to -1 for memcached in eqiad only [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113387 [19:05:25] ori: ^ [19:05:58] (03CR) 10Ori.livneh: [C: 031] "Aaron <3s ternary expressions" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113387 (owner: 10Aaron Schulz) [19:07:24] (03CR) 10Aaron Schulz: [C: 032] Set retry_timeout to -1 for memcached in eqiad only [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113387 (owner: 10Aaron Schulz) [19:07:33] (03Merged) 10jenkins-bot: Set retry_timeout to -1 for memcached in eqiad only [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113387 (owner: 10Aaron Schulz) [19:09:08] !log aaron synchronized wmf-config/mc.php 'Set retry_timeout to -1 for memcached in eqiad only' [19:09:17] Logged the message, Master [19:09:44] idle greg-g is idle [19:09:58] bd808: he okayed it earlier [19:10:40] bd808: yessir [19:10:42] sorry sir [19:10:46] was multitasking sir [19:11:13] greg-g: thanks. :) [19:11:51] !log Updating scap on mediawiki-installation dsh hosts [19:11:57] Logged the message, Master [19:12:14] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [19:13:12] fully, the fire alarm in my apartment went off a minute ago [19:13:25] then off again [19:13:27] *funny [19:13:53] ori: Lots and lots of "Already uo-to-date" responses. Is puppet pulling automatically for eventual consistency? [19:14:24] AaronSchulz: You may or may not be on fire. Please check :) [19:14:51] Unstable waveforms must be collapsed via observation [19:15:04] * AaronSchulz look at odd stuff on http://ganglia.wikimedia.org/latest/ [19:16:27] meh, that gluster spike seems random [19:24:03] twkozlowski: still on? [19:25:16] !log bd808 started scap: no-diff scap to test script changes [19:25:24] Logged the message, Master [19:25:35] mutante: Yeah [19:26:31] twkozlowski: 113073 the bugzilla sidebar thing, looks good to you? shall we merge it even without andre?:) [19:26:34] would [19:26:50] but you gave -1 before, so [19:27:45] oh, it looks good [19:27:48] i guess the question is more, like [19:27:55] is it more important to fix that for weekend [19:28:09] or to wait for more labs testing by andre because it's just his PS1 [19:28:12] ok [19:28:23] the memcached log spam is mostly tampa anyway...so I guess that log will stay fat for a while [19:28:38] what's a few gigs on fluorine? ;) [19:28:44] * AaronSchulz goes back to Math [19:28:52] mutante: I'll try to set Labs up now. But yeah, would be cool to get it in for the weekend. How much more time do I have before your weekend starts? :) [19:28:54] !log no-diff scap updated 366 JSON l10n files [19:29:04] Logged the message, Master [19:29:13] andre__: oh, i thought you are away :) [19:29:31] on and off... [19:29:38] andre__: eh, so i can either just merge that PS2 , i fixed the tabs [19:29:46] andre__: or you can tell me to come back in like 2 hours [19:30:01] i'll just be afk in between as well [19:32:20] mutante, two hours sounds good. I might be away then, but hopefully have tested everything [19:32:34] just realizing there's a bit more cleanup work to do, e.g. removing /extensions/Sitemap from Gerrit [19:33:33] andre__: alright, i'll be off and back in a bit, feel free to leave query messages,i'll read them [19:33:33] ok <> in the extension repo is definitely crap with msysgit [19:38:54] bd808: was the lack of a scap completed message from logmsgbot intentional? [19:39:06] greg-g: Still running [19:39:11] bd808: also, I assume you're done based on your last... oh [19:39:18] Our "no-op" picked up l10n updates [19:39:19] the !log no-diff updated json files thing confused me [19:39:27] * greg-g nods [19:39:59] On a related note… scap is f'ing slow [19:40:23] yep! [19:45:38] (03CR) 10Cmcmahon: [C: 031] Add qa_automation group and grant it Flow rights [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113311 (owner: 10Spage) [19:45:46] All but 42 minutes yesterday [19:45:59] To push new mw version (all code), l10n cache for 2 versions [19:47:21] I think it's going to be >30m today just for l10n updates [19:47:58] The rsync fanout was horribly slow [19:48:49] also the use of dsh instead of salt cause a YMMV [19:48:54] *causes, gah [19:49:37] AaronSchulz: Do you think using salt for messaging would be a significant difference? [19:50:29] Does salt have a "batch size" option? I don't think we want all the rsyncs running at once with the current number of servers [19:50:38] depends on your ssh-agent [19:50:43] !log bd808 finished scap: no-diff scap to test script changes (duration: 25m 26s) [19:50:50] i just wouldn't worry about it yet [19:50:51] Logged the message, Master [19:51:02] if it is fast already then it won't make much difference [19:51:11] greg-g: ^^ {{done}} [19:51:28] $ cat sync-common [19:51:28] #!/bin/bash [19:51:28] /usr/local/bin/scap-1 [19:51:54] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 33: active_shards: 66: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 1 [19:51:54] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 33: active_shards: 66: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 1 [19:52:03] grrr [19:52:10] heh [19:52:14] bd808: if it ain't scap, it's logstash :) [19:52:14] PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 33: active_shards: 66: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 1 [19:52:49] kaldari: they're done with the "no-op" scap test [19:52:50] Something nasty keeps happening there. A shard replica is flapping [19:53:23] * greg-g goes to do some lunch type thing [19:53:27] ya'll play nice now [19:53:38] * bd808 thinks that sounds like a good idea [19:54:23] (03PS1) 10Ori.livneh: webperf: Adapt NavigationTiming Graphite reporter to schema rev. 7494934 [operations/puppet] - 10https://gerrit.wikimedia.org/r/113392 [19:54:38] (03CR) 10Ori.livneh: [C: 032 V: 032] webperf: Adapt NavigationTiming Graphite reporter to schema rev. 7494934 [operations/puppet] - 10https://gerrit.wikimedia.org/r/113392 (owner: 10Ori.livneh) [19:55:13] !log during scap test snapshot[1234] reported "sudo: no tty present and no askpass program specified" [19:55:21] Logged the message, Master [19:56:47] bd808: that's been a problem for ages. might be worth filing an RT [19:57:34] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for db78 [operations/dns] - 10https://gerrit.wikimedia.org/r/113140 (owner: 10Cmjohnson) [19:59:38] !log ori synchronized php-1.23wmf14/extensions/NavigationTiming 'Update NavigationTiming for schema revision to 7494934' [19:59:46] Logged the message, Master [20:00:38] ori: Filed snapshot scap errors as https://rt.wikimedia.org/Ticket/Display.html?id=6847 [20:00:49] bd808: sweet, thanks [20:01:28] !log ori synchronized php-1.23wmf13/extensions/NavigationTiming 'Update NavigationTiming for schema revision to 7494934' [20:01:37] Logged the message, Master [20:08:35] https://bugzilla.wikimedia.org/show_bug.cgi?id=36623 can be closed? [20:08:43] * twkozlowski dunno what Ubuntu version is used nowadays [20:09:51] not yet, just a little bit longer [20:10:14] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [20:10:40] the one-off thing even though it's practically done [20:11:59] twkozlowski: that'll be resolved like when tampa is dead [20:12:31] technically, you can argue if it matters that an unused thing isnt upgraded but also not decom'ed yet, etc ..bla bla [20:14:08] mutante: oh, I don't care that much. Just digged that bug out of Bugzilla's botomless... well, database [20:14:48] yep, you could link it to a Tampa tracking .. if you want .. [20:16:59] I can't find any bug that include the word 'Tampa [20:17:22] -z [20:18:52] twkozlowski: ugh, true.. Bugzilla ticket: (no value) [20:19:21] https://bugzilla.wikimedia.org/show_bug.cgi?id=45528 mutante ? [20:19:23] twkozlowski: well, https://wikitech.wikimedia.org/wiki/Tampa_cluster is the same thing and RT #6099 and [20:19:44] https://wikitech.wikimedia.org/w/index.php?title=Tampa_cluster&action=history [20:20:52] twkozlowski: maybe, thx, i linked that, better than none [20:21:07] for me it's 6099 [20:21:28] and i'd like to keep the wikitech site in sync [20:21:33] for the public part [20:28:50] Sorry! We could not process your edit due to a loss of session data. Please try again. If it still does not work, try logging out and logging back in. [20:28:53] :( [20:28:54] is that a simple timeout? [20:29:16] jgage: Nope... more of a session timeout thingy [20:29:38] oddly it seems to have accepted my edit despite the error message [20:29:54] oh no, it gave me a preview [20:30:03] all is not lost [20:30:56] yeah ok i just had to save the edit a second itme [20:30:57] time [20:33:38] twkozlowski: please check bugzilla sidebar:) [20:34:06] yay [20:34:20] Reedy: good? [20:34:33] deployed https://gerrit.wikimedia.org/r/#/c/113073/ [20:34:39] Certainly better than it was [20:34:51] cool [20:35:29] that's good enough for the weekend:) [20:35:52] Deploy and go home [20:35:55] * Reedy high fives mutante [20:36:10] hah:) thx [20:36:39] i can wait a bit to see if it breaks suddenly:) [20:39:06] mutante: yay, works like it did before [20:39:46] nice [20:40:07] * twkozlowski also suggests that someone deploys https://gerrit.wikimedia.org/r/#/c/113377/ [20:40:19] greg-g: ^^ [20:41:34] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [20:41:44] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [20:41:44] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [20:41:54] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [20:41:56] Forgot a bout that [20:42:02] (03PS2) 10Legoktm: Revert "Add local interwiki for metawiki" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113377 [20:42:06] (03CR) 10Reedy: [C: 032] Revert "Add local interwiki for metawiki" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113377 (owner: 10Legoktm) [20:42:14] PROBLEM - Host text-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1 [20:42:16] (03Merged) 10jenkins-bot: Revert "Add local interwiki for metawiki" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113377 (owner: 10Legoktm) [20:42:18] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:18] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:24] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1:c [20:42:27] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::2:b [20:42:30] PROBLEM - Host bits-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1:a [20:42:34] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:34] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:34] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:34] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:34] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:35] That doesn't look good [20:42:35] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:35] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:36] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:36] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:37] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:37] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:38] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:38] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:39] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:39] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:40] uh [20:42:48] Bang. IPv6 is gone. [20:42:50] are people on this or should I be looking? [20:42:54] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:54] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:55] PROBLEM - Host lvs4004 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:55] PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:55] PROBLEM - Host upload-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [20:43:00] D'oh! [20:43:04] PROBLEM - Host bits-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [20:43:08] PROBLEM - Host text-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [20:43:11] wt [20:43:11] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [20:43:32] all of ulsfo [20:43:34] awesome [20:43:37] transit down? [20:43:41] !log reedy synchronized wmf-config/InitialiseSettings.php 'Revert Add local interwiki for metawiki' [20:43:41] I think ulsfo just lost its transit. [20:43:42] seems so [20:43:49] Logged the message, Master [20:44:08] texts were a bit delayed there [20:44:33] and yeah I had an old ssh session open into a ulsfo machine and my session's hung :( [20:44:57] uff [20:45:06] so move stuff off to eqiad I guess? [20:45:14] PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx [20:45:20] yeah I'll do the dns change [20:45:37] k [20:45:45] unless someone has a bright idea about what's wrong with the network and/or how to fix it quickly :) [20:45:54] eh no [20:46:04] PROBLEM - Host backup4001 is DOWN: PING CRITICAL - Packet loss = 100% [20:46:05] !log ULSFO down, traffic to Asia etc affected. Being worked on [20:46:11] Logged the message, Master [20:46:11] think of the backups! [20:46:30] AFAICT, ulsfo just completely dropped off the 'net inside and out [20:47:00] That's not good [20:47:36] (03PS1) 10BBlack: ulsfo outage, temporarily s/ulsfo/eqiad/ [operations/dns] - 10https://gerrit.wikimedia.org/r/113456 [20:48:02] ah the easy way [20:48:07] (03CR) 10BBlack: [C: 032 V: 032] ulsfo outage, temporarily s/ulsfo/eqiad/ [operations/dns] - 10https://gerrit.wikimedia.org/r/113456 (owner: 10BBlack) [20:48:46] ... dafu? Looks like the routes are gone. [20:49:08] the dns stuff is done, TTLs notwithstanding [20:50:14] PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx [20:50:42] [payments stuff is me running updates] [20:51:14] Shall we poke GTT? That definitely looks like their pipe just went boom. [20:51:41] did you see the device reboot notices just arriving in email? [20:51:41] maybe it got the measles [20:51:54] RECOVERY - Host cp4019 is UP: PING WARNING - Packet loss = 44%, RTA = 74.37 ms [20:51:56] beyond my scope (and endurance, 11 pm here and desperately trying to get some dinner in me) [20:52:04] RECOVERY - Host cp4005 is UP: PING OK - Packet loss = 0%, RTA = 75.36 ms [20:52:04] RECOVERY - Host cp4017 is UP: PING OK - Packet loss = 0%, RTA = 73.39 ms [20:52:04] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 72.05 ms [20:52:04] RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 75.22 ms [20:52:04] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 74.60 ms [20:52:05] RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 73.84 ms [20:52:05] RECOVERY - Host cp4014 is UP: PING OK - Packet loss = 0%, RTA = 73.96 ms [20:52:06] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 75.19 ms [20:52:06] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 72.07 ms [20:52:07] RECOVERY - Host lvs4002 is UP: PING OK - Packet loss = 0%, RTA = 72.19 ms [20:52:07] RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 73.82 ms [20:52:08] RECOVERY - Host cp4018 is UP: PING OK - Packet loss = 0%, RTA = 72.19 ms [20:52:08] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 75.02 ms [20:52:09] RECOVERY - Host cp4006 is UP: PING OK - Packet loss = 0%, RTA = 72.05 ms [20:52:09] RECOVERY - Host lvs4004 is UP: PING OK - Packet loss = 0%, RTA = 72.65 ms [20:52:10] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 73.81 ms [20:52:10] RECOVERY - Host cp4008 is UP: PING OK - Packet loss = 0%, RTA = 74.18 ms [20:52:11] RECOVERY - Host cp4013 is UP: PING OK - Packet loss = 0%, RTA = 74.58 ms [20:52:14] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 75.05 ms [20:52:14] RECOVERY - Host cp4010 is UP: PING OK - Packet loss = 0%, RTA = 73.31 ms [20:52:14] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 73.73 ms [20:52:14] RECOVERY - Host cp4016 is UP: PING OK - Packet loss = 0%, RTA = 72.00 ms [20:52:14] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 75.07 ms [20:52:44] RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 70.60 ms [20:52:48] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 71.85 ms [20:52:51] RECOVERY - Host bits-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 71.86 ms [20:53:11] wow, they all rebooted [20:53:13] power fail? [20:53:14] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 72.48 ms [20:53:25] RECOVERY - Host upload-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 73.80 ms [20:53:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [20:53:34] RECOVERY - Host bits-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 73.22 ms [20:53:37] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 73.24 ms [20:53:40] RECOVERY - Host text-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 71.97 ms [20:53:42] I'm seeing 4minute uptimes on machines I look at manually [20:53:50] oh? [20:53:56] wow [20:54:04] I was about to say. [20:54:14] PROBLEM - Varnish HTTP upload-frontend on cp4006 is CRITICAL: Connection refused [20:54:15] PROBLEM - Varnish HTTP upload-frontend on cp4005 is CRITICAL: Connection refused [20:54:15] PROBLEM - Varnish HTTP mobile-backend on cp4020 is CRITICAL: Connection refused [20:54:15] PROBLEM - Varnish HTCP daemon on cp4015 is CRITICAL: Connection refused by host [20:54:15] PROBLEM - Varnish HTTP text-backend on cp4018 is CRITICAL: Connection refused [20:54:15] PROBLEM - Varnish HTTP text-frontend on cp4017 is CRITICAL: Connection refused [20:54:15] PROBLEM - Varnish HTTP upload-backend on cp4005 is CRITICAL: Connection refused [20:54:15] It looks like all the racks just went dark. [20:54:16] PROBLEM - Varnish HTTP upload-frontend on cp4015 is CRITICAL: Connection refused [20:54:16] PROBLEM - Varnish HTTP text-frontend on cp4018 is CRITICAL: Connection refused [20:54:17] PROBLEM - Varnish HTTP upload-backend on cp4015 is CRITICAL: Connection refused [20:54:17] PROBLEM - Varnish HTTP upload-frontend on cp4013 is CRITICAL: Connection refused [20:54:18] PROBLEM - Varnish HTTP mobile-frontend on cp4020 is CRITICAL: Connection refused [20:54:18] PROBLEM - RAID on cp4020 is CRITICAL: Connection refused by host [20:54:19] PROBLEM - Varnish traffic logger on cp4017 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [20:54:24] PROBLEM - Varnish HTTP mobile-frontend on cp4019 is CRITICAL: Connection refused [20:54:25] PROBLEM - Varnish HTCP daemon on cp4019 is CRITICAL: Connection refused by host [20:54:25] PROBLEM - puppet disabled on cp4020 is CRITICAL: Connection refused by host [20:54:25] PROBLEM - DPKG on cp4018 is CRITICAL: Connection refused by host [20:54:25] PROBLEM - puppet disabled on cp4006 is CRITICAL: Connection refused by host [20:54:25] PROBLEM - RAID on lvs4003 is CRITICAL: Connection refused by host [20:54:25] PROBLEM - Disk space on cp4019 is CRITICAL: Connection refused by host [20:54:26] PROBLEM - Varnish traffic logger on cp4011 is CRITICAL: Connection refused by host [20:54:26] PROBLEM - Varnish HTCP daemon on cp4010 is CRITICAL: Connection refused by host [20:54:27] PROBLEM - Varnishkafka log producer on cp4020 is CRITICAL: Connection refused by host [20:54:27] PROBLEM - Varnish traffic logger on cp4018 is CRITICAL: Connection refused by host [20:54:28] PROBLEM - Disk space on lvs4003 is CRITICAL: Connection refused by host [20:54:28] PROBLEM - DPKG on cp4020 is CRITICAL: Connection refused by host [20:54:29] PROBLEM - Varnishkafka log producer on cp4011 is CRITICAL: Connection refused by host [20:54:29] PROBLEM - puppet disabled on cp4013 is CRITICAL: Connection refused by host [20:54:30] PROBLEM - Varnish HTCP daemon on cp4006 is CRITICAL: Connection refused by host [20:54:30] PROBLEM - Disk space on cp4018 is CRITICAL: Connection refused by host [20:54:31] PROBLEM - RAID on cp4013 is CRITICAL: Connection refused by host [20:54:31] PROBLEM - Varnish traffic logger on cp4015 is CRITICAL: Connection refused by host [20:54:32] PROBLEM - Varnish traffic logger on cp4014 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [20:54:32] PROBLEM - Varnish traffic logger on cp4010 is CRITICAL: Connection refused by host [20:54:33] PROBLEM - Varnish HTCP daemon on cp4013 is CRITICAL: Connection refused by host [20:54:33] PROBLEM - Varnish traffic logger on cp4019 is CRITICAL: Connection refused by host [20:54:34] PROBLEM - Disk space on cp4020 is CRITICAL: Connection refused by host [20:54:48] all the cp4* hossts are 5 mins all right [20:54:54] PROBLEM - puppet disabled on cp4011 is CRITICAL: Connection refused by host [20:54:54] PROBLEM - puppet disabled on lvs4003 is CRITICAL: Connection refused by host [20:54:54] PROBLEM - puppet disabled on cp4019 is CRITICAL: Connection refused by host [20:54:54] PROBLEM - Varnish traffic logger on cp4013 is CRITICAL: Connection refused by host [20:54:54] PROBLEM - puppet disabled on cp4010 is CRITICAL: Connection refused by host [20:54:55] PROBLEM - RAID on cp4018 is CRITICAL: Connection refused by host [20:54:55] PROBLEM - Varnishkafka log producer on cp4019 is CRITICAL: Connection refused by host [20:54:56] PROBLEM - Disk space on cp4015 is CRITICAL: Connection refused by host [20:54:56] PROBLEM - DPKG on cp4006 is CRITICAL: Connection refused by host [20:55:05] PROBLEM - Varnish HTTP upload-frontend on cp4014 is CRITICAL: Connection refused [20:55:05] PROBLEM - Varnish HTTP upload-backend on cp4013 is CRITICAL: Connection refused [20:55:05] PROBLEM - Varnish HTTP text-backend on cp4008 is CRITICAL: Connection refused [20:55:05] PROBLEM - Varnish HTTP mobile-frontend on cp4011 is CRITICAL: Connection refused [20:55:05] PROBLEM - Varnish HTTP upload-backend on cp4006 is CRITICAL: Connection refused [20:55:22] yeah so UL might be better to contact than GTT [20:55:55] can we call mark? [20:56:25] PROBLEM - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: Connection refused [20:56:26] I don't have his cell number, and officewiki is down :) [20:56:29] PROBLEM - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: Connection refused [20:56:46] * greg-g puts that in todo for Monday: put all of ops in phone [20:56:47] and the lvs's too so that's all of em [20:56:51] (5 mins, now 7) [20:56:54] RECOVERY - puppet disabled on cp4010 is OK: OK [20:56:54] PROBLEM - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.379 second response time [20:56:54] apergos: can you call mark please? [20:57:04] RECOVERY - DPKG on cp4010 is OK: All packages OK [20:57:04] kaldari: so, no deploys right now :/ [20:57:04] RECOVERY - RAID on cp4010 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:57:10] sec [20:57:14] RECOVERY - Varnish HTTP text-backend on cp4010 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.149 second response time [20:57:15] RECOVERY - Varnish HTTP text-frontend on cp4010 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 0.152 second response time [20:57:15] RECOVERY - Disk space on cp4010 is OK: DISK OK [20:57:20] greg-g: ok [20:57:25] RECOVERY - Varnish HTCP daemon on cp4010 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [20:57:25] RECOVERY - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27847 bytes in 0.218 second response time [20:57:30] RECOVERY - Varnish traffic logger on cp4010 is OK: PROCS OK: 2 processes with command name varnishncsa [20:57:32] greg-g: the patch still hasn't been merged anyway :P [20:57:37] kaldari: ulsfo is down (well, maybe recovering now) [20:57:47] Cleaning staff unplugged the racks to plug in the floor polisher? [20:57:54] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 70053 bytes in 0.620 second response time [20:58:14] PROBLEM - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: Connection refused [20:58:20] we'll leave DNS failed over till we get to the bottom of it in any case [20:58:25] RECOVERY - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69937 bytes in 0.398 second response time [20:59:04] RECOVERY - Varnish HTTP upload-backend on cp4013 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.154 second response time [20:59:04] RECOVERY - Disk space on cp4013 is OK: DISK OK [20:59:04] RECOVERY - DPKG on cp4013 is OK: All packages OK [20:59:14] RECOVERY - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 27898 bytes in 0.455 second response time [20:59:18] RECOVERY - Varnish HTTP upload-frontend on cp4013 is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 0.154 second response time [20:59:18] RECOVERY - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 541 bytes in 0.150 second response time [20:59:18] bblack: yeah [20:59:24] RECOVERY - RAID on cp4013 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:59:24] RECOVERY - puppet disabled on cp4013 is OK: OK [20:59:24] RECOVERY - Varnish HTCP daemon on cp4013 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [20:59:34] RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 563 bytes in 0.382 second response time [20:59:40] back up for me. [20:59:54] RECOVERY - Varnish traffic logger on cp4013 is OK: PROCS OK: 2 processes with command name varnishncsa [20:59:54] RECOVERY - RAID on cp4018 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:00:04] RECOVERY - Varnish HTCP daemon on cp4018 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [21:00:05] RECOVERY - puppet disabled on cp4018 is OK: OK [21:00:14] PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx [21:00:15] Mark Bergsma: +31-654282595 [21:00:15] RECOVERY - Varnish HTTP text-frontend on cp4018 is OK: HTTP OK: HTTP/1.1 200 OK - 197 bytes in 0.148 second response time [21:00:15] RECOVERY - LVS HTTP IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 541 bytes in 0.148 second response time [21:00:24] RECOVERY - DPKG on cp4018 is OK: All packages OK [21:00:24] RECOVERY - Disk space on cp4018 is OK: DISK OK [21:00:25] RECOVERY - Varnish traffic logger on cp4018 is OK: PROCS OK: 2 processes with command name varnishncsa [21:00:29] Yeah, from what I see in the SEL the boxes just out and lost power. [21:00:47] I love how icinga signs all the texts "<3, Icinga" :P It makes my valentine's day special-er :) [21:00:54] RECOVERY - LVS HTTP IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 27848 bytes in 0.221 second response time [21:00:55] ha [21:01:44] PROBLEM - Host payments1003 is DOWN: PING CRITICAL - Packet loss = 100% [21:01:50] greg-g: hmm, office.wm.org works for me [21:02:05] PROBLEM - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.380 second response time [21:02:16] andre__: yeah, you aren't in asia/west coast :) [21:02:21] andre__: it was ulsfo [21:02:54] greg-g: ahah oh well makes sense [21:03:04] RECOVERY - Varnish traffic logger on cp4006 is OK: PROCS OK: 2 processes with command name varnishncsa [21:03:05] RECOVERY - RAID on cp4006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:03:05] RECOVERY - Disk space on cp4006 is OK: DISK OK [21:03:05] RECOVERY - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 27898 bytes in 0.449 second response time [21:03:14] RECOVERY - Varnish HTTP upload-frontend on cp4006 is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 0.148 second response time [21:03:16] :) [21:03:24] RECOVERY - Varnish HTCP daemon on cp4006 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [21:03:25] RECOVERY - puppet disabled on cp4006 is OK: OK [21:03:54] RECOVERY - DPKG on cp4006 is OK: All packages OK [21:03:55] I think we shall need to have /words/ with UL. [21:04:05] RECOVERY - Varnish HTCP daemon on cp4020 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [21:04:05] RECOVERY - Varnish traffic logger on cp4020 is OK: PROCS OK: 2 processes with command name varnishncsa [21:04:14] RECOVERY - Varnish HTTP mobile-frontend on cp4020 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.151 second response time [21:04:15] RECOVERY - RAID on cp4020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:04:24] RECOVERY - puppet disabled on cp4020 is OK: OK [21:04:25] RECOVERY - DPKG on cp4020 is OK: All packages OK [21:04:25] RECOVERY - Disk space on cp4020 is OK: DISK OK [21:04:54] RECOVERY - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27898 bytes in 0.459 second response time [21:05:04] RECOVERY - Varnishkafka log producer on cp4012 is OK: PROCS OK: 1 process with command name varnishkafka [21:05:14] PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx [21:05:24] RECOVERY - Host payments1003 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [21:06:04] RECOVERY - Varnish HTTP text-backend on cp4008 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.147 second response time [21:06:24] PROBLEM - NTP on cp4013 is CRITICAL: NTP CRITICAL: Offset unknown [21:06:24] PROBLEM - NTP on cp4012 is CRITICAL: NTP CRITICAL: Offset unknown [21:06:25] PROBLEM - NTP on cp4018 is CRITICAL: NTP CRITICAL: Offset unknown [21:06:25] PROBLEM - NTP on cp4007 is CRITICAL: NTP CRITICAL: Offset unknown [21:06:46] Timestamp = 2014-02-14 19:42:14 [21:06:46] Message = System is turning off. [21:06:46] FQDD = iDRAC.Embedded.1#HostPowerCtrl [21:06:54] PROBLEM - NTP on cp4005 is CRITICAL: NTP CRITICAL: Offset unknown [21:06:54] PROBLEM - NTP on cp4011 is CRITICAL: NTP CRITICAL: Offset unknown [21:06:55] Coren: a few of them, of the strong variety [21:06:56] hey [21:06:57] what's going on? [21:06:59] I'm out [21:07:04] PROBLEM - NTP on cp4010 is CRITICAL: NTP CRITICAL: Offset unknown [21:07:05] PROBLEM - NTP on cp4019 is CRITICAL: NTP CRITICAL: Offset unknown [21:07:05] PROBLEM - NTP on bast4001 is CRITICAL: NTP CRITICAL: Offset unknown [21:07:05] PROBLEM - NTP on cp4015 is CRITICAL: NTP CRITICAL: Offset unknown [21:07:06] with no access to my SSH keys [21:07:13] paravoid_: we failed over dns to eqiad from ulsfo, it just dropped [21:07:17] paravoid: We lost all power to ULSFO for a while. [21:07:19] no clear real cause yet, maybe power? [21:07:24] RECOVERY - Varnish traffic logger on cp4011 is OK: PROCS OK: 2 processes with command name varnishncsa [21:07:41] greg-g: Definitely power according to the RAC logs. [21:07:46] * greg-g nods [21:07:46] k [21:07:54] RECOVERY - puppet disabled on cp4011 is OK: OK [21:07:57] did anyone call united layer? [21:08:04] RECOVERY - Varnish HTTP mobile-frontend on cp4011 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 0.151 second response time [21:08:05] RECOVERY - DPKG on cp4011 is OK: All packages OK [21:08:05] RECOVERY - Disk space on cp4011 is OK: DISK OK [21:08:05] RECOVERY - Varnish HTCP daemon on cp4011 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [21:08:05] RECOVERY - RAID on cp4011 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:08:09] I dunno about the layout enough to figure out if we lost one rack or both? [21:08:14] PROBLEM - Host payments1002 is DOWN: PING CRITICAL - Packet loss = 100% [21:08:29] "Hey, WMF here; 3.4 billion users couldn't access Wikipedia for 10 minutes. Cheers." [21:08:32] I believe we lost all machines [21:08:46] RobH: ping? [21:09:03] List of things you expect your DC to not do: lose all power without warning on all your gear. [21:09:04] RECOVERY - DPKG on lvs4003 is OK: All packages OK [21:09:13] is Rob around? [21:09:24] RECOVERY - Disk space on lvs4003 is OK: DISK OK [21:09:25] RECOVERY - RAID on lvs4003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:09:27] RobH was still sick, I think, though better. [21:09:34] PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.365 second response time [21:09:54] RECOVERY - puppet disabled on lvs4003 is OK: OK [21:09:54] RECOVERY - NTP on cp4005 is OK: NTP OK: Offset -0.0002267360687 secs [21:09:54] RECOVERY - NTP on cp4011 is OK: NTP OK: Offset 0.0005394220352 secs [21:10:05] RECOVERY - NTP on cp4015 is OK: NTP OK: Offset 0.0004215240479 secs [21:10:05] RECOVERY - NTP on cp4019 is OK: NTP OK: Offset 0.0003026723862 secs [21:10:05] RECOVERY - NTP on cp4010 is OK: NTP OK: Offset -0.000821352005 secs [21:10:07] cp4xx in ganglia say uptime 119 days [21:10:11] monitoring flake? [21:10:14] PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx [21:10:24] RECOVERY - NTP on cp4013 is OK: NTP OK: Offset -0.0003784894943 secs [21:10:24] RECOVERY - NTP on cp4012 is OK: NTP OK: Offset 0.0007705688477 secs [21:10:25] RECOVERY - Host payments1002 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [21:10:25] RECOVERY - NTP on cp4018 is OK: NTP OK: Offset 0.0005210638046 secs [21:10:25] RECOVERY - NTP on cp4007 is OK: NTP OK: Offset -0.003658771515 secs [21:10:28] paravoid_: no, I definitely had bad gateway [21:10:34] RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 566 bytes in 0.392 second response time [21:10:42] paravoid_: nvm me [21:11:07] 17 minutes according to log + Jorm's OK [21:11:13] root@cp4011:~# uptime [21:11:13] 21:11:02 up 22 min, 1 user, load average: 0.08, 0.19, 0.13 [21:11:33] I think ganglia just reports what it see and not what the box says. [21:11:47] paravoid_: yeah ganglia's wrong, they all have short uptimes on manual check [21:15:14] PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx [21:15:24] PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.384 second response time [21:15:41] can people call UL to see what happened? [21:15:45] Jeff_Green: btw, you said all the payments stuff was 'just you', still the case? [21:15:47] or, preferrably, call RobH and have him do that? [21:16:11] paravoid_: brandon is trying one number, he doesn't have an access code though, we'll see how far he gets [21:16:19] * greg-g plays cross channel relay [21:16:34] heh, sorry, I can't join other channels from here :) [21:16:43] yeah, no worries :) [21:16:46] we have ops people in SF, don't we [21:17:12] other than sick RobH, jgage is next. [21:17:24] RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 564 bytes in 0.375 second response time [21:17:29] or contract Leslie-Carr :) [21:17:43] paravoid_: We're susprisingly short on opsen physically at the office nowadays. [21:17:50] I am aware [21:18:04] RECOVERY - Varnish HTTP upload-frontend on cp4014 is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 0.147 second response time [21:18:14] RECOVERY - Varnish HTTP upload-backend on cp4015 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.149 second response time [21:18:15] RECOVERY - Varnish HTTP upload-frontend on cp4015 is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 0.150 second response time [21:18:15] RECOVERY - Varnish HTCP daemon on cp4015 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [21:18:24] RECOVERY - Varnish HTTP mobile-frontend on cp4019 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.144 second response time [21:18:24] RECOVERY - Disk space on cp4019 is OK: DISK OK [21:18:25] RECOVERY - Varnish HTCP daemon on cp4019 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [21:18:25] RECOVERY - Varnish traffic logger on cp4015 is OK: PROCS OK: 2 processes with command name varnishncsa [21:18:25] RECOVERY - Varnish traffic logger on cp4019 is OK: PROCS OK: 2 processes with command name varnishncsa [21:18:25] RECOVERY - Varnish traffic logger on cp4014 is OK: PROCS OK: 2 processes with command name varnishncsa [21:18:53] alright, so, we're failed over to eqiad, thus we're back to where we were 3 weeks ago, so I'm going to tell kaldari he can deploy his decently important bug fix soon. Unless I hear yelling. [21:18:54] RECOVERY - Disk space on cp4015 is OK: DISK OK [21:18:54] RECOVERY - puppet disabled on cp4019 is OK: OK [21:19:04] RECOVERY - puppet disabled on cp4015 is OK: OK [21:19:04] RECOVERY - DPKG on cp4019 is OK: All packages OK [21:19:05] RECOVERY - DPKG on cp4015 is OK: All packages OK [21:19:05] RECOVERY - RAID on cp4015 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:19:05] RECOVERY - RAID on cp4019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:19:14] RECOVERY - Varnish HTTP text-frontend on cp4017 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 0.149 second response time [21:19:24] RECOVERY - Varnish traffic logger on cp4017 is OK: PROCS OK: 2 processes with command name varnishncsa [21:20:04] RECOVERY - Varnish traffic logger on cp4005 is OK: PROCS OK: 2 processes with command name varnishncsa [21:20:10] paravoid_: I talked to ulsfo briefly on the phone, they did confirm a "power event" [21:20:14] PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx [21:20:15] RECOVERY - Varnish HTTP upload-backend on cp4005 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.150 second response time [21:20:15] RECOVERY - Varnish HTTP upload-frontend on cp4005 is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 0.145 second response time [21:20:42] they claim everything's kosher now and it's sorted out, but, it would be nice to get more-official confirmation of the situation than just "whatever this guy said to me on the phone with no identification at all" [21:21:07] eh, what about their redundancy? [21:22:22] I'm going to sign off now, since I'm not at home anyway [21:22:23] MaxSem: "don't ask, don't tell" [21:22:29] paravoid_: ok, thanks for checking in [21:22:51] please try to get a proper postmortem from UL [21:22:59] what happened and if we're sure it won't happen again [21:23:16] probably better to wait for robh to get back [21:23:35] it's not any hurry now, but it'd be nice to get this information before tuesday [21:24:00] twkozlowski: another one for the technews ^^ :) "The ULSFO caching center went offline momentarily causing access to all Wikimedia hosted sites to fail for Oceania and the West Coast of the US for around 10(?) minutes." [21:24:33] don't forget southeast/east asia [21:25:11] and western territories of canada [21:25:14] PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx [21:25:21] s/US/North America/ [21:25:27] * jgage returns from lunch to see what all the alerts are about [21:25:31] * greg-g is a us-centric jerk [21:25:52] "On February 14, all sites were broken for about 15 minutes for users in Southeast Asia and western North America due to a cache server problem." [21:26:10] ulsfo gobbledygook [21:26:21] "caching center"? [21:26:32] twkozlowski: you forgot oceania [21:26:36] :) [21:26:44] anyway [21:26:46] bye now [21:26:48] later [21:26:57] twkozlowski, not "cache server problem" but whole bloody datacenter [21:27:03] :) [21:27:07] what Max said [21:27:08] yes, cache servers [21:27:32] and LVS! [21:27:38] ;) [21:27:49] try explaining that to your average non-geeky Wikipedian ;-) [21:28:27] !log Upgraded and restarted elasticsearch on logstash1002 [21:28:31] * twkozlowski notes with some irritation that he still doesn't know what the Feb 11 outage was about and how long it was [21:28:34] Logged the message, Master [21:28:59] greg-g: we've taken it off Tech News #8 temporarily because of lack of info [21:29:14] !log Upgraded and restarted elasticsearch on logstash1001 [21:29:25] Logged the message, Master [21:29:47] twkozlowski: for now, you could say something like "due to database issues" [21:30:21] I poked the people investigating, but there's (I believe) still ongoing investigation on root cause [21:30:33] !log Upgraded and restarted elasticsearch on logstash1003 [21:30:40] Logged the message, Master [21:30:56] twkozlowski: so, I apologize, and feel your frustration [21:31:04] I deeply sympathize [21:31:07] ;) [21:31:10] bd808: props for !logging verbosely [21:32:06] is there documentation on where/how to view the RAC logs? i'm not finding it. [21:32:21] RAC? [21:32:34] mentioned in scrollback as confirming power outage [21:32:44] ah, dunno [21:32:46] out of band management thingos [21:33:01] greg-g: database issues sounds good; how long did it take? [21:33:56] twkozlowski: not sure, started at 2014-01-11 22:10 UTC [21:34:04] RECOVERY - Varnish HTTP upload-backend on cp4006 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.149 second response time [21:34:24] RECOVERY - Varnishkafka log producer on cp4020 is OK: PROCS OK: 1 process with command name varnishkafka [21:34:54] * twkozlowski checks channel logs yet again [21:35:32] greg-g: 02-11 surely :) [21:35:59] Did something funky happen to the job queue about 24 hours ago? There are 6 a/v files queued to be transcoded that never happened (Not something that causes a problem since they can be easily restarted, just curious) [21:36:25] twkozlowski: heh, yeah, that was a copy paste from the initial email :) [21:37:19] twkozlowski: on a side note because this discussion is reminding me, thank you so much for all of your work on the tech news. Hugely helpful. [21:37:24] RECOVERY - Varnishkafka log producer on cp4011 is OK: PROCS OK: 1 process with command name varnishkafka [21:37:35] sounds like nobody has contacted UL yet? i'll look into how to do that. [21:38:14] * twkozlowski hugs Jamesofur [21:38:22] Jamesofur: ++ [21:38:23] https://meta.wikimedia.org/w/index.php?title=Tech/News/2014/08&diff=7496055&oldid=7496052 greg-g [21:38:27] :D /hugs [21:39:29] * jgage is calling UL [21:39:29] ditto, thanks much twkozlowski [21:39:51] jgage: brandon called, we got a quick message [21:40:08] * ori cheers jgage on [21:40:38] jgage: to be clear, he didn't get much, if you can get more, please plese do [21:40:47] greg-g: so you see that's all the info I have right now; not much to go with [21:40:58] all they said is power outage, restored, email has been sent. "UnitedLayer SF7 Power Event on 2/14/14 [21:41:03] " to ops@ [21:41:27] twkozlowski: that made me laugh, then frown [21:41:29] greg-g: edits welcome, and encouraged :) [21:41:44] * greg-g looks at his IRC logs [21:42:01] the gist of which is "At approximately 12:50 PM today during a routine maintenance of a UPS unit in SF7, a power event caused the UPS to drop load to some customer circuits. We are still looking into the root cause and impact" [21:42:21] some meaning "all" in our case :P [21:43:07] that's a pretty nasty power event [21:43:14] !log lostash fatalmonitor dashboard working again after restarts to backend [21:43:21] Logged the message, Master [21:44:23] (03PS12) 10BBlack: Handle HTTPS for Zero traffic [operations/puppet] - 10https://gerrit.wikimedia.org/r/102316 (owner: 10Yurik) [21:44:23] Clearly wasn't routine if they managed to flub it that hard. [21:45:06] we're not dual-feed? [21:45:11] (03CR) 10BBlack: "^ PS12 is just a manual rebase onto all the other recent changes" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102316 (owner: 10Yurik) [21:46:05] !log disabled puppetd on logstash1002 to test ganglia monitor fix [21:46:14] Logged the message, Master [21:48:54] RECOVERY - Varnishkafka log producer on cp4019 is OK: PROCS OK: 1 process with command name varnishkafka [21:50:04] none of the ulsfo power strips are in librenms :\ [21:51:15] * jgage makes RT 6850 about that [21:51:26] twkozlowski: looking at my irc logs, looks like that one was ~20 minutes [21:51:55] Good greg-g. Now we only need to know what happened :) [21:52:22] since it was a Parsoid problem, did it affect VE editing somehow? [21:52:31] twkozlowski: ohhhh, that was separate [21:52:34] :) [21:52:39] there were two right around the same time :) [21:52:55] the one I've been talking about, and then the parsoid one, gwicke should be getting that posted soon [21:54:08] twkozlowski, what I heard is that people got an error in about 1 out of 20 requests [21:55:47] as in page loads, similar to what happened on Feb 13? [21:55:51] [crazy week, eh?] [21:56:45] twkozlowski, as in VE edits [21:56:56] parsoid does not affects normal page loads [21:57:02] *affect [21:57:19] gwicke: greg says that's not what he had in mind :) [21:57:34] so first outage: 20 minuted due to database issues, erorrs for 1 out of 20 requests [21:57:48] then on the same day, around the same time, problems with VE edits due to ? [21:57:50] twkozlowski: s/, erorrs for 1 out of 20 requests// [21:58:01] that errors 1 in 20 was for the parsoid one [21:58:19] boy would this be clearer if we had finished incident reports on wiki [21:58:19] okay [21:58:22] * greg-g smikes [21:58:27] * greg-g smiles, even [21:58:53] * gwicke feels the pressure [21:59:26] gwicke: so errors = couldn't save edits? [21:59:47] gwicke: not just you :) [22:00:25] twkozlowski, or got an error when loading content into VE [22:00:54] On February 11, users experienced problems with VisualEditor for about 20 minutes due to database issues." [22:01:02] perhaps not quite... [22:01:11] this is a highly unscientific number, quoted from what I remember about the IRC backlog- I think Eloquence did some manual testing at the time [22:01:15] The Parsoid cluster outage? [22:01:18] - database issues [22:01:40] gwicke: the parsoid thing wasn't caused by db stuff was it? [22:02:03] no, that was log files filling up the disk and taking out about 3/4 of the parsoid nodes [22:02:27] * gwicke writes up a report [22:02:35] :) [22:02:39] that's what I thought [22:02:52] twkozlowski: Oh, I should write VE items for Tech/News shouldn't I? [22:03:04] James_F: We did earlier today [22:03:13] Eurgh. Wrongly. [22:03:18] https://meta.wikimedia.org/wiki/Tech/News/2014/08 [22:03:24] see, that's the problem with VE [22:03:33] You add stuff on a Friday evening :-) [22:03:44] Thursday evening from our POV. :-P [22:03:48] Forthcoming change !!!!== "you can now". [22:03:57] (03PS1) 10Ori.livneh: Migrate EventLogging to "%{..}x"-style format specifiers [operations/puppet] - 10https://gerrit.wikimedia.org/r/113470 [22:04:09] * James_F fixes. [22:04:31] oh, that's because of bug status in BZ [22:04:41] Yeah. FIXED !== DEPLOYED. [22:04:45] (If only we had that state.) [22:05:02] (03PS1) 10BryanDavis: Make elasticsearch ganglia monitor compatible with logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/113471 [22:05:14] RECOVERY - Varnish HTTP mobile-backend on cp4020 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.151 second response time [22:05:37] yay greg-g, thanks [22:06:32] James_F: for site requests, I know that people keep their bugs open until their patches are deployed [22:06:55] but I guess to each their own, maybe that won't work for you [22:06:56] (03CR) 10Ori.livneh: [C: 032] Migrate EventLogging to "%{..}x"-style format specifiers [operations/puppet] - 10https://gerrit.wikimedia.org/r/113470 (owner: 10Ori.livneh) [22:07:06] (03PS2) 10Ori.livneh: eventlogging: manage /etc/eventlogging.d recursively [operations/puppet] - 10https://gerrit.wikimedia.org/r/113277 [22:07:12] (03CR) 10Ori.livneh: [C: 032 V: 032] eventlogging: manage /etc/eventlogging.d recursively [operations/puppet] - 10https://gerrit.wikimedia.org/r/113277 (owner: 10Ori.livneh) [22:07:12] twkozlowski: For site requests merged and deployed are the same state (or should be unless someone screwed up). [22:07:45] True. [22:08:06] (03CR) 10BryanDavis: "Tested by manual application on logstash1002. Needs to be verified not to disrupt reporting on the Search cluster." [operations/puppet] - 10https://gerrit.wikimedia.org/r/113471 (owner: 10BryanDavis) [22:08:56] James_F: I'd introduce a DEPLOYED status in Bugzilla if we had an automatic way to set it :) [22:09:40] andre__: I'd be delighted to set it manually for VE. [22:09:47] andre__: Others might not want to. :-) [22:09:51] hmm. [22:09:55] andre__: That would be cool. Want to add it to the deploy system wish list at https://etherpad.wikimedia.org/p/DeploymentSystemRequirements? [22:10:10] andre__: Also, https://www.mediawiki.org/wiki/MediaWiki_1.23/wmf14/Changelog#VisualEditor from Reedy is a good opportunity to set the flag… [22:10:12] James_F, DEPLOYED_PHASE1, DEPLOYED_PHASE2, and DEPLOYED_PHASE3? :P [22:10:46] DEPLOYED_TO_ONE_PRODUCTION_BRANCH_BUT_NOT_THE_OTHER, etc. [22:10:46] andre__: I think DEPLOYED_TEST, DEPLOYED_SOME, DEPLOYED_ALL, and RELEASED (for MW releases) would make more sense. [22:10:53] ah, phases are now called groups on https://www.mediawiki.org/wiki/MediaWiki_1.23/Roadmap [22:11:04] James_F, hmm, good point. [22:11:18] andre__: Can we have a multi-level state like for "RESOLVED"? [22:11:26] :( [22:11:32] James_F, no :( [22:11:33] andre__: So "DEPLOYED" / "ALL" [22:11:35] Boo. [22:11:41] * James_F grumps about Bugzilla. [22:11:49] Hmm. That would be interesting, yeah [22:11:59] I'd rather not tie those BZ status to explicit stages of deployment that may change (probably will) [22:12:13] <^d> greg-g: +1 [22:12:16] greg-g: Hence TEST vs. ALL. [22:12:18] I'd rather like a "this gerrit change fixed it, click here to see where it lives" [22:12:30] <^d> Again, greg-g+1 [22:12:30] greg-g: We already have that – but not in Bugzilla. [22:12:46] greg-g: ("Included in" in gerrit.) [22:12:48] James_F: no we don't, we just have what branch it is in, which is not a 1:1 for deployed [22:12:58] wmf13 is where, again? :P [22:13:12] greg-g: Right now? Phases 1 and 2 but not 0. [22:13:13] there needs to be a bit more logic before that's helpful [22:13:19] James_F: and you know that through? [22:13:26] greg-g: [[Deployments]]. :-) [22:13:27] gerrit? BZ? no a by-hand wiki page :) [22:13:53] Actually, I know that from grrrit-wm's reports of Reedy's merges. [22:13:59] so yeah, one can peice things together, but Yuvi is working on a 'dashboard' for lack of a better word that'll help with that [22:14:05] James_F: :P [22:14:07] Oh, he is? [22:14:14] I should send him my thoughts. [22:14:18] well, he started, then got distracted, it was a weekend thing for him [22:14:19] greg-g: https://noc.wikimedia.org/conf/highlight.php?file=wikiversions.dat [22:14:24] greg-g: What? I stalk grrrit-wm a lot. :-) [22:14:24] bd808: :P [22:15:12] https://commons.wikimedia.org/wiki/File:PersonalDashboard_v1.jpg <- my 10 second thougths on what the dashboard should be like [22:15:14] twkozlowski: OK if I mark https://meta.wikimedia.org/wiki/Tech/News/2014/08 for translation? [22:15:14] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-db1047 [22:15:21] No! [22:15:25] bd808: ^^ [22:15:31] * ori handles EL stuff [22:15:38] James_F: I'd wait for Guillaume to wake up :-) [22:15:43] thanks or [22:15:46] i [22:15:54] He's got a real knack for simplifying messages [22:16:01] twkozlowski: Bah. All those translators working on bad messages. :-( [22:16:20] twkozlowski: https://wikitech.wikimedia.org/wiki/Incident_documentation/20140211-Parsoid#Summary [22:16:29] greg-g: That's cute. I like it. [22:16:49] bd808: thanks dear [22:17:24] greg-g: We'd probably want some hooks into new BZ tickets tagged against something. [22:17:25] thanks much gwicke [22:17:29] greg-g: No idea how that would work. [22:17:34] James_F: yeah /me shrugs on that [22:17:34] You can now convert block images between some different types (like thumbnail, framed and frameless). [22:17:43] James_F: block images? [22:17:46] twkozlowski: Yes. [22:17:57] Q: What are block images? [22:18:15] Images (actually, media item transclusions, but I was simplifying) that are blocks, not inline. [22:18:24] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [22:18:33] The examples of the types follow the clause. [22:19:04] James_F: I had a look at it last week [22:19:07] greg-g: Some metrics on performance, not just fatals, would be good too. [22:19:30] James_F: so it's kind of file properties like thumbnail, frame, etc [22:19:51] twkozlowski: When last week? IIRC the code only merged on Friday. [22:19:51] James_F: let me know when we have those :) [22:20:14] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [22:20:18] greg-g: Dan's working on some for VE right now. Not split by phase, though. :-( [22:20:21] James_F: when I reviewed VE's product on BZ :) [22:20:31] Dan who? [22:21:07] twkozlowski: Given that I write https://www.mediawiki.org/wiki/VisualEditor/status specifically so that people know what's in each VE release, maybe read that instead to avoid wasting your time? :-) [22:21:19] greg-g: Garry. [22:22:12] James_F: ..... from where? [22:22:15] !log enabled puppetd on logstash1002 [22:22:22] Logged the message, Master [22:22:36] James_F: That's posted every two weeks; Tech News is a weekly [22:22:48] twkozlowski: No, it's posted weekly. [22:22:54] James_F: in any case, Guillaume's the one who's doing VE items [22:22:58] !log manually applied Ie56d3a5 on logstash100[123] hosts and restarted gmond [22:23:05] Logged the message, Master [22:23:10] 2013-01-16, then 2013-01-30, that's two weeks [22:23:11] twkozlowski: More specifically, it's posted every week that MW deploys. [22:23:50] "You can now change file properties like thumbnail and frame with VisualEditor." [22:23:55] James_F: ^^ ? [22:24:07] twkozlowski: They're not called "file properties". [22:24:12] twkozlowski: That's just… confusing. [22:24:54] twkozlowski, in case you also want to report on Parsoid: https://www.mediawiki.org/wiki/Parsoid/Deployments [22:25:12] twkozlowski: Parsoid is probably more interesting in some ways. [22:26:10] MW manual says 'file format' [22:30:07] !log kaldari synchronized php-1.23wmf14/extensions/VectorBeta/ 'sync update for VectorBeta on wmf14' [22:30:16] twkozlowski: It does? [22:30:16] Logged the message, Master [22:31:03] Yeah; I didn't even know that ;) [22:32:32] James_F: can you OK this? https://meta.wikimedia.org/w/index.php?title=Tech/News/2014/08&diff=7496614&oldid=7496373 [22:34:01] twkozlowski: "with VisualEditor" seems superfluous when it's in the VE section. :-) [22:34:04] RECOVERY - ElasticSearch health check on logstash1003 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 35: active_shards: 70: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [22:34:05] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 35: active_shards: 70: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [22:34:14] W00t! ^^ [22:34:25] RECOVERY - ElasticSearch health check on logstash1002 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 35: active_shards: 70: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [22:34:28] bd808: Yay? [22:34:33] Hopefully the damn thing will stay up now [22:34:38] Hopefully. :-) [22:36:37] Something "bad" happened last night that got the elasticsearch nodes behind logstash wedged up against their max jvm heap size. [22:37:01] twkozlowski: Better? [22:37:27] is it images or files? [22:38:19] You will soon be able to create and edit redirect pages suggests you can't right now [22:38:42] twkozlowski: It's "media items". [22:38:45] twkozlowski: But that sucks. [22:39:25] twkozlowski: Well, you can't (in VisualEditor). The entire section is about VE. [22:40:19] * twkozlowski is skeptic [22:40:57] James_F: if you translate stuff, the message doesn't say it's in the VE section :) [22:41:03] s/if/when [22:41:30] twkozlowski: it should, add to qqq if relevant for understanding [22:45:44] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [22:45:58] (03PS2) 10Chad: Remove old public key [operations/puppet] - 10https://gerrit.wikimedia.org/r/113326 [22:53:16] !log kaldari synchronized php-1.23wmf13/extensions/VectorBeta/ 'sync update for VectorBeta on wmf13' [22:53:26] Logged the message, Master [22:57:34] PROBLEM - Disk space on labstore4 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [23:00:57] hoi Reasonator gives me Bad Request [23:00:58] Your browser sent a request that this server could not understand. [23:01:20] GerardM-: no known issues currently... [23:01:37] are you sure you're sending a valid request? :) [23:01:42] yes [23:01:47] Magnus has the same issue [23:01:49] try http://tools.wmflabs.org/ [23:02:01] try reasonator.info [23:02:07] ( I should say, no known production cluster issues right now) [23:02:11] oh [23:02:15] huh, that's nice [23:02:17] Coren: ^^ [23:02:32] cmjohnson1: good timing, anything stupid in tampa right now that would affect wmflabs? [23:03:37] greg-g: there was a fpl wave down earlier [23:03:38] ori, who would be the right person to tweak the disk space alert for parsoid hosts? [23:03:47] cmjohnson1: that's over though, right? [23:04:19] gwicke: whomever is on RT duty [23:04:19] s/over/fixed/ [23:04:39] ori, ok; apergos: ping [23:04:44] * cmjohnson1 checking  [23:05:12] * apergos points out that at 1 am I'm not tweakign anything... (sorry) [23:05:16] hi [23:05:23] apergos: you shoulda stayed silent :) [23:05:33] the labs thing appears to have fs issues, something you can help with? [23:05:33] hehe [23:05:35] I shoulda but I was pinged and happened to pass by [23:05:44] Sveta: yeah, looking into it [23:05:50] someone in a real tz should have a look at it [23:06:23] yeah, just hard to pinpoint somebody [23:06:37] greg-g, Coren might be also looking into it [23:06:42] also gwicke as you know the logs are rotated once an hour now (at least last I checked they seemed to be doing that) [23:06:45] * gwicke tries to open an rt ticket [23:06:55] Sveta: yep, thanks much [23:07:00] apergos, I know, thanks for setting that up! [23:07:01] nod [23:07:27] it's only a stopgap. but it would take some serious log explosion for parsoid to fill up 6gb in an hour [23:07:40] apergos, just thought that it makes sense to lower the disk space threshold to something that gives us enough time to react [23:07:47] agree [23:08:33] greg-g: I don't see anything that says its been fixed but if it were a problem for labs it would have been a problem all day. it's the ashburn to tampa link. I doubt very much that's it [23:08:35] cmjohnson1: btw, you're off the hook [23:08:38] heh, thanks much [23:08:47] greg-g thx! [23:10:54] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [23:16:34] RECOVERY - Disk space on labstore4 is OK: DISK OK [23:16:34] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [23:29:17] apergos, https://rt.wikimedia.org/Ticket/Display.html?id=6851&results=bb1c6fbf4f72ec9a7649095081b3bfc2 [23:31:54] !log tools Rebooted labstore4 -- XFS done got broken agun [23:32:01] Logged the message, Master [23:32:01] thanks Coren [23:34:01] gwicke: https://github.com/wikimedia/operations-puppet/blob/production/modules/base/manifests/monitoring/host.pp#L68 [23:34:15] that won't be easy to customize [23:34:55] ori, k [23:35:26] we likely also have that data in ganglia, but then there is no way to define alerts on it [23:49:12] gwicke: on https://wikitech.wikimedia.org/wiki/Incident_documentation/20140211-Parsoid#Summary, is "I" you? [23:49:59] legoktm: no, that's Roan [23:50:01] let me fix [23:50:21] thanks [23:51:44] done [23:52:29] ty [23:54:56] (03PS1) 10BBlack: Move traffic back to ulsfo, reverts fa5fa2e8 [operations/dns] - 10https://gerrit.wikimedia.org/r/113479 [23:54:58] (03CR) 10BBlack: [C: 032 V: 032] Move traffic back to ulsfo, reverts fa5fa2e8 [operations/dns] - 10https://gerrit.wikimedia.org/r/113479 (owner: 10BBlack) [23:55:13] (03CR) 10TTO: "I am not sure, since you didn't really explain what was broken, but I suspect this is bug 61357?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113377 (owner: 10Legoktm) [23:59:04] !log aaron synchronized php-1.23wmf13/extensions/Math '9e75a1b' [23:59:11] Logged the message, Master [23:59:14] RECOVERY - Varnish HTTP text-backend on cp4018 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.152 second response time