[00:00:07] just a couple of minor things I put in comments in gerrit [00:00:35] the masterPos parameter doesn't seem to be set anywhere [00:01:02] if it's set on all jobs, it would break duplicate removal, so don't do that [00:01:31] it's set in refreshLinksJob2, which spawns the other jobs [00:01:52] I didn't see code that directly inserts the later [00:02:36] oh yeah, so it is [00:03:55] and I complained about some code being too compact and clever [00:04:10] you know I like stupid things ;) [00:05:20] TimStarling: I actually thought it was dumb and not clever...there is the overhead of setting the var each time [00:05:47] if ( $first ) { [00:05:54] $start = 0; [00:05:56] not that the overhead matters in PHP, but in principle :) [00:05:57] $first = false; [00:05:59] } else { [00:06:07] $start = $title->getArticleID(); [00:06:08] } [00:08:11] that would be a simple way to do it, maybe not quite simple enough for Evelyn to understand, but I'm sure she'll be up to that level soon ;) [00:09:37] ok maybe not that exact code but you get the idea [00:10:59] you know what was the main VM hotspot when I was benchmarking PPFrame_DOM::expand() last week? [00:11:05] New patchset: Asher; "adding dhcp stuffs for db10(49-50)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17922 [00:11:19] converting things that were not boolean to boolean for use in if() [00:11:31] heh [00:11:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17922 [00:11:48] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17922 [00:12:01] it needs a switch() statement for about 8 different types, which is translated to a binary tree of conditional jumps [00:12:22] so it was a big hotspot for branch misprediction [00:13:15] nuno thought it would be hashtable lookups to fetch local variables [00:13:34] but actually there is no local symbol hashtable, it's lazy-initialised [00:13:45] local variables are converted to static array indexes at compile time [00:15:09] yeah [00:15:11] TimStarling: I think that job insert code has an old bug btw [00:15:43] nope, just something I introduced [00:16:01] it doesn't deal with the last 1-9 items ;) [00:16:18] at least it works in spirit [00:16:39] it's the thought that counts [00:27:18] New patchset: Asher; "expand range of dbs in netboot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17923 [00:27:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17923 [00:30:02] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17923 [00:38:04] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [00:46:53] TimStarling: you know the job duplicate checking is broken right? [00:47:13] hmm, maybe I had heard something about that [00:47:32] I know that it doesn't work with refreshLinks2, so it's mostly useless anyway [00:47:49] it uses insertFields() for the duplicate check, which includes ('job_timestamp' => $dbw->timestamp()) [00:48:18] so duplicate refresh jobs with any other timestamp won't be seen [00:48:23] it does at least have unset( $fields['job_id'] ); ;) [00:48:39] maybe an unset for timestamp could be tossed in removeDuplicates() [00:49:13] of course with masterPos it would still be useless [00:49:40] right, but job_timestamp is newer than refreshLinks2 [00:50:04] it was really useful when it was just refreshLinks, but there were other problems with that scheme of course [00:53:05] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [00:53:44] !log planet (en and fr) configs Updated to revision 115645. [00:53:53] Logged the message, Master [01:06:09] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [01:15:58] mutante: Were you working on wikitech? It's still broken [01:16:02] from within function "User::saveSettings". Database returned error "1054: Unknown column 'user_options' in 'field list' (localhost)". [01:16:04] When logging in [01:16:44] Oh [01:16:49] Does 1.17 still use that old crap? [01:17:08] use/check [01:18:04] apparently so [01:19:38] That's not as broken as it was anyway ;) [01:22:11] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [01:40:48] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 206 seconds [01:41:59] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 242 seconds [01:49:02] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 665s [01:53:41] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 14 seconds [01:54:53] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 31s [01:56:23] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 4 seconds [02:51:09] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [02:51:09] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [03:08:42] RECOVERY - Puppet freshness on bayes is OK: puppet ran at Tue Aug 7 03:08:15 UTC 2012 [03:12:00] RECOVERY - Puppet freshness on niobium is OK: puppet ran at Tue Aug 7 03:11:47 UTC 2012 [03:14:15] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:14:24] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:14:42] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:09] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:17] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:26] RECOVERY - Puppet freshness on srv190 is OK: puppet ran at Tue Aug 7 03:15:09 UTC 2012 [03:15:35] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.279 second response time [03:15:35] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.852 second response time [03:15:54] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.954 second response time [03:16:30] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.314 second response time [03:18:18] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.181 second response time [03:22:03] RECOVERY - Puppet freshness on srv238 is OK: puppet ran at Tue Aug 7 03:21:45 UTC 2012 [03:23:32] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:54] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.241 second response time [03:25:29] RECOVERY - Puppet freshness on mw27 is OK: puppet ran at Tue Aug 7 03:25:20 UTC 2012 [03:27:44] RECOVERY - Puppet freshness on srv242 is OK: puppet ran at Tue Aug 7 03:27:28 UTC 2012 [03:57:15] morning [04:02:22] hey paravoid [04:03:17] paravoid: do you think you could generate the graphs for the c_rehash issue? or maybe we just implement my fix first and then c_rehash later? [04:03:35] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 54%, RTA = 3577.24 ms [04:03:39] (i mean puppet dependancy graph) [04:05:32] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 128.81 ms [04:39:58] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16501 [04:50:38] New patchset: Faidon; "Fix a couple of jobrunner puppet errors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17927 [04:51:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17927 [04:51:39] hey Ryan_Lane... have a couple mins? [04:51:48] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17927 [04:51:54] or any wiki sysop for that matter [04:51:59] err, wrong channel [04:52:00] (labs) [04:56:36] err, i think never mind [04:56:48] * jeremyb seems to have figured out the right ldap incantation ;) [05:04:15] jeremyb: what's up? [05:04:28] Ryan_Lane: i think maybe i've got it handled... see #-labs [06:34:50] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 4211.40 ms [06:36:20] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 133.67 ms [06:50:36] New patchset: Faidon; "(RT 3325) olivneh restricted => mortals" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17040 [06:51:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17040 [06:51:41] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17040 [07:05:17] PROBLEM - Apache HTTP on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:05:17] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.22:11000 (Connection timed out) [07:06:30] hrmmm [07:06:33] paravoid: ^^ [07:06:48] argh [07:07:00] it's in rotation too. (not a spare) [07:07:41] PROBLEM - Memcached on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:09:16] (thanks) [07:09:31] it's swapping to death [07:09:38] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.546 second response time [07:09:53] recovery? 8.5s? srsly? :) [07:09:59] haha [07:10:32] RECOVERY - Memcached on srv272 is OK: TCP OK - 0.008 second response time on port 11000 [07:11:30] !log powercycled srv272, swapdeath [07:11:39] Logged the message, Master [07:13:59] what is sockpuppet used for besides ca signing? [07:14:12] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [07:14:26] nothing, it's a transition left in the middle iirc [07:14:33] huh [07:14:42] that's why i never see it flapping ;) [07:17:50] yep that would be it [07:18:06] PROBLEM - Apache HTTP on srv272 is CRITICAL: Connection refused [07:18:27] hmm spec tests? (looking at puppet module docs) [07:19:09] apergos: related to https://en.wikipedia.org/wiki/RSpec i guess? [07:19:36] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [07:20:24] yes, I think so [07:20:49] that's going to have to be pretty far down the priority queue but it looks interesting [07:21:05] is it some kind of criterion for the forge? [07:22:10] http://puppetlabs.com/blog/test-driven-development-with-puppet/ [07:22:18] unit testing [07:22:32] might save us some heartache down the road [07:22:57] yeah, we've talked about it a bit [07:23:08] hashar even did some proof-testing [07:23:16] proof of concept I mean [07:26:38] sweet [07:28:40] New patchset: Nikerabbit; "Initial version of solr for ttmserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16732 [07:29:18] New review: Nikerabbit; "Did whitespace cleanup." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16732 [07:29:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16732 [07:58:02] * jeremyb wonders if RT 2361 has anything interesting [07:58:19] (digging about why a 400 response == OK) [07:59:45] !g 441ca9342ce60fa5182d13561faea333edc4b0b3 | aude [07:59:45] aude: https://gerrit.wikimedia.org/r/#q,441ca9342ce60fa5182d13561faea333edc4b0b3,n,z [08:03:30] jeremyb: looks like a hack to me but interesting..... [08:51:50] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [09:22:52] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [09:35:55] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [09:42:13] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 3638.43 ms [09:43:16] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 117.41 ms [09:58:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:38:56] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [10:53:56] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [11:08:19] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [11:23:37] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [12:23:54] notpeter_: let's merge those modules [12:38:14] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [12:52:11] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [12:52:11] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [13:15:08] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:29] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.831 second response time [13:16:56] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:18:17] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [13:49:29] RECOVERY - MySQL disk space on storage3 is OK: DISK OK [13:55:45] time to kick wikibugs i hear? [13:56:53] (irc bot @ mchenry) [13:58:56] wikibugs is working.. [14:12:52] errr, yeah [14:13:00] i guess i wasn't looking too closely [14:13:04] 07 13:36:46 * Nikerabbit kicks wikibugs [14:13:04] 07 13:36:53 < Nikerabbit> it's dead [14:13:16] but then it did speak soon after that [14:35:39] !log starting otrs dump on db49 [14:35:49] Logged the message, Master [15:07:26] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:43] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:43] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:43] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:43] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:17] yeah, WP is slowwwwwww [15:08:20] PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:28] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:28] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:38] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:38] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:46] PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:46] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:46] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:46] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:47] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:47] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:47] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:48] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:48] PROBLEM - Apache HTTP on mw56 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:49] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:51] Hope it's not the memcache issue again.... [15:08:56] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:56] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:56] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:56] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:56] PROBLEM - Apache HTTP on mw9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:57] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:57] PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:58] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:58] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:59] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:04] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:05] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:05] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:05] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:15] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:15] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:23] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:23] PROBLEM - Apache HTTP on mw16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:23] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:40] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:40] PROBLEM - Apache HTTP on mw10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:50] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:50] PROBLEM - Apache HTTP on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:58] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:08] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:26] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39496 bytes in 6.851 seconds [15:10:35] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49605 bytes in 2.667 seconds [15:10:44] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:46] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:46] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:56] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:56] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39682 bytes in 3.699 seconds [15:12:05] PROBLEM - Apache HTTP on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:05] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:05] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60841 bytes in 3.068 seconds [15:12:31] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:41] PROBLEM - Apache HTTP on srv288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:50] PROBLEM - Apache HTTP on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:13:25] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable [15:14:01] RECOVERY - Apache HTTP on srv288 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.860 second response time [15:14:28] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable [15:14:37] PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable [15:14:46] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:13] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:31] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:40] RECOVERY - Apache HTTP on srv264 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.221 second response time [15:15:59] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:59] PROBLEM - Apache HTTP on srv280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:06] issue with secure? [15:16:06] 208.80.152.75 via sq71.wikimedia.org (squid/2.7.STABLE9) to () [15:16:06] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Tue, 07 Aug 2012 15:15:48 GMT [15:16:16] PROBLEM - Apache HTTP on srv282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:17] PROBLEM - Apache HTTP on srv289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:34] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:34] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:34] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.187 second response time [15:16:44] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:52] PROBLEM - Apache HTTP on srv239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:52] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:13] Seemed to be issues with apache, but some lvs boxes are out as well from the look of it. Hope it's not network again =/ [15:17:38] PROBLEM - Apache HTTP on srv210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:38] RECOVERY - Apache HTTP on srv282 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.108 second response time [15:17:46] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 61850 bytes in 9.300 seconds [15:18:40] ganglia is super slow [15:18:43] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.662 second response time [15:18:44] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:19:01] PROBLEM - Apache HTTP on srv209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:19:37] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:19:43] New patchset: Alex Monk; "(bug 19569) Add Portal namespace to urwiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16432 [15:19:46] RECOVERY - Apache HTTP on srv289 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.058 second response time [15:20:04] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.809 second response time [15:20:04] RECOVERY - Apache HTTP on srv239 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.247 second response time [15:20:13] PROBLEM - Apache HTTP on srv246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:13] PROBLEM - Apache HTTP on srv276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:13] PROBLEM - Apache HTTP on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:13] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60649 bytes in 5.571 seconds [15:20:23] RECOVERY - Apache HTTP on srv209 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.179 second response time [15:20:31] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:31] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49412 bytes in 2.713 seconds [15:21:34] RECOVERY - Apache HTTP on srv246 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.650 second response time [15:21:44] RECOVERY - Apache HTTP on srv259 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.685 second response time [15:21:44] PROBLEM - Apache HTTP on srv225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:44] PROBLEM - Apache HTTP on srv244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:52] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:01] RECOVERY - Apache HTTP on srv205 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.501 second response time [15:22:26] hi [15:22:32] what's happenig? [15:22:32] hey [15:22:35] woosters, hi [15:22:35] just got page [15:22:37] mark? [15:23:04] RECOVERY - Apache HTTP on srv228 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.298 second response time [15:23:13] PROBLEM - Apache HTTP on srv269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:13] PROBLEM - Apache HTTP on srv274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:49] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.879 second response time [15:23:52] Quite a drop in cluster app server load [15:24:25] just came on [15:24:25] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39689 bytes in 6.464 seconds [15:24:34] RECOVERY - Apache HTTP on srv244 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.148 second response time [15:24:44] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:24:44] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:24:52] PROBLEM - check_job_queue on neon is CRITICAL: (Service Check Timed Out) [15:25:01] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49602 bytes in 3.666 seconds [15:25:10] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:28] RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39489 bytes in 5.197 seconds [15:25:55] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39496 bytes in 2.224 seconds [15:26:03] meltdown in progress:/ [15:26:04] RECOVERY - Apache HTTP on srv276 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.183 second response time [15:26:04] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.113 second response time [15:26:04] RECOVERY - Apache HTTP on srv269 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.167 second response time [15:26:14] PROBLEM - Apache HTTP on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:22] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:07] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39682 bytes in 8.533 seconds [15:27:34] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.245 second response time [15:27:34] RECOVERY - Apache HTTP on srv274 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.806 second response time [15:27:43] RECOVERY - Apache HTTP on srv260 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.472 second response time [15:28:00] cr2-pmtpa has 90-100ms on the local LAN [15:28:06] from hosts to their default gw [15:28:54] things seem to be recovering, though [15:28:58] can now hit random page [15:29:04] RECOVERY - Apache HTTP on srv225 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.265 second response time [15:29:13] PROBLEM - Apache HTTP on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:13] PROBLEM - Apache HTTP on srv240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:18] arg [15:29:19] no [15:29:22] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 61850 bytes in 9.756 seconds [15:29:40] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49412 bytes in 8.579 seconds [15:29:40] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:34] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:43] PROBLEM - Apache HTTP on srv258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:43] PROBLEM - Apache HTTP on srv231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:44] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 49598 bytes in 6.207 seconds [15:32:04] RECOVERY - Apache HTTP on srv275 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.468 second response time [15:32:04] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 49405 bytes in 2.684 seconds [15:32:04] RECOVERY - Apache HTTP on srv240 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.426 second response time [15:32:32] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49602 bytes in 6.290 seconds [15:32:45] oh yeah, seeing 90-100ms between pc1 perhaps all of the apaches [15:32:58] PROBLEM - Apache HTTP on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:58] PROBLEM - Apache HTTP on srv212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:06] paravoid: is it just cr2? [15:33:13] don't know yet [15:33:34] RECOVERY - Apache HTTP on srv258 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.278 second response time [15:33:34] RECOVERY - Apache HTTP on srv231 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.686 second response time [15:33:41] fenari seems to have 200ms lag to everything [15:33:43] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:43] PROBLEM - Apache HTTP on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:43] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:34:19] RECOVERY - Apache HTTP on srv212 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.723 second response time [15:34:28] PROBLEM - Apache HTTP on srv204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:34:39] 220ms between eqiad hosts and 10.2.1.1 [15:34:55] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39496 bytes in 2.308 seconds [15:35:05] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:05] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.162 second response time [15:35:13] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.699 second response time [15:35:13] PROBLEM - Apache HTTP on srv280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:22] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:49] RECOVERY - Apache HTTP on srv200 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.526 second response time [15:35:49] RECOVERY - Apache HTTP on srv204 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.433 second response time [15:35:51] it's either a cr or a switch that's borked [15:35:58] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:36:01] but I don't know the topology near enough to know [15:36:25] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39686 bytes in 6.371 seconds [15:36:34] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.537 second response time [15:37:10] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:20] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.886 second response time [15:38:13] RECOVERY - Apache HTTP on srv264 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.266 second response time [15:38:13] PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:55] New patchset: Alex Monk; "(bug 34135#c4) Let admins change FlaggedRevs stable settings on cawikinews." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17918 [15:39:07] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:39:45] hey paravoid [15:39:52] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 49595 bytes in 8.823 seconds [15:39:53] i'm trying switching over to cr1 mastership [15:39:53] oh hi LeslieCarr [15:40:05] don't [15:40:09] cr2 is vrrp master for everything [15:40:10] oh ? [15:40:28] I'm pinging cr1-sdtpa from bast1001 [15:40:28] RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39489 bytes in 3.448 seconds [15:40:37] and is also lagging [15:40:40] does it pass through cr2? [15:40:59] not on that side... [15:41:04] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.479 second response time [15:41:13] PROBLEM - Apache HTTP on srv243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:43] 2 ae0.cr1-eqiad.wikimedia.org (208.80.154.193) 0.170 ms 0.147 ms 0.147 ms [15:41:43] 3 xe-0-0-1.cr1-sdtpa.wikimedia.org (208.80.154.210) 124.371 ms 124.371 ms 124.365 ms [15:41:58] PROBLEM - Apache HTTP on srv211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:59] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:25]