[00:00:07] just a couple of minor things I put in comments in gerrit [00:00:35] the masterPos parameter doesn't seem to be set anywhere [00:01:02] if it's set on all jobs, it would break duplicate removal, so don't do that [00:01:31] it's set in refreshLinksJob2, which spawns the other jobs [00:01:52] I didn't see code that directly inserts the later [00:02:36] oh yeah, so it is [00:03:55] and I complained about some code being too compact and clever [00:04:10] you know I like stupid things ;) [00:05:20] TimStarling: I actually thought it was dumb and not clever...there is the overhead of setting the var each time [00:05:47] if ( $first ) { [00:05:54] $start = 0; [00:05:56] not that the overhead matters in PHP, but in principle :) [00:05:57] $first = false; [00:05:59] } else { [00:06:07] $start = $title->getArticleID(); [00:06:08] } [00:08:11] that would be a simple way to do it, maybe not quite simple enough for Evelyn to understand, but I'm sure she'll be up to that level soon ;) [00:09:37] ok maybe not that exact code but you get the idea [00:10:59] you know what was the main VM hotspot when I was benchmarking PPFrame_DOM::expand() last week? [00:11:05] New patchset: Asher; "adding dhcp stuffs for db10(49-50)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17922 [00:11:19] converting things that were not boolean to boolean for use in if() [00:11:31] heh [00:11:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17922 [00:11:48] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17922 [00:12:01] it needs a switch() statement for about 8 different types, which is translated to a binary tree of conditional jumps [00:12:22] so it was a big hotspot for branch misprediction [00:13:15] nuno thought it would be hashtable lookups to fetch local variables [00:13:34] but actually there is no local symbol hashtable, it's lazy-initialised [00:13:45] local variables are converted to static array indexes at compile time [00:15:09] yeah [00:15:11] TimStarling: I think that job insert code has an old bug btw [00:15:43] nope, just something I introduced [00:16:01] it doesn't deal with the last 1-9 items ;) [00:16:18] at least it works in spirit [00:16:39] it's the thought that counts [00:27:18] New patchset: Asher; "expand range of dbs in netboot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17923 [00:27:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17923 [00:30:02] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17923 [00:38:04] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [00:46:53] TimStarling: you know the job duplicate checking is broken right? [00:47:13] hmm, maybe I had heard something about that [00:47:32] I know that it doesn't work with refreshLinks2, so it's mostly useless anyway [00:47:49] it uses insertFields() for the duplicate check, which includes ('job_timestamp' => $dbw->timestamp()) [00:48:18] so duplicate refresh jobs with any other timestamp won't be seen [00:48:23] it does at least have unset( $fields['job_id'] ); ;) [00:48:39] maybe an unset for timestamp could be tossed in removeDuplicates() [00:49:13] of course with masterPos it would still be useless [00:49:40] right, but job_timestamp is newer than refreshLinks2 [00:50:04] it was really useful when it was just refreshLinks, but there were other problems with that scheme of course [00:53:05] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [00:53:44] !log planet (en and fr) configs Updated to revision 115645. [00:53:53] Logged the message, Master [01:06:09] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [01:15:58] mutante: Were you working on wikitech? It's still broken [01:16:02] from within function "User::saveSettings". Database returned error "1054: Unknown column 'user_options' in 'field list' (localhost)". [01:16:04] When logging in [01:16:44] Oh [01:16:49] Does 1.17 still use that old crap? [01:17:08] use/check [01:18:04] apparently so [01:19:38] That's not as broken as it was anyway ;) [01:22:11] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [01:40:48] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 206 seconds [01:41:59] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 242 seconds [01:49:02] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 665s [01:53:41] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 14 seconds [01:54:53] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 31s [01:56:23] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 4 seconds [02:51:09] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [02:51:09] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [03:08:42] RECOVERY - Puppet freshness on bayes is OK: puppet ran at Tue Aug 7 03:08:15 UTC 2012 [03:12:00] RECOVERY - Puppet freshness on niobium is OK: puppet ran at Tue Aug 7 03:11:47 UTC 2012 [03:14:15] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:14:24] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:14:42] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:09] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:17] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:26] RECOVERY - Puppet freshness on srv190 is OK: puppet ran at Tue Aug 7 03:15:09 UTC 2012 [03:15:35] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.279 second response time [03:15:35] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.852 second response time [03:15:54] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.954 second response time [03:16:30] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.314 second response time [03:18:18] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.181 second response time [03:22:03] RECOVERY - Puppet freshness on srv238 is OK: puppet ran at Tue Aug 7 03:21:45 UTC 2012 [03:23:32] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:54] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.241 second response time [03:25:29] RECOVERY - Puppet freshness on mw27 is OK: puppet ran at Tue Aug 7 03:25:20 UTC 2012 [03:27:44] RECOVERY - Puppet freshness on srv242 is OK: puppet ran at Tue Aug 7 03:27:28 UTC 2012 [03:57:15] morning [04:02:22] hey paravoid [04:03:17] paravoid: do you think you could generate the graphs for the c_rehash issue? or maybe we just implement my fix first and then c_rehash later? [04:03:35] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 54%, RTA = 3577.24 ms [04:03:39] (i mean puppet dependancy graph) [04:05:32] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 128.81 ms [04:39:58] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16501 [04:50:38] New patchset: Faidon; "Fix a couple of jobrunner puppet errors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17927 [04:51:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17927 [04:51:39] hey Ryan_Lane... have a couple mins? [04:51:48] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17927 [04:51:54] or any wiki sysop for that matter [04:51:59] err, wrong channel [04:52:00] (labs) [04:56:36] err, i think never mind [04:56:48] * jeremyb seems to have figured out the right ldap incantation ;) [05:04:15] jeremyb: what's up? [05:04:28] Ryan_Lane: i think maybe i've got it handled... see #-labs [06:34:50] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 4211.40 ms [06:36:20] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 133.67 ms [06:50:36] New patchset: Faidon; "(RT 3325) olivneh restricted => mortals" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17040 [06:51:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17040 [06:51:41] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17040 [07:05:17] PROBLEM - Apache HTTP on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:05:17] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.22:11000 (Connection timed out) [07:06:30] hrmmm [07:06:33] paravoid: ^^ [07:06:48] argh [07:07:00] it's in rotation too. (not a spare) [07:07:41] PROBLEM - Memcached on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:09:16] (thanks) [07:09:31] it's swapping to death [07:09:38] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.546 second response time [07:09:53] recovery? 8.5s? srsly? :) [07:09:59] haha [07:10:32] RECOVERY - Memcached on srv272 is OK: TCP OK - 0.008 second response time on port 11000 [07:11:30] !log powercycled srv272, swapdeath [07:11:39] Logged the message, Master [07:13:59] what is sockpuppet used for besides ca signing? [07:14:12] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [07:14:26] nothing, it's a transition left in the middle iirc [07:14:33] huh [07:14:42] that's why i never see it flapping ;) [07:17:50] yep that would be it [07:18:06] PROBLEM - Apache HTTP on srv272 is CRITICAL: Connection refused [07:18:27] hmm spec tests? (looking at puppet module docs) [07:19:09] apergos: related to https://en.wikipedia.org/wiki/RSpec i guess? [07:19:36] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [07:20:24] yes, I think so [07:20:49] that's going to have to be pretty far down the priority queue but it looks interesting [07:21:05] is it some kind of criterion for the forge? [07:22:10] http://puppetlabs.com/blog/test-driven-development-with-puppet/ [07:22:18] unit testing [07:22:32] might save us some heartache down the road [07:22:57] yeah, we've talked about it a bit [07:23:08] hashar even did some proof-testing [07:23:16] proof of concept I mean [07:26:38] sweet [07:28:40] New patchset: Nikerabbit; "Initial version of solr for ttmserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16732 [07:29:18] New review: Nikerabbit; "Did whitespace cleanup." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16732 [07:29:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16732 [07:58:02] * jeremyb wonders if RT 2361 has anything interesting [07:58:19] (digging about why a 400 response == OK) [07:59:45] !g 441ca9342ce60fa5182d13561faea333edc4b0b3 | aude [07:59:45] aude: https://gerrit.wikimedia.org/r/#q,441ca9342ce60fa5182d13561faea333edc4b0b3,n,z [08:03:30] jeremyb: looks like a hack to me but interesting..... [08:51:50] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [09:22:52] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [09:35:55] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [09:42:13] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 3638.43 ms [09:43:16] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 117.41 ms [09:58:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:38:56] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [10:53:56] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [11:08:19] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [11:23:37] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [12:23:54] notpeter_: let's merge those modules [12:38:14] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [12:52:11] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [12:52:11] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [13:15:08] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:29] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.831 second response time [13:16:56] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:18:17] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [13:49:29] RECOVERY - MySQL disk space on storage3 is OK: DISK OK [13:55:45] time to kick wikibugs i hear? [13:56:53] (irc bot @ mchenry) [13:58:56] wikibugs is working.. [14:12:52] errr, yeah [14:13:00] i guess i wasn't looking too closely [14:13:04] 07 13:36:46 * Nikerabbit kicks wikibugs [14:13:04] 07 13:36:53 < Nikerabbit> it's dead [14:13:16] but then it did speak soon after that [14:35:39] !log starting otrs dump on db49 [14:35:49] Logged the message, Master [15:07:26] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:43] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:43] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:43] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:43] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:17] yeah, WP is slowwwwwww [15:08:20] PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:28] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:28] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:38] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:38] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:46] PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:46] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:46] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:46] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:47] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:47] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:47] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:48] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:48] PROBLEM - Apache HTTP on mw56 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:49] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:51] Hope it's not the memcache issue again.... [15:08:56] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:56] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:56] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:56] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:56] PROBLEM - Apache HTTP on mw9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:57] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:57] PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:58] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:58] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:59] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:04] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:05] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:05] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:05] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:15] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:15] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:23] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:23] PROBLEM - Apache HTTP on mw16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:23] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:40] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:40] PROBLEM - Apache HTTP on mw10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:50] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:50] PROBLEM - Apache HTTP on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:58] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:08] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:26] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39496 bytes in 6.851 seconds [15:10:35] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49605 bytes in 2.667 seconds [15:10:44] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:46] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:46] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:56] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:56] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39682 bytes in 3.699 seconds [15:12:05] PROBLEM - Apache HTTP on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:05] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:05] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60841 bytes in 3.068 seconds [15:12:31] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:41] PROBLEM - Apache HTTP on srv288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:50] PROBLEM - Apache HTTP on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:13:25] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable [15:14:01] RECOVERY - Apache HTTP on srv288 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.860 second response time [15:14:28] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable [15:14:37] PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable [15:14:46] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:13] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:31] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:40] RECOVERY - Apache HTTP on srv264 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.221 second response time [15:15:59] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:59] PROBLEM - Apache HTTP on srv280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:06] issue with secure? [15:16:06] 208.80.152.75 via sq71.wikimedia.org (squid/2.7.STABLE9) to () [15:16:06] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Tue, 07 Aug 2012 15:15:48 GMT [15:16:16] PROBLEM - Apache HTTP on srv282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:17] PROBLEM - Apache HTTP on srv289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:34] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:34] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:34] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.187 second response time [15:16:44] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:52] PROBLEM - Apache HTTP on srv239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:52] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:13] Seemed to be issues with apache, but some lvs boxes are out as well from the look of it. Hope it's not network again =/ [15:17:38] PROBLEM - Apache HTTP on srv210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:38] RECOVERY - Apache HTTP on srv282 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.108 second response time [15:17:46] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 61850 bytes in 9.300 seconds [15:18:40] ganglia is super slow [15:18:43] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.662 second response time [15:18:44] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:19:01] PROBLEM - Apache HTTP on srv209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:19:37] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:19:43] New patchset: Alex Monk; "(bug 19569) Add Portal namespace to urwiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16432 [15:19:46] RECOVERY - Apache HTTP on srv289 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.058 second response time [15:20:04] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.809 second response time [15:20:04] RECOVERY - Apache HTTP on srv239 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.247 second response time [15:20:13] PROBLEM - Apache HTTP on srv246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:13] PROBLEM - Apache HTTP on srv276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:13] PROBLEM - Apache HTTP on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:13] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60649 bytes in 5.571 seconds [15:20:23] RECOVERY - Apache HTTP on srv209 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.179 second response time [15:20:31] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:31] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49412 bytes in 2.713 seconds [15:21:34] RECOVERY - Apache HTTP on srv246 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.650 second response time [15:21:44] RECOVERY - Apache HTTP on srv259 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.685 second response time [15:21:44] PROBLEM - Apache HTTP on srv225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:44] PROBLEM - Apache HTTP on srv244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:52] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:01] RECOVERY - Apache HTTP on srv205 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.501 second response time [15:22:26] hi [15:22:32] what's happenig? [15:22:32] hey [15:22:35] woosters, hi [15:22:35] just got page [15:22:37] mark? [15:23:04] RECOVERY - Apache HTTP on srv228 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.298 second response time [15:23:13] PROBLEM - Apache HTTP on srv269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:13] PROBLEM - Apache HTTP on srv274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:49] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.879 second response time [15:23:52] Quite a drop in cluster app server load [15:24:25] just came on [15:24:25] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39689 bytes in 6.464 seconds [15:24:34] RECOVERY - Apache HTTP on srv244 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.148 second response time [15:24:44] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:24:44] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:24:52] PROBLEM - check_job_queue on neon is CRITICAL: (Service Check Timed Out) [15:25:01] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49602 bytes in 3.666 seconds [15:25:10] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:28] RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39489 bytes in 5.197 seconds [15:25:55] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39496 bytes in 2.224 seconds [15:26:03] meltdown in progress:/ [15:26:04] RECOVERY - Apache HTTP on srv276 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.183 second response time [15:26:04] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.113 second response time [15:26:04] RECOVERY - Apache HTTP on srv269 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.167 second response time [15:26:14] PROBLEM - Apache HTTP on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:22] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:07] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39682 bytes in 8.533 seconds [15:27:34] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.245 second response time [15:27:34] RECOVERY - Apache HTTP on srv274 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.806 second response time [15:27:43] RECOVERY - Apache HTTP on srv260 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.472 second response time [15:28:00] cr2-pmtpa has 90-100ms on the local LAN [15:28:06] from hosts to their default gw [15:28:54] things seem to be recovering, though [15:28:58] can now hit random page [15:29:04] RECOVERY - Apache HTTP on srv225 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.265 second response time [15:29:13] PROBLEM - Apache HTTP on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:13] PROBLEM - Apache HTTP on srv240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:18] arg [15:29:19] no [15:29:22] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 61850 bytes in 9.756 seconds [15:29:40] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49412 bytes in 8.579 seconds [15:29:40] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:34] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:43] PROBLEM - Apache HTTP on srv258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:43] PROBLEM - Apache HTTP on srv231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:44] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 49598 bytes in 6.207 seconds [15:32:04] RECOVERY - Apache HTTP on srv275 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.468 second response time [15:32:04] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 49405 bytes in 2.684 seconds [15:32:04] RECOVERY - Apache HTTP on srv240 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.426 second response time [15:32:32] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49602 bytes in 6.290 seconds [15:32:45] oh yeah, seeing 90-100ms between pc1 perhaps all of the apaches [15:32:58] PROBLEM - Apache HTTP on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:58] PROBLEM - Apache HTTP on srv212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:06] paravoid: is it just cr2? [15:33:13] don't know yet [15:33:34] RECOVERY - Apache HTTP on srv258 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.278 second response time [15:33:34] RECOVERY - Apache HTTP on srv231 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.686 second response time [15:33:41] fenari seems to have 200ms lag to everything [15:33:43] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:43] PROBLEM - Apache HTTP on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:43] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:34:19] RECOVERY - Apache HTTP on srv212 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.723 second response time [15:34:28] PROBLEM - Apache HTTP on srv204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:34:39] 220ms between eqiad hosts and 10.2.1.1 [15:34:55] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39496 bytes in 2.308 seconds [15:35:05] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:05] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.162 second response time [15:35:13] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.699 second response time [15:35:13] PROBLEM - Apache HTTP on srv280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:22] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:49] RECOVERY - Apache HTTP on srv200 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.526 second response time [15:35:49] RECOVERY - Apache HTTP on srv204 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.433 second response time [15:35:51] it's either a cr or a switch that's borked [15:35:58] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:36:01] but I don't know the topology near enough to know [15:36:25] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39686 bytes in 6.371 seconds [15:36:34] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.537 second response time [15:37:10] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:20] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.886 second response time [15:38:13] RECOVERY - Apache HTTP on srv264 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.266 second response time [15:38:13] PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:55] New patchset: Alex Monk; "(bug 34135#c4) Let admins change FlaggedRevs stable settings on cawikinews." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17918 [15:39:07] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:39:45] hey paravoid [15:39:52] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 49595 bytes in 8.823 seconds [15:39:53] i'm trying switching over to cr1 mastership [15:39:53] oh hi LeslieCarr [15:40:05] don't [15:40:09] cr2 is vrrp master for everything [15:40:10] oh ? [15:40:28] I'm pinging cr1-sdtpa from bast1001 [15:40:28] RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39489 bytes in 3.448 seconds [15:40:37] and is also lagging [15:40:40] does it pass through cr2? [15:40:59] not on that side... [15:41:04] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.479 second response time [15:41:13] PROBLEM - Apache HTTP on srv243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:43] 2 ae0.cr1-eqiad.wikimedia.org (208.80.154.193) 0.170 ms 0.147 ms 0.147 ms [15:41:43] 3 xe-0-0-1.cr1-sdtpa.wikimedia.org (208.80.154.210) 124.371 ms 124.371 ms 124.365 ms [15:41:58] PROBLEM - Apache HTTP on srv211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:59] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:25] correct [15:42:25] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:34] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:34] RECOVERY - Apache HTTP on srv243 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.538 second response time [15:42:36] and the next hop is 224ms, doubles that [15:43:19] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.834 second response time [15:43:26] hey paravoid - the crosslink between floors is basically full [15:43:39] okay, that would explain it [15:43:46] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60834 bytes in 2.592 seconds [15:43:47] I was trying to find my way around observium :/ [15:43:48] in 2 weeks (assuming the equipment has arrived) i'm headed to tampa to dwdm that link ... [15:43:57] it's good practice anyways ;) [15:44:04] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:13] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:13] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:19] however it's hosted down there and being slow :) [15:44:40] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:49] RECOVERY - Apache HTTP on srv211 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.294 second response time [15:45:07] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:07] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:25] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39689 bytes in 3.896 seconds [15:45:30] !log Moving text squid traffic from pmtpa to eqiad [15:45:34] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.801 second response time [15:45:39] Logged the message, Master [15:45:48] LeslieCarr: if it's arrived, why don't we install it today? [15:45:50] hah, mark out of nowhere [15:45:58] don't think it's there yet... [15:46:01] i was just doing grocery shopping [15:46:10] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49605 bytes in 9.900 seconds [15:46:14] cmjohnson1: so, any packages arrived recently ? :) [15:46:28] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39682 bytes in 4.852 seconds [15:46:43] :( [15:47:13] PROBLEM - Apache HTTP on srv277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:14] PROBLEM - Apache HTTP on srv234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:14] PROBLEM - Apache HTTP on srv240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:14] PROBLEM - Apache HTTP on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:22] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:31] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49410 bytes in 4.140 seconds [15:48:12] mark: moved them to get the pipe empty again? [15:48:16] PROBLEM - Apache HTTP on srv282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:48:25] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39496 bytes in 7.274 seconds [15:48:34] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.504 second response time [15:48:43] PROBLEM - Apache HTTP on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:48:44] PROBLEM - Apache HTTP on srv238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:48:44] PROBLEM - Apache HTTP on srv270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:48:44] yeah [15:49:28] PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:37] RECOVERY - Apache HTTP on srv282 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.339 second response time [15:49:40] but why would it fill now... [15:49:51] because tampa hasn't served text in a few months [15:49:54] until yesterday [15:49:55] PROBLEM - Apache HTTP on srv289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:55] PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247, [15:50:04] RECOVERY - Apache HTTP on srv275 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.187 second response time [15:50:04] RECOVERY - Apache HTTP on srv238 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.192 second response time [15:50:04] RECOVERY - Apache HTTP on srv277 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.186 second response time [15:50:05] RECOVERY - Apache HTTP on srv270 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.188 second response time [15:50:05] RECOVERY - Apache HTTP on srv240 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.197 second response time [15:50:05] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.203 second response time [15:50:05] RECOVERY - Apache HTTP on srv234 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.564 second response time [15:50:40] it worked for almost 24h [15:50:49] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.142 second response time [15:50:58] RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39489 bytes in 7.788 seconds [15:51:07] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:51:08] (and I'm still trying to find my way around observium) [15:51:56] my guess is that now is when people are waking up in the US and starting to pound on wikipedia [15:52:01] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 15426 bytes in 0.469 seconds [15:52:46] RECOVERY - Apache HTTP on srv289 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.148 second response time [15:52:55] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 15487 bytes in 0.532 seconds [15:52:58] latency from apaches to pc1 is around 60ms.. a lot better than it was, but still bad [15:53:04] RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 15480 bytes in 0.296 seconds [15:53:04] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:13] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 61850 bytes in 1.298 seconds [15:53:31] RECOVERY - Apache HTTP on mw7 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.669 second response time [15:53:58] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39682 bytes in 1.439 seconds [15:54:07] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.707 second response time [15:54:07] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.901 second response time [15:54:07] RECOVERY - Apache HTTP on mw15 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.339 second response time [15:54:07] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.110 second response time [15:54:07] RECOVERY - Apache HTTP on mw10 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.530 second response time [15:54:08] RECOVERY - Apache HTTP on mw5 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.032 second response time [15:54:08] RECOVERY - Apache HTTP on mw3 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.973 second response time [15:54:09] RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.087 second response time [15:54:09] RECOVERY - Apache HTTP on mw4 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.971 second response time [15:54:10] RECOVERY - Apache HTTP on mw16 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.072 second response time [15:54:16] RECOVERY - Apache HTTP on mw12 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.725 second response time [15:54:16] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.795 second response time [15:54:16] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.719 second response time [15:54:16] RECOVERY - Apache HTTP on mw2 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.302 second response time [15:54:17] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.615 second response time [15:54:25] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39496 bytes in 6.507 seconds [15:54:34] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.163 second response time [15:54:52] RECOVERY - Apache HTTP on mw1 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.091 second response time [15:54:52] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.098 second response time [15:54:52] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.104 second response time [15:54:52] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.622 second response time [15:54:52] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.986 second response time [15:54:53] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.120 second response time [15:54:53] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.929 second response time [15:54:54] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.440 second response time [15:54:54] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.475 second response time [15:55:00] getting better. [15:55:01] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [15:55:37] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [15:55:37] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [15:55:37] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [15:55:37] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [15:55:37] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [15:56:57] https://observium.wikimedia.org/graphs/type=device_bits/id=62/from=1344182173/to=1344354974/ [15:57:02] gah, useless [15:57:10] https://observium.wikimedia.org/device/device=62/tab=port/port=3479/ [15:57:12] haha [15:57:16] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60837 bytes in 1.017 seconds [15:58:10] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.990 second response time [15:59:39] binasher: yes, I think there's consensus of moving away from pmtpa [15:59:43] :P [16:01:11] :D [16:02:28] down to sub-10ms latencies [16:03:05] and sdtpa [16:03:14] Though, I can't remember which is the "worse" one [16:04:13] paravoid: nuke it from orbit [16:08:18] mark: so, according to the graph we've been congested since we moved to tampa yesterday [16:19:20] heh [16:20:07] that explains a few things [16:23:12] ? [16:23:13] !log moving bits traffic from pmtpa to eqiad [16:23:22] Logged the message, Mistress of the network gear. [16:27:39] hey binasher, with the recent move of the gerrit db, to what server did the db move? [16:29:05] drdee db1048 is the master, db1046 slave [16:29:16] ty [16:33:50] New patchset: Pyoungmeister; "Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into apachemodules" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17959 [16:34:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17959 [16:35:16] Change abandoned: Pyoungmeister; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17959 [17:00:56] New patchset: Pyoungmeister; "mediawiki and application server modules." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17342 [17:01:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17342 [17:04:40] mark, notpeter: I don't understand why we do this ::cron , ::packages, ::init subclasses [17:04:55] I was looking at it today as well, because of the videoscaler/imagescaler stuff [17:05:40] or ::nice [17:05:50] esp. now that they are in separate classes it's just confusing [17:05:55] imho [17:06:07] er, classes in separate files I meant [17:06:27] doesn't style guide say one class per file? [17:07:02] anywho, I'm game for any number of comments/corrections/changes [17:07:14] I have not made these before. I'd like to learn to do it properly [17:07:45] New review: Faidon; "The jobrunner stuff have been changed since this change was originally made (there's a class instead..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/17342 [17:07:56] the one class per file is correct [17:08:05] it's not style guide, it's how the module autoloader works [17:08:26] what I'm saying is that I don't see the point for separate ::packages ::cron ::nice ::init [17:08:28] heh, yes. that's what I meant. "according to the thing I read" [17:08:35] which we already did, so it was nothing in your change [17:10:17] how would you structure it? [17:10:58] class imagescaler { shit here } [17:11:09] heh :) [17:11:27] hhhmmm, I mean, the fecal content of our manifests has been low.... [17:11:30] ;) [17:11:41] yeah, that's legit [17:11:47] we can just smoosh it all together [17:11:59] I like the more files/subclasses as it helps me find what I'm looking for [17:12:02] but so can comments.... [17:12:06] there might be a reason I'm missing [17:12:12] hence my ping above :) [17:13:03] I like more files, but we have 3-line classes install a package and having that in a separate file is just going to confuse us I think [17:14:00] well, I'm about to merge in, and then you can go to town improving. [17:14:06] mark is going to as well [17:14:13] but yeah [17:14:15] I see your point [17:14:24] I just gave you a -1 :) [17:14:27] for a different reason [17:14:36] sorry :-) [17:15:59] ok. I'm going to merge and fix in a subsequent commit. is that cool? (none of this code will be live when merged [17:16:02] ) [17:17:39] sure [17:18:08] why did you rebase? [17:18:17] we kinda lost the info that you forked off an earlier commit that way [17:18:31] otoh, gerrit is pretty bad at properly handling that... [17:19:05] what would have been the proper way? [17:19:20] conflict resolution is not one of my core compitencies ;) [17:19:50] did it actually had a conflict? [17:19:54] yes [17:20:03] oh [17:20:54] ah right, the jobrunner commits also touched role/applicationserver.pp [17:21:07] apache.pp [17:21:16] which will not actually be used [17:21:19] as naming was wrong [17:21:39] paravoid: back [17:21:50] yeah I actually hate the one class per file thing for that reason [17:22:00] since I do like to make subclasses to logically group stuff [17:22:16] otherwise people tend to put eeeeverything in one big class and make a big mess [17:22:23] which is hard to decouple later [17:22:34] but now that is frustrated by the one-class-per-file rule [17:22:37] yet another thing I hate [17:22:47] it should not be enforced rigidly [17:23:46] I like logically grouping stuff [17:23:57] but ::packages, ::cron ::init is not the groups I'd made [17:24:01] no [17:24:06] not saying those are [17:24:13] okay [17:24:42] the one class per file is not that bad [17:24:43] and don't worry, not totally happy with the modules yet [17:24:51] it needs some getting used to [17:24:53] but I figured, let's just merge them and then work on it [17:25:13] it actually helps splitting things logically [17:25:25] because if you don't, you get through the pain of switching files back and forth :P [17:25:26] i'm sure it sometimes helps [17:25:38] it sometimes frustrates as well [17:25:46] and again, *I* want to be the judge of that ;) [17:25:50] enforcing quality through pain and everything :P [17:26:08] puppet is currently enforcing pain, not quality ;) [17:26:37] we want quality pain [17:26:40] I'm drinking all the puppet koolaid! [17:27:15] you're certainly its main defender these days ;) [17:27:30] I think I was before... well... ;) [17:28:07] ok, I'm gonna merge this up and then correct the jobrunner thing. [17:28:22] feel free to give me todos or just got to town changing [17:28:48] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17342 [17:28:54] i'm going to make food [17:32:16] andrew_wmf: eh? what do you mean by newer Asterisk doesn't allow letters in the dialplaa? [17:34:27] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 190 seconds [17:34:32] please see my updated reply paravoid [17:34:40] oh man. fuck the way that puppet deals with groups [17:34:43] this is bullshit [17:34:44] PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 190 seconds [17:35:26] andrew_wmf: that's not what I asked [17:35:59] andrew_wmf: what do you mean "the newer version of Asterisk does not support having letters in the dial plan"? [17:36:12] its a difference in 1.6 [17:36:25] paravoid: do you know whent he pain started ? [17:37:15] andrew_wmf: I see nothing like that in the upgrade notes and I'd be very amazed if that's the case [17:37:31] since dialplan works with sip extensions, and sip is not just numbes [17:39:48] i'll get back to you on that shortly [17:41:22] andrew_wmf: Executing [faidon@from-isdn:1] Goto(... [17:41:27] works for me [17:41:53] New patchset: MaxSem; "Real WLM updater script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17964 [17:42:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17964 [17:45:59] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 17 seconds [17:46:27] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 9 seconds [17:48:28] New patchset: Pyoungmeister; "didn't mean to commit this. not nearly done." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17965 [17:48:30] andrew_wmf: also, kinda late for 1.6, isn't it? 11 is about to get released... [17:48:59] we are on 1.4 now [17:49:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17965 [17:49:27] I know, but if you're upgrading you might just as well upgrade to 1.8 or 10 [17:49:29] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17965 [17:49:31] this dial plan change will allow us to do 1.6 and add the web gui [17:49:42] what web gui? [17:49:43] 1.8 will be a month or two later [17:50:01] there is a web front end for freepbx we will be using for administration [17:50:18] anyway, Asterisk works fine with letters in the dialplan and hasn't changed anything either from 1.4->1.6 or any subsequent versions [17:52:14] I don't mind changing my config, but I just wanted you to know that the stated reasoning is wrong in case this saves you time dealing with the migration [17:52:41] there's a technical reason why it is so, i'll get back to you on it [17:58:23] New patchset: Pyoungmeister; "some cleanup for the mediawiki module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17966 [17:59:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17966 [17:59:26] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17966 [17:59:51] Change abandoned: Pyoungmeister; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16440 [18:00:06] Change abandoned: Pyoungmeister; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16426 [18:02:35] New patchset: Pyoungmeister; "switching to new jobrunner class in applicationserver role class." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17967 [18:03:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17967 [18:03:40] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17967 [18:05:21] New review: Dsc; "Never before have I seen a more functional, elegant, and timely patchset for such a critical problem..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/17465 [18:13:04] notpeter: there's "git rm" you know :) [18:13:53] which is a shortcut for "rm foo; git add foo", read "remove the file, then add the change to the index" [18:17:34] git rm --cached ftw [18:22:44] paravoid: is this re: my tweet? [18:22:57] yes [18:24:01] gotcha. that would have prevented my accidental checkin of all kinds of junk [18:24:05] wooo [18:24:05] live and learn [18:24:13] and do dumb things on pre-prod code :) [18:46:18] New review: Asher; "blind trust and execution of .sql fetched over http which could contain SYSTEM shell execution, etc...." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/17964 [18:52:22] New patchset: Catrope; "Give James Alexander shell access on singer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17973 [18:52:36] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [18:53:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17973 [19:05:47] Whoops [19:06:03] Just got a fatal error but didn't copy it [19:23:30] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [19:28:49] hey, when you guys update tickets on rt, do you allow it to simply BCC everyone in ops, as it is configured to do by default? [19:28:58] i feel pretty spammy every time i do that. [19:30:50] i have my rt stuff filtered away [19:30:55] into its own folder [19:31:13] LeslieCarr: "Trash"? :P [19:31:21] i kid! i kid! [19:31:49] hehehe [19:34:48] Trash? Straight to /dev/null [19:34:51] duhh ;) [19:36:33] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [19:59:30] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [20:39:32] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [20:51:05] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: host 91.198.174.244, sessions up: 2, down: 2, shutdown: 0BRPeering with AS64600 not established - BRPeering with AS64600 not established - BR [20:51:14] RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [20:52:35] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [20:53:00] mark: LeslieCarr: something to worry about ? [20:53:06] looking [20:53:11] that's lvs's [20:53:31] i never have a clue what those messages mean. kinda like power phases ;) [20:53:52] uhoh [20:54:17] i don't like it when you say that... [20:54:25] looks like some license expiration [20:54:32] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [20:54:38] say what? [20:54:38] also, we're sending full tables to a switch it looks like [20:54:40] DFSG > * [20:55:07] it's a stupid little switch [20:55:14] however, i don't see this supposed huge set of routes [20:55:22] Totally need switches handling layer3 stuff when you have shiny routers :D [20:59:44] hehe, it's getting fixed [21:00:25] it was pretty long for one of them [21:00:31] 07 15:49:56 <+nagios-wm> PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247, [21:00:35] 07 20:51:16 <+nagios-wm> RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [21:08:19] New review: Pyoungmeister; "a generic solr module should probably not contain a specific schema. can you please put that in file..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/16732 [21:09:32] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [21:14:42] New patchset: Alex Monk; "(bug 38690) Finish removal of editor and reviewer groups from trwiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16987 [21:24:33] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [21:52:52] New review: Nemo bis; "Note that the bot is back since August 3 for no understandable reason." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2675 [22:05:09] !log streaming a hotbackup of db1001 to db1049 (new enwiki snapshot host in eqiad) [22:05:18] Logged the message, Master [22:06:31] hey, i got bumped to mortal yesterday but i still can't ssh into fenari -- could this simply be because puppet hasn't run? [22:31:06] New patchset: Kaldari; "Turning on curation toolbar for testwiki. Hope I'm doing this right." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18054 [22:37:31] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18054 [22:38:51] yo ops [22:38:58] data loss on emery seems high [22:38:59] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=emery.wikimedia.org&v=1.41509&m=packet_loss_90th&r=hour&z=default&jr=&js=&st=1344378808&vl=%25&z=large [22:39:13] but not high enough for nagios... [22:39:32] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [22:39:36] ori-l: getting better not worse as i read it [22:40:56] hrm. yeah. i think i misread the scale of the y axis [22:41:09] now worse again [22:41:50] yeah, but i think it's business-as-usual [22:42:12] * jeremyb agrees [22:45:13] puppet-freshness on the other hand... [22:45:16] CRITICAL [22:45:22] (business as usual) [22:45:24] I like how opening Graphite results in one of: my browser crashing; flash crashing; graphite crashing. [22:48:38] binasher: are all the metrics in ganglia available in graphite? [22:50:03] i am guessing no, and in fact, that they're unrelated, now that i look [22:50:14] yeah [22:50:37] I don't think they share much [22:51:05] drat. i wanted to drill into that emery graph [22:53:29] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [22:53:29] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [23:59:52] New patchset: MaxSem; "WLM updater script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17964