[00:00:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:11:45] New patchset: Ryan Lane; "Adding php-luasandbox to labsconsole" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22856 [00:12:17] !log Added rev_sha1 to revision table on liquidthreads_labswikimedia [00:12:23] !log added scribunto to labsconsole [00:12:26] Logged the message, Mr. Obvious [00:12:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22856 [00:12:35] Logged the message, Master [00:14:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [00:15:23] mutante, also, for the OTRS 3.1 upgrade, we're going to need someone from ops as an ops contact - Jeff's going to be busy with the fundraiser, would you be able to do it or know who else might be able to? [00:16:10] I know this ultimately depends on what CT says, but I might as well find out who can do it first [00:16:59] Thehelpfulone: i have no OTRS experience, so probably not [00:17:56] hmm I think I don't know if OTRS experience itself is needed - Reedy mentioned on the bug https://bugzilla.wikimedia.org/show_bug.cgi?id=22622#c29 that it's something to do with perl [00:18:39] also, your userpage https://www.mediawiki.org/wiki/User:Mutante needs an update :) [00:19:03] true, thanks [00:19:21] !log added SpamBlacklist to labsconsole [00:19:30] Logged the message, Master [00:20:08] New patchset: Krinkle; "misc deployment scripts: Minor clean up" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22858 [00:20:18] Thehelpfulone: wow, we have " the Inventor of OTRS" working on the upgrade and he signed an NDA? that sounds like it might work:) [00:20:47] Thehelpfulone: i can just say repeat what Sam said, tell us what you need ..via RT [00:20:51] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/22858 [00:20:54] I know right? yesterday almost all hope was lost when Philippe said there's no engineering resources for it and he's not returned the NDA! [00:21:33] mutante, meh, the issue with RT is that it's not public - I can't tell what the updates are, and neither can anyone else outside of ops [00:22:19] Ryan_Lane, speaking of OTRS, is the plan to do the test server on labs first? I see there's an OTRS project - can we get Martin access to this? [00:22:33] we can't, really [00:22:37] private data [00:22:37] (presuming he needs/wants it)? [00:22:43] Ryan_Lane, he signed an NDA? [00:22:51] no private data in labs right now [00:23:06] so it needs to happen on a production server [00:23:08] oh I see [00:23:27] then I presume to get him access to the production server it needs an RT ticket? ;-) [00:23:29] we have plans for private data in labs, but it's a ways out [00:23:41] I think there's some people working this right now [00:23:48] Thehelpfulone: i know, but that is not my personal decision and as long as that is the tool we use in our team thats the way to make sure other ops actually see it [00:23:49] and I believe we have tickets in [00:24:03] I'd like to make most of RT public [00:24:08] we need to do LDAP auth first [00:24:15] then we need to open up specific queues [00:24:24] * mutante nods [00:24:50] Ryan_Lane, I agree, but last time I discussed with CT he said it's not a "high priority" which is ops speak for not happening anytime soon :( [00:25:00] unfortunately, yep [00:25:06] apparently there's private data across lots of different queues though? [00:25:08] I would love to have a pile of public rt queues [00:25:20] how much work would it take to actually make RT public? [00:25:30] is implementing LDAP authentication difficult? [00:25:58] it seems to be a pain in rt [00:26:01] it's undocumented [00:26:11] a pain in the rt? :-) [00:26:15] heh [00:26:44] is LDAP the only way we can do authentication? doesn't RT have it's own system like Bugzilla does? [00:26:55] we don't want to manage accounts [00:27:04] labsconsole does that for us already [00:27:44] doing a quick google search, what about http://requesttracker.wikia.com/wiki/ExternalAuthentication#2._RT::Authen::ExternalAuth [00:28:01] there's some stuff at http://wiki-archive.bestpractical.com/view/LdapSiteConfigSettings too [00:28:18] yeah, look at the docs, though :) [00:28:26] also, the ubuntu install of rt is…. different [00:29:38] http://requesttracker.wikia.com/wiki/ExternalAuth#CPAN_installation - are those the wrong docs? [00:33:08] Ryan_Lane, see above - what method of installation do you plan to use? [00:33:52] hopefully not cpan [00:34:02] I've looked at these docs before, though [00:34:07] it's not a terribly simple process [00:34:47] At least it's not RT... oh god that app sucks [00:35:46] puppet feature "rudimentary CPAN support", "Added by Jim Blomo about 5 years ago. " :/ [00:36:48] is there anything better than RT? [00:37:07] "better" [00:37:47] every bug system sucks [00:37:49] i dunno, how is Mantis? [00:37:50] every single one [00:37:57] there's no such thing as a good one [00:38:22] RT works well with email input etc, which is useful for ops [00:38:32] we need that for procurement [00:38:46] the NASA likes RT :) [00:39:40] http://bestpractical.com/rt/praise.html [00:40:19] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [00:40:19] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [00:40:19] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [00:41:04] I'll pick a random one - what about bugnet? [00:41:23] Ryan_Lane: we should write our own. Obviously. [00:41:41] heh, well http://www.thegeekstuff.com/2010/08/bug-tracking-system/ things bugzilla is the best [00:41:54] again, it's "best" [00:42:02] thinks* [00:42:05] again, no such thing as a good one [00:43:43] it can suck and still suck less than all others [00:44:28] http://www.youtube.com/watch?v=d85p7JZXNy8 [00:44:58] Advantages are they're free, unlike kayako which sucks and is stupidly expensive. [00:46:01] non-free should not even be an option, or i would have said Atlassian JIRA (toolserver uses it, heh:P) /me hides [00:46:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:46:19] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:46:19] The world does not need more java [00:56:42] Change merged: Krinkle; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21393 [00:58:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.566 seconds [01:04:28] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [01:05:22] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [01:09:50] !log upgrading OATHAuth to 795cef09cab6ecb0e9ded35df06b2877ccc22c1a on labsconsole [01:09:59] Logged the message, Master [01:14:50] New patchset: Jgreen; "second attempt to mount netapp to locke for fundraising banner log archiving" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22872 [01:15:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22872 [01:16:32] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22872 [01:20:59] !log enabling ConfirmEdit with FancyCaptcha on labsconsole [01:21:10] Logged the message, Master [01:22:14] New patchset: Jgreen; "fixed include" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22873 [01:22:59] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22873 [01:27:08] !log enabling $wgEmailConfirmToEdit on labsconsole [01:27:17] Logged the message, Master [01:33:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:39:27] !log renaming mailing list chaptercommittee-l to affcom , rebuilding archives... [01:39:36] Logged the message, Master [01:41:22] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 240 seconds [01:42:18] New patchset: Dzahn; "mail alias and HTTP redirect for list rename: chaptercommittee-l -> affcom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22874 [01:43:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22874 [01:43:07] New patchset: Dzahn; "mail alias and HTTP redirect for list rename: chaptercommittee-l -> affcom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22874 [01:43:53] New review: Dzahn; "for RT-3477" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/22874 [01:43:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22874 [01:45:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.763 seconds [01:46:46] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (33037) [01:47:41] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (32244) [01:57:27] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [01:58:03] PROBLEM - Puppet freshness on cp1025 is CRITICAL: Puppet has not run in the last 10 hours [02:10:48] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [02:10:57] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [02:10:57] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [02:11:42] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [02:21:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:11] New patchset: Jgreen; "adding account file_mover to aluminium/grosley/storage3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22876 [02:32:44] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22876 [02:33:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.399 seconds [02:36:00] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [02:37:30] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (34761) [02:38:24] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (33571) [02:42:08] New patchset: Jgreen; "Revert "adding account file_mover to aluminium/grosley/storage3" . . . because our account creation classes are too broken to use--they create inconsistent GIDs across hosts, thwarting sane use of nfs/netapp." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22877 [02:46:50] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22877 [02:54:20] New patchset: Jgreen; "removing nfs mount from locke" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22878 [02:59:37] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22878 [03:04:48] RECOVERY - Puppet freshness on search1001 is OK: puppet ran at Thu Sep 6 03:04:19 UTC 2012 [03:08:42] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [03:09:36] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [03:32:54] RECOVERY - Puppet freshness on mw74 is OK: puppet ran at Thu Sep 6 03:32:32 UTC 2012 [03:48:30] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (11976), zhwiki (51853) [03:51:57] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (51286) [05:21:30] PROBLEM - Host srv266 is DOWN: PING CRITICAL - Packet loss = 100% [06:15:18] PROBLEM - Puppet freshness on cp1022 is CRITICAL: Puppet has not run in the last 10 hours [06:41:15] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [06:41:15] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:41:15] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [06:41:15] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [06:41:15] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [06:41:16] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [06:41:16] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [06:41:17] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [06:41:17] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:50:36] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244, [08:28:46] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [08:29:49] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [08:37:01] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [08:38:22] PROBLEM - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:52] RECOVERY - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 61471 bytes in 7.374 seconds [08:43:34] hmmm that didn't page [08:43:35] weird [10:41:12] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [10:41:12] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [10:41:12] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [10:47:12] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:58:04] paging is broken we think. or have you been getting pages recently? [10:58:19] I haven't [10:58:49] yeah I didn't think about it til Rob said something yesterday but I haven't either [11:01:13] poor apergos [11:01:16] still jetlagged [11:01:24] 4 am again [11:01:32] yeah, I noticed [11:01:37] I even drank caffeine yesterday afternoon [11:01:39] no difference [11:01:47] still fell asleep at 10pm [11:45:13] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (106422), zhwiki (49409) [11:46:07] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (105850), zhwiki (49296) [11:59:19] PROBLEM - Puppet freshness on cp1025 is CRITICAL: Puppet has not run in the last 10 hours [12:12:22] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [12:12:22] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:16:43] RECOVERY - Puppet freshness on cp1025 is OK: puppet ran at Thu Sep 6 12:16:20 UTC 2012 [12:16:53] so [12:16:58] so? [12:17:09] i wanted to talk to you about what to send where on the eqiad upload varnishes [12:17:16] i mirrored squid's current config now [12:17:23] that is, thumbs/temp/originals to swift, rest to ms7 [12:17:31] correct [12:17:37] what is temp anyway? [12:17:48] some MW temp cache space or something [12:17:50] don't remember exactly [12:17:55] PROBLEM - BGP status on csw2-esams is CRITICAL: (Service Check Timed Out) [12:19:07] RECOVERY - Varnish HTTP upload-backend on cp1025 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.053 seconds [12:19:14] while you guys are both in here, do either of you remember ben testing with an r510 (or some dell with an h700/h800) for swift? [12:19:33] yes and he couldn't make it but Asher said that he was wrong. [12:19:43] do we knw what he did? [12:19:52] RECOVERY - Varnish HTCP daemon on cp1025 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [12:19:54] I hunted around on wikitech but couldn't find notes there [12:19:56] i don't remember that [12:20:03] but we have R720XDs in esams for swift [12:20:21] what controllers do those have? [12:20:33] don't remember, but I had them check they could do JBOD [12:20:37] before purchase [12:20:37] ok [12:21:09] can't buy R510s anymore [12:21:28] I guess asher might know the back story, I'll see what he remembers [12:21:30] thanks [12:21:40] I have a mail from Asher that CT forwarded me [12:21:53] that said that he made R510s with JBOD just fine [12:21:55] you both in SF now? [12:21:59] only one [12:22:14] I'm just on SF time :-) [12:22:17] kind of [12:22:18] hehe [12:22:27] that's what I'm on: "kind of" [12:22:41] my sleep schedule is completely fucked up due to the previous two-three weeks [12:22:48] PROBLEM - BGP status on csw2-esams is CRITICAL: (Service Check Timed Out) [12:22:51] it won't get better [12:23:08] heh [12:23:26] the deployment windows every day of the week at 20:00 localtime surely didn't help :-) [12:23:32] no it didn't [12:23:43] mark: so, what did you want to ask about the Varnishes? [12:24:04] just what you think I should send where [12:24:11] but you've already answered it I guess [12:27:43] okay [12:27:52] you said you fixed it already? [12:27:56] two days ago [12:27:59] heh [12:28:02] am about to test it [12:28:16] since you were not online then I thought you were flying actually ;) [12:29:08] oh yeah, I saw your ping, ponged you a few hours after [12:30:20] so, we have more pending changes [12:30:31] I'll try to keep varnish configs up-to-date [12:31:49] ok [12:33:18] RECOVERY - Puppet freshness on cp1022 is OK: puppet ran at Thu Sep 6 12:32:54 UTC 2012 [12:35:15] RECOVERY - Varnish HTCP daemon on cp1022 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [12:35:42] RECOVERY - Varnish HTTP upload-backend on cp1022 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.053 seconds [12:35:51] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [12:35:51] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [12:36:45] RECOVERY - NTP on cp1025 is OK: NTP OK: Offset -0.05371642113 secs [12:37:39] PROBLEM - BGP status on csw2-esams is CRITICAL: (Service Check Timed Out) [12:42:54] PROBLEM - Host cp1021 is DOWN: PING CRITICAL - Packet loss = 100% [12:43:30] RECOVERY - Host cp1021 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [12:45:45] PROBLEM - Host cp1024 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:54] PROBLEM - Host cp1023 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:30] RECOVERY - Host cp1023 is UP: PING OK - Packet loss = 0%, RTA = 26.65 ms [12:46:39] RECOVERY - Host cp1024 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [12:47:33] PROBLEM - Varnish HTTP upload-frontend on cp1021 is CRITICAL: Connection refused [12:48:00] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:49:03] PROBLEM - Host cp1025 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:21] PROBLEM - Host cp1026 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:39] RECOVERY - Host cp1025 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [12:49:48] PROBLEM - Varnish HTTP upload-frontend on cp1024 is CRITICAL: Connection refused [12:49:48] RECOVERY - Host cp1026 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [12:50:33] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:51:00] PROBLEM - Host cp1028 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:09] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:51:09] PROBLEM - Host cp1027 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:45] RECOVERY - Host cp1027 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [12:52:03] RECOVERY - Host cp1028 is UP: PING OK - Packet loss = 0%, RTA = 27.23 ms [12:52:12] RECOVERY - Varnish HTTP upload-frontend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [12:53:24] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:54:09] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [12:54:36] PROBLEM - Varnish HTTP upload-frontend on cp1026 is CRITICAL: Connection refused [12:55:21] PROBLEM - Varnish HTTP upload-frontend on cp1027 is CRITICAL: Connection refused [12:55:48] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:56:06] RECOVERY - Varnish HTTP upload-frontend on cp1024 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [12:56:15] PROBLEM - Varnish HTTP upload-frontend on cp1028 is CRITICAL: Connection refused [12:56:24] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:56:42] RECOVERY - Varnish HTTP upload-frontend on cp1025 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [12:58:21] RECOVERY - Varnish HTTP upload-frontend on cp1027 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [12:58:57] RECOVERY - Varnish HTTP upload-frontend on cp1022 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [13:02:17] New patchset: Mark Bergsma; "Don't start the loggers until Varnish is running" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22888 [13:03:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22888 [13:04:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22888 [13:06:54] RECOVERY - Varnish HTTP upload-frontend on cp1028 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [13:07:21] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [13:09:36] PROBLEM - Host cp1021 is DOWN: PING CRITICAL - Packet loss = 100% [13:10:21] RECOVERY - Host cp1021 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [13:16:03] RECOVERY - Varnish HTTP upload-frontend on cp1026 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [13:16:39] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [13:18:54] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [13:25:48] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [13:26:33] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [13:29:08] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [13:29:08] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [13:48:49] heading in to the office [13:52:14] streaming in varnish seems to work now [13:52:20] oh really? [13:52:22] how cool! [13:52:27] i just checked with a > 1 GB ogv on commons [13:52:39] since it's bigger than 64, the streaming code kicked in and delivered it to me just fine [13:52:45] i'll have to check concurrency better [13:52:54] :-) [13:52:57] yay for ditching squid [13:53:05] and the varnish instance that caches it went from completely empty to 1.3 GB on one disk cache hehe [13:53:11] /dev/sda3 139G 1.3G 138G 1% /srv/sda3 [13:53:12] /dev/sdb3 139G 36M 139G 1% /srv/sdb3 [13:53:24] how does it do disk cache? files? [13:53:26] one single big file? [13:53:29] single big file [13:53:41] divided in silos [13:53:49] does it make since to give it a block device directly? [13:54:03] giving a block device doesn't help, like with squid [13:54:07] sinc ethen the kernel doesn't cache it [13:54:11] and varnish kind of relies on that [13:54:20] for squid it doesn't matter, as it does its own memory caching [13:54:28] right [13:54:30] then it's nice that the kernel isn't also caching it [13:54:37] so we just use a single file on an otherwise empty xfs fs [13:54:56] have you booked tickets for VUG btw? [13:54:59] no [13:55:24] but you will? [13:55:27] i don't know [13:55:31] i don't really want to go [13:55:35] haha [13:55:59] it's gonna be a busy period, I don't like too much travel as I can't get work done [14:12:36] New patchset: Pyoungmeister; "adding udp2log-log4j.jar to classpath to fully support udp2log logging" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/22898 [14:13:19] Change abandoned: Pyoungmeister; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22734 [14:13:37] Change merged: Pyoungmeister; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/22898 [14:22:52] well I can load a single large file from two clients at least, while it's being fetched from swift [14:23:03] !log reimaging search1001, 1002, 1003 [14:23:14] Logged the message, notpeter [14:26:15] * apergos lurks for the rest of the swift conversation [14:26:52] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:27:11] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:27:44] it's not a swift conversation [14:28:04] New patchset: Pyoungmeister; "lucene: a couple of small tweaks to get udp2log results up and running" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22899 [14:28:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22899 [14:29:34] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22899 [14:31:40] PROBLEM - SSH on search1003 is CRITICAL: Connection refused [14:31:58] PROBLEM - Lucene disk space on search1003 is CRITICAL: Connection refused by host [14:32:34] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [14:32:52] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [14:35:10] PROBLEM - NTP on search1001 is CRITICAL: NTP CRITICAL: No response from NTP server [14:36:31] PROBLEM - Lucene disk space on search1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:31] PROBLEM - Lucene disk space on search1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:37:16] PROBLEM - SSH on search1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:37:16] PROBLEM - SSH on search1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:37:52] PROBLEM - Lucene on search1003 is CRITICAL: Connection timed out [14:40:16] RECOVERY - SSH on search1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:40:16] RECOVERY - SSH on search1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:40:16] RECOVERY - SSH on search1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:42:44] New patchset: Jgreen; "adding netapp mount back to locke (again)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22901 [14:43:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22901 [14:43:46] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22901 [14:48:22] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [14:55:08] !log ms-be6 going down to remove ssd drives [14:55:17] Logged the message, Master [14:56:28] RECOVERY - Lucene disk space on search1001 is OK: DISK OK [14:59:37] RECOVERY - Lucene disk space on search1002 is OK: DISK OK [15:02:19] PROBLEM - NTP on search1003 is CRITICAL: NTP CRITICAL: No response from NTP server [15:05:37] RECOVERY - Lucene disk space on search1003 is OK: DISK OK [15:06:13] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [15:10:43] cmjohnson1: yay you are here [15:11:43] New patchset: Ottomata; "Ungh, RT 3460 is for halfak, not aaron shulz." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22902 [15:12:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22902 [15:12:40] RECOVERY - NTP on search1001 is OK: NTP OK: Offset -0.01202392578 secs [15:12:43] notpeter, could you merge that one? [15:12:43] https://gerrit.wikimedia.org/r/22902 [15:13:06] apergos: yes�almost finished w/os [15:13:15] sweet [15:14:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22902 [15:14:33] thank you! [15:14:33] doneski [15:14:35] yup! [15:21:58] RECOVERY - NTP on search1003 is OK: NTP OK: Offset -0.006657481194 secs [15:24:04] RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:24:16] apergos: ms-be6 is all yours [15:24:40] yay [15:24:46] ready for puppet cert? [15:24:56] yes [15:26:26] ok doing the first puppet run now [15:26:36] let's see what happens shall we? [15:27:30] * cmjohnson1 is optimistic [15:27:38] * apergos is pessimistic [15:27:43] that should balance us out :-D [15:27:45] * ^demon is realistic [15:28:34] RECOVERY - Lucene on search1003 is OK: TCP OK - 0.027 second response time on port 8123 [15:28:50] doo dee doo dee doo [15:29:19] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.030 second response time on port 8123 [15:29:19] PROBLEM - NTP on ms-be6 is CRITICAL: NTP CRITICAL: Offset unknown [15:29:28] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.027 second response time on port 8123 [15:30:05] cmjohnson1: so. both srv281 and srv266 are dead again... [15:30:11] test complete! [15:30:12] :) [15:30:38] notpeter [15:30:55] yep [15:31:01] *sigh* [15:31:16] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [15:31:16] also search32 is dead again as well [15:31:34] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:31:43] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [15:31:43] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:31:43] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:31:45] but hey! search32 seems to be staying up! [15:31:48] cmjohnson1: really? [15:31:52] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [15:32:10] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [15:32:22] i see an amber led�didn't check it yet�just assumed it was bad [15:32:28] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [15:32:28] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:32:37] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [15:32:37] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [15:33:02] aaaaaaand it sucks to be us. [15:33:14] lemme see exactly which ones it whined about and in which way [15:33:27] notpeter: same b.s. [15:33:28] Record: 2 [15:33:29] Date/Time: 09/05/2012 19:31:42 [15:33:29]