[00:00:59] * andrewbogott struggles with the notion that proxy == load balancer [00:01:04] heh [00:01:06] I guess I understand why they're the same, almost [00:01:28] Since a balancer has to redirect traffic transparently which I guess is what a proxy does. [00:01:36] *light bulb* [00:01:39] yep [00:02:18] though a reverse proxy doesn't necessarily need to load balance [00:06:32] oh, hey, the code *is* released! [00:06:32] https://github.com/rackspace/atlas-lb [00:07:23] Change on 12mediawiki a page Wikimedia Labs/Reverse proxy for web services was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=509903 edit summary: /* Alternatives to implementing this ourselves */ [00:09:34] no haproxy adapter :( [00:21:28] Change on 12mediawiki a page Wikimedia Labs/Reverse proxy for web services was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=509920 edit summary: [00:21:35] screw atlas [00:21:43] I think we should implement this [00:22:28] https://www.mediawiki.org/wiki/Wikimedia_Labs/Reverse_proxy_for_web_services#Other_suggestions <— we can switch pybal from twisted to eventlet, and add an openstack-like API [00:22:56] Is the API they define reasonable? Reimplementing an existing API would be slightly nicer than making a new one [00:23:01] true [00:23:22] yes, though LVS wouldn't support a lot of it [00:23:22] Or just writing a custom backend for Atlas [00:23:57] we need to update pybal for IPv6 [00:24:14] and fundraising would like to be able to pool/depool via API [00:24:33] we wouldn't want to use atlas in production [00:24:49] Just on account of it being java? [00:24:54] yep [00:24:57] also, pybal supports bgp [00:25:23] so, it can advertise service IP addresses to the routers, and if an LVS server dies, the router automatically moves the traffic [00:26:28] RECOVERY Free ram is now: OK on bastion-prod1 bastion-prod1 output: OK: 90% free memory [00:27:48] RECOVERY Total Processes is now: OK on bastion-prod1 bastion-prod1 output: PROCS OK: 80 processes [00:28:38] RECOVERY dpkg-check is now: OK on bastion-prod1 bastion-prod1 output: All packages OK [00:28:57] the initial modifications (moving from twisted to eventlet) shouldn't be major [00:29:14] the default way that pybal configures itself is via files [00:29:18] RECOVERY Current Load is now: OK on bastion-prod1 bastion-prod1 output: OK - load average: 0.05, 0.20, 0.23 [00:29:34] could probably abstract that, make it driver based, and make it pull that from the database [00:29:58] RECOVERY Current Users is now: OK on bastion-prod1 bastion-prod1 output: USERS OK - 0 users currently logged in [00:30:28] andrewbogott: http://svn.wikimedia.org/viewvc/mediawiki/trunk/pybal/ [00:30:38] RECOVERY Disk Space is now: OK on bastion-prod1 bastion-prod1 output: DISK OK [00:34:01] Ryan_Lane: That code hasn't been touched in years because it works perfectly, or because it was replaced with something else? [00:34:15] hm [00:34:20] I know it's been touched since then [00:34:24] I wonder if that's the wrong spot [00:34:56] hm 0.1+r74215 [00:35:24] in general it works [00:35:32] and hasn't needed many modifications [00:35:47] it for sure needs modifications for IPv6, though [00:35:54] so, yeah, that's the right code [00:37:15] ah. crap [00:37:21] the bgp code is written as twisted too [00:37:56] http://svn.wikimedia.org/viewvc/mediawiki/trunk/routing/twistedbgp/ [00:38:50] though, really, that's just the way the code is controlled [01:16:43] cool, a bastion-prod1 instance? [01:16:58] well, I'm going to separate users [01:17:09] ? [01:17:13] it still won't be allowed to access production [01:18:19] Separate users as in dividing people who work to make code into production and those that don't? [01:18:48] no. people who have access to production, and those that don't [01:18:53] to prevent escalation attacks [01:19:01] It's strictly for security reasons [01:19:09] okay... [01:19:45] but isn't production denying our public key? [01:19:54] or whichever key [01:19:56] Yours, yes [01:19:58] production is denying ssh access completely [01:19:59] But it accepts mine [01:20:04] from labs [01:20:16] hmm [01:20:25] If I were to forward my key to bastion , then a root on bastion can "steal" my key [01:20:45] well, it can get access to your agent [01:20:51] it can't actually steal the real key :) [01:20:54] They can take over the agent process, forward it somewhere else, then use it to log into prod [01:20:59] Yeah that's why I said steal in quotes [01:21:04] * Ryan_Lane nods [01:21:06] lol [01:21:06] They can use your key through your agent process on bastion [01:21:14] right [01:21:25] So in theory, people have a separate key for production access so this can't happen [01:21:32] In practice, people may accidentally forward both keys [01:21:44] I should probably name that something else [01:21:51] looks reasonable [01:22:05] Or the production setup may be buggy in that it still accepts some people's old keys for certain purposes *cough* [01:22:06] bastion-limited, bastion-restricted? [01:22:21] it shouldn't [01:22:26] we use ensure => absent [01:22:27] Anyway, the solution is to have separate bastion VMs for people with prod login and people without [01:22:48] and how is 1 instance going to do that? [01:22:50] Ryan_Lane: Taking that subtopic to the private chan [01:23:06] yeah, logged channel here [01:24:25] So yeah the theory is that if bastion-restricted has all people with prod keys and only those people, root escalation on the bastion VM isn't enough [01:24:47] Because the bastion VM that the attacker is using is not the same VM that all the users with prod keys are using [01:25:00] So you'd actually have to have an escalation cross-VM, which is less likely [01:25:27] Of course people should *still* take care not to forward their prod key, but this offers additional protection [01:25:28] so we are going to be denied access to that instance? [01:25:40] Yes, and vice versa [01:26:06] People without prod access can't log into bastion-restricted, or we'd risk them stealing our keys [01:26:17] People with prod access can't log into the "normal" bastion host, or they'd risk getting their key stolen [01:26:30] okay [01:26:35] too far fetched [01:26:45] but in the name of security, its nothing [01:27:07] well, it's not a large inconvenience, and it adds an actual protective measure [01:27:58] Ryan_Lane: Do we also have a way to disallow agent forwarding within labs? [01:28:05] Or out of the bastion hosts specifically [01:28:14] no, and we don't always want to [01:28:21] people should really know better, though [01:28:25] Sure [01:28:29] PROBLEM host: bastion-prod1 is DOWN address: bastion-prod1 CRITICAL - Host Unreachable (bastion-prod1) [01:28:42] You shouldn't ever need to, in theory, always go through bastion [01:28:55] well, unless you need to hit gerrit [01:29:04] As yourself? [01:29:12] Hmm, right [01:29:16] though, I think we should have people clone anonymously there [01:29:24] and git pull from there to their local system [01:29:29] then push from local [01:29:57] Yeah there's no reason to check out the puppet manifests on the VM, you can't run them from there anyway [01:30:10] You're gonna have to check in your puppet changes in order for them to be run on your VM [01:30:17] actually, we plan on changing that [01:30:22] Oh? [01:30:29] yes. branch-per-project [01:30:35] Well KO [01:30:36] you'll make your changes locally, and run them locally [01:30:43] Huh, OK' [01:30:59] So if the changes literally only exist in a git clone locally on the VM, they'll still be run? [01:31:07] yep [01:31:12] They don't need to be in gerrit or in the central repo at all? [01:31:14] That's cool [01:31:17] yeah [01:31:20] that's the goal [01:31:22] In that case it totally makes sense to clone&push from the VM [01:31:40] But people should be told to just create a new private key for labs, I know of people that have already done this [01:31:41] hm. I see why Special:NovaAddress is so freaking slow [01:31:50] I'm doing *so many* ldap lookups [01:32:11] yeah, we tell people that too [01:32:31] but, it isn't totally necessary to forward your agent anyway [01:32:57] keep your repo on the local system, make changes on the instances, pull from the instances to local, push to gerrit [01:33:08] git is decentralized [01:33:31] you don't *have* to push in from the instances [01:33:56] PROBLEM Total Processes is now: CRITICAL on bastion-restricted1 bastion-restricted1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:34:17] Well yeah, but pulling from the VM to local is inconvenient [01:34:22] Especially since the VM usually won't have a public IP [01:34:23] it's just an alias [01:34:28] proxycommand [01:34:36] PROBLEM dpkg-check is now: CRITICAL on bastion-restricted1 bastion-restricted1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:34:37] or keep your repo on bastion [01:34:45] we have 300GB of storage there [01:34:59] Sure, there are ways to do it [01:35:06] plus, the home directory share there will be 50GB [01:35:09] and can be increased [01:35:14] I'm just saying that half our target audience drops out as soon as you say "proxycommand" [01:35:16] PROBLEM Current Load is now: CRITICAL on bastion-restricted1 bastion-restricted1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:35:18] yeah [01:35:22] Is /home available on bastion and the VMs? [01:35:30] no [01:35:52] So how does the home dir stuff work? [01:35:53] so, git pull ; git review [01:36:06] PROBLEM Current Users is now: CRITICAL on bastion-restricted1 bastion-restricted1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:36:46] PROBLEM Disk Space is now: CRITICAL on bastion-restricted1 bastion-restricted1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:37:26] PROBLEM Free ram is now: CRITICAL on bastion-restricted1 bastion-restricted1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:51:47] New patchset: Ryan Lane; "Up version of user-management tools" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3102 [01:51:58] New patchset: Ryan Lane; "Adding support to restrict instances by puppet variable" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3103 [01:52:09] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3102 [01:52:09] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3103 [01:52:37] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3102 [01:52:39] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3102 [01:53:02] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3103 [01:53:04] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3103 [01:56:26] RECOVERY Free ram is now: OK on mobile-enwp mobile-enwp output: OK: 37% free memory [02:00:10] New patchset: Ryan Lane; "We need restricted_to and restricted_from" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3104 [02:00:16] RECOVERY Current Load is now: OK on bastion-restricted1 bastion-restricted1 output: OK - load average: 0.24, 0.08, 0.09 [02:00:20] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3104 [02:00:25] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3104 [02:00:27] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3104 [02:01:13] RECOVERY Current Users is now: OK on bastion-restricted1 bastion-restricted1 output: USERS OK - 1 users currently logged in [02:01:14] !log bastion restricted login on bastion1 to people in the bastion group, and am disallowing ops [02:01:16] Logged the message, Master [02:01:30] !log bastion restricted login on bastion-restricted1 to people in the ops group [02:01:31] Logged the message, Master [02:01:43] RECOVERY Disk Space is now: OK on bastion-restricted1 bastion-restricted1 output: DISK OK [02:02:28] hm [02:02:28] wtf [02:02:33] RECOVERY Free ram is now: OK on bastion-restricted1 bastion-restricted1 output: OK: 92% free memory [02:02:56] seems I just broke bastion [02:03:44] hm. dns disappeared for a bit. that's weird [02:03:50] probably the office network [02:03:52] yeah [02:03:53] RECOVERY Total Processes is now: OK on bastion-restricted1 bastion-restricted1 output: PROCS OK: 82 processes [02:04:04] http access seems to be down for me wikimedia-wide [02:04:12] odd [02:04:33] RECOVERY dpkg-check is now: OK on bastion-restricted1 bastion-restricted1 output: All packages OK [02:04:33] works for me [02:05:23] yeah, back up [02:05:27] some small hiccup [02:14:33] PROBLEM Free ram is now: WARNING on mobile-enwp mobile-enwp output: Warning: 18% free memory [02:28:00] New patchset: Ryan Lane; "Adding requirement for pam_access" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3105 [02:28:11] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3105 [02:28:19] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3105 [02:28:21] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3105 [02:39:33] RECOVERY Free ram is now: OK on mobile-enwp mobile-enwp output: OK: 21% free memory [03:12:11] !log dumps Deleted instance dumps-nfs1 to make way for gluster storage [03:12:12] Logged the message, Master [03:17:23] PROBLEM host: dumps-nfs1 is DOWN address: dumps-nfs1 check_ping: Invalid hostname/address - dumps-nfs1 [04:46:24] PROBLEM Free ram is now: WARNING on mobile-enwp mobile-enwp output: Warning: 17% free memory [06:06:24] RECOVERY Free ram is now: OK on mobile-enwp mobile-enwp output: OK: 26% free memory [06:19:24] PROBLEM Free ram is now: WARNING on mobile-enwp mobile-enwp output: Warning: 18% free memory [06:42:45] PROBLEM Current Load is now: WARNING on bots-sql3 bots-sql3 output: WARNING - load average: 8.26, 8.56, 6.39 [06:44:51] RECOVERY Free ram is now: OK on mobile-enwp mobile-enwp output: OK: 25% free memory [06:51:41] RECOVERY Disk Space is now: OK on aggregator1 aggregator1 output: DISK OK [06:59:41] PROBLEM Disk Space is now: WARNING on aggregator1 aggregator1 output: DISK WARNING - free space: / 540 MB (5% inode=94%): [07:02:41] RECOVERY Current Load is now: OK on bots-sql3 bots-sql3 output: OK - load average: 0.56, 2.59, 4.86 [07:24:41] PROBLEM Disk Space is now: CRITICAL on aggregator1 aggregator1 output: DISK CRITICAL - free space: / 284 MB (2% inode=94%): [07:56:52] hi [08:07:07] hi [08:08:28] i wanted to know if i can usee the labs to run a program that requires 6GB RAM. [08:08:55] whym: yes [08:09:07] but it should be related to wikimedia heh [08:09:19] petan|wk: and can it be a long-standing daemon? [08:09:27] I think so [08:09:46] what is it for? [08:09:52] https://github.com/whym/RevDiffSearch [08:10:05] isn't it possible to use some cache on disk? [08:10:07] it is a search engine over all diffs [08:10:10] to avoid using so much ram [08:10:23] ok but why it needs 6gb? [08:10:37] it sounds like it is poorly optimized :P [08:10:46] what is it written in? [08:10:55] basically it's a tradeoff between search speed and ram consumption [08:10:57] I think it shouldn't be problem to run it [08:11:28] yeah i'm aware it's not best optimized [08:11:30] Ryan_Lane: ^ [08:11:51] whym: there is a lot of disk space, ram is a bit of problem afaik [08:12:09] is this a tool? [08:12:12] but if it's usefull for wikimedia somehow it would be probably ok [08:12:26] petan|wk: maybe i can tweak it to slow it down and use less ram [08:13:03] what will it be used for? [08:13:15] ah. for wikihadoop? [08:13:39] Ryan_Lane: it is for mainly analytics work [08:13:55] have you worked with our analytics team at all? [08:14:05] right now it's like a really bad idea to launch a 6GB instance [08:14:15] Ryan_Lane: i've been working with Diederik van Liere [08:14:19] * Ryan_Lane nods [08:14:34] we need to expand hardware [08:14:42] when we do that it'll be fine to launch larger instances [08:15:05] ok [08:15:18] does it work to make multiple smaller instances? [08:15:24] or is one larger one needed? [08:15:42] Ryan_Lane: one larger one in the current implementation [08:15:51] PROBLEM Free ram is now: WARNING on mobile-enwp mobile-enwp output: Warning: 19% free memory [08:16:00] large instances have 8GB of RAM, xlarge instances have 16 [08:16:21] when we switch over to the cisco hardware, each hardware node will have 200GB or so of ram [08:17:18] maybe i can start with a limited experiment with smaller wikis than enwiki [08:17:52] so that i'll be able to quickly catch up when the hardware is expanded [08:20:16] i've been registered to the labs a while ago but don't have access to creating instances [08:20:23] can i request it here? [08:23:01] well, you should likely be added to an analytics project [08:23:10] you should get drdee to do so [08:24:48] Ryan_Lane: is it possible to restrict sudo? [08:25:07] I was thinking about that today [08:25:12] ok [08:25:23] I restricted access to the bastion nodes [08:25:24] Ryan_Lane: i'll ask him, thanks [08:25:36] I may be able to handle sudo the same way [08:25:51] RECOVERY Free ram is now: OK on mobile-enwp mobile-enwp output: OK: 22% free memory [08:25:51] did you read my email on wikitech [08:25:57] eh [08:25:59] huggle [08:26:01] that one [08:26:20] we have a problem with login to SUL [08:26:30] eh? [08:26:38] is it problem if our application ask users for their name and password? [08:26:50] no [08:26:51] because web application of huggle will need to do that [08:26:59] name and password of SUL [08:27:04] production [08:27:16] on toolserver it's not allowed [08:27:25] Krinkle told me it's not allowed on labs too [08:27:32] correct [08:27:37] it isn't [08:27:42] aha [08:27:45] so how can we do that [08:27:53] I need to put that into the terms of use, actually [08:28:16] right, it's ok, but I need to know how we can sort it out [08:29:03] we need oauth/openid on wikis [08:29:13] ok, is it going to happen soon? [08:29:16] or, just development of huggle can be in labs [08:29:22] because it's quite a blocker for us [08:29:34] development of huggle is in labs [08:29:54] but we need to be able to login people to SUL :) [08:30:05] it requires access to SUL? [08:30:13] huggle is editing wiki sites [08:30:26] in order to edit them you need to be logged in [08:30:28] yeah, can't it authenticate to bets? [08:30:30] *beta [08:30:34] yes [08:30:42] that would be fine [08:30:54] authentication to production SUL would not be [08:30:55] ok, but our goal is to enable it for production, not beta [08:31:07] there is no point in developing it if it's not going to work [08:31:08] that version shouldn't live in labs [08:31:17] ok, where is can live [08:31:19] * it [08:31:30] hm. going to need to figure that out [08:31:56] would it be possible to install oauth on prod? [08:32:08] it needs to be written [08:32:11] ah [08:32:14] openid as a provider is likely easier [08:32:36] but it doesn't really solve the issue [08:32:42] we really need oauth [08:32:53] I've been saying this for a while [08:32:59] I had an idea [08:33:05] there's no major push behind it [08:33:14] to create a restricted project on labs, where people from ops would have access to [08:33:21] production would live there [08:33:22] Hmm, does MobileFrontend slow down performance significantly? [08:33:29] the version which ask for password etc [08:33:47] people from ops would be able to review source and updates before updating sw [08:33:56] so that it should be secure [08:34:28] petan|wk: we'll need to figure this out [08:34:34] hm... [08:34:41] oauth would solve this problem a lot better :) [08:34:53] yes, but I think it's going to take years for it to happen [08:35:04] proposal is 2 years old [08:35:09] still no progress [08:35:37] well, we'll need to figure something out [08:35:59] in fact I don't see much problem on application which ask for a password, as long as it's open source and is managed by trustworth people [08:36:18] problem is if it lives on instance where everyone has root [08:36:34] yes [08:36:57] so if it could be managed by some "approved" people only, it shouldn't be security issue [08:36:57] well, either way, we aren't really ready for that right now [08:37:09] preferably by people from ops [08:37:17] hm... [08:38:18] I will use my personal server for hosting then... I hope it's on going to have a big load [08:38:35] it would be cool to move it to labs in future once it's solved [08:38:44] we can do development there now [08:39:29] that won't be ok either [08:39:33] we'll end up blocking it [08:39:43] huh? [08:40:22] this isn't the first time an app like this has been blocked [08:40:35] it's the thing that prompted the first discussion of oauth [08:40:45] huggle already ask people for login data [08:40:54] it's being used for 7 years [08:40:55] it's a locally installed app right now [08:41:04] once it's a web app it's different [08:41:12] the credentials need to be stored somewhere [08:41:15] yes but this local app could have a password logger if developers were evil [08:41:26] which makes it a place where credentials can be stolen en masse [08:41:45] Ryan_Lane: we don't need to store credentials [08:41:50] we just pass it to login api [08:41:56] then we store the session data only [08:42:03] like the current app does it [08:42:06] lemme find the old thread [08:42:11] hm... [08:42:30] I don't see a way to steal any credentials from there [08:42:58] only problem is if devs would steal them by changing the code, but that could happen even now [08:43:09] christ, how do I search wikitech-l archives? [08:43:15] huggle is being used for long time, and no account was compromised so far [08:43:25] Ryan_Lane: google [08:43:38] wikitech-l isn't indexed [08:44:30] god damn it, I wish they'd let crawlers index it [08:44:58] http://www.gossamer-threads.com/lists/wiki/wikitech/172528?search_string=oauth;#172528 [08:45:11] ko [08:45:13] ok [08:45:42] actually: http://www.gossamer-threads.com/lists/wiki/wikitech/172478 [08:46:12] problem is in this thread is same as we have now with app [08:46:21] people can't trust the provider [08:46:27] yes/no [08:46:43] with a downloaded application you can see exactly what it's doing [08:46:47] you can't with a website [08:46:49] no [08:46:53] you can't see it, it's a binary [08:47:02] you can download the source, and compile it [08:47:11] you can watch your network traffic [08:47:12] how you can be sure the binary is same as source [08:47:15] you can stick it into a debugger [08:47:28] ok, so it's a problem of trustworth [08:47:38] not a problem with app itself [08:47:39] if you compile the binary from the source, you know it's right because you trust the compiler [08:47:43] yes [08:47:54] so if it was running on restricted instance in labs, it should be ok [08:48:07] maybe. like I said, we aren't set up for this [08:48:09] because only selected people would have access to it [08:48:23] but, if it's run externally, we'll almost definitely have to block it [08:48:46] right, I know a lot of stuff we can't do, tell me what we can do [08:48:58] give me some time to figure it out [08:49:12] I'll ask internally how we can handle this [08:49:16] ok [08:50:30] well, I'm off to bed [08:51:14] * Ryan_Lane waves [09:16:51] PROBLEM Free ram is now: WARNING on mobile-enwp mobile-enwp output: Warning: 18% free memory [10:22:30] . [10:22:36] @search log [10:22:36] Results (found 9): morebots, labs-morebots, credentials, logging, terminology, newgrp, initial-login, requests, hyperon, [10:41:18] Hydriz [10:41:23] what is rc for incubator name [10:41:29] ? [10:41:38] you mean irc channel on irc.wm? [10:42:00] if thats the case, then #incubator.wikimedia [10:42:22] ok [10:46:14] @regsearch [10:46:14] Could you please tell me what I should search for :P [10:46:16] @search [10:46:16] Could you please tell me what I should search for :P [10:46:18] !ping [10:46:18] pong [10:46:20] lag [10:46:23] ok [10:46:52] @ [10:46:55]