[00:00:59] * andrewbogott struggles with the notion that proxy == load balancer [00:01:04] heh [00:01:06] I guess I understand why they're the same, almost [00:01:28] Since a balancer has to redirect traffic transparently which I guess is what a proxy does. [00:01:36] *light bulb* [00:01:39] yep [00:02:18] though a reverse proxy doesn't necessarily need to load balance [00:06:32] oh, hey, the code *is* released! [00:06:32] https://github.com/rackspace/atlas-lb [00:07:23] Change on 12mediawiki a page Wikimedia Labs/Reverse proxy for web services was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=509903 edit summary: /* Alternatives to implementing this ourselves */ [00:09:34] no haproxy adapter :( [00:21:28] Change on 12mediawiki a page Wikimedia Labs/Reverse proxy for web services was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=509920 edit summary: [00:21:35] screw atlas [00:21:43] I think we should implement this [00:22:28] https://www.mediawiki.org/wiki/Wikimedia_Labs/Reverse_proxy_for_web_services#Other_suggestions <— we can switch pybal from twisted to eventlet, and add an openstack-like API [00:22:56] Is the API they define reasonable? Reimplementing an existing API would be slightly nicer than making a new one [00:23:01] true [00:23:22] yes, though LVS wouldn't support a lot of it [00:23:22] Or just writing a custom backend for Atlas [00:23:57] we need to update pybal for IPv6 [00:24:14] and fundraising would like to be able to pool/depool via API [00:24:33] we wouldn't want to use atlas in production [00:24:49] Just on account of it being java? [00:24:54] yep [00:24:57] also, pybal supports bgp [00:25:23] so, it can advertise service IP addresses to the routers, and if an LVS server dies, the router automatically moves the traffic [00:26:28] RECOVERY Free ram is now: OK on bastion-prod1 bastion-prod1 output: OK: 90% free memory [00:27:48] RECOVERY Total Processes is now: OK on bastion-prod1 bastion-prod1 output: PROCS OK: 80 processes [00:28:38] RECOVERY dpkg-check is now: OK on bastion-prod1 bastion-prod1 output: All packages OK [00:28:57] the initial modifications (moving from twisted to eventlet) shouldn't be major [00:29:14] the default way that pybal configures itself is via files [00:29:18] RECOVERY Current Load is now: OK on bastion-prod1 bastion-prod1 output: OK - load average: 0.05, 0.20, 0.23 [00:29:34] could probably abstract that, make it driver based, and make it pull that from the database [00:29:58] RECOVERY Current Users is now: OK on bastion-prod1 bastion-prod1 output: USERS OK - 0 users currently logged in [00:30:28] andrewbogott: http://svn.wikimedia.org/viewvc/mediawiki/trunk/pybal/ [00:30:38] RECOVERY Disk Space is now: OK on bastion-prod1 bastion-prod1 output: DISK OK [00:34:01] Ryan_Lane: That code hasn't been touched in years because it works perfectly, or because it was replaced with something else? [00:34:15] hm [00:34:20] I know it's been touched since then [00:34:24] I wonder if that's the wrong spot [00:34:56] hm 0.1+r74215 [00:35:24] in general it works [00:35:32] and hasn't needed many modifications [00:35:47] it for sure needs modifications for IPv6, though [00:35:54] so, yeah, that's the right code [00:37:15] ah. crap [00:37:21] the bgp code is written as twisted too [00:37:56] http://svn.wikimedia.org/viewvc/mediawiki/trunk/routing/twistedbgp/ [00:38:50] though, really, that's just the way the code is controlled [01:16:43] cool, a bastion-prod1 instance? [01:16:58] well, I'm going to separate users [01:17:09] ? [01:17:13] it still won't be allowed to access production [01:18:19] Separate users as in dividing people who work to make code into production and those that don't? [01:18:48] no. people who have access to production, and those that don't [01:18:53] to prevent escalation attacks [01:19:01] It's strictly for security reasons [01:19:09] okay... [01:19:45] but isn't production denying our public key? [01:19:54] or whichever key [01:19:56] Yours, yes [01:19:58] production is denying ssh access completely [01:19:59] But it accepts mine [01:20:04] from labs [01:20:16] hmm [01:20:25] If I were to forward my key to bastion , then a root on bastion can "steal" my key [01:20:45] well, it can get access to your agent [01:20:51] it can't actually steal the real key :) [01:20:54] They can take over the agent process, forward it somewhere else, then use it to log into prod [01:20:59] Yeah that's why I said steal in quotes [01:21:04] * Ryan_Lane nods [01:21:06] lol [01:21:06] They can use your key through your agent process on bastion [01:21:14] right [01:21:25] So in theory, people have a separate key for production access so this can't happen [01:21:32] In practice, people may accidentally forward both keys [01:21:44] I should probably name that something else [01:21:51] looks reasonable [01:22:05] Or the production setup may be buggy in that it still accepts some people's old keys for certain purposes *cough* [01:22:06] bastion-limited, bastion-restricted? [01:22:21] it shouldn't [01:22:26] we use ensure => absent [01:22:27] Anyway, the solution is to have separate bastion VMs for people with prod login and people without [01:22:48] and how is 1 instance going to do that? [01:22:50] Ryan_Lane: Taking that subtopic to the private chan [01:23:06] yeah, logged channel here [01:24:25] So yeah the theory is that if bastion-restricted has all people with prod keys and only those people, root escalation on the bastion VM isn't enough [01:24:47] Because the bastion VM that the attacker is using is not the same VM that all the users with prod keys are using [01:25:00] So you'd actually have to have an escalation cross-VM, which is less likely [01:25:27] Of course people should *still* take care not to forward their prod key, but this offers additional protection [01:25:28] so we are going to be denied access to that instance? [01:25:40] Yes, and vice versa [01:26:06] People without prod access can't log into bastion-restricted, or we'd risk them stealing our keys [01:26:17] People with prod access can't log into the "normal" bastion host, or they'd risk getting their key stolen [01:26:30] okay [01:26:35] too far fetched [01:26:45] but in the name of security, its nothing [01:27:07] well, it's not a large inconvenience, and it adds an actual protective measure [01:27:58] Ryan_Lane: Do we also have a way to disallow agent forwarding within labs? [01:28:05] Or out of the bastion hosts specifically [01:28:14] no, and we don't always want to [01:28:21] people should really know better, though [01:28:25] Sure [01:28:29] PROBLEM host: bastion-prod1 is DOWN address: bastion-prod1 CRITICAL - Host Unreachable (bastion-prod1) [01:28:42] You shouldn't ever need to, in theory, always go through bastion [01:28:55] well, unless you need to hit gerrit [01:29:04] As yourself? [01:29:12] Hmm, right [01:29:16] though, I think we should have people clone anonymously there [01:29:24] and git pull from there to their local system [01:29:29] then push from local [01:29:57] Yeah there's no reason to check out the puppet manifests on the VM, you can't run them from there anyway [01:30:10] You're gonna have to check in your puppet changes in order for them to be run on your VM [01:30:17] actually, we plan on changing that [01:30:22] Oh? [01:30:29] yes. branch-per-project [01:30:35] Well KO [01:30:36] you'll make your changes locally, and run them locally [01:30:43] Huh, OK' [01:30:59] So if the changes literally only exist in a git clone locally on the VM, they'll still be run? [01:31:07] yep [01:31:12] They don't need to be in gerrit or in the central repo at all? [01:31:14] That's cool [01:31:17] yeah [01:31:20] that's the goal [01:31:22] In that case it totally makes sense to clone&push from the VM [01:31:40] But people should be told to just create a new private key for labs, I know of people that have already done this [01:31:41] hm. I see why Special:NovaAddress is so freaking slow [01:31:50] I'm doing *so many* ldap lookups [01:32:11] yeah, we tell people that too [01:32:31] but, it isn't totally necessary to forward your agent anyway [01:32:57] keep your repo on the local system, make changes on the instances, pull from the instances to local, push to gerrit [01:33:08] git is decentralized [01:33:31] you don't *have* to push in from the instances [01:33:56] PROBLEM Total Processes is now: CRITICAL on bastion-restricted1 bastion-restricted1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:34:17] Well yeah, but pulling from the VM to local is inconvenient [01:34:22] Especially since the VM usually won't have a public IP [01:34:23] it's just an alias [01:34:28] proxycommand [01:34:36] PROBLEM dpkg-check is now: CRITICAL on bastion-restricted1 bastion-restricted1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:34:37] or keep your repo on bastion [01:34:45] we have 300GB of storage there [01:34:59] Sure, there are ways to do it [01:35:06] plus, the home directory share there will be 50GB [01:35:09] and can be increased [01:35:14] I'm just saying that half our target audience drops out as soon as you say "proxycommand" [01:35:16] PROBLEM Current Load is now: CRITICAL on bastion-restricted1 bastion-restricted1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:35:18] yeah [01:35:22] Is /home available on bastion and the VMs? [01:35:30] no [01:35:52] So how does the home dir stuff work? [01:35:53] so, git pull ; git review [01:36:06] PROBLEM Current Users is now: CRITICAL on bastion-restricted1 bastion-restricted1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:36:46] PROBLEM Disk Space is now: CRITICAL on bastion-restricted1 bastion-restricted1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:37:26] PROBLEM Free ram is now: CRITICAL on bastion-restricted1 bastion-restricted1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:51:47] New patchset: Ryan Lane; "Up version of user-management tools" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3102 [01:51:58] New patchset: Ryan Lane; "Adding support to restrict instances by puppet variable" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3103 [01:52:09] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3102 [01:52:09] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3103 [01:52:37] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3102 [01:52:39] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3102 [01:53:02] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3103 [01:53:04] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3103 [01:56:26] RECOVERY Free ram is now: OK on mobile-enwp mobile-enwp output: OK: 37% free memory [02:00:10] New patchset: Ryan Lane; "We need restricted_to and restricted_from" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3104 [02:00:16] RECOVERY Current Load is now: OK on bastion-restricted1 bastion-restricted1 output: OK - load average: 0.24, 0.08, 0.09 [02:00:20] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3104 [02:00:25] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3104 [02:00:27] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3104 [02:01:13] RECOVERY Current Users is now: OK on bastion-restricted1 bastion-restricted1 output: USERS OK - 1 users currently logged in [02:01:14] !log bastion restricted login on bastion1 to people in the bastion group, and am disallowing ops [02:01:16] Logged the message, Master [02:01:30] !log bastion restricted login on bastion-restricted1 to people in the ops group [02:01:31] Logged the message, Master [02:01:43] RECOVERY Disk Space is now: OK on bastion-restricted1 bastion-restricted1 output: DISK OK [02:02:28] hm [02:02:28] wtf [02:02:33] RECOVERY Free ram is now: OK on bastion-restricted1 bastion-restricted1 output: OK: 92% free memory [02:02:56] seems I just broke bastion [02:03:44] hm. dns disappeared for a bit. that's weird [02:03:50] probably the office network [02:03:52] yeah [02:03:53] RECOVERY Total Processes is now: OK on bastion-restricted1 bastion-restricted1 output: PROCS OK: 82 processes [02:04:04] http access seems to be down for me wikimedia-wide [02:04:12] odd [02:04:33] RECOVERY dpkg-check is now: OK on bastion-restricted1 bastion-restricted1 output: All packages OK [02:04:33] works for me [02:05:23] yeah, back up [02:05:27] some small hiccup [02:14:33] PROBLEM Free ram is now: WARNING on mobile-enwp mobile-enwp output: Warning: 18% free memory [02:28:00] New patchset: Ryan Lane; "Adding requirement for pam_access" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3105 [02:28:11] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3105 [02:28:19] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3105 [02:28:21] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3105 [02:39:33] RECOVERY Free ram is now: OK on mobile-enwp mobile-enwp output: OK: 21% free memory [03:12:11] !log dumps Deleted instance dumps-nfs1 to make way for gluster storage [03:12:12] Logged the message, Master [03:17:23] PROBLEM host: dumps-nfs1 is DOWN address: dumps-nfs1 check_ping: Invalid hostname/address - dumps-nfs1 [04:46:24] PROBLEM Free ram is now: WARNING on mobile-enwp mobile-enwp output: Warning: 17% free memory [06:06:24] RECOVERY Free ram is now: OK on mobile-enwp mobile-enwp output: OK: 26% free memory [06:19:24] PROBLEM Free ram is now: WARNING on mobile-enwp mobile-enwp output: Warning: 18% free memory [06:42:45] PROBLEM Current Load is now: WARNING on bots-sql3 bots-sql3 output: WARNING - load average: 8.26, 8.56, 6.39 [06:44:51] RECOVERY Free ram is now: OK on mobile-enwp mobile-enwp output: OK: 25% free memory [06:51:41] RECOVERY Disk Space is now: OK on aggregator1 aggregator1 output: DISK OK [06:59:41] PROBLEM Disk Space is now: WARNING on aggregator1 aggregator1 output: DISK WARNING - free space: / 540 MB (5% inode=94%): [07:02:41] RECOVERY Current Load is now: OK on bots-sql3 bots-sql3 output: OK - load average: 0.56, 2.59, 4.86 [07:24:41] PROBLEM Disk Space is now: CRITICAL on aggregator1 aggregator1 output: DISK CRITICAL - free space: / 284 MB (2% inode=94%): [07:56:52] hi [08:07:07] hi [08:08:28] i wanted to know if i can usee the labs to run a program that requires 6GB RAM. [08:08:55] whym: yes [08:09:07] but it should be related to wikimedia heh [08:09:19] petan|wk: and can it be a long-standing daemon? [08:09:27] I think so [08:09:46] what is it for? [08:09:52] https://github.com/whym/RevDiffSearch [08:10:05] isn't it possible to use some cache on disk? [08:10:07] it is a search engine over all diffs [08:10:10] to avoid using so much ram [08:10:23] ok but why it needs 6gb? [08:10:37] it sounds like it is poorly optimized :P [08:10:46] what is it written in? [08:10:55] basically it's a tradeoff between search speed and ram consumption [08:10:57] I think it shouldn't be problem to run it [08:11:28] yeah i'm aware it's not best optimized [08:11:30] Ryan_Lane: ^ [08:11:51] whym: there is a lot of disk space, ram is a bit of problem afaik [08:12:09] is this a tool? [08:12:12] but if it's usefull for wikimedia somehow it would be probably ok [08:12:26] petan|wk: maybe i can tweak it to slow it down and use less ram [08:13:03] what will it be used for? [08:13:15] ah. for wikihadoop? [08:13:39] Ryan_Lane: it is for mainly analytics work [08:13:55] have you worked with our analytics team at all? [08:14:05] right now it's like a really bad idea to launch a 6GB instance [08:14:15] Ryan_Lane: i've been working with Diederik van Liere [08:14:19] * Ryan_Lane nods [08:14:34] we need to expand hardware [08:14:42] when we do that it'll be fine to launch larger instances [08:15:05] ok [08:15:18] does it work to make multiple smaller instances? [08:15:24] or is one larger one needed? [08:15:42] Ryan_Lane: one larger one in the current implementation [08:15:51] PROBLEM Free ram is now: WARNING on mobile-enwp mobile-enwp output: Warning: 19% free memory [08:16:00] large instances have 8GB of RAM, xlarge instances have 16 [08:16:21] when we switch over to the cisco hardware, each hardware node will have 200GB or so of ram [08:17:18] maybe i can start with a limited experiment with smaller wikis than enwiki [08:17:52] so that i'll be able to quickly catch up when the hardware is expanded [08:20:16] i've been registered to the labs a while ago but don't have access to creating instances [08:20:23] can i request it here? [08:23:01] well, you should likely be added to an analytics project [08:23:10] you should get drdee to do so [08:24:48] Ryan_Lane: is it possible to restrict sudo? [08:25:07] I was thinking about that today [08:25:12] ok [08:25:23] I restricted access to the bastion nodes [08:25:24] Ryan_Lane: i'll ask him, thanks [08:25:36] I may be able to handle sudo the same way [08:25:51] RECOVERY Free ram is now: OK on mobile-enwp mobile-enwp output: OK: 22% free memory [08:25:51] did you read my email on wikitech [08:25:57] eh [08:25:59] huggle [08:26:01] that one [08:26:20] we have a problem with login to SUL [08:26:30] eh? [08:26:38] is it problem if our application ask users for their name and password? [08:26:50] no [08:26:51] because web application of huggle will need to do that [08:26:59] name and password of SUL [08:27:04] production [08:27:16] on toolserver it's not allowed [08:27:25] Krinkle told me it's not allowed on labs too [08:27:32] correct [08:27:37] it isn't [08:27:42] aha [08:27:45] so how can we do that [08:27:53] I need to put that into the terms of use, actually [08:28:16] right, it's ok, but I need to know how we can sort it out [08:29:03] we need oauth/openid on wikis [08:29:13] ok, is it going to happen soon? [08:29:16] or, just development of huggle can be in labs [08:29:22] because it's quite a blocker for us [08:29:34] development of huggle is in labs [08:29:54] but we need to be able to login people to SUL :) [08:30:05] it requires access to SUL? [08:30:13] huggle is editing wiki sites [08:30:26] in order to edit them you need to be logged in [08:30:28] yeah, can't it authenticate to bets? [08:30:30] *beta [08:30:34] yes [08:30:42] that would be fine [08:30:54] authentication to production SUL would not be [08:30:55] ok, but our goal is to enable it for production, not beta [08:31:07] there is no point in developing it if it's not going to work [08:31:08] that version shouldn't live in labs [08:31:17] ok, where is can live [08:31:19] * it [08:31:30] hm. going to need to figure that out [08:31:56] would it be possible to install oauth on prod? [08:32:08] it needs to be written [08:32:11] ah [08:32:14] openid as a provider is likely easier [08:32:36] but it doesn't really solve the issue [08:32:42] we really need oauth [08:32:53] I've been saying this for a while [08:32:59] I had an idea [08:33:05] there's no major push behind it [08:33:14] to create a restricted project on labs, where people from ops would have access to [08:33:21] production would live there [08:33:22] Hmm, does MobileFrontend slow down performance significantly? [08:33:29] the version which ask for password etc [08:33:47] people from ops would be able to review source and updates before updating sw [08:33:56] so that it should be secure [08:34:28] petan|wk: we'll need to figure this out [08:34:34] hm... [08:34:41] oauth would solve this problem a lot better :) [08:34:53] yes, but I think it's going to take years for it to happen [08:35:04] proposal is 2 years old [08:35:09] still no progress [08:35:37] well, we'll need to figure something out [08:35:59] in fact I don't see much problem on application which ask for a password, as long as it's open source and is managed by trustworth people [08:36:18] problem is if it lives on instance where everyone has root [08:36:34] yes [08:36:57] so if it could be managed by some "approved" people only, it shouldn't be security issue [08:36:57] well, either way, we aren't really ready for that right now [08:37:09] preferably by people from ops [08:37:17] hm... [08:38:18] I will use my personal server for hosting then... I hope it's on going to have a big load [08:38:35] it would be cool to move it to labs in future once it's solved [08:38:44] we can do development there now [08:39:29] that won't be ok either [08:39:33] we'll end up blocking it [08:39:43] huh? [08:40:22] this isn't the first time an app like this has been blocked [08:40:35] it's the thing that prompted the first discussion of oauth [08:40:45] huggle already ask people for login data [08:40:54] it's being used for 7 years [08:40:55] it's a locally installed app right now [08:41:04] once it's a web app it's different [08:41:12] the credentials need to be stored somewhere [08:41:15] yes but this local app could have a password logger if developers were evil [08:41:26] which makes it a place where credentials can be stolen en masse [08:41:45] Ryan_Lane: we don't need to store credentials [08:41:50] we just pass it to login api [08:41:56] then we store the session data only [08:42:03] like the current app does it [08:42:06] lemme find the old thread [08:42:11] hm... [08:42:30] I don't see a way to steal any credentials from there [08:42:58] only problem is if devs would steal them by changing the code, but that could happen even now [08:43:09] christ, how do I search wikitech-l archives? [08:43:15] huggle is being used for long time, and no account was compromised so far [08:43:25] Ryan_Lane: google [08:43:38] wikitech-l isn't indexed [08:44:30] god damn it, I wish they'd let crawlers index it [08:44:58] http://www.gossamer-threads.com/lists/wiki/wikitech/172528?search_string=oauth;#172528 [08:45:11] ko [08:45:13] ok [08:45:42] actually: http://www.gossamer-threads.com/lists/wiki/wikitech/172478 [08:46:12] problem is in this thread is same as we have now with app [08:46:21] people can't trust the provider [08:46:27] yes/no [08:46:43] with a downloaded application you can see exactly what it's doing [08:46:47] you can't with a website [08:46:49] no [08:46:53] you can't see it, it's a binary [08:47:02] you can download the source, and compile it [08:47:11] you can watch your network traffic [08:47:12] how you can be sure the binary is same as source [08:47:15] you can stick it into a debugger [08:47:28] ok, so it's a problem of trustworth [08:47:38] not a problem with app itself [08:47:39] if you compile the binary from the source, you know it's right because you trust the compiler [08:47:43] yes [08:47:54] so if it was running on restricted instance in labs, it should be ok [08:48:07] maybe. like I said, we aren't set up for this [08:48:09] because only selected people would have access to it [08:48:23] but, if it's run externally, we'll almost definitely have to block it [08:48:46] right, I know a lot of stuff we can't do, tell me what we can do [08:48:58] give me some time to figure it out [08:49:12] I'll ask internally how we can handle this [08:49:16] ok [08:50:30] well, I'm off to bed [08:51:14] * Ryan_Lane waves [09:16:51] PROBLEM Free ram is now: WARNING on mobile-enwp mobile-enwp output: Warning: 18% free memory [10:22:30] . [10:22:36] @search log [10:22:36] Results (found 9): morebots, labs-morebots, credentials, logging, terminology, newgrp, initial-login, requests, hyperon, [10:41:18] Hydriz [10:41:23] what is rc for incubator name [10:41:29] ? [10:41:38] you mean irc channel on irc.wm? [10:42:00] if thats the case, then #incubator.wikimedia [10:42:22] ok [10:46:14] @regsearch [10:46:14] Could you please tell me what I should search for :P [10:46:16] @search [10:46:16] Could you please tell me what I should search for :P [10:46:18] !ping [10:46:18] pong [10:46:20] lag [10:46:23] ok [10:46:52] @ [10:46:55] @recentchanges+ [10:46:55] Invalid wiki [10:46:59] @recentchanges+ incubator_wiki [10:47:00] Wiki inserted [10:47:21] should work [10:47:52] What wikis are actually defined? [10:47:57] lot of [10:48:09] no list [10:48:14] at least not publicly available [10:48:29] but not all? [10:48:32] no [10:48:36] :( [10:48:39] I was lazy to define all [10:48:43] if you want you can [10:49:16] yay works [10:49:24] okay, how to do that? [10:49:26] syntax: [10:49:42] #blah.wikipedia|http://gfdgs.wikipedia.org/w/index.php?diff=|blah_wikipedia [10:49:52] just make a list of all wikis and put it to pastebin [10:49:57] there is 800+ wikis [10:50:11] lol [10:50:25] but when its on? [10:50:27] can we do it? [10:50:34] if you make it I will update it [10:50:44] :( [10:50:49] what you mean? [10:50:58] I don't understand you [10:51:05] Like, now wm-bot is on, is there any syntax to just insert it? [10:51:08] no [10:51:11] sigh [10:51:12] security reason [10:51:20] it needs to be inserted to configs [10:51:22] nevermind then :P [10:51:24] ok [10:51:38] Then... [10:51:47] Can we also enable the bot to monitor deletions? [10:51:53] yes [10:51:59] or maybe monitor pages in a namespace [10:52:08] it can do that [10:52:12] how? [10:52:14] just type @RC+ wiki Page [10:52:25] @RC+ en_wikipedia A* [10:52:25] Inserted new item to feed of changes [10:52:45] okay :) [10:52:45] many pages starting with a:P [10:52:51] Change on 12en_wikipedia a page Anne McCaffrey bibliography was modified, changed by SGBailey link https://en.wikipedia.org/w/index.php?diff=481662120 edit summary: /* Dragonriders of Pern series */ Fmt of Alternate dragonseye - get rid of blank lines. [10:52:55] here we go [10:52:55] Change on 12en_wikipedia a page Amtshainersdorf railway station was modified, changed by Slambo link https://en.wikipedia.org/w/index.php?diff=481662125 edit summary: fix punctuation around footnote markers per [[WP:REFPUNCT]]; move language icon out of link title [10:53:02] @RC- en_wikipedia A* [10:53:02] Deleted item from feed [10:53:11] that's main space [10:53:33] then it automatically announces deletions? [10:53:40] nah [10:53:43] that's not working yet [10:53:47] oh [10:53:52] so far it check the edits only [10:54:04] hopes its implemented :P [10:54:07] no [10:57:28] @channellist [10:57:28] I am now in following channels: #huggle, #wikimedia-dev, #mediawiki-move, #wikimedia-tech, #wm-bot, #wikimedia-labs, #wikimedia-operations, ##matthewrbowker, ##matthewrbot, #wikipedia-zh-help, #wikimedia-toolserver, ##Alpha_Quadrant, #wikimedia-mobile, #mediawiki, #wikipedia-cs, #wikipedia-cs-rc, #wiki-hurricanes-zh, #wikinews-zh, #wikipedia-zh-helpers, #wikipedia-en-afc, ##thesecretlair, ##addshore, #wikipedia-bag, #wikimedia-incubator, [11:02:39] !ping [11:02:39] pong [11:20:51] PROBLEM Free ram is now: WARNING on mobile-enwp mobile-enwp output: Warning: 17% free memory [11:31:43] PROBLEM Current Load is now: WARNING on mobile-enwp mobile-enwp output: WARNING - load average: 5.09, 5.61, 5.08 [11:56:53] PROBLEM Current Load is now: CRITICAL on mobile-enwp mobile-enwp output: CHECK_NRPE: Socket timeout after 10 seconds. [12:01:53] PROBLEM Current Load is now: WARNING on mobile-enwp mobile-enwp output: WARNING - load average: 7.65, 7.11, 6.71 [12:25:53] RECOVERY Free ram is now: OK on mobile-enwp mobile-enwp output: OK: 27% free memory [12:28:23] RECOVERY SSH is now: OK on deployment-webs1 deployment-webs1 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:31:33] RECOVERY Current Users is now: OK on deployment-webs1 deployment-webs1 output: USERS OK - 0 users currently logged in [12:31:33] RECOVERY Current Load is now: OK on deployment-webs1 deployment-webs1 output: OK - load average: 0.06, 0.16, 0.08 [12:31:33] RECOVERY Total Processes is now: OK on deployment-webs1 deployment-webs1 output: PROCS OK: 112 processes [12:31:38] RECOVERY dpkg-check is now: OK on deployment-webs1 deployment-webs1 output: All packages OK [12:32:43] RECOVERY Disk Space is now: OK on deployment-webs1 deployment-webs1 output: DISK OK [12:32:43] RECOVERY Free ram is now: OK on deployment-webs1 deployment-webs1 output: OK: 90% free memory [12:38:53] PROBLEM Free ram is now: WARNING on mobile-enwp mobile-enwp output: Warning: 16% free memory [13:08:53] RECOVERY Free ram is now: OK on mobile-enwp mobile-enwp output: OK: 21% free memory [13:56:43] PROBLEM Current Load is now: WARNING on mobile-enwp mobile-enwp output: WARNING - load average: 5.86, 5.97, 6.02 [16:16:53] PROBLEM Current Load is now: CRITICAL on mobile-enwp mobile-enwp output: CHECK_NRPE: Socket timeout after 10 seconds. [16:21:44] PROBLEM Current Load is now: WARNING on mobile-enwp mobile-enwp output: WARNING - load average: 6.18, 6.59, 6.63 [17:26:46] can anyone get into bastion.wmflabs.org? [17:31:53] PROBLEM Current Load is now: CRITICAL on mobile-enwp mobile-enwp output: CHECK_NRPE: Socket timeout after 10 seconds. [17:41:12] 03/13/2012 - 17:41:12 - Creating a home directory for midom at /export/home/bastion/midom [17:42:13] 03/13/2012 - 17:42:12 - Updating keys for midom [17:42:13] 03/13/2012 - 17:42:12 - Creating a home directory for ariel at /export/home/bastion/ariel [17:43:13] 03/13/2012 - 17:43:12 - Updating keys for ariel [18:11:29] 03/13/2012 - 18:11:29 - Creating a project directory for publicdata [18:11:29] 03/13/2012 - 18:11:29 - Creating a home directory for ariel at /export/home/publicdata/ariel [18:12:29] 03/13/2012 - 18:12:29 - Updating keys for ariel [18:27:33] PROBLEM dpkg-check is now: CRITICAL on bots-cb bots-cb output: CHECK_NRPE: Socket timeout after 10 seconds. [18:29:53] PROBLEM Current Load is now: CRITICAL on bots-cb bots-cb output: CHECK_NRPE: Socket timeout after 10 seconds. [18:30:13] PROBLEM Total Processes is now: CRITICAL on bots-cb bots-cb output: CHECK_NRPE: Socket timeout after 10 seconds. [18:31:03] PROBLEM Free ram is now: CRITICAL on bots-cb bots-cb output: CHECK_NRPE: Socket timeout after 10 seconds. [18:32:13] PROBLEM Disk Space is now: CRITICAL on bots-cb bots-cb output: CHECK_NRPE: Socket timeout after 10 seconds. [18:37:06] RECOVERY Disk Space is now: OK on bots-cb bots-cb output: DISK OK [18:37:23] RECOVERY dpkg-check is now: OK on bots-cb bots-cb output: All packages OK [18:37:57] !instances | apergos [18:37:57] apergos: https://labsconsole.wikimedia.org/wiki/Help:Instances [18:37:59] !security [18:37:59] https://labsconsole.wikimedia.org/wiki/Help:Security_Groups [18:38:33] likely won't need security groups for instances in this project [18:40:03] RECOVERY Total Processes is now: OK on bots-cb bots-cb output: PROCS OK: 110 processes [18:40:32] cute bot [18:40:53] RECOVERY Free ram is now: OK on bots-cb bots-cb output: OK: 71% free memory [18:41:15] what is the point of multiple instances for a project? basically I don't really understand the distinction betwesn projects and instances [18:41:31] OK, say you're building a replica of the WMF cluster [18:41:41] Then you'd have a project called wmfclone [18:41:45] sure [18:42:03] Inside that project, you would have multiple instances (VMs), like wmfclone-apache1, wmfclone-squid1, wmfclone-mysql1 etc [18:42:10] deployment-prep has 10 instances now I think [18:42:22] They have Squid, Apache, MySQL, almost everything [18:42:32] except an nginx proxy for HTTPS IIRC [18:42:33] so an instance is some vm that has some of the ffeatures you want for the project? [18:42:39] An instance is a VM [18:42:47] and you group them together by calling them a project? [18:42:51] And a project may or may not involve having multiple VMs that work together [18:42:55] as well as I suppose managing access [18:42:58] Yes [18:42:59] and resources for them all [18:43:06] Access management is project-wide, as is resource management [18:43:17] And firewall rules also look at project, by default [18:43:22] ok [18:43:38] So by default every VM can talk only to VMs in the same project, and everything else is firewalled [18:44:16] ah [18:44:24] but labsconsole has a web interface where you can open ports for inter-project communication if you need it [18:44:44] great [18:44:47] So to quote Ryan, what we call "project" is really a "security group" [18:44:52] ok [18:44:58] I'll just think of it that way [18:44:59] thanks [18:45:52] RoanKattouw: let's not use security group for two things :) [18:46:02] a project is a security separation [18:46:10] a security group is a set of firewall rules [18:46:18] fucking EC2 and their shitty terminology [18:46:19] Oh, sorry [18:46:24] I misunderstood then [18:46:36] Wait that is really bad terminology [18:46:41] "group" should refer to people [18:47:03] ah, ec2 sticks it to us eh? [18:47:12] awesome [18:47:16] Well Nova is API-compatible with EC2, on purpose [18:47:26] sure, and that makes good sense [18:47:36] it just sucks that they coulnd't have gotten that right for all of us [18:48:04] Yeah [18:53:55] yeah, we're stuck with the EC2 terminology, since it's the elephant [18:54:20] they didn't do a terrible job, but the terminology is a little confusing [18:54:33] they don't use projects, though, they use accounts, I think [18:54:43] PROBLEM Current Load is now: WARNING on bots-cb bots-cb output: WARNING - load average: 0.78, 3.60, 17.99 [18:54:45] I much prefer project to account, though [18:54:52] I think project came from nova [18:55:05] mm yeah account would be a disaster [19:19:43] RECOVERY Current Load is now: OK on bots-cb bots-cb output: OK - load average: 0.44, 0.59, 4.03 [19:27:43] PROBLEM host: robh-1 is DOWN address: robh-1 CRITICAL - Host Unreachable (robh-1) [19:30:02] 03/13/2012 - 19:30:02 - Updating keys for robh [19:30:03] 03/13/2012 - 19:30:03 - Updating keys for robh [19:30:05] 03/13/2012 - 19:30:05 - Updating keys for robh [19:30:12] 03/13/2012 - 19:30:11 - Updating keys for robh [19:30:16] 03/13/2012 - 19:30:16 - Updating keys for robh [19:31:47] PROBLEM Current Load is now: WARNING on mobile-enwp mobile-enwp output: WARNING - load average: 6.80, 6.16, 6.02 [19:34:12] PROBLEM Current Load is now: CRITICAL on robh1 robh1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:34:47] PROBLEM Current Users is now: CRITICAL on robh1 robh1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:35:27] PROBLEM Disk Space is now: CRITICAL on robh1 robh1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:36:17] PROBLEM Free ram is now: CRITICAL on robh1 robh1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:37:37] PROBLEM Total Processes is now: CRITICAL on robh1 robh1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:38:27] PROBLEM dpkg-check is now: CRITICAL on robh1 robh1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [20:21:58] PROBLEM Current Load is now: CRITICAL on mobile-enwp mobile-enwp output: CHECK_NRPE: Socket timeout after 10 seconds. [20:26:57] PROBLEM Current Load is now: WARNING on mobile-enwp mobile-enwp output: WARNING - load average: 14.00, 10.93, 9.92 [22:06:57] PROBLEM Current Load is now: CRITICAL on mobile-enwp mobile-enwp output: CHECK_NRPE: Socket timeout after 10 seconds. [22:11:47] PROBLEM Current Load is now: WARNING on mobile-enwp mobile-enwp output: WARNING - load average: 8.86, 9.57, 9.17 [22:34:02] PROBLEM Current Users is now: CRITICAL on publicdata-administration publicdata-administration output: Connection refused by host [22:34:42] PROBLEM Disk Space is now: CRITICAL on publicdata-administration publicdata-administration output: Connection refused by host [22:35:22] PROBLEM Free ram is now: CRITICAL on publicdata-administration publicdata-administration output: Connection refused by host [22:35:43] is this sort of spam usual on instance creation? [22:36:52] PROBLEM Total Processes is now: CRITICAL on publicdata-administration publicdata-administration output: Connection refused by host [22:37:32] PROBLEM dpkg-check is now: CRITICAL on publicdata-administration publicdata-administration output: Connection refused by host [22:38:22] PROBLEM Current Load is now: CRITICAL on publicdata-administration publicdata-administration output: Connection refused by host [22:39:37] apergos: yes [22:39:49] apergos: because there's some issue with puppet and nrpe on first puppet run [22:39:55] and no one has bothered to fix it [22:40:18] ok so I don't need to worry at least [22:43:50] so I can't seem to ssh into the instance... is it just too soon? [22:44:09] actually lemme see if it still says "pending" on the page [22:44:23] no, it says running [22:44:35] Permission denied (publickey). [22:44:39] (from bastion-restricted) [22:54:26] lemme see [22:54:40] could be too early [22:54:49] check the instance's console log to see if puppet finished running [22:55:01] ok [22:55:04] Magic [22:55:29] 03/13/2012 - 22:55:28 - Creating a home directory for laner at /export/home/publicdata/laner [22:55:41] must. improve. interface. [22:55:48] Why do I allways read laner as lamer.... [22:55:50] can't really tell [22:55:56] cause I've never seen a normal boot up [22:56:04] 99% [Working] Fetched 3691kB in 0s (8856kB/s) Failed to fetch http://ubuntu.wikimedia.org/ubuntu/pool/main/r/ruby1.8/libreadline-ruby1.8_1.8.7.249-2ubuntu0.1_amd64.deb Hash Sum mismatch E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing? [22:56:08] it died [22:56:09] that's the last stuff in the console output [22:56:10] delete/recreate [22:56:11] ah [22:56:25] there's some weird bug [22:56:26] what did I say? 10 times harder for me than everyoe else? [22:56:28] 03/13/2012 - 22:56:28 - Updating keys for laner [22:56:31] holding true to form [22:56:34] this happens for everyone occasionally [22:56:48] yes but I have gotten *every* bug so far. [22:56:51] Will be more fun when we have dymanically building apt stuff... [22:56:52] true :) [22:56:52] you have to admit... [22:56:54] :-D [22:57:08] Also, the nova credentials bug is ANNOYING [22:57:16] og [22:57:18] err [22:57:20] er? [22:57:20] apergos: use a small [22:57:24] rather than a medium [22:57:27] ok [22:57:29] it seems to happen more often with mediums [22:57:36] ... [22:57:39] oh gee [22:57:40] I'm wondering if it's some memory issue with kvm [22:57:50] since we're basically running out of memory [22:58:07] hello [22:58:13] hyperon: howdy [22:58:17] Don't we have some new beefy nodes on the way? [22:58:20] should I start from scratch? [22:58:30] or is it repairable? [22:58:49] [22:56:10] delete/recreate [22:59:13] yay, two computers and one internet connection later [22:59:25] i am actually capable of contributing [22:59:27] yeah but I'm not clear what that's about [22:59:42] and he asked me to use a small so I wonder if that means I should start over [22:59:46] Yeah [22:59:59] Delete the instance and start again as it's not setup. [23:00:02] apergos: you have to [23:00:06] ok [23:00:09] once puppet fails an initial run it's fucked [23:00:14] actually, this one is worse than that [23:00:19] oh great :-/ [23:00:21] cloud init failed because it couldn't install ruby [23:00:35] Ryan_Lane: is the mysql openstack thing still a nondone [23:00:43] hyperon: yep [23:00:46] ok [23:01:02] uh oh [23:01:07] i guess ya'll can yell at me for that... [23:01:20] * Damianz strings hyperon up and gets the darts out [23:01:27] I dleted it and it showed it as still running [23:01:32] and I get a chance to delete it again [23:01:47] I would accidentally it if I could figure out how :-P [23:01:49] hyperon: it's fine. we haven't started targetting that yet [23:02:03] so, if you finish it before we get to it, great, if not, we'll do it :) [23:02:13] apergos: that's not a bug [23:02:20] apergos: it takes a little while to delete instances [23:02:42] I guess it should show delete pending or something [23:02:44] you can create a new one while that one is waiting to die [23:02:45] oh okay, i guess...what do you mean by targeting? [23:02:55] I wanna give it the same name [23:02:57] apergos: yeah, openstack is a little silly there [23:02:59] you can [23:03:02] it deleted it from dns [23:03:04] ok [23:03:15] hyperon: it's not something we're ready to tackle yet [23:03:38] (we being staff) [23:03:53] ah, so has labs reorganized or something? [23:03:57] and now we wait again :-D [23:04:05] hyperon: what do you mean? [23:04:52] seperate question [23:05:03] still don't know what you mean [23:05:41] actually, never mind [23:05:46] i will resume work [23:05:52] PROBLEM host: publicdata-administration is DOWN address: publicdata-administration CRITICAL - Host Unreachable (publicdata-administration) [23:12:12] I guess I need to wait for the first puppet run now [23:30:43] PROBLEM Current Load is now: WARNING on bots-sql3 bots-sql3 output: WARNING - load average: 7.33, 6.34, 5.55 [23:31:43] RECOVERY host: publicdata-administration is UP address: publicdata-administration PING OK - Packet loss = 0%, RTA = 7.97 ms [23:40:43] RECOVERY Current Load is now: OK on bots-sql3 bots-sql3 output: OK - load average: 3.07, 3.97, 4.84 [23:41:22] finally I'm in [23:41:25] now I can go to bed [23:45:10] I'm not slacking, I'm waiting for puppet to do what I told it to do hours ago [23:46:25] kick it [23:46:36] it won't make it go faster but you'll feel better