[11:22:17] !log tools puppet breakage in due to me introducing openstack-mitaka-jessie repo by mistake. Cleaning up already [11:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:27:37] Amir1: ok to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/500928 ? [15:27:44] Since that old alias is broken [15:29:00] sure thing [15:29:13] then how wikilabels work now [15:34:18] I'm not sure that value is actually applied anywhere [15:34:30] But that old alias points at a server that isn't running pg [15:34:37] It was :) But I shut it down [15:46:56] Amir1: that's out [15:47:43] Nice, thanks [15:47:51] Checking the page, it works [18:01:34] is anyone hanging trying to do `ssh tools-dev.wmflabs.org `? [18:03:57] One of my teammates had that problem earlier. I think the hostname changed? [18:04:28] notconfusing: load is pretty crazy on that host… tools-login seems ok at the moment [18:04:33] so you might be happier switching over [18:06:12] bstorm_, /home seems pretty much ruined on tools-dev.wmflabs.org aka tools-sgebastion-08. Might be of interest to you :( [18:06:32] (I can ssh in with a root key but not as myself, presumably because of the different location of $home) [18:06:34] andrewbogott: thanks that'll work in the mean time. [18:07:26] andrewbogott: will take a look in a sec [18:18:03] This server has not been rebooted since the change that fixed lsof procs going into D state. That's clearly happened a fair bit. [18:19:20] I'll reboot it to see if it can be recovered that way. That's about the only thing that is going to clear some of that stuck stuff [18:20:28] bd808: do you know any reason I shouldn't reboot it (I don't use this bastion very often) [18:20:40] I see a couple current logins [18:21:12] However, I question if those logins are actually usable [18:22:09] Actually, let's see if I can remount it. The lsof's often block it, but just in case [18:23:08] That worked :) [18:31:09] notconfusing: that host tools-dev should be recovered now. [18:35:45] bstorm_: wait what? you fixed D-state? [18:36:07] I did a force remount of the NFS. The things that were in a D state recovered after that [18:36:12] ah [18:36:28] Also, the root cause of a lot of the D state processes was fixed by gtirloni [18:36:39] how? [18:36:41] This was still affected because that didn't clean itself up [18:36:59] sorry too busy recently, couldn't follow on the recent changes [18:37:20] Basically, a mw-restarter thing (not actually 100% on what it did) was running lsof processes against filesystems. He added an exception so it didn't do that on NFS. [18:37:36] I see [18:37:46] When doing it on an NFS client, the lsof procs would get stuck (and block remounting) [18:37:59] So any network jitter would lead to a problem [19:34:46] bstorm_: sorry I missed the ping. Was eating lunch. Generally rebooting a bastion is fine but when we can it is nice to give a 5 minute warning via wall first. When things are really messed up thought reboot as needed :) [19:35:08] :) [19:35:13] no worries [19:54:07] what kind of "know who is sending requests" options are there for http services available to cloud instances? It seems ldap is undesirable, as users would hardcode their passwords. oauth seems pretty heavy to put in a http proxy but might be plausible. requesting people to use reasonable User-Agent header wont work as most users simply wouldn't set it [19:55:01] ebernhardson: I'm not sure I understand the use case here [19:55:12] ebernhardson: http auth? [19:55:27] chicocvenancio: we are standing up an http service that will be accessible to cloud instances that can be overloaded. When it's overloaded we need to know who [19:55:41] ldap is definitely not a good idea and most likely against the rules [19:56:04] ebernhardson: will it run in WMCS? [19:56:04] chicocvenancio: it isnt if you setup your own ldap instance iirc [19:56:11] yeah [19:56:16] chicocvenancio: no, it runs on real hardware and is accessible only to cloud ip's [19:56:21] You cant use wikimedia’s anymore though like you use too [19:56:45] has foo.wikimedia.org domain name and lives in the wmf network [19:57:12] ebernhardson: we use service specific credentials for the wiki replicas and kubernetes [19:57:38] I was kind of figuring that the cirrus replica would need to follow a similar pattern [19:57:45] ebernhardson: is this going to be accessed by a user or a program? [19:57:58] bd808: is there some sort of generic frontend for having users register credentials perhaps? [19:58:12] if it were only that simple :) [19:58:40] Zppix: either, depends how users decide to use it. The first use case is most likely one of the heaviest, allow users to search across several TB of data in thousands of shards [19:58:44] In Toolforge we have a service demon that will look for accounts that are missing credentials and issue them [19:58:47] bd808: hey while your doing that frontend credential thing can you make all my stuff self-patch xD [19:59:39] ebernhardson: i wonder if you could do http auth and have a script that will allow users to add new http auth creds [20:00:00] a generic credential issuing service doesn't seem that hard :P [20:00:06] * ebernhardson isn't writing it though :P [20:00:11] ebernhardson: The easiest thing to do I think is add provisioning the credentials to that same service in Toolforge. This is actually how we hand out all wiki replica username/passwords creds today [20:00:27] bd808: that seems a bit overkill for one project though? [20:00:33] bd808: any reference repos or something i can look at? It's not particularly clear [20:00:37] the part we will need to figure out is how to get the authorized crednetials over to the proxy server [20:01:18] ebernhardson: i know you already said oauth, but that would be a good one considering that could prevent impersonation [20:01:26] Zppix: ebernhardson is working on standing up a new wiki replica like shared service that will give Cloud users access to a replica of the production CirrusSearch data [20:02:05] so functionally the same thing as the wiki replicas, but with a different data set (and hosting platform) [20:02:16] oauth might be possible, but ideally it needs to fit inside an nginx proxy. i can investigate that a bit [20:02:18] ebernhardson: oh [20:02:34] oauth is not going to worrk I don't think [20:02:34] ebernhardson: i wouldnt think oauth would be that heavy [20:02:41] bd808: why not? [20:02:57] the oauth we have is a 3-legged exchange that requires human interaction on metawiki [20:03:09] not something you can script [20:03:19] right, this service only exposes a REST api with json responses [20:03:25] bd808: we dont use tokens? [20:03:29] no [20:03:34] thats oauth 2 [20:03:38] Ah [20:04:14] OAuth2 could have been called "a completely different protocol that is re-using the therm OAuth" :) [20:04:15] * chicocvenancio votes to prioritize oauth 2 into mediawiki, but this is ot [20:04:40] oauth 2 was rejected by csteip some time ago as a bad idea, we have a different security team now though so who knows [20:05:16] right now we don't even have an official maintainer for OAuth, so adding another thing seems unlikely [20:05:31] yeah, I'm aware [20:05:41] it would solve some of my problems though [20:05:54] in PAWS, Matrix, Discourse and a few other things [20:06:24] bd808: so the core suggestion seems to be some sort of ldap (but not the one used for normal auth) with cirrus-replica specific credentials, and something to populate it? [20:06:58] mostly the thing we need is just to know who to talk to when it falls over :) [20:07:09] ebernhardson: not ldap, just user/password or a fixed token. [20:07:12] (to advise on better ways) [20:07:50] the thing to figure out is how to setup the nginx side so that it can check a header or http basic auth against list of accounts to allow [20:08:53] the existing elasticsearch cluster in Toolforge does this with HTTP Basic auth, but the way I built that won't scale to a larger user group because the cneteral auth file is managed with Puppet [20:09:14] so we need a different system for recording and validating the user credentials [20:09:51] hmm, yea managing the user list in puppet sounds painful [20:09:51] We could do something like the dynamicproxy instances do. That involves a service, redis, and lua [20:10:45] btw the instances are up with elasticsearch running now :) firewalls only allow cumin hosts to the ssl terminators right now though [20:11:31] ebernhardson: dont forget the carbon actuators [20:12:05] if we did that, the flow would be something like: maintain-dbusers generates a username & password for an account, that data is sent to the auth manager service, it records the data in redis (and some other more durable backing store like sqlite or mysql), ngins proxy has lua to lookup the user from redis [20:13:09] bd808: as long as i have you, we've been discussing how to temper user expectations, especially around instant-update. Based on DBA advice it seems even if we don't guarantee instant updates, if users get that they will expect it and yell loudly when they aren't instant. We've been talking about injecting a couple hours delay to the update jobs, pulling weekly dumps to the servers, other [20:13:15] options. [20:13:35] (by yell loudly i mean build things on top of it that depend on things we didn't guarantee) [20:14:15] artificial delay seems annoying, but I understand the underlying problem [20:14:45] it is hard to make folks understand that prod and non-prod may disagree and that should not be unexpected [20:16:06] indeed [20:16:46] bd808: some nginx+lua+redis sounds doable, might be a bit fragile initially but can probably be made to mostly work [20:18:05] ebernhardson: there may be other things we can do too. I haven't dug around in nginx authz options for a long time [20:18:25] but if we can do it in lua we can make that authz in nginx :) [20:18:29] bd808: how much less useful do you think a service that does weekly updates to the searchable content would be? One of the benefits there is it severs the link between prod hosts and the cloud service, depending only on already publicly available information (existing dumps). On the downside, it's only updated weekly and requires a large indexing job that is also not ideal [20:19:23] ebernhardson: its better than nothing obviously, but it will make using it as a "grep for things to fix in the wikis" a more challenging customer experience [20:20:05] I'm sure folks will find other really cool ways to use this data, but that grep case is an easy one to explain [20:20:25] it's also the only actual use case we have right now :) The rest is a build it and they will come scenario [20:21:36] will ponder, since things are finally being stood up trying to nail down the impl. I prefer real updates as well, but even figuring out how to properly delay updates is a pain [20:22:30] ebernhardson: I'll try to make some time tonight to look at options on the nginx side and write up something to share with you [20:22:40] bd808: thanks! [20:51:03] bd808: oauth2 is provisionally on the roadmap for the MediaWiki REST API, actually [20:51:19] tgr: cool :) [20:51:35] * chicocvenancio smiles [20:52:47] ebernhardson: bd808 we do have some private data in elastic I think from phab, just a sanity check that it can't leak through here? [20:53:32] chasemp: not that I'm aware of, but ebernhardson is obviously more clueful [20:54:06] we would not be replicating the phab index(es) [20:54:16] roger [20:55:36] the actual method of getting data into the replica cluster is one of the things that ebernhardson said they were still working through, but we have not talked about anything outside of the CirrusSearch indexes [20:56:12] cool cool, not trying to harsh your mellow, jsut wanted to put it on the radar [20:56:24] As I understand it all of that data is already public via documented url params on the prod wikis [20:56:47] chasemp: sure, I took it as a well intentioned inquiry from Security :) [20:59:34] re: oauth1, you might want to use it to set up some kind of local auth, if you want to freeride on central settings (reject users who have been centrally blocked and so on) [21:00:12] using it as an actual authentication method would be insane levels of overhead (that would not really be different for oauth2 either) [21:00:59] unless you want to make your replica an oauth *server*, which seems way too much effort for just enforcing API policy [21:02:08] chasemp: right, this wont me mirroring the prod clusters directly, rather it would either be the cirrus update process writing to the secondary cluster, or something like that [21:02:20] heard, tx [21:02:23] tgr: In this particular use case, we are just wanting an "api key" of some sort so that we can block runaway scripts if needed [21:03:32] its not really about gatekeeping to keep people out of the data. Instead its about having a way to cut off a pathologically bad client without resorting to locking everyone else out too [21:04:03] right, this data is all public and we want to make it as accessible as possible, just need ways to deal with the problems that will certainly arise [21:04:47] bd808: blocking all kinds of supplementary accounts of a malicious user who gets blocked on-wiki has been a trendy topic recently [21:05:20] tgr: fair enough. Out of scope here (I hope) [21:06:30] since this isn't writable from the cloud side i hope it's out of scope [21:24:35] Should https://toolsadmin.wikimedia.org/ credentials work on WikiTech? I'm getting 'Auto-creation of a local account failed: Automatic account creation is not allowed.' [21:32:15] RhinosF1, wikitech account creation is disabled right now [21:32:58] that confusingly extends to attaching existing accounts too which is what RhinosF1 is running into [21:33:09] Krenair, At least it's not me. Why? How long for? [21:33:32] well from wikitech's perspective it doesn't really properly exist [21:33:54] Krenair, What doesn't exist? [21:34:07] you have an LDAP user but no corresponding entry in wikitech's user table [21:34:08] RhinosF1: let me find the decent explanation I wrote on Monday... [21:34:20] this can occur if you create via toolsadmin instead of wikitech [21:34:27] or were imported from SVN or whatever [21:34:34] RhinosF1: https://phabricator.wikimedia.org/T200184#5075876 [21:34:48] * RhinosF1 looking bd08 [21:34:58] under normal circumstances you could just provide your credentials to log into wikitech and it would figure the rest out, creating you a local user account [21:35:00] TL;DR is security incident response, and we don't know but are working on fixes [21:35:04] Krenair, I Crete through Toolsadmin [21:35:09] yes [21:35:39] unfortunately right now account creation is disabled so that's not possible [21:37:42] I'll wait then. I wasnt planning on doing anything soon. [21:45:04] Signed up to the wikitech mailing list so I'll keep an eye on that. Thanks for your help Krenair/bd08!! [21:45:23] thanks for your understanding RhinosF1 [21:45:28] you're welcome [21:47:01] No problem bd08 [21:49:41] bd808: actually, why is account autocreation disabled? [21:50:15] tgr: no idea. by-product of one of the other switches we flipped on purpose? [21:50:39] probably yes, by default * has createaccount so there is no need for autocreate [21:50:47] seems harmless to enable though [21:53:25] bd808: btw have you seen https://phabricator.wikimedia.org/T200184#5080193 ? [21:54:20] tgr: yes. I knew that was possible, but honestly we don't have anyone who can sit and interact with all the people to get them the passwords in a secure manner [21:55:28] and honestly we still have no guidance on who is "safe" to allow in via that method [21:58:39] make the outreachy/gsoc admins wiki sysops and let them do this? [22:00:22] well, only gsoc at this point, but still, pretty crippling to not be able to provide developer accounts to people during the application period [22:02:27] bd808: also, does password reset still need to be disabled now that T219277 has been closed? [22:02:29] T219277: Wikitech password reset flow - https://phabricator.wikimedia.org/T219277 [22:03:20] would the new sysops be told how to identify those they should not let in? [22:07:41] tgr: you'd have to ask John. I'm not in control of what is on and off here. [23:12:19] E aí... Vim aqui dar um oi pra geral!