[03:13:04] bd808, zhuyifei1999_: Woo! Thanks for the quick review and merge. :D [04:35:55] Ivy: np [15:02:00] (03PS1) 10Rush: neutron dummies for rabbit and db [labs/private] - 10https://gerrit.wikimedia.org/r/419196 [15:55:49] (03PS3) 10Chico Venancio: Add Chicocvenancio's key for Cloud Services [labs/private] - 10https://gerrit.wikimedia.org/r/405376 (https://phabricator.wikimedia.org/T185273) [16:14:21] (03CR) 10Rush: [V: 032 C: 032] neutron dummies for rabbit and db [labs/private] - 10https://gerrit.wikimedia.org/r/419196 (owner: 10Rush) [16:53:56] gehel: So I reprovisioned deployment-maps01 and everything managed to install there (after I helped scap along a bit by half-manually deploying kartotherian and tilerator). But now both services crash immediately on startup, so thet get into restart loops. And I can't figure out what the errors are [16:54:13] (I have to miss the meeting in 5 mins because of a conflict, so I'm telling you here) [16:56:31] RoanKattouw: I'll have a look... maybe an issue I have seen already... [16:56:54] I stopped both services because I didn't want the instance to run out of disk space logging all those restarts [16:56:58] RoanKattouw: that might be that the data isnt loaded yet [16:57:12] there are some manual steps that have not been puppetize: https://wikitech.wikimedia.org/wiki/Maps#Manual_steps [16:57:15] Oh right, I forgot to inport data [16:57:27] Oooh thanks [16:57:49] they don't really make sense to automate, as we only do them once in forever [16:58:01] and the dataload itself isn't something we want to do in puppet [16:59:14] ping me again if it isn't the data load. [16:59:31] and note that we need to modify the procedure to load only a subset of data [16:59:49] Paul is the right person if you need help on that side [17:01:03] Thank you very much for that documentation [17:01:09] That'll be super helpful [18:08:49] !log deployment-prep Enable Extension:JADE, T176333 [18:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [18:08:53] T176333: Deploy JADE prototype in Beta Cluster - https://phabricator.wikimedia.org/T176333 [18:18:23] (03CR) 10Rush: "Arturo is on clinic so hopefully can roll this out" [labs/private] - 10https://gerrit.wikimedia.org/r/405376 (https://phabricator.wikimedia.org/T185273) (owner: 10Chico Venancio) [18:28:27] Hey, is creating diffusion repositories for tools with - in their names allowed? https://toolsadmin.wikimedia.org/tools/id/weapon-of-mass-description/repos/id/tool-weapon-of-mass-description seems to be stuck [18:29:02] bd808, [18:32:38] Urbanecm: yes [18:33:01] As the repo is created as a number or the server but you can have a display name [18:35:28] Then...why nothing's happening? [19:05:59] Urbanecm: reload the page. there's a display bug [19:06:12] https://phabricator.wikimedia.org/source/tool-weapon-of-mass-description/ [19:35:17] I see quarry has a new limit on # of results. I think it is set far too low :( It's broken my workflow for ORES. [19:38:45] I can't find any sort of relevant task to comment on. [19:39:16] zhuyifei1999_, ^ [19:40:16] halfak: https://github.com/wikimedia/analytics-quarry-web/commit/d5e28455adfd53fe36f862ad3cd03d5258165cff ? [19:40:30] T188564 [19:40:31] T188564: Quarry should refuse to save results that are way too large - https://phabricator.wikimedia.org/T188564 [19:41:37] halfak: is that the relevant change? [19:41:44] Looks like iut. [19:41:59] 65536 is too small? :P [19:42:17] * chicocvenancio was about to make a similar comment [19:42:30] yeah. I regularly generate random samples in the range of 200k for ORES. [19:42:38] Just one column of ints. [19:42:43] That I then work with elsewhere [19:42:50] I don't think this limit makes sense :\ [19:43:03] These sampling queries usually take less than 10 seconds. [19:43:24] E.g. I was just about to sample 92k revisions from the last year of enwiktionary for use in an ORES training set [19:43:25] It's because people do stupid shit [19:43:37] You could do it from the command line ;) [19:43:40] Right. It's just that this filter is filtering out less stupid shit. [19:43:45] Quarry is important [19:43:55] Result sets are persistent and easy to review. [19:44:01] It's part of being open and transparent. [19:47:04] It's also a shared resource [19:47:46] halfak: maybe the interesting thing to try to figure out is how to measure the size in a way that isn't row based [19:47:50] halfak: what limit do you suggest? [19:48:09] I can increase it for sure, but definitely not unbounded [19:49:13] it's way too frequent for people to SELECT * FROM revision; [19:50:40] bd808, +1 [19:50:51] another method I just thought of would be signal handlers for SIGALRM. if a certain save takes > 60 seconds then kill it [19:50:54] I think the issue here is that I'm not abusing a shared resource, but I'm getting caught by the filter. [19:50:57] Reedy, ^ [19:51:08] does that sound sane to you? [19:51:15] halfak: I don't disagree [19:51:25] zhuyifei1999_, doesn't sound crazy [19:51:34] But the point is, other people do... And you're unfortunately being caught in the crossfire :) [19:51:42] k, will implement tonight [19:51:46] Reedy, right which is a problem, right? [19:51:52] Maybe [19:51:58] But.. In your case, you've other ways of running queries [19:52:02] You can do sql on the command line [19:52:08] You've access to analytics and production slaves [19:52:11] Reedy, not like this. We'd have to overhaul all of ORES. [19:52:13] You don't have to do it in a nice gui [19:52:23] o_0 [19:52:29] Reedy, it's not the nice gui that is important. It's the public storage of results. [19:52:42] You can download ores and type "make models" and it will work (assuming you have dependencies worked out) [19:52:51] It's about having a totally public pipeline. [19:53:00] zhuyifei1999_: could add a "high limits" thing like we do in MW for trusted peoples [19:53:16] We'd need to figure out some other way to store query results (and the queries next to them) in public. [19:53:37] Reedy: I'd need to figure out how to store the user right grants [19:55:12] (maybe in quarry's internal db..., and then how to grant them safely (preferably with an interface so no need to write SQL directly)) [19:55:52] zhuyifei1999_, yuvipanda and I discussed this in the past and decided it was a pain. [19:56:06] zhuyifei1999_, just wondering, could you use the query estimator to solve for this? [19:56:17] I think we could have a higher threshold based on number o collums x rows [19:56:33] though time_spent seems a lot better to me [19:56:38] query estimator? [19:56:46] zhuyifei1999_, "explain" [19:56:56] Estimates complexity and output size [19:57:22] explain doesn't say how much output data are in the set does it? only the # of rows [19:57:31] it's a guess [19:57:56] Hmm... yeah, hard to say. [19:58:40] zhuyifei1999_, yeah, bad idea. Looks like it can't really account for "where" [19:58:53] halfak: PAWS may work for the time being [19:59:13] chicocvenancio, actually, I think we're gonna need to be blocked until this is resolved. [19:59:21] I don't want to have two standards in our workflow. [19:59:33] Either we're converting 100% to a new standard or sticking with the old one. [20:00:13] And FWIW, quarry has been amazing until this point for what we need it for [20:00:37] chicocvenancio: you won't believe how fast `time sql --cluster analytics enwiki 'select page_id from page;' > /dev/null` runs. it definitely finishes before the query killer kicks in [20:01:26] halfak: I'll change to that time limit, and can you test after that? [20:01:31] zhuyifei1999_: in paws [20:01:45] zhuyifei1999_, sure! will do :) [20:01:53] zhuyifei1999_, want me to write up a task for you? [20:02:17] up to you. if not I'll just bug: T188564 [20:02:18] T188564: Quarry should refuse to save results that are way too large - https://phabricator.wikimedia.org/T188564 [20:02:29] zhuyifei1999_, maybe you could leave a limit at something really big too -- e.g. 1 million. [20:02:42] OK I'll leave it be :) [20:03:06] I'll never ask for a sample above 1 million for sure [20:03:06] the limit doesn't matter. what matters is that saving results takes forever [20:03:26] and eventually get killed by OOM killer [20:03:43] Ahh gotcha. So it seems that timing that is the critical bit. [20:04:08] zhuyifei1999_: what do you think about setting that to something like 650k but multiplying rows by number of columns? [20:04:16] We'll be measuring column size * rows essentially. [20:05:26] chicocvenancio: some rows can be large, like those that stores string. I simply didn’t expect a legitimate use of 65k+ rows before [20:05:51] but yeah now I see this [20:05:59] * halfak is 2 legit 2 quit [20:06:12] https://en.wikipedia.org/wiki/MC_Hammer [20:06:18] yes, I didn't see it at the time either [20:06:47] zhuyifei1999_ & chicocvenancio: I've got to say it's damn amazing of you to respond to me so quickly. :))) [20:08:17] np I’ll do the patch either 1h later or tonight, depending on whether I’ll be available 1h later [20:14:10] halfak: I remember ORES was killed once by some regex and broken timeouts. was it fixed by SIGALRM? [20:14:37] Oh yes it was :) [20:14:45] zhuyifei1999_, ^ [20:14:55] k ;) [20:15:05] We used to use thread-based timeouts. Want to see the code change or do you have it figured out? [20:16:41] I done it already in other code a few years ago, but sure curious on how you did it anyways [20:16:56] (^ was some bots) [20:18:19] found it https://github.com/wiki-ai/ores/pull/215/files [20:24:49] Ahh yes. Sorry was getting distracted [20:25:04] SignalTimeout == no fractions of seconds anymore :| [20:42:14] (03CR) 10BryanDavis: [C: 031] Add Chicocvenancio's key for Cloud Services [labs/private] - 10https://gerrit.wikimedia.org/r/405376 (https://phabricator.wikimedia.org/T185273) (owner: 10Chico Venancio) [21:05:24] halfak: how is stopit installed at ORES? pip? I don't see a apt package with the name [21:06:00] Oh yeah. Via pip. We deploy with a repository of pre-built wheels these days. [21:06:13] But the pattern is relatively simple if you want to re-implement. [21:06:14] * halfak gets [21:06:39] hmm. I guess it's easier for me to use python's builtin signals module instead [21:06:52] in that case [21:07:35] https://stackoverflow.com/questions/492519/timeout-on-a-function-call [21:07:58] zhuyifei1999_, maybe some time later we can talk about converting quarry's python dependencies to use wheels. It's really nice. [21:07:59] yeah ik that [21:08:08] sure thanks [21:08:28] I think it'll be easy too. Let's see if I boldly submit a PR. :D [21:08:48] Where's the deployment configuration repo? [21:09:04] * zhuyifei1999_ is not the person who knows about packaging... [21:09:13] the deployment is configured by puppet [21:09:37] hmm.. usually there will be a repo with configuration specific to our environment. [21:09:44] * halfak knows how yuvipanda thinks [21:09:50] https://phabricator.wikimedia.org/source/ores-deploy-wheels/ FYI [21:10:04] https://github.com/wikimedia/puppet/tree/production/modules/quarry [21:10:52] I'll bug you later to show me how you do deploys. :) [21:10:57] k [21:36:22] Hi. I have a question. So I am on maurelio@bastion-01:~$ and now I want to switch to deployment-tin. Doing maurelio@bastion-01:~$ ssh -A maurelio@deployment-tin.deployment-prep.equiad.wmnet isn't working. What am I doing wrong? [21:37:11] (note that I have a .config file with proxy command set, just trying to do that without touching that file in case I want to go to other projects not in the file) [21:38:22] Your key isn't forwarded? [21:38:53] Reedy: the error I got indeed is a publickey one [21:39:00] but ssh should do it right? [21:39:16] only if called with -A when you logged in to the bastion [21:39:25] ssh -A maurelio@deployment-tin.deployment-prep.equiad.wmnet [21:39:29] Why are you trying to go from one host to another? [21:39:43] if you have the proxycommand... [21:39:45] chicocvenancio: indeed, I did ssh -A maurelio@primary.bastion.wmflabs.org [21:39:56] Just logout and connect "directly" [21:40:10] Reedy: the point is not to have to modify the file each time I want to switch to another instance :) [21:40:18] Well, if you do, you've set it wrong [21:40:26] or, you're doing it wrong [21:40:33] For example.. [21:40:34] In prod [21:40:39] I don't connect to a bastion, then to tin [21:40:49] form my laptop, I go "straight" to tin.eqiad.wmnet from my laptop [21:40:54] is it `-A`? I think I get that wrong [21:41:03] -A is key forwarding, yes [21:41:13] well, agent [21:41:22] Hauskatze: the implications of forwarding is that roots have access to your agent socket [21:41:27] Reedy: thanks [21:41:58] Reedy: indeed, I can do ssh -A deployment-tin directly using the config set [21:42:15] Host deployment-tin [21:42:16] ProxyCommand ssh -a -W %h:%p maurelio@primary.bastion.wmflabs.org [21:42:16] UseRoaming no [21:42:16] User maurelio [21:42:16] you shouldn't need to use -A [21:42:25] Right, your config is too simple [21:42:37] (as was to be expected :) ) [21:42:47] I copied greg-g's file :) [21:42:50] somewhat [21:43:14] You want something like... Host *.eqiad.wmnet [21:43:33] then ssh deployment-tin.eqiad.wmnet (possibly iwth username@ if necesary) [21:44:22] * Hauskatze rm -Rf .ssh/* to clean the whole lot of stuff there [21:44:37] careful of your ssh keys... [21:45:13] Reedy: with my current config on git I do ssh deployment-tin and I go directly there so it's not an issue with my config file I think; most probably I'm not doing it rightly [21:45:25] in any case, thanks [21:45:38] yes, because you have a specific rule for that host [21:46:01] ^and that host alone [21:46:26] Hauskatze: though [21:46:33] -a disables forwarding [21:46:39] (it's in your proxycommand) [21:47:51] why would you want to forward your agent to deployment-tin ? [21:48:37] So you can lazily hop to later hosts [22:01:49] Platonides: ¿tienes un minuto? [22:09:00] un minuto sí [22:09:06] ¿qué quieres? [22:09:10] woah [22:09:13] -A to deployment-tin? [22:09:20] that's worth avoiding [22:11:31] Platonides: te hablo en es-ops [22:11:39] ok [22:16:15] Krenair: it is the recommended method in https://wikitech.wikimedia.org/wiki/Help:Access#Accessing_instances_with_ProxyCommand_ssh_option_(recommended) [22:17:15] with -a [22:17:24] okay yes, not -A but -a [22:23:12] -a and -A are completely different [22:23:15] infact, opposite [22:24:47] yep [22:25:09] well, my .ssh/config thing uses -a not -A so it's okay I guess [22:34:21] the ssh -A on command line to deployment-tin [22:34:30] will take precedence over the ssh_config [22:34:37] actually, the -a was applied to the bastion [22:34:56] but you would be forwarding to the final host [22:37:06] I think that with the config file ssh deployment-tin would be enough as the "-a" thing is already there right? I mean, "ProxyCommand ssh -a -W %h:%p maurelio@primary.bastion.wmflabs.org" [22:39:05] ? [22:40:42] nevermind [22:46:25] you are actually running two ssh instances [22:46:36] the one to deployment-tin with agent forwarding [22:46:50] and the background one to bastion that is providing the tunnel, without [22:48:43] with a modern ssh client, using the "ProxyJump" config is much easier to reason about and setup in your config -- https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Proxies_and_Jump_Hosts#Jump_Hosts_--_Passing_Through_a_Gateway_or_Two [22:50:10] oh, I didn't know about that new option [22:50:15] ProxyJump supports 'fancy' things like passing through hostA to hostB to hostC really easily. Make your own ssh TOR router :) [22:50:29] bd808: will have a look, just following Help:Access@wikitech :) [22:50:47] most of the docs on wikitech could use some love :) [22:51:05] there is a lot of stuff there that was written in 2013 [22:51:28] I'd love to help updating them but I lack the knowledge [22:51:40] just -W was a great advance over the netcats you had to use on ProxyCommands back then! [22:52:11] pretty much the best reference on openssh is https://en.wikibooks.org/wiki/OpenSSH