[01:20:52] successful update from jessie to stretch. this is a simple machine, mostly just hosts files on a web server so not too painful \o/ [01:23:25] My process is exiting with the message "Killed". Is it getting hit by oom killer? [01:23:38] this is on toolforge [01:47:59] brion, a replacement or an in-place upgrade? [01:51:54] Krenair: in-place upgrade via apt [01:52:04] if that had trouble i was going to wipe and replace ;) [01:52:27] :/ [01:52:35] we'll see if it explodes later [01:53:55] I think we generally advise replacements instead of in-place upgrades but ok [01:55:39] one thing to be aware of is the instance's image will still be listed as jessie [01:59:57] which probably just means it may show up in a ticket about upgrading jessie instances in a few months, might need a comment to indicate its in-place upgrade [02:23:32] heh. fair enough! [02:23:42] i'll give it a reinstall later when i've the time to clean up a few things [07:13:31] audiodude: Could be because it used too much memory. If you submitted the job using jsub, try increasing the memory limit. [10:08:14] in a Kubernetes continuous job, is there some way to gracefully react to a shutdown/restart event? [10:08:42] anything I can listen for in my Python code to do some cleanup work before Kubernetes destroys the container? [10:09:55] or do I need to build some independent mechanism (e. g. monitor a certain Redis key in the job, and remember to always write “shutdown” to that key before I restart the worker externally)? [11:58:42] that reminds me of another question – if I restart a webservice, how much time does it get to shut down before it’s forcibly killed? [11:58:54] I have some web requests taking up to a minute that I would prefer to not be interrupted [11:59:07] (standard Python 3.4 uWSGI webservice, Kubernetes backend) [13:33:03] !log hat-imagescalers deleting 'bonny' and 'docker-registry-test' after discussion with the creators [13:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Hat-imagescalers/SAL [13:57:41] !log git disabling puppet on gerrit-test5 temporarily [13:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL [15:21:18] I will let you know when I see lucaswerkmeister and I will deliver that message to them [15:21:18] @notify lucaswerkmeister https://pracucci.com/graceful-shutdown-of-kubernetes-pods.html [15:25:20] bd808: thanks <3 [15:26:39] lol [15:27:11] hah, good timing apparently :D [15:27:15] (I just checked the logs) [15:43:36] CropTool got hit by an autoblock today (https://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard/Blocks_and_protections#Toolforge_IPs_blocked?) Do you think it would help to add the Toolforge IP range to https://commons.wikimedia.org/wiki/MediaWiki:Autoblock_whitelist ? If so, what IP range should I request the Commons admins to add? [15:46:26] danmichaelo: I guess the range could be this one https://github.com/wikimedia/operations-mediawiki-config/blob/d26e8858d5e12c445b3653d5c928593d8ebda49d/wmf-config/CommonSettings.php#L3552-L3555 [15:46:39] no opinion from me on whether whitelisting makes sense though, I can’t judge that [15:48:25] lucaswerkmeister: Thanks! They already have 10.0.0.0/8 on the whitelist, so I will try requesting the other one to be added. [16:06:16] bd808: I have a task that keeps getting killed after the big move. Where can I find logs why it's getting killed? [16:24:40] multichill: that's not always easy to figure out in grid engine. In theory the accounting logs can tell you some information about why the monitor killed something (qacct -j ) [16:26:10] I don't think we actually have hard limits on CPU time, so its usually RAM related. But it can also be problems with the job scheduler itself and jobs getting killed as collateral damage when one of the admins tries to fix the scheduler on a particular exec node [16:27:50] bd808: It's quite memory consuming job. I had problems with running out of memory too in the past [16:28:26] For example job 1915552 got killed with exist status 137. [16:29:27] Using quite a bit of memory, maxvmem 7.111GB. Would it be possible the executing host ( tools-sgeexec-0935.tools.eqiad.wmflabs ) ran out of memory? Timestamp is Sun Apr 14 15:23:58 2019 [16:30:14] Apr 14 15:23:57 tools-sgeexec-0935 kernel: [363381.673563] Thread Pool Wor invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, order=0, oom_score_adj=0 [16:31:11] multichill: yeah, 7G could mean that the kernal OOMKiller got involved. Our exec nodes only have 8G of total ram [16:31:47] Did that change with the switch? Because this job used to run without incident on the old grid [16:55:07] multichill: exec nodes are the same size, but I think that the new gird tracks resource usage a bit differently [17:12:28] bd808: Bummer. Disabled the tool. Any plans for bigger exec nodes? ;-) [17:25:45] multichill: that's always possible, but today I think its an outlier case. The "graduation" plan has historically been for things that outgrow the grid to move into dedicated Cloud VPS projects [17:58:22] I'm trying to figure out how to set up a `deploy` user for a labs instance, and be able to SSH in as that user. I have created the user on the instance and added my pubkey to `etc/ssh/userkeys/deploy`, and I've tried logging in by setting either ProxyCommand or ProxyJump to get into bastion as `ragesoss`, and then get into the instance as `deploy`. I get into bastion, but the final step fails after the instance accepts the key, with ` [17:58:22] Connection closed by UNKNOWN port 65535`. [17:59:34] Are there additional steps beyond creating the `deploy` user and adding pubkeys to `etc/ssh/userkeys/`, in order to do this sort of thing? [18:01:20] afaik you are not expected to manually create any system users, because LDAP is used as backend and puppet would create those system users [18:01:42] the puppet classes related to deployment have code that sets up a deploy-user and key for them [18:02:10] and there is keyholder to hold the keys for them [18:02:55] it this is about things deployed with scap that is [18:03:02] also it sets up proper sudo rules the user needs [18:05:12] i think you need one instance with role deployment_server and another to deploy to, which uses scap::target [18:06:33] okay. Krenair and valhallasw previous seemed to think it would be possible to set things up in way that I could go through a `deploy` user for Programs & Events Dashboard. [18:07:28] You're probably missing the security access rule ragesoss [18:07:32] i am not sure. they might know better. [18:08:39] The standard rule is that for a user to be able to log in to an instance, they must be in the project-$project LDAP group. [18:09:31] Local users cannot be in such a group unless they also exist in LDAP, but that's a source of trouble for other reasons. [18:10:46] The scap::target manifest that mutante is a good place to start to see how this works with prod/beta deployments. [18:11:26] the final step would be to go to "Project Hiera" and set the deployment server to be used there [18:11:29] It boils down to setting up the user, their keys, which you've done [18:11:45] I do not believe ragesoss is setting up a standard wikimedia deployment server mutante [18:12:06] ok. just seems almost more work to do recreate manually [18:12:18] right, this is for the cloud VPS server that runs outreachdashboard.wmflabs.org [18:12:20] But then crucially it creates a security::access::config rule. [18:12:42] Krenair: the Hiera example was from doing it for Phabricator [18:12:44] however, I think you're going about this entirely the wrong way. [18:13:09] You should not be SSHing out of the bastion as anything other than your own user, or potentially root. [18:13:24] If you need a custom user, SSH into your instance as your own user and run sudo. [18:15:10] root would work fine, although I don't see my key in `/etc/ssh/userkeys/root`. [18:15:50] There is a way to fix that but software deployments should not run as root either. [18:15:56] the Rails deployment tool that the dashboard uses isn't compatible with going through sudo after login. [18:16:11] I suggest you fix the tool or find a new one. [18:16:50] that's not very practical in the short to medium term. [18:20:26] I'm surprised puppet lets you put stuff under /etc/ssh/userkeys. [18:20:43] Unless you are doing this through puppet [18:21:07] maybe puppet will wipe it out, I guess? [18:21:23] I just did it on the instance manually as root. [18:21:49] (but of course, it didn't do any good without the security rule that you noted) [18:23:10] I'd be kind of surprised if puppet did not purge unpuppetised resources under this directory [18:23:26] Indeed ssh::server's file { '/etc/ssh/userkeys': contains recurse => true, purge => true [18:23:42] yeah, it's gone now. [18:24:02] I suggest you puppetise anything important [18:26:47] that will also save you time in the long term. you dont have to keep redoing it each time there is a new instance [18:26:55] for OS upgrades and other reasons [18:27:33] that sounds like a pretty big project. [18:27:52] maybe not if you can copy the setup from an exising project [18:27:57] that deploys something else [18:27:58] Depends how much stuff you have. [18:28:00] but mostly uses the same setup [18:28:24] But I will say that doing the pretty big project once into puppet is better than doing it several times over the lifetime of the system as instances need to be replaced. [18:28:55] is there a starting point in the documentation you'd recommend, re: puppet? [18:29:02] that, if you spend the time once it pays off in the future [18:30:03] i would say find an existing project that uses a deployment server and see which puppet classes it applies [18:30:14] you dont necessarily have to write entirely new code [18:30:26] just find out which classes you need and put them on the instance using the UI [18:30:43] then run the puppet agent to apply it [18:31:09] paladox: ^ maybe you can show the example from where you deploy phab in cloud VPS [18:31:21] I just use my user in the cloud [18:31:41] it was far easier then trying mess around with users/ssh. [18:32:00] paladox: when i cleaned up the planet project the other day.. i found our old settings in Hiera ..setting our own puppetmaster and also deployment_server [18:32:23] * paladox wonders why deployment_server was set :) [18:32:34] so separate from the user used.. it is about the roles and the Hiera setting [18:32:51] paladox: because we wanted to test "rawdog" change i think :) [18:33:33] heh, i thought rawdog is in puppet not a scap deploy repo? [18:33:40] the point is.. you have set deployment_server to something in hiera before, right [18:33:46] and had a deployment server [18:33:55] in whatever project [18:34:15] ah yeh [18:34:23] i have mine as phab-tin.phabricator.eqiad.wmflabs [18:34:25] so just any random example for how you did that [18:34:32] would be helpful for ragesoss [18:35:15] i assume "applied deployment_server role" in Horizon on one instance [18:35:34] and then "set Hiera to that deployment server on the other instance and add some scap class" [18:36:23] i have "profile::mediawiki::deployment::server" set [18:36:49] https://tools.wmflabs.org/openstack-browser/server/phab-tin.phabricator.eqiad.wmflabs [18:36:58] i see.. so that is part of role::deployment_server without getting all the other stuff it does [18:37:42] ragesoss: so yea.. you can start by just simply testing to apply a puppet class to an instance..in Horizon.. and use the one paladox mentioned above and then run 'puppet agent -tv' and see what happens [18:37:57] so it should be GUI only and then one command [18:38:17] (in an ideal world .. but you can go from there) [18:39:01] I've done adding puppet classes and deploying them before, but I assume most of the work here would involve new stuff... probably not going to find examples of other puppet roles for a Ruby stack. [18:39:33] afaict it shouldnt matter _what_ code you deploy for this part [18:39:41] but also i dont fully know what you need [18:39:45] Puppet itself is ruby [18:42:04] I am confused. [18:43:19] I think will need to try to make sense of https://wikitech.wikimedia.org/wiki/Puppet first, perhaps. [18:51:44] thanks, all! still feeling stuck, but I appreciate the help. [18:54:33] There's several options to get this working in a hacky way but [18:54:43] ragesoss: you dont need to care about all the complexity on that wiki page. it's more about finding the right class and then slap it on an instance [18:54:49] yeah there's not many easy and clean ways to do it [18:55:17] and that classes can be uploaded to gerrit and merged ..if you need new ones [18:58:22] Are we talking about "right class" as in a class that can add a deploy user, or the "right class" to set up a fresh instance for standing up my app? [18:58:31] or something else? [18:59:40] ragesoss: ideally .. both. one that does the setup for deployment.. each on the server side and the receiving side.. which should already exist. and then on top of it a new one that does all other things you specific case needs. and then they would be combined into a single 'role' [18:59:59] that would mean half of it you should be able to resuse and the other half would have to be created [19:00:42] in the end you would have a single 'role' that includes multiple 'profiles' and you can add that role class on a new instance.. and everything magically happens [19:10:44] ragesoss: it might be more useful for you to do a generic non-wmf puppet tutorial. (puppet.com/learn-puppet ) if you want the basics like "how to install a package using puppet" . that wikitech page we have is about many things you dont even need and would be confusing ..it's more for people who run puppet masters and prod stuff [19:11:27] i would skip that for now.. assuming what you want is more practical code examples how to do things like "install this package, clone that code. change those permissions" [19:11:51] and of course you can git clone operations/puppet from gerrit and look around in the existing classes to get an idea [19:12:04] yeah, that sounds right. I've been poking through the wmf puppet repo, https://github.com/wikimedia/puppet, but it's hard to find anything to orient on. [19:13:48] ragesoss: if you go to ./modules/role/manifests/ you will see all the "roles". that is the highest level. the thing you apply to an instance. you will see these classes mostly just include multiple other classes. the 'profiles'. they are collections of profiles that, together, do a certain thing [19:14:23] then if you go to modules/profile/manifests/ you can find what the actual profiles do .. and they use modules [19:16:08] the modules are things like "setup a webserver" or some other service but should stay generic so they can be reused by others [19:16:23] they are all the other stuff in ./modules/ [20:47:25] bd808: Nah, I just need to rewrite it to use a database backend