[15:53:58] !log toolsbeta Adding toolsbeta-proxy-3 to the list of slave proxies in hiera (T267140) [15:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [15:54:02] T267140: [toolsbeta] Rebuild servers to learn how to take down the services without downtime - https://phabricator.wikimedia.org/T267140 [16:28:08] !log cloudinfra add myself as user and projectadmin [16:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [16:40:51] !log toolsbeta Moving active proxy from proxy-1 to proxy-3 (T267140) [16:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [16:40:55] T267140: [toolsbeta] Rebuild servers to learn how to take down the services without downtime - https://phabricator.wikimedia.org/T267140 [19:06:30] bd808: Why is my screen getting killed on the bastion-dev host? [19:07:04] bstorm: ^ that may be related to things you have done there? [19:07:38] multichill: bstorm has been working on some automated long running process culling [19:07:39] multichill: you should have gotten an email about it. It's a script I set up to kill long-running processes on bastion [19:07:41] I've had a screen on toolserver/toollabs/toolforge for the last 10 years and this is the first time it's getting killed [19:07:50] Good chance it'll need tuning [19:08:14] I don't like being your guinea pig. Can you please not kill my sessions? [19:08:29] I didn't. It's a service [19:08:36] It killed mine as well [19:08:45] Who set up this service? [19:09:12] You right? Can you please fix it [19:09:43] T266300 [19:09:44] T266300: Establish a systemd timer to remove long-running processes on the bastion in a random and somewhat friendly way - https://phabricator.wikimedia.org/T266300 [19:10:08] That's part of how it is supposed to work at this time. It will only kill something that is more than three days old right now [19:10:56] The issue is trying to find a way to stop long-running processes without making NFS suck [19:11:14] I want to make the experience of using a bastion better, but I still want to discourage long-running processes [19:12:02] multichill: a constructive thing you could do to advance you cause is explain how and why having a long running screen session allows you to do your work. [19:12:30] +1 [19:14:08] bd808: Seems to be the world upside down. I assume you know what screen is? I always have it open with just a couple of terminals so I can easily reconnect, check logs, check jobs, check other things and that for multiple accounts [19:14:47] multichill: I do know what screen is. And I run tmux on my laptop for the same purpose [19:15:23] if you are saying the reason is that you do not like typing `ssh ...` then I guess that is what you can report as the use case [19:16:24] I'm used to always have a screen running on a stepstone with everything ready so I can just start. I do of course have to ssh in to reconnect to it [19:20:44] brb, in meeting [19:37:11] multichill: I'm very interested in your thoughts on this in general and can continue on the task if you want. The basic notion is that folks have seen the network restrictions to NFS as something to not mess with because the poor performance on the bastions prevents people from misusing them. I'm trying to find a different way to prevent that directly rather than relying on poor performance. [19:39:01] My first stab at it is a service that is randomly killing 3+ day old processes. I understand that is frustrating for you because you have a workflow that includes long-running screens. There may be other criteria I can train this on than simple time-running and other things we can do. It may also be able to filter `screen` commands and `tmux` stuff [19:40:19] The goal isn't to stop screen and tmux workflows as much as prevent using a bastion like a server in itself or running really long-intense workflows off the job grid. [19:41:24] We cannot guarantee a screen will always be running because reboots and things are required for maintenance sometimes, but it doesn't *have* to be a 3 day limit either [19:51:17] !log tools.lexeme-forms deployed uncommitted JS fix, to be committed later if it works as intended [19:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL [20:15:36] bstorm: Of course no jobs on the Bastion and every once in a while it reboots. That's how it always has been [20:16:19] multichill: I just proposed a patch to the script and added you as a reviewer [20:28:19] !log tools.meetbot Stopped k8s webservice, started grid engine webservice, stopped grid engine webservice, started k8s webservice. This cleared the stuck state in the Toolforge front proxy which had a routing entry for a grid engine webservice for this tool. (T267368) [20:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.meetbot/SAL [20:36:46] bstorm: Thanks, maybe you can add mysql (client) too? [20:37:01] To leave open for 3 days? [20:37:40] It's just a shell on a database? [20:38:27] Yeah, but the DBAs don't generally want us to leave idle connections open for a long time anyway. It makes for problems when doing maintenance on them often. [20:38:49] Also each connection does consume resources. [20:39:06] It auto disconnects after a certain time [20:39:07] The query killer should kill it in 3 days if it were running something [20:39:24] Yeah, for analytics servers like 4 hours [20:39:39] So no need to kill the mysql client because the mysql server is already taking care of that [20:40:12] Propose on the review! We can maybe do that. I dunno at the moment. :) [20:40:29] It's worth considering [20:41:23] Just curious what everyone thinks [20:43:23] https://gerrit.wikimedia.org/r/c/operations/puppet/+/639617 [20:50:23] I finally wrote up a bug for T267369 after fixing my 3rd or 4th tool stuck in this difficult to diagnose state. [20:50:23] T267369: Front proxy can keep bad routing info for webservices previously running on the grid engine - https://phabricator.wikimedia.org/T267369 [20:56:15] bstorm: (As someone who hasn't been impacted) It looks reasonable to me.