[13:04:36] !log tools.wikibase-databridge-storybook npm run build-storybook && tar -C storybook-static -c . | ssh toolforge sudo -i -u tools.wikibase-databridge-storybook sh -c rm -rf www/static/*; tar -C www/static/ -x # deploy locally built storybook (T237527) [13:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibase-databridge-storybook/SAL [13:59:56] !log tools introduce the `profile::toolforge::proxies` hiera key in the global puppet config [14:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:01:08] !log tools drop hiera references to `tools-test-proxy-01.tools.eqiad.wmflabs`. Such VM no longer exists [14:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:45:06] !log tools cleaned up container that was taken up 16G of disk space on tools-worker-1020 in order to re-run puppet client [19:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:55:31] !log tools drained tools-worker-1002,8,15,32 to rebalance the cluster [19:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:09:30] does anyone know why the toolforge servers are so much slower these days? [21:14:29] !help does anyone know why the toolforge servers are so much slower these days? [21:14:29] Examknow: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [21:15:24] Examknow: depends on which servers and when? [21:16:07] Are you talking about the bastions? [21:16:14] yes [21:16:56] Ok, that would either be because someone is running more on the bastions directly than they should or some contention on NFS. [21:17:19] I can't even log in on my first try, so there's something going on there right now [21:17:25] bstorm_: Will it speed up later? [21:17:25] I'll try to fix that... [21:17:34] bstorm_: thank you [21:19:04] bstorm_: the servers just went down completely :( [21:19:22] no, but something could be crippling them [21:19:40] It's been up for 41 days [21:19:49] Someone is running a lot on there [21:19:49] hrm [21:19:54] yeah [21:20:24] Something just started that is slamming it [21:20:56] bstorm_: is there any way to monitor the processes? [21:23:56] I'm digging around to see what's running [21:24:37] bstorm_: where? [21:24:45] On the bastion [21:25:02] oh [21:28:43] I see a bunch of python procs in uninterruptible slep [21:28:44] sleep [21:28:48] that would slow down NFS [21:29:20] yeah [21:29:21] digging a bit more. It might actually be the heavy mem usage going on [21:29:41] my instance just came back online [21:33:42] ? [21:33:48] You mean you just got into ssh? [21:34:00] The issue is definitely NFS contention [21:34:15] I have no slowness as root [21:36:34] do you have root access? [21:37:06] Yes, that's what I mean [21:37:34] oh I don't [21:38:42] If I touch the nfs, it hangs, though :) [21:38:51] Trying to figure out what process is doing this... [21:47:23] I was about to say [21:47:30] I'm seeing slowness logging into the bastion [21:47:34] and even `cd` as root [21:47:59] Krenair: I can hardly even log into the bastions [21:48:10] NFS problems would explain this [21:48:23] Examknow, yeah well I can, it's just unusably slow [21:48:24] Yes...and it's obvious from nfsiostat...just trying to find which process [21:48:25] NFS = Need for Speed :-( :-( :-( [21:51:35] something starting at around about 2000, maybe 1958? [21:52:34] based on https://grafana.wikimedia.org/d/yKBG_upWz/wmcs-nfs-server-details?orgId=1&var-server=cloudstore1009&refresh=30s&from=now-3h&to=now [21:54:49] better than plain guessing :) [21:55:12] Nah, that's cloudstore1009. This is labstore1004 [21:55:26] huh, well, something might be up there too but ok [21:55:41] cloudstore1009 is scratch and maps. Not worried about them [21:55:47] They won't affect your home dir and login [21:56:03] unless you are on a maps host...but not toolforge [21:58:16] 2100-2106 then [21:59:28] process 1289 on the bastion perhaps? gzip running since 2101 [21:59:50] gzip could do it perhaps. I was looking at it a bit earlier [22:00:02] I'll kill it and see :-p [22:00:41] It's dead [22:00:45] -rw-r--r-- 1 tools.ftl tools.ftl 6.4G Nov 26 21:00 /data/project/ftl/data/logs/ftl-log-2019-Jul2-Nov26 [22:01:02] that looks much better now [22:01:34] Yup [22:01:40] I'll contact that user [22:01:46] this was all still technically guessing [22:01:54] I love NFS [22:01:59] 😛 [22:02:22] I almost killed that one a few minutes ago as well, but I was trying to get some kind of hard number [22:02:27] Just wasn't much [22:02:48] it was on the assumption that whatever was the source of the problems started and then immediately wrought havoc instead of lying low [22:03:19] and also that the source of problems was running on the bastion [22:03:35] so I just took the time on the graph that disk IO spiked and looked for processes on the bastion beginning around then, that gzip stood out [22:04:36] Examknow, Wurgl ^ FYI it should be better now [22:04:53] Thanks a lot [22:04:59] Krenair: Thank you [22:05:38] email sent [22:08:36] Thanks for the help, Krenair. [22:09:35] no worries, I just happened to try to log into the bastion at the wrong moment :P [22:11:10] Switching between two files within vi took a minute or so. Just typing Ctrl-W Ctrl-W [22:11:32] Wurgl, you mean before or right now? [22:11:43] before [22:11:49] yeah [22:11:58] Now it is excellent [22:11:59] when this sort of stuff any kind of filesystem operation becomes horrific [22:12:24] Which includes simply logging in because the home dir is NFS [22:12:27] Even the distance (me: Europe; Server: USA) ist no problem [22:12:37] The reason I could use it was I logged in directly to root, which is not on NFS [22:12:44] mm [22:12:54] That's how I knew what to look for...just not which one it was yet. [22:13:02] Usually it's an scp that's massive [22:13:03] I should add a key to root [22:13:03] Root is cheating! :-) [22:13:17] do I need to generate a special one on my yubikey or..? [22:14:20] Krenair: are you sure you cannot already ssh to root@login.tools.wmflabs.org? [22:14:32] I'm not sure without checking [22:14:37] It's cloudwide root that would be required [22:14:45] root@login.tools.wmflabs.org: Permission denied (publickey,hostbased). [22:14:51] Ok [22:14:53] well I could do it either way technically right [22:15:04] either add it to hiera extra_root_keys in tools project hiera [22:15:08] Krenair, bstorm_: Who has access to root? [22:15:08] Yeah [22:15:19] or make a change against labs/private.git to make it a global root thing [22:15:23] employees of WMCS and certain users who are tools admins [22:15:34] I'm a foundation employee [22:16:12] Examknow ^ [22:16:43] Krenair is an admin in a lot of places. [22:17:39] Trying to see if there's some better way to quickly track a heavy NFS user. Nethogs and similar tooling just shows you that root is using it...because nfsd and nfs client runs as root :-p [22:17:43] I'm a volunteer, cloud-wide admin of some sorts (though I haven't gotten around to making a key for root everywhere), and also more recently a tools admin [22:18:28] it needs an NDA and stuff [22:19:17] bstorm_, Krenair: ah [22:20:05] I run into login problems a lot more rarely than I used to years ago so direct root access is not much of a priority anymore [22:20:27] I remember a few times having to sort out instances remotely over Salt to get SSHd to function again :D [22:21:24] https://wikitech.wikimedia.org/wiki/Help:Access_policies [22:21:48] In terms of who has admin rights, Examknow ^^ [22:22:38] Can we look at what users have NFS file handles open at any given time, then repeat a few times to find long-lived ones and make some deductions from there? still a guessing game though... [22:22:58] Yeah. Every login will have something on NFS [22:23:19] I was drilling down into longer running ones, but that was wrong :) [22:23:40] so we can probably ignore like open directories and stuff [22:25:24] Yeah [22:25:48] Mind you, I'll probably just add gzip and friends to my initial check 😛 [22:25:51] oh so you tried looking into the older file handles and that turned out to be a red herring? [22:25:58] Yeah [22:26:32] do we have some high IO patterns whether lots of handles are being opened and closed? :/ [22:27:08] where* [22:27:16] We have before, but that is generally not on the bastion [22:27:30] Usually, that has been on a non-tools VPS client host [22:27:35] Running something like apache [22:28:05] On the bastion, it's almost always a tar, cp between filesystems, scp, etc [22:28:10] of something extremely large [22:28:10] so what was wrong with looking at older open files? [22:28:25] Nothing :) I just had the wrong files [22:28:29] oh :D [22:31:39] copying the VIAF files: 6GB in gzip format: http://viaf.org/viaf/data/viaf-20191104-clusters-rdf.xml.gz [22:31:49] And then: gunzip! [22:31:55] Bye bye NFS [22:37:00] yeah, that'll do it [22:37:17] what is our recommendation for doing that kind of stuff? [22:37:32] Doing it to scratch is not a bad idea [22:37:38] Then it doesn't impact user home dirs [22:37:41] would ionice help with this? [22:37:45] It'll still hang scratch [22:37:46] (does that even work with NFS?) [22:37:55] ionice is weird. [22:38:01] I'm not sure it would work for nfs [22:38:24] It is dependent on kernel scheduling peculiarities, so I wouldn't count on it on the nfs clients [22:38:28] It works on the server [22:38:56] Overall, I've usually suggested using scratch [22:39:21] It is on a faster, newer server and not the home dirs. However, that is not a backed up area, of course [22:39:50] Scratch is on 10G ethernet [22:39:54] the homes are not [22:40:26] so when do we get 10G ethernet for tools homes? :P [22:41:12] BTW: That VIAF example was a joke. I do this at home. [22:41:22] :) [22:43:51] Krenair: Well, we have to upgrade the NFS servers soon (OS-wise) which should help some problems we've had with kernel versions. If that can be rolled into 10G upgrades it might be nice, but they may need to be reracked to do it. [22:44:04] That might be more of a disruption, but either way. [22:44:22] It would definitely help matters to have that. [22:44:49] Right now, dumps, scratch and maps are all 10G ethernet. Tools and most project NFS is not. [22:47:06] Another note is that if a user needs something massive gzipped, it is better they put in a phab task to have someone do it server-side. If WMCS team does it directly on the server with ionice, it shouldn't impact NFS much if at all. [22:47:23] That's what I suggested to the user over email. [22:47:30] If they still need it [22:51:55] A different question: Is there any way to allow a user (read-only) access to different user-databases? like access s51412__data from a different account than tools.persondata [22:52:12] GRANT-Statements in SQL … [22:56:34] I'd rather that wasn't done even if there is a way to do it, instead favoring an api endpoint or something? [22:57:34] !log tools push upgraded webservice 0.52 to the buster and jessie repos for container rebuilds T236202 [22:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:57:38] T236202: Modify webservice and maintain-kubeusers to allow switching to the new cluster - https://phabricator.wikimedia.org/T236202 [22:58:53] Wurgl: name the database _p and then it is read-only for everyone. Or add the second tool as a maintainer of the first tool and then the second tool can read the credentials of the first tool and get read-write access. [23:00:43] Making a process for granting arbitrary rights to other accounts on toolsdb is not something we are likely to work on. [23:00:48] read-only for just one other tool [23:01:25] I was asked to split one tool. [23:02:32] sure, additional read-only access for one tool your request today. But there are ~2500 tools and ~1800 maintainers who could all ask for such things. [23:02:41] The "original" tool needs access to the data of the new tool. Actually the new one is a side-effect of the old one, but both are very different in usage and purpose [23:03:14] Hmm … allow GRANT statement and the users are responsible. [23:05:28] And yes. I am thinking of an API, but that just causes lot of headaches without a real solution [23:06:01] yeah, if it can be safely made self-service that could work. My MySQL knowledge is too weak to know what all the edge cases might be [23:09:50] The syntax is a little bit hairy. Mostly, because very seldom used, so you have to read manuals again and again [23:25:16] !log tools rebuilding docker images to include the new webservice 0.52 in all versions instead of just the stretch ones T236202 [23:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:25:20] T236202: Modify webservice and maintain-kubeusers to allow switching to the new cluster - https://phabricator.wikimedia.org/T236202