[13:55:04] <arturo>	 !log admin [codfw1dev] rebooting cloudnet2003-dev into linux kernel 4.14 for testing stuff related to T247135
[13:55:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[13:55:06] <stashbot>	 T247135: codfw1dev unavailable? - https://phabricator.wikimedia.org/T247135
[17:02:11] <arturo>	 !log admin [codfw1dev] deleting address scopes, bad interaction with our custom NAT setup T247135
[17:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[17:02:13] <stashbot>	 T247135: codfw1dev unavailable? - https://phabricator.wikimedia.org/T247135
[17:34:04] <bstorm_>	 !log tools.jembot truncated the error.log file T247315
[17:34:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.jembot/SAL
[17:34:06] <stashbot>	 T247315: 2019-03-10: tools and misc nfs share cleanup - https://phabricator.wikimedia.org/T247315
[17:39:12] <bstorm_>	 !log tools.robokobot truncated virgule.err for robokobot T247315
[17:39:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.robokobot/SAL
[17:39:14] <stashbot>	 T247315: 2019-03-10: tools and misc nfs share cleanup - https://phabricator.wikimedia.org/T247315
[19:03:57] <Urbanecm>	 Hi everyone, I just got this error while using https://tools.wmflabs.org/meta/stalktoy/177.51.66.143. Might that be caused by some Cloud-side issue? https://usercontent.irccloud-cdn.com/file/FxAKBkqR/image.png
[19:04:00] <Urbanecm>	 cc SQL 
[19:04:47] <SQL>	 As Urbanecm mentions, I'm seeing similar log entries.
[19:04:50] <SQL>	 2020-03-10 18:58:22: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  mysqli::__construct(): (HY000/2002): php_network_getaddresses: getaddrinfo failed: Name or service not known in /data/project/ipcheck/public_html/oauth.php on line 95
[19:04:58] <SQL>	 line 95 being a connect to the meta db
[19:06:37] <SQL>	 Gotta go unfortunately
[20:51:12] <SQL>	 Urbanecm: Put in a ticket: https://phabricator.wikimedia.org/T247352
[21:00:03] <Urbanecm>	 thanks
[21:00:10] <SQL>	 np
[21:15:58] <bd808>	 SQL: thanks for the detail on that ticket. This looks like it could be some issue with the Docker images and/or networking on the 2020 Kubernetes cluster. We will look into it.
[21:16:14] <SQL>	 bd808: no problem. Thanks for looking into it :)
[21:19:13] <Davod>	 Hello, I want to run a tool under my tool account, but I seen limited resources and it crashed.
[21:19:52] <bd808>	 Davod: maybe this will help? -- https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration#Lower_default_resource_limits_for_webservice
[21:20:31] <bd808>	 there are now --cpu and --mem arguments you can give to `webservice start` to increase the quota for a Kubernetes container.
[21:20:49] <Davod>	 Actually, it is a cron job using the Grid.
[21:21:03] <bd808>	 Davod: ok, that's https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Allocating_additional_memory
[21:22:00] <Davod>	 Indeed, but there is also limited storage space, as some tools used hundred of gigabytes.
[21:23:40] <bd808>	 Davod: yes, some tools use a lot of local disk. We are not out of disk however
[21:24:07] <bd808>	 Davod: Maybe you can back up and explain what resources you are running out of and what you are doing when that happens?
[21:25:03] <Davod>	 Let me test
[21:28:50] <Davod>	 To be specific, I own tools.esfichataxon to process the Wikidata latest JSON dump (74 GB gzipped) with Wikidata Toolkit client.
[21:29:27] <Davod>	 Using the Bastion servers runs fine (but I stoped it as it slowed down it)
[21:29:37] <bd808>	 Davod: first thing I can say about that is that if you try to de-compress that dump to your tool's $HOME or even copy it, bad things will happen
[21:30:22] <Davod>	 Yep, I know that, as I see the dump may be HUUUUGE
[21:30:38] <Davod>	 this si why I hacven't uncompresseded
[21:30:55] <Davod>	 Filesystem                                                         Size  Used Avail Use% Mounted on
[21:30:55] <Davod>	 nfs-tools-project.svc.eqiad.wmnet:/srv/tools/shared/tools/project  8.0T  6.9T  719G  91% /mnt/nfs/labstore-secondary-tools-project
[21:32:56] <bd808>	 Davod: the dumps are mounted at /public/dumps from a separate NFS server cluster
[21:34:17] <Davod>	 oohh, I havemnt found that in the documentation I read
[21:34:54] <bd808>	 Davod: /public/dumps/public/other/wikibase/wikidatawiki
[21:35:29] <bd808>	 Those files are identical to the ones you would be able to download from https://dumps.wikimedia.org/other/wikibase/wikidatawiki/
[21:36:25] <bd808>	 The file system layout is the same as well, just replace https://dumps.wikimedia.org/ in any URL with /public/dumps/public/
[21:36:25] <Davod>	 Thanks
[21:40:55] <Davod>	 So, now, the problem comes when I run Java under the Grid engine... let me test again, with 2GB
[21:41:03] <Davod>	 of memory assigned
[21:43:30] <Davod>	 Could not allocate metaspace: 1073741824 bytes
[21:43:48] <bd808>	 Davod: java is hungry for RAM. You should ask for 4G to run that job
[21:43:59] <Davod>	 This is the output whatn a ran
[21:44:00] <Davod>	 jsub -mem 2g java -jar bin/wdtk-client.jar -a json --fProp P171 -i /public/dumps/public/wikidatawiki/entities latest-all.json.gz -n -z gz --fLang en,es
[21:44:09] <Davod>	 Indeed
[21:45:12] <bd808>	 We have an open ticket somewhere about trying to figure out why jdk11 really won't run on the job gird with less than 3G of ram.
[21:45:17] <bd808>	 *job grid
[21:45:26] <Davod>	 Whm
[21:45:39] <bd808>	 but there should be no problem finding a place to run your job with a 4G reserve
[21:46:00] <Davod>	 The only "but" is I'll run the task only once.
[21:47:22] <Davod>	 As I want to extract only specific data into a JSON file, namely, taxonomy, where is rarely updated, unless a new species is added to the database.
[21:48:13] <Davod>	 The other "alternative" is running at the Bastion server but...
[21:49:08] <Examknow>	 bastions are being really slow right now
[21:50:02] <Krenair>	 Examknow, hi
[21:50:07] <Davod>	 However, the whole login from another session became slow when I ran the program
[21:50:07] <Krenair>	 can you name one specifically?
[21:50:24] <Examknow>	 server hostname?
[21:50:30] <Krenair>	 yeah
[21:50:48] <Examknow>	 tools-sgebastion-07
[21:51:05] <Krenair>	 at first glance it doesn't look too busy Examknow 
[21:51:06] <SQL>	 FWIW if it helps, tools-sgebastion-07 was VERY slow recently
[21:51:12] <bd808>	 Davod: yes, this type of task is the right thing to figure out how to run on the job grid. Running it on a bastion will just get someone complaining and an admin killing your job.
[21:51:26] <Krenair>	 SQL, sometimes it does get busy yeah
[21:51:35] <Examknow>	 Krenair it could also be my connection
[21:51:40] <Davod>	 Yep, this is why is not an alternative
[21:51:41] <Krenair>	 often doing something that should've been run elsewhere :)
[21:51:42] <bd808>	 SQL: maybe because Davod was trying to download and extract data form the wikidata dumps :)
[21:51:43] <Krenair>	 possibly
[21:51:49] <Krenair>	 can you be more specific about what seems slow about it?
[21:51:58] * SQL points people at "webservice shell"
[21:51:59] <Examknow>	 just the ssh session
[21:52:12] <Krenair>	 well
[21:52:39] <SQL>	 bd808: could be related :P
[21:52:56] <Krenair>	 Examknow, is it slow to respond when you try to run a command or tab-complete and things?
[21:52:59] <Davod>	 I effectively noticied severe slowdowns when I ran heavy tasks at the Bastion server (when I tested them and see the stdout)
[21:53:15] <Krenair>	 yes, please don't run heavy tasks on the bastion servers
[21:53:23] <Davod>	 Yep, everything became slow when I ran those tasks in the past
[21:53:28] <SQL>	 Davod: IIRC, you are not supposed to run heavy tasks on the bastions
[21:53:36] <Examknow>	 I am not even running anything right now
[21:53:41] <Examknow>	 just typing code
[21:53:58] <Examknow>	 Krenair: It just is slow to respond to my typing
[21:54:00] <Krenair>	 so it's slow to display your text on the screen in a text editor?
[21:54:09] <bd808>	 it's NFS lag. I just killed a wget that was eating up disk IO
[21:54:24] <Examknow>	 yes and on the command line as well
[21:54:55] <AntiComposite>	 we really need a wiki page explaining NFS lag
[21:55:14] <Krenair>	 Examknow, is it better now?
[21:55:23] <bd808>	 AntiComposite: agreed. want to start one somewhere?
[21:55:33] <Davod>	 Do you killed a task I ran right now?
[21:55:59] <bd808>	 Davod: if it was a wget download from http://tools.wmflabs.org/wdumps/download/124 then yes
[21:56:13] <bd808>	 Davod: if that's you, run that as a gird job please
[21:56:20] <Davod>	 Yep... fortunately I haven't used Axel
[21:56:22] <AntiComposite>	 (also, the dumps are in nfs alreadY)
[21:56:50] <Davod>	 Where can I find those dumps (triggered by me)?
[21:56:59] <AntiComposite>	 oh, something else then, nevermind
[21:57:24] <Krenair>	 I did try listing my home dir and things that have been slowed down before during high NFS usage
[21:57:32] <bd808>	 Davod: downloading is ok, just do not do it directly from the bastion
[21:58:12] <Krenair>	 it didn't appear particularly slow to me :/
[21:58:15] <Davod>	 I found it!
[21:59:04] <bd808>	 The big issue that happens here is that we have network quotas setup for each instance talking to the NFS server. The login.toolforge.org bastion (tools-sgebastion-07) has a single quota for all the users using it.
[21:59:46] <bd808>	 so when one tool or user is reading or writing a lot of data to NFS (/home, /project) then nobody else can get their NFS reads/writes fit in
[22:00:27] <bd808>	 The grid engine has ~30 instances to spread things out across
[22:00:37] <Davod>	 ahh OK
[22:00:56] <bd808>	 so there is a) more total bandwidth to the NFS server, and b) less interactive work by humans that will notice slowness
[22:01:37] <bd808>	 b) is the really important part here honestly
[22:01:51] <bd808>	 software is a lot more tolerant of waiting than people are :)
[22:02:30] <Davod>	 So, I leave the dumps there
[22:03:09] <Davod>	 I seen wdumps (/data/project/wdumps) is using more than 700GB
[22:04:01] <bd808>	 templatetiger has historically used the most disk. But the at rest space used is not really th most urgent problem
[22:04:58] * bd808 says that as b.storm and j.eh are actively trying to recover disk space from tools
[22:05:12] <Davod>	 So.. WTF is templatetiger?
[22:05:18] <bd808>	 a tool :)
[22:05:37] <bd808>	 https://tools.wmflabs.org/admin/tool/templatetiger
[22:05:54] <Davod>	 Yep, but what does?
[22:06:11] <bd808>	 "Templatetiger extracts all templates from dumps. It intends to analyse the values contained in the templates and represent them in new ways, apply filters and do other useful stuff. Perhaps it will be obsolet with WikiDATA, but it can be useful to fill WikiDATA with correct informations from different Wikipedias."
[22:06:27] <bd808>	 see also https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Vorlagenauswertung/en
[22:06:30] <Davod>	 I just read the description Whm...
[22:09:25] <Davod>	 I have another question... are those NFS mounts public? I want to access them from my computer without downloading via HTTP
[22:10:25] <bd808>	 Davod: no, NFS only works inside the Wikimedia network. From home you would need to download the data from https://dumps.wikimedia.org/
[22:10:33] <bd808>	 or one of it's mirrors
[22:10:45] <bd808>	 https://dumps.wikimedia.org/mirrors.html
[22:11:25] <AntiComposite>	 bd808, do you think that NFS load explanation would be best on [[wikitech:Help:Troubleshooting Tolforge]], [[wikitech:Help:Shared Storage]], or a new page?
[22:11:52] <bd808>	 AntiComposite: a fine question. Where would you think to look for it first?
[22:12:18] <AntiComposite>	 Not knowing what the issue is already, probably troubleshooting
[22:13:30] <bd808>	 seems like a reasonable first place
[22:13:58] <bd808>	 srrodlund can always move it if she thinks there is a better location :)
[22:18:36] <bd808>	 AntiComposite: if you think of it, ping [[User:SRodlund]] on the talk page of whatever wikitech page you add the info to for a review. She can help clean up the wording and reorganize things if needed
[22:18:54] * bd808 loves that Sarah R is working on making our docs more readable
[22:19:32] * bd808 notices that srrodlund has arrived as though summoned :)
[22:21:08] <bd808>	 if y'all don't know her, srrodlund is a tech writer working on the Foundation's Technical Engagement team. She and folks she has helped organize have been steadily working on the organization and navigation for Toolforge and Cloud VPS help docs for ~2 years.
[22:21:27] <bd808>	 and I really think their work is making a difference!
[22:21:41] <srrodlund>	 :-)
[22:42:07] <AntiComposite>	 bd808, srrodlund: https://wikitech.wikimedia.org/wiki/Help:Troubleshooting_Toolforge#Bastion_is_slow
[22:43:36] <bd808>	 AntiComposite: thanks!
[22:44:27] <bd808>	 the bit about admins logged in as root not being affected is not 100% true, but the nit picking to explain that would be long and boring :)
[22:45:10] <srrodlund>	 AntiComposite: Thanks I will take a look at it today or early tomorrow!
[22:45:22] <srrodlund>	 I'm really glad you are contributing to the documentation! 
[22:47:40] <AntiComposite>	 yeah, I just removed that bit, it's not really relevant to most users anyway
[22:57:31] <Krenair>	 bd808, well as an admin without much experience with NFS, would I be correct in thinking that it still affects us if we try to interact with an MFS mount, whereas normal users a) default to logging in with a home directory on NFS and b) aren't able to touch much of the system that isn't on NFS anyway?
[23:05:59] <bd808>	 Krenair: I think a) is the main advantage for someone holding an ssh key for root@. b) is sort of a restatement of a)? The main interactive lag comes from having things in your session that interact with files from NFS on all actions. Like, for instance, having a shell script of some sort that runs as part of your $PS1 prompt.
[23:06:08] <bd808>	 oh-my-zsh kind of things
[23:07:19] <Examknow>	 Krenair: Sorry bout that
[23:07:23] <Examknow>	 yes it is better now
[23:07:25] <Examknow>	 thanks
[23:57:12] * AntiComposite wonders aloud if some sort of OONFS Killer could be set up to prevent this from happening every week
[23:58:58] <Platonides>	 just a better i/o scheduling strategy should help
[23:59:09] <Platonides>	 one that took uid fairness into account