[00:30:05] !log wmflabsdotorg Updated DNS to make tools-dev.wmflabs.org a CNAME of stretch-dev.tools.wmflabs.org [00:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wmflabsdotorg/SAL [00:30:51] !log tools DNS record created for trusty-dev.tools.wmflabs.org (Trusty secondary bastion) [00:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:08:19] I have an issue with ssh into login.tools.wmflabs.org [10:08:44] returns Connection closed by 185.15.56.48 [10:08:56] Please any help on that ? [10:12:27] Eugene233: I am not the right person to ask, but an email was sent about the login server to the cloud list, maybe it could be related? [10:15:04] I will just have to wait then because i am not subscribed to the list [10:53:26] jynus: Eugene233 is not in the Shell user group [10:53:33] Please could you add him if you have the right? [10:54:05] xSavitar: https://lists.wikimedia.org/pipermail/cloud-announce/ [10:54:15] you can read it without being subscribed [10:54:30] mutante: Okay! [10:54:31] Eugene233: ^ [10:54:43] https://lists.wikimedia.org/pipermail/cloud-announce/2019-March/000142.html [10:55:19] Thanks [10:55:40] xSavitar: The legacy Ubuntu Trusty bastion will still be reachable as [10:55:41] "login-trusty.tools.wmflabs.org" [10:55:45] try if that one works for you [10:55:52] I can connect [10:55:54] if it does and the other one does not.. then it's about the upgrade [10:55:56] Eugene233 can't [10:56:05] I think it's because he's not in the shell user group [10:56:13] Both bastions are still accessible [10:56:16] likely it is an issue where the client and server cant agree on a secure cipher or so [10:56:31] https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation says so :) [10:56:43] Hmmm.... [10:56:49] if he was not in the shell user group that should affect both bastions the same way [10:56:57] so let him try if there is a different between them or not [10:57:06] Eugene233: I think mutante has made a valid point [10:57:13] I remember changing my ssh keys [10:57:18] Very recently actually [10:57:28] it might be that the client version he uses is older [10:57:35] Another reason too [10:57:59] Eugene233: try login-trust.tools.wmflabs.org does that work? [10:58:01] mutante: returns the same thing [10:58:07] login-trusty, sorry [10:58:55] did you ever login succesfully in the past? [10:59:00] or is this the first time [11:01:05] Eugene233: ^^ [11:01:26] mutante: First time [11:02:23] Eugene233: ok, well that changes things. then please create a request: https://wikitech.wikimedia.org/wiki/Help:Getting_Started#Get_started_with_Toolforge [11:02:31] https://toolsadmin.wikimedia.org/auth/login/?next=/tools/membership/apply [11:09:33] mutante: maybe if Eugene233 makes the request, you can process it for him? :) [11:09:56] mutante: I have made the request [11:10:17] Nice one Eugene233, follow mutante's advice from here :) [11:12:08] xSavitar: i am not on the team usually handling these requests, so i'd rather leave it at that [11:12:19] just happened to see the question here [11:12:29] Oh okay, thanks a lot mutante [11:12:34] np [11:34:51] hey xSavitar may I help you? [11:35:14] oh, is Eugene233 having issues [11:35:19] arturo: Yeah, actually it's Eugene233 [11:35:24] Yes [11:37:03] arturo: yes I am [11:38:28] Eugene233: interesting `Mar 8 10:48:39 tools-sgebastion-07 sshd[25086]: Failed publickey for eugene233 from x.x.x.x port 5801 ssh2: RSA SHA256:4obOCRnWOzk6R5hLh/I+aBA5vw1ARHYVhd4PI6lDfH0` [11:38:35] it seems you are using the wrong ssh key? [11:57:05] arturo: I think I am using the same public key i have on wikitech [12:12:31] Eugene233: Can you double check? [12:13:54] https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-openstack [13:23:27] I’m working on a tool for editing that’s supposed to work on all Wikimedia wikis [13:23:34] what’s the best way to identify a user in that case? [13:23:44] (e. g. write “this batch was created by user X” to toolsdb) [13:23:54] should I use the metawiki user_id, or get the CentralAuth ID and use that? [13:25:15] centralauth’s COUNT(*) FROM globaluser (57M) is much larger than metawiki’s COUNT(*) FROM user (23M), but I’m not sure if those are users I need to worry about or if it’s just a bunch of legacy accounts [13:28:14] lucaswerkmeister: I don't know, but if I had to guess, I would say centralauth- if there is unattacked accounts, you can ask people to attach their accounts before using your tool? This is not an informed comment, check documentation/code for more reliable details [13:28:22] *unattached [13:31:15] okay thanks, I’ll try to go ahead with centralauth then [13:54:49] xSavitar: I have double checked [13:55:15] I can even SSH gerrit [13:55:22] whats the issue? [14:16:54] Eugene233: Gerrit and Wikitech do not share ssh key configs [14:17:00] Iirc [14:32:04] According to an error I saw, a tool has max_user_connections:1 for a database backend. that seems like a mistake, or is that meant as the default? [14:32:11] Got this on https://tools.wmflabs.org/meta/crossactivity/Krinkle [14:32:18] Exception: SQLSTATE[HY000] [1226] User 's52256' has exceeded the 'max_user_connections' resource (current value: 1) [14:32:37] Got it for both s6.web.db.svc.eqiad.wmflabs and s8.web.db.svc.eqiad.wmflabs. [14:34:16] chicocvenancio: that is correct [14:35:03] chicocvenancio: the only think wikitech shares with is ldap, horizon, openstack, toolsadmin and a few other things (and other things i have no clue exist) [14:37:29] Eugene233: the key used by gerrit is set in the gerrit web UI, as opposed to wikitech UI [14:37:39] they could be different ones [14:39:01] Krinkle: that would be https://phabricator.wikimedia.org/T217853 [14:40:06] mutante, chicocvenancio, Zppix thanks for following up, I got distracted with something else [14:40:42] jynus: hm.. that runs on the same account? [14:40:49] I don't see it in the interface [14:41:49] arturo: no problemo [14:42:41] I don't know what you mean- that account was creating 100% cpu over 16 cores and yesterday people complained about that here [14:42:51] ^@Krinkle [14:43:43] jynus: k, https://phabricator.wikimedia.org/T217853#5011201 [14:49:27] jynus: Can you see mapping between s* names and tools shell accounts? e.g. does it belong to tools.meta or tools.drtrigonbot ? [14:50:02] I probably see the same thing you do, I just use https://tools.wmflabs.org/contact/ [14:50:52] it returns #Pywikibot-catfiles tool-labs-tools-drtrigonbot---catimages pywikibot-catimages pywikibot-catfiles gsoc-catfiles file-metadata [14:50:57] for 52256 [14:51:37] 18<arturo18> Eugene233: interesting `Mar 8 10:48:39 tools-sgebastion-07 sshd[25086]: Failed publickey for eugene233 from x.x.x.x port 5801 ssh2: RSA SHA256:4obOCRnWOzk6R5hLh/I+aBA5vw1ARHYVhd4PI6lDfH0` [14:51:37] 18<arturo18> it seems you are using the wrong ssh key? [14:51:41] Actually arturo I think something is wrong there [14:51:51] I looked up Eugene233's key in LDAP, and stuck it through ssh-keygen -lf [14:52:03] 2048 SHA256:4obOCRnWOzk6R5hLh/I+aBA5vw1ARHYVhd4PI6lDfH0 agboreugene@gmail.com (RSA) [14:52:26] Could it be a blip earlier? I've had my key rejected a few times due to ldap issues. [14:52:44] So why is there a log error saying that exact key is failed? [14:52:48] might be that yeah paladox [15:02:38] jynus: thanks, didn't know about that. [15:02:48] I wonder where it gets that phab info from. [15:03:35] jynus: How do you know it's Pywikibot-catfiles btw, did it say that in the query in a comment or so? [15:04:20] I don't know which tool produces the query- from my side I only see the account (52256) [15:04:39] I can increase the limit if it is creating issues, but what do I do when people complain about lag? [15:04:45] Oh, okay. Then it's probably not related to Pywikibot-catfiles . [15:05:02] I don't know where tools/contact gets that from, but tools.meta has no relation to Pywikibot-catfiles as far as I know. [15:05:28] Krenair: the same key error is present many times in the logs. I can imvestigate more later [15:06:22] yesterday I got "https://tools.wmflabs.org/guc has lag, I cannot use it" [15:06:25] k [15:06:49] so I am okay with any solution, as long as it is agreed by mostly everybody [15:07:29] You can get this result using `ldapsearch -LLLx uid=eugene233 sshPublicKey | tail -n 7 | sed -Ee "s/^(sshPublicKey:)? //" | tr -d "\n" | ssh-keygen -lf -` [15:07:43] jynus: yeah, that's fine, don't change it, they need to fix it. [15:07:56] note I did not block fully the account, just reducend the rate to the web server, they have 2 other servers they can use [15:07:58] jynus: but they (I've assigned it to Patho now), will need to know what was wrong. Can you give them a query maybe? [15:08:09] or is it not a specific query causing issues? [15:08:15] are they doing writes? [15:08:19] tools.db? [15:08:22] it was a select [15:08:34] to wikireplica, did not touch tools acces either [15:08:40] ok [15:09:07] that doesn't really narrow it down for them, given it's a dozen different tools with hundreds of different queries :) [15:09:29] it's a dozen different tools- so probably that is the most important issue [15:09:40] they should use different db accounts for different tools [15:09:56] so that if one missbehaves, it doesn't affeect the otehrs [15:09:57] is every special page in mw a different tool? [15:10:08] special page in mw? [15:10:16] it's made by one person, all for the same purpose. Just different features. [15:10:40] See sidebar at [15:10:44] I mean, not for every page, but for each functioanlity, as you said it had "dozens of different tools" [15:11:46] anything that, if one goes down, the other should be still up [15:11:50] if one tool is more popular, because it has a single page that is popular (like GUC), or because it has 10 pages that are together popular, we typically increase the conn limit, right? (Assuming it does good queries) [15:11:58] yes [15:12:09] that's the case here. [15:12:25] the code is together as one tool, it's not going to be split for the users or the one author that maintains it. [15:12:39] not sure if typically, but I think we never refused one if the request was reasonable [15:12:42] but yeah, it needs to be fixed for this bug. [15:12:51] it can be technically a single tool [15:13:02] so the bug is that it has a select somewhere that causes timeouts, or lag? I don't know how that works. [15:13:11] no, heavy cpu usage [15:13:24] right, but from the tool perspective, what are they doing wrong? [15:13:25] which leads to eating all resources [15:13:37] cpu doesn't mean anything to them :) [15:13:40] using too much cpu- normally that is due to expensive queries [15:13:52] usually queries not using indexes [15:14:29] I have not done all the research, I sent the ticket so we can followup [15:14:55] OK. Yeah np. [15:14:57] but after the rate-limit, the cpu went from 100% on 16 virtual cores to 30-50% [15:15:06] Will it be possible to get a sample of a query that is problematic? [15:15:12] yes, I am getting that [15:15:16] I had it on my log [15:15:17] cool cool [15:15:19] no worries [15:15:33] Thanks! [15:15:55] the thing is, I could have done nothing, as it was not that bad as causing server issues [15:16:05] but other tools were being affected [15:16:26] so we need a protocol, were people say what they prefer to do :-) [15:16:51] but "no lag/slow queries" and "do not reduce limits" are incompatible :-) [17:49:27] !log tools Re-enabled 4 queue instances that had been disabled by LDAP failures during job initialization (T217280) [17:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:49:30] T217280: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 [18:14:26] zhuyifei1999_: I have this inexplicable disk usage. [18:14:44] du shows 4GB being used yet, the disk is full [18:14:57] toolforge? [18:15:00] or a vps? [18:15:03] vps [18:15:27] did all processes using the files freed thier fds of the files that was deleted? [18:16:03] No idea. [18:16:35] I use fopen, fread, flock, fwrite, and fclose [18:16:59] On shutdown, an auto called function closes and deletes all created files. [18:17:07] out of inodes? [18:17:19] Reedy: dunno. It's php [18:17:54] Cyberpower678: by disk is full you mean 'df' is full or 'df -i' is full? [18:17:55] So it's using unlink to delete the file. I believe unlink alters the inode [18:18:18] unlink does not alter the inode. it alters the directory entry containing the inode [18:18:25] zhuyifei1999_: the first on [18:18:45] then that's the case of fds not being freed [18:18:58] see lsof -p [18:19:30] zhuyifei1999_: what do I use for pid? [18:19:47] the process that you suspect that opened the files [18:19:58] I have more than 30 processes [18:20:03] uh [18:20:51] then just `lsof` [18:21:11] that should do all processes it can find [18:22:12] it should like some fds being '(deleted)' [18:22:20] *like => list [18:22:51] zhuyifei1999_: I see multiple gigs of deleted files in /tmp [18:23:17] well, then you didn't close them properly :P [18:24:51] if you want to free the space, you can either restart the prccesses, reboot, or `truncate -s 0 /proc//fd/`, but that risk crashing the process if it doesn't expect the truncation [18:26:49] ^ /proc//fd/ is an easy way to revive an deleted file as long as the file is still opened some at least one process [18:27:09] zhuyifei1999_: how would I close them properly. IABot doesn't create files that are gigabytes in size. [18:27:17] It looks like /dev/null logs [18:27:59] the close(2) syscall. idk how php handles fds... [18:28:56] What is fds? [18:29:04] file descriptors [18:29:38] a number used by the syscall interface to identify an opened file of a particular process [18:30:06] Well PHP should be handling them, otherwise it's a bug with PHP. I'm following the documentation of the functions. [18:31:50] it could be, yes, but I'm not too familiar with php to confirm or deny that. but given the huge userbase of php, that is unlikely to have never been noticed [18:34:52] I see no lingering files generated by PHP itself. So the program is closing and removing them correctly. [18:35:21] Besides, this is a new issue. IABot has been running autonomously on that VPS for months without issues [18:35:55] without issues... maybe the file is just growing in size but you didn't notice? [18:35:57] (or it has been slowly building those files?) [18:36:06] did you check graphite? [18:36:13] or nagf? [18:36:16] * chicocvenancio jynxes zhuyifei1999_ [18:36:20] or grafana? [18:36:26] chicocvenancio: huh? [18:37:05] (/me don't get the reference) [18:37:31] its just we wrote the same thing at almost the same time [18:37:39] https://www.reddit.com/r/AskReddit/comments/1myheh/when_two_people_say_the_same_thing_at_the_same/ [18:37:41] oh [18:38:08] I misread your name. I thought cp678 said that [18:38:36] zhuyifei1999_: no. IABot generates little files and deletes them after use. Even if it weren't deleted, IABot truncates it to 0 before using it again. [18:38:53] Cyberpower678: what is the instance [18:38:55] ? [18:39:08] did you check the contents of the fds? [18:39:59] zhuyifei1999_: I'm trying. But your command won't work. [18:40:21] flock 1189 cyberpower678 1u REG 254,3 2441572352 68 /tmp/tmpfGUdjbA (deleted) [18:40:22] flock 1189 cyberpower678 2u REG 254,3 2441572352 68 /tmp/tmpfGUdjbA (deleted) [18:40:25] what do you mean by 'doesn't work'? [18:40:44] then the link is /proc/1189/fd/1 [18:41:02] that's stdout of pid 1189 by convention [18:41:08] I tried to call /proc/1189/fd/1u but I get an error [18:41:15] it's 1 not 1u [18:41:31] zhuyifei1999_: all stdout and stderr is going out to /dev/null [18:41:40] So it shouldn't be doing that. [18:42:12] zhuyifei1999_: that's the stderr file [18:42:20] I just opened it [18:42:29] this is indeed something that happened suddenly today https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1552070517.02&target=cyberbot.cyberbot-exec-iabot-01.diskspace.root.inodes_avail&target=cyberbot.cyberbot-exec-iabot-01.diskspace.root.byte_avail [18:42:31] But I don't understand why it's hanging in limbo [18:44:36] zhuyifei1999_: If you want, I can let you in. I'm confused here. [18:44:51] 2 hours later. no time right now [18:44:55] or 1 [18:45:10] (more like 1.5) [18:45:47] * Cyberpower678 doesn't get why routing to /dev/null still creates a file that hangs around. [18:45:57] it shouldn't [18:46:05] how is the process started? [18:46:42] zhuyifei1999_: here's a crontab entry * * * * * flock -n /home/IABot/flock/enwikidead.lockfile php /home/IABot/IABot/deadlink.php enwiki dead &> /dev/null [18:47:27] which host is this? [18:47:40] cyberbot-exec-iabot-01 [18:47:46] I'll check if I can login. will debug later [18:48:29] yeah, I'm in. will debug later [18:53:58] sudo killall php definitely frees up the disk [19:01:45] Cyberpower678: actually, change of plans. I'm getting lunch at home rather than outside, so I'm able to look into it now [19:01:56] cool [19:02:47] * Cyberpower678 is getting lunch too. Undoubtedly, the stderr flooding is an IABot bug I am looking into as we speak, but it shouldn't be filling the disk space since it's going to /dev/null [19:03:38] uh, do I not have sudo? [19:04:40] zhuyifei1999_: probably not. I revoke that for security [19:04:45] Gimme a sec. [19:05:04] k [19:05:38] there's no way I can check processes running under your account otherwise :P [19:08:47] zhuyifei1999_: remind me how to add it on Horizon? [19:09:04] I got it [19:09:22] sudo access granted [19:09:51] ok [19:13:09] Cyberpower678: you didn't 'sudo killall php' right? [19:13:19] cuz disk is still 100% full [19:13:29] killall php frees the disk, but it fills pretty fast again [19:13:46] Like I said, newest beta has an issue that I'm working to fix. [19:15:37] oh, so it was freed, but got immediately filled already?! [19:15:45] Yep [19:15:50] no wonder that 1.6T disk usage :P [19:15:56] yep [19:16:00] * Cyberpower678 hides [19:17:17] is it the obvious solution that the redirect to null only applies to `flock` but not to `php`? [19:17:50] Hello. I noticed that the `python-mwclient` package is not available on stretch. What's the reason behind that? [19:18:44] It seems fairly well maintained from what I can see. [19:18:45] I just tested with `flock -n test sleep 100 &> /dev/null` and that works correctly [19:19:39] both to flock and sleep [19:19:45] but that doesn't produce any output? [19:21:27] !log tools depooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage [19:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:23:35] Cyberpower678: I'm gonna strace cron and see what it's doing [19:23:50] Okay [19:24:09] You can play around with it. Cron should be saved to /home/IABot/crontab [19:24:53] oh, I probably won't be modifying anything [19:25:09] in the worst case I'll just use gdb [19:26:23] gosh, strace was absolutely flooded when it cron timer hits the minute [19:29:58] [pid 4028] 19:29:01 execve("/bin/sh", ["/bin/sh", "-c", "flock -n /home/IABot/flock/sdiywikimaster.lockfile php /home/IABot/IABot/deadlink.php sdiywiki master &> /dev/null"], [/* 5 vars */] [19:30:19] so sh -c [19:31:43] ah [19:31:49] I can reproduce this [19:33:22] Cyberpower678: appearantly, dash's &> is different from bash's &> [19:33:39] try `sh -c 'flock -n test sleep 100 &> /dev/null'` [19:33:40] Niharika: I don't think there is a regular Stretch package for it. Is there a reason why you can't pip install it? [19:33:53] you see they aren't redirected at all [19:35:17] try SHELL=/bin/bash [19:35:43] Ran it. Nothing was returned [19:36:15] yes, then see the pid? [19:36:19] and lsof it [19:36:30] do you see it redirected? [19:36:37] I mean fd 1 and 2 [19:37:14] I'm not sure what I'm supposed to be looking at. [19:37:14] oh, ^ was assuming your 'ran it' refers to `sh -c 'flock -n test sleep 100 &> /dev/null'` [19:37:15] doesn't 2>&1 do the same thing as &>? I suppose that might work on non-bash as well. [19:37:33] SHELL=/bin/bash goes into the crontab [19:37:39] https://unix.stackexchange.com/questions/94456/how-to-change-cron-shell-sh-to-bash [19:37:44] and crontab(5) [19:37:52] valhallasw`cloud: let me check [19:37:57] zhuyifei1999_: at the top? [19:41:33] yeah [19:41:44] valhallasw`cloud: yeah `> /dev/null 2>&1` works as well [19:42:52] I dunno dash syntax as good as bash's, so idk why &> is parsed like `& >` [19:46:45] Cyberpower678: works? [19:46:55] `&>` is a bash-ism. It's not posix shell syntax [19:47:15] zhuyifei1999_: I haven't tried it. [19:48:12] * zhuyifei1999_ is too bash-ed and know little about what is bashism and what are posix [19:48:30] (except from those manuals in section 1p) [19:48:30] https://mywiki.wooledge.org/Bashism -- that's a fairly good list [19:48:36] !log tools repooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage [19:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:48:37] * zhuyifei1999_ looks [19:49:45] wow I don't know many of these bashisms [19:49:58] bash has some cool magic :) [19:50:37] yeah [19:50:55] many cool bash things were borrowed from ksh [19:54:50] zhuyifei1999_: alright, I added SHELL=/bin/bash to the crontab, loaded it, and ran sudo killall php [19:55:18] Since it only takes about 5-10 minutes to fill the disk, we should know pretty soon if it works [19:57:07] ok [20:00:08] looks good so far https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1552075191.412&target=cyberbot.cyberbot-exec-iabot-01.diskspace.root.byte_free&from=-10minutes [20:00:58] Disk space seems to be holding at about 35% usage [20:12:58] :) [20:49:45] zhuyifei1999_: still stable. Thanks for the help [21:10:11] np [23:07:20] Trying to migrate webservices and grid jobs from trusty to scratch. Having massive problems (shell hangs when listing home directory, ssh connection intermittently refused). Any idea whats going on? [23:08:51] dschwen, connection refused or intermittent auth problems? [23:17:29] dschwen: we are having some ongoing problems with the LDAP directory getting overloaded. Shell hangs actually could be related for conversion of numeric ids to names or it could be a separate NFS issue. More details about which server and what actions would be helpful. [23:20:35] The ligin-trusty and login-stretch boxes [23:20:49] LDAP overload could explain both issues [23:23:16] By the way am I getting this right. I also have to transfer the crontabs? [23:23:47] I have a bunch of jsub in my crontabs and I'm changing --release=trusty to --release=stretch when I'm copying them over. Is that correct? [23:25:38] dschwen: yes, you have to manually move the crontab from trusty to stretch. You can just delete the `--release=...` parts of any cron lines. [23:26:10] Ok, thanks! [23:27:03] *) -l release=trusty