[06:10:55] hi all. is there any chance we get dotnet core installed on the web grid? I'm so sick of mono bugs [06:16:11] sick of mono bugs <= +1. who would have thought running out of memory = infinite loop [07:37:49] !log tools tools-worker-1015.tools.eqiad.wmflabs having severe NFS issues (all NFS accessing processes are stuck in D state). Draining. [07:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [07:43:46] !log tools D states are not responding to SIGKILL. Will reboot. [07:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [07:44:52] !log tools I saved dmesg and process list to a few files in /root if that helps debugging [07:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [07:48:17] !log tools systemd stuck in D state. :( [07:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:34:39] !log tools hard rebooted tools-worker-1015 via horizon [08:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:37:52] !log tools uncordon tools-worker-1015.tools.eqiad.wmflabs [08:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:41:36] not sure if known already, labstore1004 is at 1% free for /srv/tools [09:22:52] (labstore1004:/srv/tools acked in -admin by gtir.loni) [09:31:09] !log tools.iabot commented cronjobs, stop webservices and truncated Worker*.err files (T216988) [09:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.iabot/SAL [09:31:17] T216988: labstore1004 - DISK CRITICAL - free space: /srv/tools 115904 MB (1% inode=79%): - https://phabricator.wikimedia.org/T216988 [09:31:40] godog: thanks for the heads up, we're not receiving alerts from labstore1004 for some reason [09:32:46] ah, no problem gtirloni [09:33:21] I was looking at icinga alerts, didn't see the notification itself [09:41:37] !log tools rebooted tools-sgeexec-09{16,22,40} [09:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:50:17] !log tools rebooted tools-sgeexec-09{16,22,40} (T216988) [09:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:50:23] T216988: labstore1004 - DISK CRITICAL - free space: /srv/tools 115904 MB (1% inode=79%): - https://phabricator.wikimedia.org/T216988 [10:32:22] !log admin restarted nfsd on labstore1004 [10:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:41:20] tools.wmflabs.org doesn't work [10:43:41] annika: I'm checking [10:47:33] I'm getting no response when sending commands to sgebastion [10:48:08] error: commlib error: got select error (Connection refused) [10:48:08] error: unable to send message to qmaster using port 6444 on host "tools-sgegrid-master.tools.eqiad.wmflabs": got send error [10:57:33] JamesR: do you still get it now? I've started it [10:57:44] gtirloni: checking [10:59:12] gtirloni: qstat looks ok, however I get nothing when sending crontab -e [10:59:34] JamesR: does it get stuck or do we get an empty crontab file? [10:59:43] stuck [11:00:17] ok, I can reproduce [11:03:47] gtirloni: tools.wmflabs.org is working again, but multiple webservice tools are down [11:04:38] yep, unfortunately I'm having to restart a bunch of servers after the NFS service was restarted. some servers recovered fine without a reboot but most did not. I expect things should be fine in 15-20min, I'm working on it [11:12:32] we having tool.wmflabs conniptions? [11:12:42] seeing irc bots falling away [11:12:45] yes, they are working on it [11:12:51] thanks JamesR [11:13:04] channel subject was silent [11:13:11] no worries - I thought my bots were going crazy ;)\ [11:13:48] hard to tell among all the real loons, no good baseline :-) [11:30:58] sDrewthedoff: unfortunately, I have to agree. we need to go back to the drawing board :( [11:31:11] JamesR sDrewthedoff: how is it looking now? [11:33:49] gtirloni: I'm getting permission error when trying to access a folder in my tool [11:34:16] JamesR: what's your tool's name? is it on the old/new cluster? kubernetes or grid engine? [11:34:16] take not working either [11:34:57] gtirloni: tool name is james [11:35:28] i've moved to sge [11:36:31] gtirloni, I am just looking at irc bots, some are having issues [11:36:46] for some reason its webservice wasn't running, i started it.. can you check now? and also, what's the directory that is giving you permission denied? [11:37:10] sDrewthedoff: ok, let me know the tool names and any log file I should look at please [11:37:14] and maybe these bots have disappeared from irc.wikiedia.org checking now [11:37:31] gtirloni: I don't use webservice, only grid for perl/python bots. the folder is aivhelperbot. [11:39:03] JamesR: it's missing 'x' permissions so nobody can enter it, you can try `chmod +x` on it to allow access https://www.irccloud.com/pastebin/HYeplH0r/ [11:39:41] gtirloni: thought I had tried that already! sorry! [11:40:22] sDrewthedoff: I've noticed k8s didn't detect some hosts went down and was still reporting pods as running, so I had to restart some things. Just a hint of what might be wrong, just restarting them might fix the issue [11:40:28] JamesR: no worries [11:41:51] gtirloni, have we morphed login to now be stretch by defult? [11:42:29] if yes, then my crontab jobs have disappeared [11:43:02] sDrewthedoff: not yet, it still goes to tools-bastion-03, which is the trusty cluster [11:43:36] see if you can find them on login-trusty.tools.wmflabs.org [11:43:55] nope, I had killed them there [11:45:06] sDrewthedoff: let me check, what's the tool name? [11:45:06] Dvorapa: is there a known outage that could be affecting PAWS? [11:45:43] chicocvenancio: yes, nfs issues required a restart.. dealing with the fallout now [11:46:25] ugh, it isn't letting me connect to stretch [11:46:31] gtirloni: btw https://tools.wmflabs.org/sge-status/ also down [11:46:46] Thanks, gtirloni likely paws will fix itself, I'll keep an eye [11:48:14] billinghurst@tools-bastion-03:~$ ssh login-stretch.tools.wmflabs.org [11:48:14] Permission denied (publickey,hostbased). [11:49:32] went through the removing old fingerprint per instructions,and won't now let me in [11:59:28] opened T217015 for sge-status [11:59:29] T217015: sge-status - https://phabricator.wikimedia.org/T217015 [12:00:15] !tools PAWS: killed proxy pod to attempt to get it to see routes to open notebooks servers T217010 [12:00:16] T217010: HTTP Error 502 / 503 when logging into https://paws.wmflabs.org - https://phabricator.wikimedia.org/T217010 [12:00:41] chicocvenancio: thanks! [12:03:18] sDrewthedoff: I'm not sure what could be wrong, there might be a certain delay between updating your SSH key and it actually being active. just to check, you updated it here? https://toolsadmin.wikimedia.org/profile/settings/ssh-keys [12:06:35] nope, it isn't my ssh that I am updating [12:06:51] it is a new one on the server 172.16 [12:07:04] it is a new one on the server 172.16.7.167 [12:08:59] right, that's the new stretch bastion [12:09:34] and it asks you to add it to your known_hosts [12:10:11] and backs up your old to known_hosts.old [12:10:21] oh, are you trying to login to login-stretch from login-trusty? [12:11:04] login to stretch from the bastion [12:11:10] that won't work [12:11:21] login-stretch is a bastion, just for the stretch cluster [12:11:30] login-trusty is the original bastion, for the trusty cluster [12:11:41] try logging in to login-stretch from your laptop/desktop [12:13:18] it had worked previously, but okay [12:19:56] !log tools.wikiloves deleted local crontab on tools-bastion-03 (T217019) [12:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikiloves/SAL [12:20:00] T217019: wikiloves - Incorrect usage of crontab in Toolforge - https://phabricator.wikimedia.org/T217019 [12:20:07] Hmm, it seems I have forgotten my LDAP password. What's the best way to reset it? [12:20:51] JamesR, forgotten password form on wikitech? [12:21:07] !log tools PAWS: killed proxy and hub pods to attempt to get it to see routes to open notebooks servers to no avail. Restarted BernhardHumm's notebook pod T217010 [12:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:21:10] T217010: HTTP Error 502 / 503 when logging into https://paws.wmflabs.org - https://phabricator.wikimedia.org/T217010 [12:21:26] sDrewthedoff: that's weird, I don't think we have host-based authentication enabled but if you remember it working before (I honest don't know), please open a phab task so someone more knowledgeable than me can check that [12:22:05] Krenair: I have already tried today, and it's linked to one of two email addresses. Clearly I used the wrong one on first attempt, now locked out for 24 hours. [12:25:13] gtirloni, I don't need it now that it is moved, so not worth chasing [12:25:23] Never mind, found the email :) [12:25:50] sDrewthedoff: got it, makes sense. [12:26:15] * gtirloni wonders if there's more fallout to deal with [12:26:29] 99242 0.40826 sulwatcher tools.stewar dr 02/11/2019 22:44:45 continuous@tools-sgeexec-0912. 1 <-- it is scheduled for deletion since 30 minutes or so (tools.stewardbots) [12:26:48] Melos: I'll check that, one moment [12:30:31] COIBot has not come back into IRC, though I think it is on its own instance on wmflabs, not through tools [12:32:12] perhaps an email should be sent as multiple tools are still down (usually getting 502 error) [12:32:41] Dvorapa: will do, good idea [12:51:00] Melos: that job is gone now, does anything else needs attention? [12:51:20] gtirloni: no, thank you :) [12:51:29] cool, thank you. and sorry [12:54:12] gtirloni: kmlexport is getting the following error: `Restarting webservice...............ERROR: Pod resisted shutdown` [12:54:25] !log tools PAWS: Restarted Criscod notebook pod T217010 [12:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:54:29] T217010: HTTP Error 502 / 503 when logging into https://paws.wmflabs.org - https://phabricator.wikimedia.org/T217010 [12:55:00] Dvorapa: checking [12:56:09] Dvorapa: I see there's no pod running and trying `webservice start` gives me `Your job is already running`... this looks like a bug in webservice. `kmlexport` is a grid engine job, right? [12:56:42] yes, so far a gridengine job (there were numerous issues with k8s) [12:57:11] (issues with k8s have phab tasks, not needed to worry about them) [12:58:08] got it [12:58:37] the error message is misleading.. it's saying the `pod` can't be stopped which leads us to think it's kubernetes.. the real issue is that the grid engine job wouldn't die, so I force killed it [12:59:15] now start/stop/restart seem to work [13:00:07] we had 32 jobs in the queue waiting to run, now it's 22... it should be <5 soon (which is the baseline, AFAIK) [13:01:05] gtirloni: cool, thank you [13:03:25] Dvorapa: tracking issue here T217025 [13:03:25] T217025: webservice: misleading error message `Pod resisted shutdown` - https://phabricator.wikimedia.org/T217025 [13:04:50] ok [13:11:50] !log tools PAWS: Stopped AABot notebook pod T217010 [13:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:11:56] T217010: HTTP Error 502 / 503 when logging into https://paws.wmflabs.org - https://phabricator.wikimedia.org/T217010 [14:25:24] gtirloni: FYI, it seems the NFS thing caused several of my bot's sge jobs to go into a weird state, apparently not running and qstat tells me they're using 0 memory and 0 CPU time. Attempting to qdel them now for restart. [14:25:53] anomie: ok, if qdel doesn't work please let me know and I can check them [14:26:16] The NFS thing is only affecting grid right? not k8s [14:26:26] Zppix: it affected k8s too [14:26:35] gtirloni: is there a ticket? [14:27:32] Zppix: T216988 [14:27:33] T216988: labstore1004 - DISK CRITICAL - free space: /srv/tools 115904 MB (1% inode=79%): - https://phabricator.wikimedia.org/T216988 [14:27:43] thanks [14:28:36] gtirloni: Some went away right away. One (250145) hung around for a while but eventually went away. 250143, 250144, 250150, and 331841 are still there. Also 250146 and 250147 were unaffected for whatever reason. [14:29:22] gtirloni: is there a list of affected tools? [14:29:34] Zppix: no [14:29:46] anomie: I've killed the remaining ones [14:29:52] thanks [14:31:14] was it a case of everything with a file handle open at the time getting broken? [15:52:49] gtirloni: FYI, low priority weirdness: The resubmitted jobs for my bot are running normally (log files and redis show appropriate data), but for some reason qstat is showing 0 memory and CPU usage for 351099, 351100, 351101, and 351103 (but 351096-8 and 351102 are showing proper usage). [18:04:36] !log wikidata-dev wikidata-shex set up ConfirmEdit [18:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-dev/SAL [19:16:23] Hello, I am not able to ssh or http to my VPS whgi.wikidumpparse.eqiad.wmflabs. When I attempt to do either the I get no response or error message but hangs forever (doesn't even timeout). Any thoughts? I see that the service status is "degraded" is it a known issue? [19:17:21] $ ssh whgi.wikidumpparse.eqiad.wmflabs [19:17:21] The authenticity of host 'whgi.wikidumpparse.eqiad.wmflabs (172.16.1.145)' can't be established. [19:17:21] ECDSA key fingerprint is SHA256:OdPqbOuHb97w+0aYKh58AGSI0YGWelaT30oXbMqgOHk. [19:17:21] Are you sure you want to continue connecting (yes/no)? yes [19:17:21] Warning: Permanently added 'whgi.wikidumpparse.eqiad.wmflabs,172.16.1.145' (ECDSA) to the list of known hosts. [19:17:21] Permission denied (publickey). [19:17:25] The host seems to be up etc [19:17:34] Is your ssh config still correct? [19:20:20] Thanks reedy, yeah, if I log into tools-labs I can reach it. But it using the old ssh config by going through bastion first I cant. Should I still be using `Host *.eqiad.wmflabs: ProxyCommand ssh -a -W %h:%p maximilianklein@primary.bastion.wmflabs.org` ? [19:26:00] Hmm now from tools-labs the error has changed to "connection refused" while i was trying to transfer my public key [19:26:08] notconfusing: some VMs might have stale NFS mounts, I've rebooted that VM. please try now [19:26:27] heh [19:26:33] That's probably the reboot then [19:32:13] gtirloni, Reedy: I can log in again thanks! (screen sessions didn't persist though). Thanks for the quick help. [19:33:52] notconfusing: I tried unmounting NFS but it was in a very bad state :/ sorry for the reboot but it was necessary [19:34:45] gtirloni: no problem. thanks for getting it back up [20:10:07] Hi, I'm trying to move my web service from grid to kubernetes but when I do: webservice --backend=kubernetes start , I just get a blank page... any help? [20:11:26] jem: have you looked in error.log to see if there are any log messages? [20:12:30] without listing a runtime type you should be getting a php5.6 Docker container and that should send runtime errors to $HOME/error.log [20:16:03] Thanks, bd808, checking [20:22:43] Hmmm... the messages are weird, including a chdir error... I'll check carefully [20:25:20] jem: if you want a second set of eyes, just let me know which tool you are working on [20:29:45] Thanks, bd808 :) I'll try to figure it out myself and if I'm stuck I'll "be back" [20:29:57] +1 [21:02:21] Is it possible to change my shell account name? (from royh to hvaara) [21:20:40] hvaara: it is probably possible, yes. There is no self-service system for this, but you can create a task in Phabricator asking for changes to your Developer Account's shell name. [21:28:02] bd808: Great! Thanks! :) [21:41:47] !log tools depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test T217066 [21:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:41:52] T217066: Rebooting more than 2 instances via Horizon sends one instance (at least?) into an error state - https://phabricator.wikimedia.org/T217066 [22:23:07] hey all, I'm lost in the trusty deprecation docs. I need to schedule Python scripts to run at regular intervals. Was doing that through cron. Now I'm not sure whether I'm supposed to migrate to a new grid, or to Kubernetes? Do I have a choice? If so, what are the benefits or drawbacks of each? The docs suggest I should move to Kubernates: "When possible, we recommend migrating web services to Kubernetes instead of the new grid" [22:23:08] https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#Move_a_grid_engine_webservice [22:23:44] J-Mo: you have the option in this case. We prefer that people use Kubernetes where possible because it's just better, but SGE (Son of Grid Engine) is available as an option. [22:23:57] If you were using crontab before you were using Grid Engine [22:25:07] thanks harej. Are there examples of what the new crontab and execution syntax looks like? I *think* I can follow the instructions on migrating, but not sure how to run my jobs when I'm done. [22:27:16] for example, here's what one of my current jobs looks like on trusty: 0 16 * * * jsub -l release=trusty -N sendTeahouseInvites ~/venv/bin/python /data/project/hostbot/bot/send_th_invites.py "th_invites" [22:28:16] J-Mo: the crontab syntax should not need to change at all when moving from Trusty to Stretch. -- https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#Move_a_cron_job [22:29:06] in that particular example, the "-l release=trusty" part can be removed. I think you get a warning right now on either grid when providing a release=... flag [22:29:48] awesome, thanks harej and bd808. I'll be back if I get into the weeds again. [22:31:07] J-Mo: if you can think of less confusing words for the instructions at https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation edits are welcome as well! :) [22:32:24] cool cool. If I figure it out, I'll try to leave some breadcrumbs for other confused folks who come after [23:20:20] !log tools Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for T217066 [23:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:20:24] T217066: Rebooting more than 2 instances via Horizon sends one instance (at least?) into an error state - https://phabricator.wikimedia.org/T217066