[01:35:26] groups: cannot find name for group ID 50380 [01:35:36] I guess I'm in a group that no longer exists :D [01:35:38] Krenair that's known (ldap issues) [01:35:43] uh wrong ping [01:35:46] Krinkle ^^ [01:35:50] wow, that was.. fast. [01:35:59] Krinkle see https://phabricator.wikimedia.org/T217280 [01:36:25] Okay [08:31:09] !log quarry restarted uwsgi to deal with 502 nginx errors `sudo systemctl restart uwsgi-quarry-web` [08:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL [13:04:14] hauskatze: thanks for the ping, I'll take a look [13:04:40] hi valhallasw`cloud :) [13:04:50] well gerrit is down so it'll have to wait [13:06:18] hauskatze: ok, this is stupid. I forgot to update the crontab to use the new virtualenv [13:06:50] valhallasw`cloud: don't flog yourself for that, crontab is easily missed :) [13:07:19] probably best not to enable it for now valhallasw`cloud while gerrit is down? [13:08:28] mmm, maybe. It should be OK -- if an exception is thrown it will just try again the next time [13:08:37] but with a two-week backlog maybe better to not risk it :P [13:09:20] I'm fine with what you decide [15:21:41] Looks like KrinkleBot for protecting file pages on commons is stuck again on the grid. Not on a shell. Could someone restart it? It's on a 15min cron. The current job probably just needs to be killled [15:31:44] valhallasw`cloud, ^ perhaps? [15:36:36] Krenair: let me take a look. Is this the Trusty or the Stretch grid? [15:37:25] never mind, Stretch grid. The job is in an error state, I'll qdel it [15:38:13] !log tools.krinklebot qdel job 870644 0.25000 fileprotec tools.krinkl; in error state, requested by Krinkle on IRC. [15:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.krinklebot/SAL [16:33:17] * Krinkle is back on a proper device now [16:33:19] Thanks valhallasw`cloud [16:59:05] !help [16:59:05] MarioFinale: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [16:59:30] Is any cloud andmin here? I have 3 tasks stuck in dRr. [16:59:48] 3 grid jobs* [17:00:29] MarioFinale: let me take a look. Which tools is this? [17:00:57] tools.periodibot Jobs: 203211 203212 203214 [17:01:37] thanks [17:02:22] MarioFinale: judging from the d in the status you intended to stop the jobs, correct? [17:02:42] yep [17:13:31] MarioFinale: I'm not quite sure what's going on. The process is stuck in something called 'uninterruptible sleep', which generally has something to do with NFS, but NFS seems OK [17:16:23] weird [17:18:50] It happened to 3 running jobs simultaneously. And one of those jobs were in a different host. [17:20:11] was [17:20:49] should I create a phab task about this? [17:21:48] MarioFinale: yes, please. I'll prod around some more -- NFS does seem to be the cause [17:28:20] Done: T218486 [17:28:21] T218486: Grid jobs stuck on host - https://phabricator.wikimedia.org/T218486 [17:30:11] valhallasw`cloud: I need to go, thanks for the help! [19:46:58] Looking at the job that valhallasw`cloud killed for me, interesting metadata: [19:46:58] $ qacct -j 870644 [19:47:04] qsub_time Thu Jan 1 00:00:00 1970 [19:47:30] taskid undefined [19:47:39] cpu 0.000s [19:47:39] mem 0.000Bs [19:47:40] etc. [19:52:50] hey valhallasw`cloud given that gerrit is back can we get forrestbot to work again? :) [19:53:07] Krinkle: interesting. The NFS share is also super slow -- I'm trying to copy the (only 70M) accounting data file but it's taking forever [19:53:39] Is there a way yet for a tool to opt-out of NFS some how? [19:53:44] We did that for VPS projects. [19:54:12] I think toolforge itself is on nfs iirc [19:54:15] I know its complicated, just wondering if there's an option or recommendation here maybe for the subset of users who are willing to do a bit of work to make it possible. [19:54:16] not really -- SGE is tightly bound to NFS. It should be possible with k8s in theory [19:54:47] for k8s you could do something like 'copy all relevant files to /tmp and run the tool from there' [19:55:01] for SGE as well, but then you run the risk of never cleaning up those files [19:55:05] right [19:55:43] it's not a long running process though in my case. I've not seen the python process (usually takes maybe 30 seconds) itself get stuck. If NFS fails, it just exists cleanly as I'd expect. [19:56:12] The issue is always (to me, mysterious) ways in which jsub schedules it, but won't start, which means the cron job that -once starts it again 15min later won't do anything. [19:56:43] It's becoming a real bottle neck to whack-a-mole every other week or so to keep it running. I wonder if maybe there's a better way to do this for me. [19:57:16] See https://github.com/Krinkle/pywiki-fileprotectionsync#readme for the crontab I use currenetly [19:59:05] Krinkle: yeah, that is really annoying. Unfortunately no simple solution for that -- SGE is just always a bit broken (although more so lately) [20:02:02] Krinkle: I'm not sure what happened with the accounting file. The same occurs for other tools every now and then as well (52/200k entries in the accounting file), especially given that I had just qdel'ed it, which should lead to a clean exit [20:03:31] 03/16/2019 00:01:12|worker|tools-sgegrid-master|W|job 870644.1 failed on host tools-sgeexec-0934.tools.eqiad.wmflabs general assumedly before job because: can't get password entry for user "tools.krinklebot". Either user does not exist or error with NIS/LDAP etc. [20:04:05] so the cause seems to be that the job never started [20:06:30] Is it possible someone can check into why im having issues connecting to cloud vps bastion the only error i get is “Connection Failure” [20:10:36] valhallasw`cloud: hm.. so the 0/undefined values I find in qacct, is that reflecting that the job never started, or is that a separate issue due to NFS (where the logs are) not being able to find it quick enough or got lost? [20:10:52] Krinkle: I believe the issue is that the job never started (due to LDAP issues) [20:11:04] if NFS was broken, I would expect no line at all [20:11:30] Zppix: bastion.wmflabs.org works for me [20:11:38] hauskatze: (late) -- good idea. [20:11:40] valhallasw`cloud: weird [20:11:49] valhallasw`cloud: ah, okay. Thank you :) [20:12:16] valhallasw`cloud: i cant get a connection to bastion.wmflabs.org , or directly to both bastions [20:12:34] Zppix: ssh -vvvv bastion.wmflabs.org and take a look at the verbose output [20:13:38] 4 '-v's? My SSH client only goes up to 3 >.> [20:14:33] Could it be that im trying to use ed25519 ssh key? [20:15:20] Zppix: the verbose output will tell you [20:16:10] Zppix, ed25519 keys should work fine [20:16:50] in fact yes, it does work fine, that's what I'm using: debug1: Server accepts key: pkalg ssh-ed25519 blen 51 [20:18:06] Krenair: i just switched to rsa and it works fine so i think it is the key though i just generated it and pasted it with no space into prefs [20:18:09] Its weird [20:19:35] Zppix, this is "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILgpf4aMCKXu4bgtdwPj6df0cp7QCbo4DxTw3hiawn9O Generated By Termius" ? [20:19:50] Krenair: yes [20:22:12] Zppix, paste your ssh -vvv bastion.wmflabs.org [20:24:46] [21:11:39] hauskatze: (late) -- good idea. <-- sorry, I was on the phone; thanks :) [20:25:44] Krenair: it works now [20:25:46] Weird [20:31:24] hauskatze: buzzing away happily. [20:31:46] valhallasw`cloud: I see, much thanks [20:33:16] valhallasw`cloud: ehm, does the bot no longer use https://tools.wmflabs.org/forrestbot/forrestbot.log.txt ? [20:33:27] hauskatze: it does, but I ran it manually [20:33:36] so the output went to my screen instead of the log file :-) [20:33:45] perfect then [20:39:23] Hello, After the migration to stretch. I noticed that my cronjob has stoped to work. Basically my gen_stats.py need toolsforge pkg. Which is only work when vir env is activate. For that I wrote small script https://github.com/Jayprakash-SE/indic-wsstats/blob/master/cronjob.sh . It is working fine in trusty. [20:40:27] I saw the cron.error log, It says ImportError: No module named toolforge [20:41:09] even I installed the toolforge pkg in virtual env. [20:44:10] Jayprakash12345: did you create a new virtualenv on Stretch? unfortunately you can't reuse the same venv across both platforms [20:45:27] Yeah, I followed https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#Rebuild_virtualenv_for_python_users [20:46:14] and if you run cronjob.sh from the terminal, does it work as expected? [20:47:09] No, It is not working. But when i run it after webservice --backend=kubernetes python shel . It is working fine. [20:47:24] ah! [20:47:37] confusingly, webservice --backend=kubernetes runs yet another platform (debian jessie) [20:48:16] so if you want to run the tool from a crontab, you need to create the virtualenv on the bastion, not in the webservice shell [20:52:06] I followed https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web#Using_virtualenv_with_webservice_shell this when I Setup my tool. [20:52:41] Yes; that is for a web-based tool. However, you're running your tool from a crontab, which means it runs in a different environment. [20:52:53] Oh [20:54:04] So How I can setup virenv in bastion instead of webservice shell? [20:57:34] Jayprakash12345: run the same commands as before, but without calling webservice --backend=kubernetes python shell [20:57:48] Jayprakash12345: and probably place the virtualenv in a different place, as www/python/venv suggests it's for a web tool [20:57:58] Got it, Trying [21:09:37] valhallasw: Thank you very much, Now cronjob is working fine :) THanks [21:30:08] valhallasw: Hi, Now my has down. I used webservice --backend=kubernetes python start to start the app. [21:30:42] Jaypraka_: I'm confused -- I thought you were running the app from a crontab. Or do you also have a Python-based webapp? [21:32:31] Yes, I have flask web app. and i used cronjob to gen_stats.py to genreate a json file. which is using in web app [21:33:22] See https://github.com/Jayprakash-SE/indic-wsstats [21:34:46] Jaypraka_: aaah [21:34:58] Jaypraka_: in that case you will need two virtualenvs. One for the cronjob and one for the webservice [21:35:25] for example, www/python/venv for the webservice. That one should be created through webservice shell [21:35:50] and e.g. crontab/venv for the crontab. That one should be created from the bastion [21:35:59] sorry -- it's a bit of a confusing situation [21:36:24] Ok, I am trying. [21:47:05] valhallasw: Now the tool is up. And cronjob is also working. Thank you very much once again :) [21:50:24] Jaypraka_: hi - do you think the author of https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/OATHAuth/+/492435/ will check back to fix the issues? [21:54:01] hauskatze: Yes, I should sent a message to her. we have telegram group. [21:54:22] Jaypraka_: perfect - when she fixes, I can +2 if I'm around [22:06:00] I have added my stuck web service job 859159 to T218486 if somebody wants to analyse. Otherwise can it be killed. The webservice has been down since 1 am. [22:06:00] T218486: Grid jobs stuck on host - https://phabricator.wikimedia.org/T218486 [22:34:39] !log tools clearing errored out queues again [22:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL