[00:11:05] bd808: followed your advice from the ticket [00:11:06] Run command: webservice ruby25 start -- $HOME/start.sh [00:11:06] start.sh contents: [00:11:06] cd $HOME/www/ruby/src [00:11:06] bundle exec rackup -p 8000 [00:12:45] hmmm.. that should work in theory [00:14:09] ProcReader: what is the tool name you are working under. I can peek at the Kubernetes ingress wiring to see if it looks correct [00:14:39] drvstats [00:15:36] ProcReader: your pod is in CrashLoopBackOff state right now [00:15:40] that may be the issue [00:15:52] starting and then crashing for some reason [00:16:30] hmm, when I run the app within a shell, it runs fine, locally on my Mac too, so I don't think it'd be the app itself crashing? [00:16:56] is there anything in your dotfiles it needs? [00:17:25] webservice start won't read .profile, .bashrc, etc [00:18:16] no, just the gems. the app file currently running is effectively just a dummy "get / do "PONG" end", and a dummy SQL query in another route [00:19:04] "OSError: [Errno 8] Exec format error" is the error. That's from the bootstrapping code [00:19:37] * bd808 steps into the tool to see if he can figure out what's up [00:19:37] was that caused in the last 20 mins or so? i did run a bad exec about an hour ago trying to figure something out [00:20:13] your service.manifest looks ok [00:20:40] I wonder if I've broken something in webservice-runner... [00:21:41] webservice-runner is the code that runs inside the container to start things up. and not a lot of webservices actually use extra_args so its possible that I did something badly in a recent update [00:22:05] ProcReader: so what I'm doing is `kubectl get po` to see the pod state [00:22:34] and then `kubectl logs drvstats-5574b5bcb-rfb8h` to see why it is dying [00:22:57] * bd808 has a hunch... [00:23:07] no, bad hunc [00:23:31] I thought maybe that your shell wrapper wasn't executable, but you did that corerctly [00:23:34] *correctly [00:25:52] ProcReader: your start script is missing a #!... line [00:26:40] try adding `#!/bin/bash` as the first line [00:26:46] ahh, lemme see [00:27:28] really interestingly, ./start.sh actually does something without the shebang in the interactive shell [00:27:35] it really should not, but it does [00:28:35] webservice-runner does a low level `os.execv(self.extra_args[0], self.extra_args)` call so there is no controlling shell. [00:28:45] guess we're getting somewhere "kubectl logs drvstats-5574b5bcb-xf466" doesn't show any errors now, but the page is 502 bad gateway still [00:30:48] hmmm. I see the `[2020-06-26 00:27:07] INFO WEBrick::HTTPServer#start: pid=6 port=8000` line in the log output [00:32:03] worst case, if you still have your tool running that rails script, I guess seeing if that still works is probably a quick way to rule out any issues outside my container [00:33:10] WEBrick? Not puma? [00:34:07] Well, shouldn't matter either way [00:34:27] `curl localhost:8000` works inside the pod [00:34:43] so what's up with the ingress? [00:35:07] Is it serving / or /$toolname? [00:35:43] * bstorm tries to sneak away like she planned [00:35:46] --canonical is set so it's redirecting of you go to the legacy url [00:36:42] everything in the ingress and service object looks right [00:37:48] !log tools.drvstats Hard restart while debugging 502 gateway errors [00:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.drvstats/SAL [00:38:03] bstorm: planning to change to puma before actual prod (just a quick rackup in the meantime for testing) [00:38:05] ugh back to crashing [00:38:10] also, is that a rat or a squirrel or what [00:38:45] oops the crash is my bad [00:38:46] I can't remember what they're called lol [00:39:33] !log tools.drvstats Hard restart #2 while debugging 502 gateway errors [00:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.drvstats/SAL [00:40:24] does rackup only bind to 127.0.0.1 by default? [00:40:48] the curl works inside the container, but not outside. this is now my hunch [00:40:59] that it is bound to loopback in the container [00:41:14] i'll modify the command to make it explicit [00:41:16] and see if that helps [00:41:39] I think it needs `--host 0.0.0.0` [00:41:53] https://github.com/rack/rack/commit/076711a837cda3f07889cab05cb89964ce2314f0 [00:42:44] ah wow, https://drvstats.toolforge.org/, that worked! [00:42:55] wizard! :) [00:42:59] thanks so much bd808 :) [00:43:15] ProcReader: np, now you have to write the tutorial ;) [00:43:22] yes I was just about to say haha [00:44:41] when you find the hard problems in the ruby side, b.storm is a ruby nerd at heart and can probably be nerd sniped into helping debug [00:46:45] i'll certainly keep that in mind ;) -- hopefully I don't run into many problems on the ruby side [00:46:49] for her sake :D [09:33:49] !log tools.zppixbot update python for bot (not cron) to py37 -- T254246 [09:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [09:33:51] T254246: Upgrade ZppixBot docker image to python 3.7 - https://phabricator.wikimedia.org/T254246 [09:45:46] !log tools.zppixbot update complete -- T254246 [09:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [09:45:48] T254246: Upgrade ZppixBot docker image to python 3.7 - https://phabricator.wikimedia.org/T254246 [12:12:32] !log toolsbeta puppetmaster live-hacking with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/608005 (T120210) [12:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:12:35] T120210: tools-mail: check SPF of sender before forwarding email - https://phabricator.wikimedia.org/T120210 [19:27:54] !log tools.drvstats Hard restart #2 while debugging 502 gateway errors [19:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.drvstats/SAL [19:35:53] some kind of issue with toolforge rn or just me? taking like 5secs to get response to shell commands [19:55:25] looks like there was someone running something stupid on the bastion [19:56:31] if the bastion is slow, that's typically NFS lag from someone running a bot or script from the bastion [19:56:37] https://grafana-labs.wikimedia.org/d/QSE7tV-Wk/toolforge-bastions?orgId=1&from=now-6h&to=now&var-Host=tools-sgebastion-07 [19:56:46] that'll cause high load and increased NFS latency [20:17:25] aren't people meant to use dev. for high-load stuff? [20:32:45] ProcReader: yes, but humans are unpredictable and also do not follow directions well. [21:14:16] !log tools.zppixbot-test shut down for T256502 [21:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [21:14:19] T256502: Rebuild ZppixBot-test - https://phabricator.wikimedia.org/T256502 [21:17:42] !log tools.quickcategories deployed e1c86c5b27 (update pagepile url) [21:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.quickcategories/SAL [21:21:24] !log tools.zppixbot-test backup dbs, rm everything else python related -- T256502 [21:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [21:21:27] T256502: Rebuild ZppixBot-test - https://phabricator.wikimedia.org/T256502 [21:57:34] !log paws applied the metrics manifests to kubernetes to enable metrics-server, cadvisor, etc. T256361 [21:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [21:57:36] T256361: PAWS: get new service and cluster metrics into prometheus - https://phabricator.wikimedia.org/T256361 [21:57:55] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Returning_the_status_of_a_particular_job doesn't seem to be working for me [21:58:07] I have: "Job 7141324 exited because of signal SIGINT" [21:58:14] but [21:58:15] tools.newusers@tools-sgebastion-07:~$ qstat -j 7141324 [21:58:15] Following jobs do not exist: [21:58:15] 7141324 [22:13:03] legoktm: did the job exit recently? We rotate the data file for that state lookup [22:13:44] bd808: it exited probably within ~10 min of me trying to look at the exit status [22:13:56] hmm [22:14:39] sigint almost always means OOM [22:14:54] bd808: 7142037 exited about a minute ago and doesn't exist according to `qstat -j` [22:15:44] yeah, I guessed as much. At this point the `qstat -j` questions are more to figure out whether the documentation is out of date or something isn't working as expected [22:16:16] `qstat -j '*'` has stuff, but not nearly as much as I would expect [22:17:21] it should list all the things that you can see at https://sge-status.toolforge.org/ [22:17:33] and it pretty obviously does not [22:18:43] hmm.. or does it [22:18:59] /usr/bin/qstat -j '*' | grep job_number|wc -l == 757 jobs [22:25:53] legoktm: I am not sure why, but qstat seems to only show running jobs even when looked up by id and not any historical jobs [22:26:24] we haven't done anything purposefully to change the grid for a long time [22:26:32] should I file a bug? [22:27:06] I wonder if tracking historic jobs got messed up by nfs restarts or something? [22:27:10] legoktm: sure [22:28:56] !log tools.sge-jobs Hard restart to set --canonical [22:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sge-jobs/SAL [22:29:36] https://sge-jobs.toolforge.org/ is still working [22:29:40] which is good [22:30:09] but it uses different magic to look at the audit logs [22:31:25] !log tools.zppixbot-test starting things back up after T256502 [22:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [22:31:27] T256502: Rebuild ZppixBot-test - https://phabricator.wikimedia.org/T256502 [22:32:42] filed as https://phabricator.wikimedia.org/T256513 [22:39:28] bstorm: btw, compiling rust on the grid works out pretty well: https://paste.centos.org/view/raw/266d40d7 I'll post some more details on the task in a bit [22:39:38] Cool :) [22:47:52] I don't think qstat ever showed historical info...just errored and hung jobs [22:47:57] I added more info in the ticket [22:48:12] I think you are thinking of qacct [22:48:23] legoktm: ^^ [22:49:37] hmm [22:49:45] maybe I misinterpreted the documentation then [22:50:02] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Returning_the_status_of_a_particular_job shows `qstat -j ###` but then the next part says "Common shell exit code numbers[1] returned e.g. by qacct..." [22:50:20] We haven't revisited that documentation in quite a while [22:50:36] We did when I did the grid rebuild, but even then, I didn't dig into it all that much [22:50:52] Mostly just things I changed. Maybe it could use more context on some of that [22:51:10] what if I add "qstat will only show information about currently running jobs. For historical jobs, use qacct." ? [22:51:19] Please do! [22:51:27] I think that should clear it up [22:51:49] Maybe also mention that qacct is slow as a dying turtle [22:52:18] that's unfair to sick turtles :) [22:52:22] "It's not broken. That's just what working looks like on a massive file on NFS with a slow connection." [22:53:46] https://wikitech.wikimedia.org/w/index.php?title=Help%3AToolforge%2FGrid&type=revision&diff=1871328&oldid=1870115 [22:58:30] Looks awesome, thanks! [22:59:10] And now that it's in the docs and that phab task it'll be easier to find if anyone else wonders :-D