[00:32:38] I'll follow up with Erutuon [01:41:10] https://tools.wmflabs.org/checker/ seems to be down, likely due to the name change. [02:00:02] !help enwp10 web tool is down, getting 403 NOT AUTHORIZED responses since the name change [02:00:02] If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-kanban [02:00:35] https://enwp10.toolforge.org/cgi-bin/pindex.fcgi [02:01:24] [telegram] JesseW: the link works for me? [02:02:13] [telegram] audiodude: have you tried to contact the maintainers? [02:02:18] they are me [02:02:26] :) [02:03:43] I did test it before the forced switchover and it was working on the new path [02:03:52] hmm... [02:04:07] do you mind if I poke around inside the tool a bit? [02:04:17] go for it [02:04:41] * bd808 waits for his ssh agent to wake up from some nap [02:04:49] If I delete a deployment-prep instance, do I need to do anything to update shinken or will it eventually figure it out? [02:05:08] dpifke: it should figure it out after ~30 minutes [02:05:28] bd808: I was trying to find access and error logs but I don't know where they are for lighttpd [02:05:45] Cool, thanks. [02:05:56] I was looking at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web/Lighttpd [02:05:57] audiodude: errors should go to $HOME/error.log [02:06:10] we turned off access.log logging by default [02:06:15] ah okay [02:06:29] too many multi-gig files nobody actually read :) [02:06:46] makes sense [02:07:06] so 403 to me could just be a permissions issue somehow? [02:07:45] audiodude: I think it is your custom mappings in $HOME/.lighttpd.conf [02:08:05] your index document is a cgi script correct? [02:08:18] yes [02:08:59] My current guess is that you need to remove the "/enwp10" from your mappings [02:09:15] and I think that https://enwp10.toolforge.org/enwp10/cgi-bin/pindex.fcgi confirms that :) [02:09:56] oh hey [02:10:00] okay great thanks [02:10:03] :D [02:10:28] yw :) [02:11:55] very long lived tools that are using a lot of .lighttpd.conf trickery are harder for us to reason about as we try to make incremental changes to how this stuff all works [02:12:21] its not bad that they exist, but it makes automating changes difficult [02:12:27] that makes sense. And as you can see from the deprecation notice, we're trying to EOL this particular version [02:23:17] bd808: Hm, it still doesn't come up for me... [02:23:45] * bd808 tries again [02:23:53] just times out [02:24:08] it was fast for me once, but does seem to be just spinning now [02:24:27] * bd808 peeks under the hood [02:24:44] ah, just came up for me [02:25:32] coming up fine for me now. What did you do? :-) [02:25:56] I looked at it. I have this effect on servers. They fear me :) [02:26:30] Helpful! [02:26:39] something is going on here though. does the tool do some heavy lifting? it is acting like it is overloaded [02:26:50] it does some pretty heavy lifting, yes. [02:27:02] Or at least, it's always been slow. [02:27:56] It's doing WhatLinksHere on all the pages linked from a given page. [02:28:07] which can range from ~5 to ~500 [02:29:29] Here's the (pretty small) code: https://github.com/legoktm/checker/blob/master/app.py [02:29:56] oh hi [02:29:59] !log tools.checker Hard restart to pick up "replicas: 2" setting in $HOME/service.template [02:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.checker/SAL [02:30:07] legoktm: I poked your tool [02:30:13] thank you! [02:30:20] and gave it 2x the execution space :) [02:31:05] hmmm... still spotty response to page reloads [02:31:17] I still don't know what that tool does, but it usually only breaks every year or so [02:31:28] lol [02:31:43] heh. you just kept MZ's toy working then legoktm? [02:31:59] it's very useful for making sure wikisource works are properly (and completely) published [02:32:09] pretty much [02:32:23] I think on a plain page load it makes an API request and queries the meta_p database [02:33:05] yes, and on a page load with params it does the pagelinks query [02:33:48] heh. its a flask app that looks more like a port of a php page [02:35:23] I think it would be a lot faster with a tiny bit of redis cache mixed in [02:35:26] it was originally a CGI script and I just adjusted enough of it to use flask so I could move it to k8s [02:35:55] caching the database list and the API call? [02:35:58] yeah that sounds like a good idea [02:36:14] yeah. even just for like 10m would help it I think [02:36:31] let me see which flask caching library is maintained... [02:36:53] yeah. I found one at some point... [02:37:25] I've rolled my own so many times I don't remember which tool I finally realized that there was an easier way [02:38:32] https://pythonhosted.org/Flask-Caching/ is the one I used in some other tool, there's a flask-cache which used to be more popular but is unmaintained [02:39:10] my last home-grown is https://phabricator.wikimedia.org/source/tool-k8s-status/browse/master/k8s/cache.py [02:39:33] Maybe we should wrap one into your toolforge lib [02:41:30] legoktm: I guess https://pypi.org/project/cachelib/ is the one I used there [02:41:47] which is formerly werkzeug.cache [02:47:50] bd808: so replicas: 2 means that there are two instances of the app running and k8s auto load balances between the two? [02:47:57] yup [02:48:20] if you do `kubectl get pod` you can see 2 running [02:48:52] and `kubectl get svc` will show you the service object that does the round-robin load balancing [02:49:38] traffic hits the Ingress object, it asks the Service object how to find the backing pod to route to [02:50:26] its a nicer way to scale up a uwsgi tool than trying to run more concurrent threads in the uwsgi container [02:50:40] how much of an impact do you think this will make for popular tools? I expect most to be limited by sql queries ultimately [02:50:58] "it depends" I think [02:51:09] maybe more specifically I'm asking when it makes sense to add that to a tool [02:51:38] if you are db bound it probably makes things worse [02:52:23] but if the tool is mostly calling web apis of some kind and also acting like it is queuing a lot then it should help [02:53:47] I added the feature flag for scaling replicas initially for the fourohfour tool which handles missing webservice calls for all tools [02:54:01] I wanted it to queue as little as possible [02:54:37] gotcha [02:55:18] I can't think of any other tools off hand then that would benefit from that [02:55:39] !log tools.checker enable redis caching for mostly static data [02:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.checker/SAL [02:55:56] Also, service.template is my new favorite thing [02:56:06] https://github.com/legoktm/checker/commit/38f4726decd61c4f65953b7bdebf8e1b5d1290b8 - it makes all the command line invocations so much easier [02:57:21] yes, sevice.template is what I should have designed for .webservicerc in the first place :) [02:57:53] and you can have multiple (and use --template=...) if you need that for some reason [02:58:50] * bd808 needs to get this worked into a brand new "how to I use webservice" wiki page [02:59:32] the only thing that would be nice is if it could read them out of ~/www/python/src/ so I don't need to symlink them into the root but when I thought about proposing that I realized it wouldn't know which directory to look for if it didn't know the type [02:59:53] heh yeah [03:00:22] it could have a cascade though [03:00:35] that wouldn't be too ugly to add into the script [03:01:09] look for it in $HOME, then $HOME/www/python/src, then $HOME/www/js/source, etc [03:01:19] that would still be quick [03:01:40] legoktm: send a patch :) [03:01:58] Oooh, OK, will do [03:04:15] legoktm: b.storm said something in passing today about a hardcoded tools.wmflabs.org thing in pywikibot somewhere. Does that ring a bell with you? [03:05:28] https://codesearch.wmflabs.org/pywikibot/?q=wmflabs.org&i=nope&files=&repos= [03:05:40] Looks like it has some tool URLs in a few scripts [03:05:44] we are looking at the same page :) [03:06:00] Oh but [03:06:09] http://pywikibot.org/ isn't working [03:06:33] That's supposed to redirect to tools.wmflabs.org/pywikibot/ [03:06:44] how/where is that setup? [03:06:49] (It could've been broken for a while tbh) [03:06:56] Should be in puppet [03:07:33] https://gerrit.wikimedia.org/g/operations/puppet/+/72af44178d05199e0ac1162cd4637f93265b6a13/modules/ncredir/files/nc_redirects.dat#282 [03:07:34] ah. it's in modules/ncredir/files/nc_redirects.dat [03:08:14] WMNL owns the domain, but it should point to WMF DNS [03:08:17] that really should still work [03:08:35] It's totally possible something else broke it and I just noticed now [03:10:01] the file is protocol agnostic. I wonder if that is causing some problem? [03:10:21] all the others are https://... rather than //... [03:10:45] legoktm: should I make the patch or are you already on it? [03:11:28] I'm creating a phab ticket [03:12:12] heh. I was just going to make a patch and add a.rturo as reviewer so he would deploy in the EU morning :) [03:14:01] https://phabricator.wikimedia.org/T257536 [03:14:05] bd808: please do :) [03:14:43] I'll write the pwb *.toolforge.org patch [03:20:57] (PS1) Legoktm: Update URLs for new toolforge.org domain [pywikibot/core] - https://gerrit.wikimedia.org/r/610553 [03:22:55] legoktm: we should get those domains added to the LE cert too [03:23:25] is that easy to do now? I don't think it was at the time [03:23:55] if the dns is hosted by WMF then yes. If its external then no [03:24:22] `git grep pywikibot` in operations/dns.git is a no hit :/ [03:24:57] hmm.. but NS says it is local [03:25:00] https://phabricator.wikimedia.org/T106311#1523595 [03:25:19] or https://phabricator.wikimedia.org/T106311#1523674 [03:26:09] acme-chief should make this just a hiera setting to add for the cert [03:26:22] but in 2016 that was not a thing :) [03:27:19] ah. ok. there is a CNAME record hosted externally that makes it work [03:27:34] so yeah, that will keep acme-chief from doing its thing [03:28:04] it does the LE auth using DNS challenges so that it can do wildcards [03:28:30] so the DNS has to be one that acme-chief knows how to update when LE sends a new challenge token [03:29:06] it's not possible to use the file based /.well-known version? [03:29:31] My guess is that WMNL holds the domain for historic/legacy reasons and no one wants to deal with properly transferring it over [03:29:40] in the LE protocol, yes. In acme-chief I don't think that is implemented [03:31:02] acme-chief is basically a custom LE client that gets the cert issued and then pretends to be a puppetmaster so that Puppet can fetch files from it. [03:31:21] its slick, but also complicated magic [03:32:18] ok, let me file a task at least with these notes [03:37:15] https://phabricator.wikimedia.org/T257537 and I cc'd Kren.air and siebrand [04:39:54] !log tools.refill Restarted to pick up toolforge.org migration related changes to $HOME/.lighttpd.conf (T257481) [04:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.refill/SAL [09:16:07] !log admin manually increasing sysctl value of net.nf_conntrack_max in cloudnet servers (T257552) [09:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:16:10] T257552: prometheus: node exporter stopped collecting nf_conntrack_entries metric - https://phabricator.wikimedia.org/T257552 [11:11:53] !log admin [codfw1dev] rebooting cloudnet2003-dev for testing sysct/puppet behavior (T257552) [11:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:11:56] T257552: cloudnet: prometheus node exporter stopped collecting nf_conntrack_entries metric - https://phabricator.wikimedia.org/T257552 [11:23:47] !log admin [codfw1dev] rebooting cloudnet2003-dev again for testing sysct/puppet behavior (T257552) [11:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:23:50] T257552: cloudnet: prometheus node exporter stopped collecting nf_conntrack_entries metric - https://phabricator.wikimedia.org/T257552 [15:56:23] andrewbogott: Thanks. I'm sorry I'm requesting lots of projects [15:56:33] you're doing lots of good stuff! [18:13:54] Please, request to global kickban Joaquinito01 [18:15:11] bd808 MacFan4000 Please, request to global kickban Joaquinito01 [18:15:28] @kb Guest27 [18:15:33] that's a troll [18:23:15] give me op please [18:23:33] give me op please [18:23:35] give me op please [18:23:39] MacFan4000: ^ [18:24:21] Reedy Please, request to use Wikimedia Toolforge [18:24:45] Please, with using apache traffic server [18:30:32] @kb Guest36820 [18:30:33] Sorry but I don't see this user in a channel [18:30:38] ah too late [18:31:06] bd808: git-review for toolforge should be fixed/fixable now: https://phabricator.wikimedia.org/T257496#6294709 [18:32:07] just needs someone to install it on toolforge bastions (the right way) [18:50:08] mutante: awesome. Se we just need to `apt-get install git-review` to pick up the new version in theory? [18:51:00] bd808: apt-get update; apt-get install git-review . ACK, i just did the same on some stat hosts https://phabricator.wikimedia.org/T257609 [18:51:21] just didn't think i should just do it on tools bastion without at least pinging.. and not sure how many bastions there are [18:52:22] I'll run a clush for it. We have a template for doing that kind of thing [18:52:37] cool:) [18:53:54] !log tools Updating git-review to 1.27 via clush across cluster (T257496) [18:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:53:58] T257496: toolforge needs git-review 1.27+ to work with new Gerrit deploy - https://phabricator.wikimedia.org/T257496 [18:57:27] mutante: heh. We must have ensure => latest on that somewhere. The hosts all came back with "git-review is already the newest version (1.27.0-1)." :) [18:59:37] bd808: hah! well..if it works it works [19:00:46] bd808: confirmed. you do have ensure => latest in profile::toolforge::grid::exec_environ [19:01:33] convenient but could also mean surprises [19:02:34] yeah. it's historical at this point. So far it has not caused major malfunctions, but like you say it may sometime down the road [19:07:24] yea, i can see an argument for both sides [20:12:41] To figure out the Kubernetes symlink problem yesterday, I rewrote my server to take paths as arguments and report when they don't point to a directory or file as expected. [20:13:23] Now the server is crashing with the message '/data/project/templatehoard/git/templatehoard-server/cbor is not a directory' [20:14:44] This is a symlink to a directory, /mnt/nfs/labstore-secondary-tools-project/templatehoard/www/static/dump/20200701/ [20:15:01] Perhaps the target isn't valid inside Kubernetes [20:15:27] Erutuon: can I have a tl;dr of the problem? [20:16:13] or is there a ticket about it? [20:17:22] My Rust server checks that `~/git/cbor` is a directory or a symlink to one, and finds it is not, but it is a symlink to a directory when I am logged in via SSH. [20:18:06] er, `~/git/templatehoard-server/cbor` [20:18:08] sorry [20:18:15] no ticket yet [20:18:31] because for all I know I could be doing something wrong here [20:18:56] this is under the templatehoard tool [20:19:39] is it possible for you to add a sleep indefintely before that check? you can enter the container and see if that is the case [20:20:16] if that is the case referring to `~/git/templatehoard-server/cbor` being ENOENT [20:21:18] I could add a sleep to the bash script [20:21:31] sleeping in the server itself would require recompiling [20:21:46] I'm new to Kubernetes so I don't know how to enter the container [20:22:28] `kubectl exec -it -- /bin/bash`, iirc [20:22:45] `` you can get from `kubectl get pods` [20:26:09] Okay, inside there, `/mnt/nfs/labstore-secondary-tools-project` doesn't exist so the symlink points to a nonexistent file [20:28:06] isn't `/mnt/nfs/labstore-secondary-tools-project` `/data/project`? the latter should exist everywhere [20:28:28] the correct path would be /data/project/templatehoard/www/static/dump/20200701 [20:28:49] so does `/data/project/templatehoard/www/static/dump/20200701` exist? [20:28:54] yeah [20:29:09] so use that :) you figured it out [20:29:12] so in the server script I have to spell out `/data/project` instead of using `~`? [20:29:26] er `/data/project/templatehoard` [20:29:41] doesn't ~ expands to `/data/project/templatehoard`? [20:29:57] are you using `realpath`? [20:30:04] no [20:30:07] to both [20:30:48] my server's bash script contains `RUST_LOG=all ~/bin/server --cbor ~/git/templatehoard-server/cbor ...` [20:31:32] oh but the problem is when I created the symlink, it used the `/mnt/nfs/labstore-secondary-tools-project` path rather than `/data/project` [20:31:42] I just used `ln -s` [20:31:50] looks like a fish problem [20:31:53] https://www.irccloud.com/pastebin/k20Q8wvu/ [20:33:08] aha [20:33:28] gah [20:34:30] wonder if it's bug or intended behavior [20:37:00] strangely, it doesn't realpathify $HOME [20:37:39] hmm, I think this is fixed in Fish 3.1.2 [20:37:56] the server is running Fish 2.4.0, quite old [20:39:48] I recall that the older Fish resolved symlinks unexpectedly when I was cd-ing and the newer one didn't [20:40:17] well, that's what you get for trying to run stable systems for years ;) [20:40:19] https://github.com/fish-shell/fish-shell/commit/0f0bb1e10f0f9d749b3d4a48de3a9d86376a7825#diff-ff25c22e06664808e8ba36fcd39b6439R775 [20:40:26] when I installed it using a newer PPA [20:40:38] hi Erutuon [20:41:02] oh hi! [20:41:07] so that would be issue https://github.com/fish-shell/fish-shell/issues/3350 [20:41:34] Erutuon: I missed you yesterday, but it seems like the issue is a path one? [20:41:42] I wonder if it would break anything to install a more recent version of Fish [20:41:52] I believe k8s only has access to it as /data/project, not the /mnt/nfs path... [20:42:02] yeah [20:42:23] the overhead of installing a new version... oh good luck highly don't recommend that rabbit hole [20:42:30] it was ultimately an issue with Fish shell using the /mnt/nfs path for ~ [20:43:17] well, it isn't hard on Ubuntu 18, which I was running [20:43:21] don't know about Debian though [20:43:40] https://software.opensuse.org/download.html?project=shells%3Afish%3Arelease%3A3&package=fish [20:43:44] we don't install PPAs [20:43:59] ahh [20:45:18] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Local_package_policy [20:46:01] fish is my default shell on my laptop, but I've found that all it's really nice convenience features often cause headaches on servers, so I stick to bash for everything else [20:47:17] yeah, I used to use zsh a lot and was too lazy to install it everywhere and eventually switched back to bash [20:47:38] can't get too spoiled by convenience features ;) [20:49:42] well, I'm less enthusiastic now I realize it's an older version [20:50:31] eventually the bastions will be upgraded to Debian Buster and that has 3.0.2 [20:50:49] depending on how motivated you are you can always compile it yourself :) [20:54:18] I think I copied the binary over when I started out on Toolforge, but maybe ran into some library issue [20:54:45] just use bash folks :) [20:55:05] fun shells are for your laptop, not a shared sever environment [20:56:06] yeah if you copy over some binary, unless it's statically compiled it's almost guaranteed you'll get a library issue [20:56:15] legoktm: I kind of doubt the bastions will ever be Buster. I think they will more likely be the release that comes after Buster. WMCS has no roadmap plan to rebuild Toolforge at the OS level this year. [20:56:26] now I'm getting 502 from templatehoard [20:56:42] and nobody statically compiles most programs [20:57:17] bd808: ack, bullseye then (yes, Debian picked two "bu..." names back to back) [20:57:44] which is basically, if you copy a binary, it's probably not going to work [20:57:54] oh whew, never mind, forgot to pass the port number [20:58:07] * zhuyifei1999_ goes to fix https://lore.kernel.org/bpf/cover.1594065127.git.zhuyifei@google.com/T/#m6c38fde3d4908f0a7e1efdd0a8003f13eeeccbe4 ;) [20:58:37] heh, Rust is statically compiled by default [20:58:50] so I probably *could* just copy it over [20:59:12] in fact, I have copied over some Rust tools and they work just fine [21:00:28] I use Bash for all my scripts at the moment, despite having started a ticket about getting Fish working on the Grid Engine... [21:00:29] you mean the rust compiler or stuffs built under rust? [21:00:54] built with the Rust build tools [21:01:09] ripgrep, fd-find, bat [21:02:14] that's interesting. [21:02:40] though /me stills likes dynamic linking better ;) [21:04:07] my $0.02USD: static linking is great for speed of code and runtime memory usage and horrible for security patching [21:04:49] makes sense [21:06:16] Rust is slow to compile (if you don't have compatible library artifacts in your build directories) so it can be painful to recompile [21:07:06] zhuyifei1999_: TIL where your day job is :) [21:07:25] however at least it's usually just `cargo build --release` or `cargo install` [21:07:33] legoktm: lol [21:08:25] Erutuon: are you using service.template btw? you can put the arguments to your binary in there so they don't get accidentally forgotten in the future [21:08:32] (or just hardcode port 8000) [21:09:03] legoktm: spread the good word on service.template :) [21:09:05] I am, but the only argument at the moment is the name of my bash script that calls the binary [21:09:16] I haven't found any documentation so I don't know how to do more with it [21:09:25] Erutuon: https://wikitech.wikimedia.org/wiki/User:Legoktm/Rust_on_Toolforge#webservice [21:09:39] right, but how do I access $PORT from the service.template? [21:09:50] if you're running on k8s it's always 8000 [21:09:54] ohh [21:10:53] and do I have to put each space-separated argument on a separate line? [21:10:57] https://wikitech.wikimedia.org/w/index.php?title=User%3ALegoktm%2FRust_on_Toolforge&type=revision&diff=1873068&oldid=1872477 [21:11:09] I believe so, it's a yaml file [21:11:30] you could also do uh, extra_args: ["arg1", "arg2"] I think [21:12:19] for reference, it gets executed with: os.execv(self.extra_args[0], self.extra_args) [21:16:01] hmm, and is there a way to pipe stderr to a file? [21:16:37] maybe it is by default... [21:17:49] I think you might need a shell wrapper for that [21:18:47] so is stderr just thrown away by default? [21:19:21] stderr goes to the Kubernetes output buffer by default [21:19:41] which you can get back using `kubectl logs ` [21:20:10] In a more ideal world we would be sending that also to something like an ELK cluster [21:20:49] but multi-tenant ELK stacks are hard and we haven't made that a major project yet [21:22:08] maybe my server should just have a log file argument [21:22:36] the more you can avoid reading/writing to the NFS $HOME, the better your tool will run [21:22:53] but yeah if you need crash logs right now you have to make them youself [21:22:59] *yourself [21:23:52] mostly what I'm interested in is search queries [21:24:46] though the log also showed me the error message about the broken symlink [21:33:35] any way to tell kubectl to save logs from templatehoard-* pods to somewhere in my home directory? [21:34:56] not in an ongoing way, no [21:35:48] the Kubernetes way™ to do that would be with a sidecar container in the pod that watched the event stream and serialized it to disk storage [21:36:05] but this circles back to "the more you can avoid reading/writing to the NFS $HOME, the better your tool will run" [21:37:29] Erutuon: I guess the reasonably easy way to do that would be to make your wrapper script exec your process as a shell pipeline that used `>>` or `|tee -a` to write to a file in your tool's $HOME [21:38:07] yeah, at the moment my Bash script uses `>>` [21:38:46] the next level way would be to use a logging framework in your app and setup some set of output filters in that framework write some/all things to disk [21:39:00] but I'm hoping to be able to use service.template to specify the arguments to my server [21:39:10] stashbot does the latter, with only error+ messages ending up on disk [21:39:11] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [21:39:19] * bd808 pets stashbot [21:43:27] guess I'll have to do something like that [21:44:22] there're probably some more flexible Rust libraries for logging [21:48:04] Erutuon: https://github.com/rust-lang/log and then one of the implementation modules for that interface [21:48:27] yeah, I'm using that, with the env_logger backend [21:49:47] flexi_logger might work, if I can figure it out [21:52:20] or one of the even fancier frameworks [21:56:49] env_logger seems to minimize writes, because there's sometimes a delay between a request and when it appears in the file [21:57:06] might use BufWriter internally [22:01:03] Erutuon: you will get that lag from NFS too, even if things are unbuffered. You code writes from NFS client A and you read via NFS client B so there are multiple layers of file cache to be invalidated before you can see the output