[03:05:20] bd808: any chance there exists a means to inquire as to why the webservices just hangs after just a few hours of uptime? [03:45:12] Too busy to respond? [03:45:22] The web service, not Bryan [03:46:01] Is it possible for someone to take the webservice offline by making too big a request? [03:46:30] Like if I showed up and asked the bot to archive links on 1,000 World War II-sized articles [11:35:09] !log tools running `grid-configurator --all-domains` which basically added tools-sgebastion-10,11 as submit hosts and removed tools-sgegrid-master,shadow as submit hosts [11:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:38:42] !log tools added 'will get out of space in X days' panel to the dasboard https://grafana-labs.wikimedia.org/goto/kBlGd0uGk (T279990), we got <5days xd [14:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:38:46] T279990: [tools] prometheus out of space - https://phabricator.wikimedia.org/T279990 [14:41:49] I'm trying to follow the Kubernetes Example deployment.yaml instructions, but I get: "error when creating "deployment.yaml": deployments.apps is forbidden" [14:42:10] inductiveload: which example are you using? [14:42:29] the stashbot one [14:42:43] (it's a continuous thing, not a cronjob) [14:42:51] * bd808 goes to see what lies he wrote on wikitech :) [14:43:39] :-D I ran "kubectl create --validate=True -f deployment.yaml" [14:44:04] is the full file you are using available somewhere so I can take a look? [14:44:49] inductiveload: hmmm.. I'm not seeing the obvious problem in the example. Can I take a look at your deployment.yaml file? Either in a pastebin or you can tell me the tool name and I can look on the bastion. [14:45:03] https://dpaste.org/0ZDq [14:45:49] inductiveload: your tool is named "thumb-poke", not "thumbpoke" [14:45:59] yes ^ that [14:46:30] urrrgggghhh [14:46:47] The error message you got is really badly written, but the problem is that your deploymnet-yaml has the wrong namespace name in it [14:47:06] makes sense in retrospect ^_^ [14:47:27] easy enough to do wrong and easy enough to fix :) [14:47:37] I guess it's because the service account does not have access to the wrongly named namespace? that message is certainly confusing, but I don't see how we could do anything to it :/ [14:47:42] "deployment.apps/thumb-poke created" [14:47:49] \o/ [14:47:50] \o/ [14:48:39] i thought the namespace was just a thing inside k8s, didn't release it had to be _exactly_ the tool name [14:49:28] > Each tool has been granted control of a Kubernetes "namespace". Your tool can only create and control objects in its namespace. A tool's namespace is the same as the tool's name with "tool-" appended to the beginning (e.g. tool-admin, tool-stashbot, tool-hay, etc). [14:49:34] it is "just a thing" inside k8s, but also the credentials we issue to each tool are limited to using a namespace that exactly matches the tool account's name [14:49:36] it actually is written there, just lower down [15:01:22] can you hand a k8s container a python virtualenv to use? [15:04:03] yes, create the venv in a kubernetes interactive pod (`webservice shell` works), and then just use the full path of its python executable /bin/python in the deployment [15:04:45] ok, so webservice doesn't actually need to look like a "standard" server to work? [15:05:34] i basically have a background thread and a worker process (via redis). the is currently not actually a web frontend as such [15:06:51] `webservice shell` is just the easiest way to get an interactive shell inside a kubernetes container, I don't remember if running something else as a webservice will get confused (read: constantly restarting) or not if there is no webservice running [15:07:25] there will eventually be a webservice too, so worst case I can serve a dummy page [15:08:10] bd808: is there a way to make a query that will help enlighten me why the webservice hangs up after just a few hours? [15:13:01] Cyberpower678: hmmm... I can't think of any way to "make a query", but I would expect that some kind of errors would be logged. One thing that comes to mind as a possibility is lighttpd worker thread exhaustion if you are leaking php processes somehow or having some of them hang. [15:14:07] I figured it's a leaked process, but I wouldn't know where to look to patch it, without knowing what query was sent to the UI to cause it. :-) [15:15:48] bd808: so to rephrase any way to find these leaked processes? [15:19:44] you can turn on access logs if you'd like -- https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web/Lighttpd#Web_logs -- but I would also recommend setting some reminder for yourself to turn them back off in a few days. [15:20:13] and ~/error.log should show if lighttpd is running out of processes [15:45:54] hmm, so it seems it's _almost_ working, but kubectl logs ... is empty despite something clearly crashing in the container [15:48:25] bd808: thanks a bunch, as usual [15:49:48] inductiveload: try kubectl describe pod , you can get the pod name from kubectl get pods [15:50:27] that just says "Back-off restarting failed container" [15:50:34] but not why [16:31:25] bd808: the error.log is making no indication that the webservice is running out of processes. :-( [16:32:07] Yet, the webservice just hangs up with the eventual 502 bad gateway, after servicing for just a few hours. [16:32:25] *504 [16:36:25] Cyberpower678: your error.log is completely empty. did you delete the file at some point after starting the webservice? If so you detached the inode that the lighttpd process is writing to from the filesystem. You'll have to restart the webservice to restore log output. [16:36:53] bd808: I truncated it of the errors from yesterday [16:37:12] Before doing so I checked for lighttpd errors. There were none. [16:37:33] Only the onslought of PHP Fatal errors from that weird mysqli bug [16:39:16] truncate doesn't detach the inode does it? [16:39:22] bd808 ^ [16:39:44] ok, so now both containers are "running", but only one shows logs (the worker shows some celery spew, but I think the other services isn't working) [16:39:57] Cyberpower678: so just to make sure we are on the page here... you restarted the service yesterday, after which you truncated the error.log file, and now the webservice is up and running, but you want to know what errors may have happened before you truncated the error.log? [16:40:44] Cyberpower678: I say working fine because both https://iabot.toolforge.org/ and https://iabot.toolforge.org/toolinfo.json load immediately for me [16:41:01] I restarted the webservice a couple times yesterday and today, and after the latest restart a moment ago, I truncated the file. Before doing so I checked for those errors you mentioned and found none. Just entries about me restarting the webservice repeated. [16:41:38] Doing the restart so I can see the access.log since the doc says it needed a webservice restart [16:42:55] Cyberpower678: the error.log truncation removes any trace of when the service started (that gets logged there), which in turn makes getting debugging support from my difficult because I can't see this information that is only in your head. [16:43:20] leave it be until it hangs again I guess and then hope that there is some trace in error.log [16:43:44] I have a snippet of the restart entries containing the PIDs of the old webservices. [16:43:48] If that helps. [16:44:18] 2021-04-16 16:27:05: (server.c.1828) server stopped by UID = 53156 PID = 9938 [16:46:01] The access log will at least help me to identify the last queries to the webservice before it goes down. [18:41:55] Hi - where do babies these special DNS entries come from? https://openstack-browser.toolforge.org/project/deployment-prep [18:42:11] I don't see most of them at https://horizon.wikimedia.org/project/proxy/ [18:43:18] such as forceupdate.beta.wmflabs.org [18:44:03] Krinkle: horizon, dns -> zones [18:44:04] bd808: it's hanging right now. [18:44:11] I haven't touched anything. [18:44:47] Majavah: ah, "Record sets" [18:44:51] I did check there but I missed "record sets" [18:44:56] thx! [18:46:15] Cyberpower678: your ~/error.log is 100% empty still. No php errors, no system errors. I'd say that confirms my suspicion that you ended up detaching the stderr stream from the named inode [18:46:41] I thought truncate doesn't do that. [18:46:59] Cyberpower678: my only recommendation at this point is to restart and let it actually collect data [18:47:00] I was told that using truncate is a great way to empty out logs without detaching them from the process. [18:48:16] bd808: I'm sure it wasn't detached, but I restarted it anyway. [18:48:34] There were no errors the last 6 times I restarted either. [18:51:02] "2021-04-16 18:47:30: (server.c.1751) [note] graceful shutdown started" -- that vindicates you Cyberpower678. The inode was attached [18:51:20] so why was there nothing at all in the log? That seems funny [18:51:36] I don't know. :-( [19:13:38] I'm looking at the last use of deployment.wikimedia.beta and found these in Puppet for openstack: https://codesearch.wmcloud.org/operations/?q=wikimedia.beta.wmflabs.org&i=nope&files=wmfkeystonehooks&excludeFiles=&repos= [19:13:47] (cc andrewbogott ) [19:14:06] are those interchangeable with another beta wiki to your knowledge, or do we need to migrate something as part of T198673 [19:14:06] T198673: Remove deployment.wikimedia.beta.wmflabs.org wiki (deploymentwiki) - https://phabricator.wikimedia.org/T198673 [19:14:26] for most purposes meta.wikimedia.beta or en.wikipedia.beta tends to be used [20:09:53] Krinkle: those look to be dummy/testing defaults. The operational script in our prod would be editing pages on wikitech.wikimedia.org [20:10:38] that's the helper class that is used to create/delete the Nova Resource namespace pages [20:25:43] bd808: looks like it started hanging again [20:32:15] Cyberpower678: https://iabot.toolforge.org/toolinfo.json responds immediately, so if it is hung up I don't think the problem is the lighttpd server. It could be the fcgi container for php though [20:33:06] it is super weird not to see any php error log output at all [20:50:12] https://iabot.toolforge.org/index.php isn't though. :O [20:50:30] Ugh. [20:50:51] Okay, this makes things even more difficult. [20:51:01] Cyberpower678: yeah, I added a bd808-test.php file that only dumps php_info() and it's stuck too, so it feels like the problem is likely in the fcgi php container [20:51:52] bd808, don't you just love how my very presence seems to cause problems. :p [20:52:11] Your backend is running on tools-sgewebgrid-lighttpd-0914 at the moment. (Found via https://sge-status.toolforge.org/) I'm trying to find a bit more about what is going on there [21:07:57] Cyberpower678: I could only get trace dumps from 3 of the 6 php-cig processes, but all 3 are stuck in the same place. https://phabricator.wikimedia.org/P15398 [21:09:06] Ack, one my critical workers is STILL down [21:09:24] I thought that thing fired up yesterday. [21:10:05] * bd808 walks slowly away from a 4397 line php source file [21:10:12] But that as an aside, maybe I should put a wait timeout in there for the web UI processes. [21:10:51] bd808 Lol, the size of it scare you off? [21:10:55] you have a hard sleep in a loop with no break condition [21:11:57] no emergency break condition is probably a better way to say it. so once it gets wedged that php worker is dead for all time [21:12:18] More than 450 pending requests. [21:13:13] Well it won't get wedged, it's waiting for the worker to return a response to it's request. Once I get the worker up, it will clear out the pending requests, and the hung processes will free. [21:14:08] Not a problem for CLI workers, but for Web UI workers, I should put some kind of timeout in place so the FCGI containers won't hang up. [21:14:18] is there an architecture diagram for this stuff anywhere? [21:14:38] Nope. Ocaasi and I are working to document IABot thoroughly. [21:17:11] ok. i'm done looking at this Cyberpower678. My current opinion is the "bug" in your code and/or system design and not the fault of any Toolforge infrastructure. If you can show more concrete evidence that there is a problem that is not self-induced I may help look again. [21:17:53] bd808: thank you for looking. That trace was very useful. [21:18:08] I would still be looking in the wrong places if it weren't for those. [21:23:55] And there we go. The webservice has been freed. [21:24:41] bd808: however, is it possible to increase PHP process from 3 to 6? [21:24:51] *process limit?> [21:26:41] This is unrelated to the hangup. [21:27:39] the php fcgi process count is hard coded in the webservice source [21:28:04] can you explain why you think you need more web workers? [21:30:31] Some WebUI requests can take a minute. The UI jobs are coded to reject large requests, and instructs users to submit bot jobs instead, but sometimes when they are actively using the page analysis tool, it will easily saturate the 3 process limit, forcing other users to wait for one of them to finish. [21:31:21] If it's not an easy change, don't worry too much about it, it's not a critical requests, but it would smooth out operations a bit. [22:17:53] Cyberpower678: maybe you should throw out if in a web process and tell the user they should request it later? [22:18:09] Platonides? [22:18:41] instead of that sleep(2); [22:19:08] Platonides: that's the timeout I mentioned. [22:19:21] Implement a wait timeout for web UI requests. [23:15:32] !log tools cleaned up all source files for the grid with the old domain name to enable future node creation T277653 [23:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:15:38] T277653: Toolforge: migrate grid to Debian Buster - https://phabricator.wikimedia.org/T277653