[00:17:18] Hey folks! So, Weeklypedia (https://weekly.hatnote.com) is broken right now due to the upgrade of the recentchanges table from rc_type to rc_source. I've got the queries updated, but I'm having trouble restarting the service. [00:17:19] When I try to use toolforge webservice to restart, I get: [00:17:21] requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://k8s.tools.eqiad1.wikimedia.cloud:6443/apis/apps/v1/namespaces/tool-weeklypedia/deployments [00:17:22] Might be due to this being Python 2, but doesn't look like that on its face. I can get the code updated of course, though the weekly email is supposed to go out tomorrow night, so time is tight. [00:42:57] That usually means the job already exists, can you paste the exact command you're trying [02:18:24] understandable, but I checked that: [02:18:25] $ toolforge webservice python2 status [02:18:27] DEPRECATED: 'python2' type is deprecated. [02:18:28] See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes [02:18:30] for currently supported types. [02:18:31] Your webservice is not running [02:18:33] And: [02:18:34] $ toolforge webservice python2 start [02:18:36] is what yielded the original error [07:00:15] well, `kubectl get pods` / `kubectl get deployments` show something is still running [07:01:03] you could either delete that deployment manually with `kubectl delete deployment weeklypedia` or restart with `kubectl rollout restart deployment weeklypedia` [07:01:12] the latter is probably safer [07:08:05] @telemoud: I fixed the inconsistent state that tool got in, so it operates normally again. but yes, you should migrate away from python 2 at some point. [08:41:31] !log toolsbeta test reboot of toolsbeta-nfs-4 [08:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:47:18] hi, I don't seem to be able to create any VMs in Horizon successfully today. So far I've created 3 and they never boot up to the point where sshd is running and accepting connections [11:47:25] this one has been stuck for 12 minutes now: https://horizon.wikimedia.org/project/instances/42fac45f-256d-4383-8e0b-36b6ec91bf0d/console [11:47:34] the one before, I waited 20 minutes before deleting it [11:47:51] anyone around who could take a look? [11:48:25] which project? horizon urls are specific to the project you have selected [11:48:37] catalyst-dev [11:48:59] I'll have a look [11:49:06] ty! [11:54:41] jnuche: somehow that instance is not responsible even on the serial console [11:57:22] I did a hard reboot and it's continuing now, let's see if I can find what went wrong the first time [11:58:11] what happened? Irccloud lost connection to irc [11:58:54] instances for me are never getting to the point they accept ssh connections [11:59:18] taavi was saying one of the instances wasn't even reachable with the serial console [11:59:42] journal shows this line after what horizon had: [11:59:42] > Sep 25 11:39:11 k3s-control-plane01 kernel: clocksource: Long readout interval, skipping watchdog check: cs_nsec: 5562949299 wd_nsec: 5562949406 [12:00:43] jnuche: by any chance, did you attach the "k3s-control-plane01-data" volume to the VM around the time it froze? [12:02:10] taavi: can't say exactly when it happens, but tofu is indeed attaching a volume after OpenStack reports the VM has been created: https://gitlab.wikimedia.org/repos/test-platform/catalyst/catalyst-tofu/-/blob/main/modules/node/main.tf?ref_type=heads#L52 [12:02:33] that piece of config hasn't changed in a month, was still working fine yesterday [12:03:12] Can someone check the logs for deployment 20250925-120109-60idmwtm3o or 20250925-115803-auxbzihn7t getting `add-dangling-edits-to-group(failed): 422 Client Error: Unprocessable Entity for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/v1/tool/cluebotng-review/jobs/ (422): no details` 🤷‍♀️ [12:04:33] that might be me, one sec [12:04:51] Hmm Error creating: pods "add-reviews-from-report-29311995-5gsg6" is forbidden: maximum cpu usage per Container is 3, but limit is 4 [12:05:03] same as I get in lima-kilo since yesterday strangely [12:05:14] oh, that should not happen, looking [12:05:27] it's using default btw [12:06:20] gotta go out for a bit, but feel free to play with anything under cluebotng-review, there is something a bit odd... was just trying to do a new deploy but noticed the scheduled jobs failed last night [12:56:08] a disk quota bump is also blocking me atm [12:56:42] any chance anyone has any free bandwidth right now? :) https://phabricator.wikimedia.org/T405363 [13:10:37] taavi: any progress? is it worth trying to recreate an instance again? [13:19:01] jnuche: as I said, seems like something gets un-happy about hotplugging that volume into a VM that's running at that moment. at least rebooting the VM works around that [13:19:03] cc andrewbogott [13:21:34] dcaro: also happening for 20250925-125606-a71b64ll3h on cluebotng-trainer, so seems to be something genericish across tool accounts [13:22:03] yep, there's some issue with cronjobs in deployments (they create ok manually), I've reverted the last deployment, re-checking [13:22:40] unrelated but `jobs dump` is not carrying `port` through though it exists in `show`, so dump/reload (was going to test with this) blows up... will make a ticket for that in a bit [13:23:26] Damianz: it should be working now, both [13:23:28] will look into that [13:29:13] taavi: I see, thanks for investigating. It's unfortunate because it was working until recently. I will try changing the behavior of our tofu stack [13:48:58] Will check in a few min, just eating lunch [14:28:55] dcaro: looks good, both deployments succeeded now [14:31:32] https://phabricator.wikimedia.org/T405601 for the dump issue [14:32:01] If I had a euro for every bug I'd have at least 5 euro... probably not the most lucrative business model [16:12:17] @lucaswerkmeister / @taavi good info, working now, thanks! [16:17:35] Is it intended that a deployment is `successful` while Runs are still `pending`... that seems questionable [16:17:59] They are slowly changing state which would imply the deployment is actually still `running` [16:22:08] it should not [16:26:55] https://phabricator.wikimedia.org/T405620 here you go then [16:27:28] I was very confused why things had not updated yet the deployment had finished until I checked `toolforge components deployment show` [16:33:48] Defiantly reproducible as well, 20250925-163154-5rpfbytgfe is doing the same [19:16:08] !log damian-scripts@tools-bastion-15 tools.cluebotng bot deployed @ refs/tags/v1.2.0 [19:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng/SAL [19:17:45] !log damian-scripts@tools-bastion-15 tools.cluebotng bot deployed @ refs/tags/v1.2.1 [19:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng/SAL [21:20:16] !log damian-scripts@tools-bastion-15 tools.cluebotng bot deployed @ refs/tags/v1.2.2 [21:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng/SAL [21:20:40] !log damian@tools-bastion-15 tools.cluebotng bot deployed @ v1.2.2 [21:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng/SAL [21:53:01] dcaro: looks like that haproxy session issue is happening again if you're around to grab the logs mentioned in the ticket [21:53:34] ingress maxed out at 2k sessions, tools down, previous ticket T405280 [21:53:35] T405280: [infra,haproxy,ingress] 2025-09-23 Ingress hitting the backend session limit and started replying with 5xxs - https://phabricator.wikimedia.org/T405280 [21:53:41] (or anyone else with shell there)