[00:17:18] <wm-bb>	 <telemoud> Hey folks! So, Weeklypedia (https://weekly.hatnote.com) is broken right now due to the upgrade of the recentchanges table from rc_type to rc_source. I've got the queries updated, but I'm having trouble restarting the service.
[00:17:19] <wm-bb>	 <telemoud> When I try to use toolforge webservice to restart, I get:
[00:17:21] <wm-bb>	 <telemoud> requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://k8s.tools.eqiad1.wikimedia.cloud:6443/apis/apps/v1/namespaces/tool-weeklypedia/deployments
[00:17:22] <wm-bb>	 <telemoud> Might be due to this being Python 2, but doesn't look like that on its face. I can get the code updated of course, though the weekly email is supposed to go out tomorrow night, so time is tight.
[00:42:57] <Damianz>	 That usually means the job already exists, can you paste the exact command you're trying
[02:18:24] <wm-bb>	 <telemoud> understandable, but I checked that:
[02:18:25] <wm-bb>	 <telemoud> $ toolforge webservice python2 status
[02:18:27] <wm-bb>	 <telemoud> DEPRECATED: 'python2' type is deprecated.
[02:18:28] <wm-bb>	 <telemoud>   See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes
[02:18:30] <wm-bb>	 <telemoud>   for currently supported types.
[02:18:31] <wm-bb>	 <telemoud> Your webservice is not running
[02:18:33] <wm-bb>	 <telemoud> And:
[02:18:34] <wm-bb>	 <telemoud> $ toolforge webservice python2 start
[02:18:36] <wm-bb>	 <telemoud> is what yielded the original error
[07:00:15] <wm-bb>	 <lucaswerkmeister> well, `kubectl get pods` / `kubectl get deployments` show something is still running
[07:01:03] <wm-bb>	 <lucaswerkmeister> you could either delete that deployment manually with `kubectl delete deployment weeklypedia` or restart with `kubectl rollout restart deployment weeklypedia`
[07:01:12] <wm-bb>	 <lucaswerkmeister> the latter is probably safer
[07:08:05] <taavi>	 @telemoud: I fixed the inconsistent state that tool got in, so it operates normally again. but yes, you should migrate away from python 2 at some point.
[08:41:31] <godog>	 !log toolsbeta test reboot of toolsbeta-nfs-4
[08:41:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL
[11:47:18] <jnuche>	 hi, I don't seem to be able to create any VMs in Horizon successfully today. So far I've created 3 and they never boot up to the point where sshd is running and accepting connections
[11:47:25] <jnuche>	 this one has been stuck for 12 minutes now: https://horizon.wikimedia.org/project/instances/42fac45f-256d-4383-8e0b-36b6ec91bf0d/console
[11:47:34] <jnuche>	 the one before, I waited 20 minutes before deleting it
[11:47:51] <jnuche>	 anyone around who could take a look?
[11:48:25] <taavi>	 which project? horizon urls are specific to the project you have selected
[11:48:37] <jnuche>	 catalyst-dev
[11:48:59] <taavi>	 I'll have a look
[11:49:06] <jnuche>	 ty!
[11:54:41] <taavi>	 jnuche: somehow that instance is not responsible even on the serial console
[11:57:22] <taavi>	 I did a hard reboot and it's continuing now, let's see if I can find what went wrong the first time
[11:58:11] <dcaro>	 what happened? Irccloud lost connection to irc
[11:58:54] <jnuche>	 instances for me are never getting to the point they accept ssh connections
[11:59:18] <jnuche>	 taavi was saying one of the instances wasn't even reachable with the serial console
[11:59:42] <taavi>	 journal shows this line after what horizon had:
[11:59:42] <taavi>	 > Sep 25 11:39:11 k3s-control-plane01 kernel: clocksource: Long readout interval, skipping watchdog check: cs_nsec: 5562949299 wd_nsec: 5562949406
[12:00:43] <taavi>	 jnuche: by any chance, did you attach the "k3s-control-plane01-data" volume to the VM around the time it froze?
[12:02:10] <jnuche>	 taavi: can't say exactly when it happens, but tofu is indeed attaching a volume after OpenStack reports the VM has been created: https://gitlab.wikimedia.org/repos/test-platform/catalyst/catalyst-tofu/-/blob/main/modules/node/main.tf?ref_type=heads#L52
[12:02:33] <jnuche>	 that piece of config hasn't changed in a month, was still working fine yesterday
[12:03:12] <Damianz>	 Can someone check the logs for deployment 20250925-120109-60idmwtm3o or 20250925-115803-auxbzihn7t getting `add-dangling-edits-to-group(failed): 422 Client Error: Unprocessable Entity for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/v1/tool/cluebotng-review/jobs/ (422): no details` 🤷‍♀️
[12:04:33] <dcaro>	 that might be me, one sec
[12:04:51] <Damianz>	 Hmm Error creating: pods "add-reviews-from-report-29311995-5gsg6" is forbidden: maximum cpu usage per Container is 3, but limit is 4
[12:05:03] <Damianz>	 same as I get in lima-kilo since yesterday strangely
[12:05:14] <dcaro>	 oh, that should not happen, looking
[12:05:27] <Damianz>	 it's using default btw
[12:06:20] <Damianz>	 gotta go out for a bit, but feel free to play with anything under cluebotng-review, there is something a bit odd... was just trying to do a new deploy but noticed the scheduled jobs failed last night
[12:56:08] <jnuche>	 a disk quota bump is also blocking me atm
[12:56:42] <jnuche>	 any chance anyone has any free bandwidth right now? :) https://phabricator.wikimedia.org/T405363
[13:10:37] <jnuche>	 taavi: any progress? is it worth trying to recreate an instance again?
[13:19:01] <taavi>	 jnuche: as I said, seems like something gets un-happy about hotplugging that volume into a VM that's running at that moment. at least rebooting the VM works around that
[13:19:03] <taavi>	 cc andrewbogott 
[13:21:34] <Damianz>	 dcaro: also happening for 20250925-125606-a71b64ll3h on cluebotng-trainer, so seems to be something genericish across tool accounts
[13:22:03] <dcaro>	 yep, there's some issue with cronjobs in deployments (they create ok manually), I've reverted the last deployment, re-checking
[13:22:40] <Damianz>	 unrelated but `jobs dump` is not carrying `port` through though it exists in `show`, so dump/reload (was going to test with this) blows up... will make a ticket for that in a bit
[13:23:26] <dcaro>	 Damianz: it should be working now, both
[13:23:28] <dcaro>	 will look into that
[13:29:13] <jnuche>	 taavi: I see, thanks for investigating. It's unfortunate because it was working until recently. I will try changing the behavior of our tofu stack
[13:48:58] <Damianz>	 Will check in a few min, just eating lunch
[14:28:55] <Damianz>	 dcaro: looks good, both deployments succeeded now
[14:31:32] <Damianz>	 https://phabricator.wikimedia.org/T405601 for the dump issue
[14:32:01] <Damianz>	 If I had a euro for every bug I'd have at least 5 euro... probably not the most lucrative business model
[16:12:17] <wm-bb>	 <telemoud> @lucaswerkmeister / @taavi good info, working now, thanks!
[16:17:35] <Damianz>	 Is it intended that a deployment is `successful` while Runs are still `pending`... that seems questionable
[16:17:59] <Damianz>	 They are slowly changing state which would imply the deployment is actually still `running`
[16:22:08] <dcaro>	 it should not
[16:26:55] <Damianz>	 https://phabricator.wikimedia.org/T405620 here you go then
[16:27:28] <Damianz>	 I was very confused why things had not updated yet the deployment had finished until I checked `toolforge components deployment show`
[16:33:48] <Damianz>	 Defiantly reproducible as well, 20250925-163154-5rpfbytgfe is doing the same
[19:16:08] <wm-bot>	 !log damian-scripts@tools-bastion-15 tools.cluebotng bot deployed @ refs/tags/v1.2.0
[19:16:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng/SAL
[19:17:45] <wm-bot>	 !log damian-scripts@tools-bastion-15 tools.cluebotng bot deployed @ refs/tags/v1.2.1
[19:17:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng/SAL
[21:20:16] <wm-bot>	 !log damian-scripts@tools-bastion-15 tools.cluebotng bot deployed @ refs/tags/v1.2.2
[21:20:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng/SAL
[21:20:40] <wm-bot>	 !log damian@tools-bastion-15 tools.cluebotng bot deployed @ v1.2.2
[21:20:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng/SAL
[21:53:01] <Damianz>	 dcaro: looks like that haproxy session issue is happening again if you're around to grab the logs mentioned in the ticket
[21:53:34] <Damianz>	 ingress maxed out at 2k sessions, tools down, previous ticket T405280
[21:53:35] <stashbot>	 T405280: [infra,haproxy,ingress] 2025-09-23 Ingress hitting the backend session limit and started replying with 5xxs - https://phabricator.wikimedia.org/T405280
[21:53:41] <Damianz>	 (or anyone else with shell there)