[06:57:18] morning [07:22:48] greetings [07:27:51] morning [07:29:47] getting ready for the tools nfs switch in 30m [07:32:10] tempting to not reboot the currently stuck nfs workers [07:37:05] oh, sorry, I started rebooting them xd [07:37:16] I can stop though [07:37:34] dcaro: no all good, thank you! [07:37:35] (kinda like before getting coffee I run the cookbook every morning) [07:37:41] lol [07:37:59] actually more like lolsob [07:39:25] 🤞 for the upgrade [07:39:32] (to fix it so I don't do it anymore xd) [07:40:24] heheh yeah 🤞 indeed [07:43:01] 🤞 [07:43:18] * volans hides for the duration of the upgrade :-P [07:43:59] hahah I noticed users are at least quite lenient with toolforge/us [08:01:21] ok starting [08:06:54] we're off to a not so great start https://phabricator.wikimedia.org/P83774 [08:07:51] that seems to be due to this: https://phabricator.wikimedia.org/P83775 [08:08:47] thank you [08:09:14] godog: for some reason, the roles are applied on the `tools-nfs-` prefix level, not on the VM directly, and the cookbook doesn't seem to be able to handle that [08:10:04] mmhh ok I'm tempted to move the puppet settings from tools-nfs- prefix onto to tools-nfs-2 [08:10:07] and try again [08:10:13] godog: tryr now? I applied 'role::wmcs::nfs::standalone' to the nfs-3 VM directly [08:10:28] ok trying again [08:10:59] yeah that's better, thank you taavi ! [08:11:10] making a note to fix this later [08:12:38] ok partial success [08:13:17] hm? [08:15:07] wmcs-prepare-cinder-volume failed, I'm finishing manually [08:17:28] waiting for dns to propagate [08:18:23] [for later] why did that fail? [08:21:43] indeed I'll document that too [08:21:54] ok flip is done, I'll start rebooting nfs workers [08:22:43] taavi: would you minding checking on non-nfs-workers ? [08:22:52] will do [08:23:23] thx [08:25:38] from the alert seems that toolforge is unhappy [08:25:56] yep, until we reboot the nfs workers it will not be happy I think [08:26:16] yeah I see a couple of pages re: 5xx [08:27:01] so I started wmcs.toolforge.k8s.reboot on a bunch of nfs workers, currently waiting for phase drain on the first one [08:27:05] let me know when the first nfs worker is rebooted, we can check if that one comes healthy [08:27:13] *come up [08:27:21] ok, it'll be tools-k8s-worker-nfs-1: [08:27:55] ack [08:28:03] still in wait_drain though [08:28:14] we can split the workers if you want [08:28:18] godog: we're maxing out the haproxy connection limits with all the nfs-enabled tools so even non-nfs web services are not doing great, but otherwise the non-nfs workers are doing just fine [08:28:25] I will next look at the non-k8s nfs clients [08:28:32] taavi: thank you! [08:28:38] dcaro: sure [08:29:25] I did these https://phabricator.wikimedia.org/P83778 [08:29:47] damn sorting xd [08:29:54] mmhh except that the command line with two spaces doesn't work [08:29:59] anyways [08:30:27] ok my bad there's some garbage in there, the cookbook has exited anyways so we can coordinate dcaro [08:31:38] dcaro: nfs-1 is back up [08:31:39] np [08:32:52] dcaro: recommendations/suggestions on how we can split the workers ? [08:33:25] lets get the list of remaining workers and split it in two [08:33:32] ok [08:33:40] all workers https://etherpad.wikimedia.org/p/2025-10-workers-reboot [08:33:53] (sorted the same way the cookbook did I think) [08:34:28] ok I'll keep going with the first half [08:35:33] ack, I think we missed 22 for some reason [08:36:16] yes that's my bad copy/pasta [08:36:53] it seems like the nfs server on clouddumps1001 is also having some sort of an issue [08:37:14] https://phabricator.wikimedia.org/P83780 [08:37:31] that's the client thing again? [08:37:32] ah yes I know what's going on, clouddumps1001 needs kicking of stuck nfs clients [08:37:36] indeed [08:38:25] this https://phabricator.wikimedia.org/T404833#11189123 [08:38:26] it needs what? [08:38:31] * taavi looks [08:39:06] 458 echo expire> /proc/fs/nfsd/clients/1398/ctl [08:39:13] something similar to that, but with the right client id [08:39:39] that did it, thanks [08:39:50] ack [08:40:40] thoughts on just rebooting nfs workers as opposed to waiting for draining? I suspect we'll be here all day otherwise [08:41:12] +1 [08:41:39] if you are curious on what clients reconnected: on tools-nfs-3 grep ^name: /proc/fs/nfsd/clients/*/info [08:41:57] +1 [08:42:21] ok! what cookbook is recommended for that ? [08:42:57] wmcs.vps.instance.force_reboot can hard reboot individual vms [08:43:49] ack thanks, will it attempt soft reboot first too ? either acpi or ssh I suppose ? [08:44:09] that one does not. it just issues the openstack hard reboot command [08:44:44] ok thank you, I'll test with a gentler 'reboot' [08:47:06] https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1195629 [08:48:02] it seems to be failing to soft-reboot though :/, a bit quick [08:48:27] most often with stuck nfs issues you need a full hard reboot [08:48:29] just timing ot [08:48:31] yep [08:48:37] it has 30s timeout [08:49:15] yeah indeed [08:49:24] hard reboot it is [08:52:14] yeah I think also hard reboot -> will need clouddumps1001 action [09:07:17] 78 did take a long time to come back up :/ [09:09:50] ack [09:09:53] availability is recovering [09:10:49] 🎉 [09:11:29] or not, spoke too soon [09:11:32] 81 is taking it's time to reboot too [09:11:57] hahahaha, I can see myself jumping on every spike up and down [09:12:18] is someone looking at those stuck clouddumps1001 clients? [09:12:24] taavi: I am yes [09:12:30] fixing them as we go [09:12:43] thanks [09:12:51] """fixing""" [09:17:54] last nfs worker just rebooted [09:18:20] \o/ [09:18:29] haproxy availability seems at similar levels than before [09:19:05] memory usage has spiked though it seems [09:19:10] can confirm, I kicked all 'courtesy' clients from clouddumps1001 [09:19:32] let's see if the memory usage stabilizes as before https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&from=now-1h&to=now&timezone=utc&var-cluster_datasource=P8433460076D33992&var-cluster=tools [09:19:50] dcaro: I would expect memory usage to go down as k8s processes all the disruption with nodes suddenly going down and pods failing health checks [09:20:00] (i.e. that matches with what we've seen before) [09:20:00] yep [09:21:01] currently 0 stuck workers :) [09:21:13] don't jinx it :-) [09:21:21] lolz and also \o/ [09:21:56] godog: this went relatively well overall, thank you!! [09:22:18] taavi: sure np! I'm glad it went relatively well [09:22:37] "only" ~1h or low availability [09:22:40] I think the main thing to fix for the next time is to figure out why the initial mount failed [09:22:45] s/or/of/ [09:23:26] yeah AFAICS there's a race now between the cookbook and prepare-volume, where /dev/sdb didn't appear yet [09:23:41] will document more on the task, I'll take a ~10m break [09:25:14] kudos godog! [09:43:07] back [09:43:11] dcaro: cheers [09:43:18] thanks for the help folks, appreciate it [10:07:49] mmhh my bad tools-nfs-3 is significantly smaller in cpu/mem than tools-nfs-2, I can resize the instance tomorrow though [10:15:33] * taavi lunch [10:18:26] * godog lunch too [10:39:25] * volans errand + lunch [13:07:11] I have some code increasing the log tailing in toolforge https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/997 [13:07:30] that will prevent us from losing logs due to rate limiting, and instead continue receiving them at the rate limit [13:08:09] there's a chance that the alloy process will not be able to keep up if there's many logs, and will die from memory exhaustion (looking at options there), but when it restart it should come up again and continue where it left [13:08:13] (if the logs are still around) [13:08:59] there's some issue with one-off jobs though, as the logs are not around for very long, if the job is generating too many lines (ex. ~5000) with this settings it will not be able to send all of them before the log gets deleted [13:09:05] also looking into that [13:09:16] but for cronjobs and continuous jobs this should be quite helpful [15:14:00] hmm... my loki tests killed lima-kilo xd [15:14:22] starting 4 jobs with just a loop echoing constantly crashed something [15:14:58] oh... 2-json.log: no space left on device" message= [15:14:59] of course [15:15:57] huh `/dev/sr0 48M 48M 0 100% /mnt/lima-cidata` [15:16:04] that's just 48M [15:24:03] `/dev/vda1 50G 49G 0 100% /` [15:24:05] that's the isuse [15:46:00] if there are no objections I'll reboot cloudcumin2001 (there is noone logged in nor screen/tmux or cookbook running) [15:46:15] ack from me [15:46:21] SGTM volans [15:47:33] k proceeding [15:54:07] {done} lmk when could be a good time for cloudcumin1001, I see some people loggedin, although nothing really running right now [15:54:46] +1 from me to reboot volans [15:54:53] I logged out [15:55:12] you've a vim open in a tmux :D [15:56:17] doh! thank you, {{done}} [15:56:31] feel free to kill anything I might have open [15:57:06] lme brb [15:59:40] thx, proceeding [16:00:40] * godog off [17:35:58] * dcaro off