[07:34:23] greetings [08:19:30] dhinus: I'm looking into the toolforge scheduling failure [08:21:12] godog: thanks! [11:02:26] people of the cloud! [11:02:32] how are you doing? [11:03:12] hopefully a quick one. I rolled out a patch that modifies the DSCP marking we use for "low priority" traffic [11:03:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1279339 [11:03:24] this is used on the cloudcephosd nodes [11:03:40] for some reason despite puppet saying it restarts ferm when the file on disk changed, the old rules persist [11:03:58] I've tested on some non-cloud hosts and they needed a manual "systemctl restart ferm.service" to apply the change [11:04:43] so I guess my question is if that is risky/ok to do across the cloudcephosd's? I'm guessing so, they won't have a lot of other things adding iptables rules in the same way say the openstack nodes might? [11:08:26] hey topranks, checking [11:13:36] I had a chat with Morit z on this, and it makes some sense. When puppet does a "ferm refresh" it signals to ferm itself to reload the rules (doesn't trigger a systemctl restart) [11:14:14] and due to some "ferm fun" with how it parses all the files it doesn't see changes in mangle/postrouting so doesn't do anything [11:15:19] I can confirm that's the case, ferm-status thinks there are no changes, but there are [11:15:44] the correct fix is to move cloudceph to nftables :-) [11:16:09] I ran and verified systemctl restart ferm on cloudcephosd1037 and things check out, so +1 on my end to proceed topranks, or I can do it too [11:16:19] moritzm: heh, agreed [11:16:27] godog: no that is ok I'll take a look [11:16:37] thanks for checking! [11:16:54] topranks: sure np, thank you for taking care of it [11:17:08] cloudcephosd1037 looks good I can confirm it's using the new marking [11:17:48] re: cloudceph on nftables that's T361913 just for the record [11:17:49] T361913: Migrate cloudceph servers to nftables - https://phabricator.wikimedia.org/T361913 [11:38:21] FYI folks that's been done now and all looking good thanks [11:41:37] sweet, thank you topranks ! [12:09:23] I'm going to add a new non-nfs worker to toolforge, I think what caused T425696 is cordoning 106 and that pushed cpu harder on non-nfs workers [12:09:24] T425696: restarted pod failed to schedule due to resource constraints - https://phabricator.wikimedia.org/T425696 [12:09:47] cc taavi ^ JFYI [12:11:24] https://w.wiki/Mt5s this is what I mean, all but one non-nfs worker is > 80% cpu allocated (requests, not limits) [12:13:55] godog: non-NFS workloads are still able to run on NFS workers if necessary. so does that mean we don't have a single 3CPU slot anywhere on the cluster? [12:15:02] mmhh ok I wasn't aware non-nfs can spillover to nfs workers, nevermind [12:15:45] to your question, I am checking [12:16:08] https://www.youtube.com/watch?v=QY4KKG4TBFo "we are checking" [12:26:46] heh occasionally some tools can't schedule, looks like cpu/mem https://phabricator.wikimedia.org/P92441 [12:26:59] and https://w.wiki/Mt9i [15:30:33] Oops, TIL why I wanted to keep sysop on wikitech [15:30:46] does anyone here have it still? [15:32:44] bliviero: here's the stub of that 'exceptions to terms of use' page https://wikitech.wikimedia.org/wiki/Wikitech:Cloud_Services_Terms_of_use/exceptions -- the main thing we're missing is the actual name of the tool. [15:45:38] andrewbogott: Can grant you it again easily enough... [16:06:19] andrewbogott: thanx for the stub! [16:39:55] Reedy: at the moment I just want to add a single link, you can grant me sysop or add the link yourself, whatever's easier [21:00:10] taavi, regarding that rust build, I opened https://github.com/vexxhost/magnum-cluster-api/issues/1017 and will resort to something desperate/hacky if they don't go for it.