[09:32:46] ok to upgrade cloudcumin1001 to Bookworm later? the remaining issues for the SSH access of some codfw1dev are rather unrelated it appears [09:39:01] +1 for me [09:39:33] ok, I need to wrap up some other things now, will post a note when I start [10:38:22] I'll start in 10m [10:38:57] ack [10:40:03] +1 [10:48:40] starting now [10:50:49] * volans around [10:59:02] so, for the NFS tracing I've sent yesterday my (hopefully) final patch, it has quite some changes since the last implementation because while digging into bpf, chatting with valentin and checking a fosdem talk he suggested, I think we have now a much more robust and efficient solution [10:59:16] in addition to cover the additional mountpoints/symlinks [10:59:18] very nice [11:00:23] if you don't feel comfortable in reviewing that we can also go the way of disabling puppet in the toolforge nfs workers, deploy to toolsbeta for a couple of hours and only if that works move to toolforge too [11:00:44] (generic you) [11:01:45] yeah I looked at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1231034 though I'm not familiar enough with bpf to give a meaningful review [11:01:57] +1 to toolsbeta -> tools deploy [11:03:56] yeah, I wish I was comfortable reviewing that but I am not [11:04:09] also, one thing that worries me is sudden extra load on the Loki instance after deploying it to tools [11:04:23] that's fair, I'd like though an agreement on the two open comments, namely if it's ok to skip tracking tools usage of their own home given that we know it's on NFS and it's expected that they use it (this will reduce load on loki btw) [11:06:49] and the fact that with the current implementation i'm storing in loki the resolved path (so /mnt/nfs/...) even if the user opened /public/... for example. This should simplify the aggregation on loki side, but we can easily store the actuall called path if we think is more useful [11:08:01] I think storing the canonical mounth path (/mnt/nfs/...) is actually better as that means it'll always be the same path for the same files, instead of one of several different options [11:08:17] and yeah, I see no point in storing detailed data about accessing files in the tool's own home directory [11:12:17] +1 to what taavi said [11:16:57] thanks all! then I'll go for the toolsbeta approach if there are no objections [11:17:17] glad we're in agreement on the two open questions [11:18:53] +1 for toolsbeta, then maybe I would deploy to tools early next week, so we don't have the weekend coming up. how long do you plan to keep it running in tools? [11:19:21] cloudcumin1001 is updated to bookworm, keyholder is rearmed and I could successfully run 'uname -a' against O{project:deployment-prep} via Cumin [11:19:41] if anyone has a cookbook to test, please run it for additional confirmation, but it seems all fine to me [11:19:43] * volans re-applying patch [11:19:52] ack [11:20:51] {done}, testing [11:22:42] cumin seems to work fine, I don't have a cookbook handy to test [11:22:54] I'm sure others do ;) [11:27:30] tried wmcs.ceph.osd.show_info and wmcs.toolforge.toolsdb.get_cluster_status, they worked fine [11:29:17] in T317001 (2022) we blocked an apple /21 from accessing dumps due to network link saturation, any opposition to removing that block now? [11:29:17] T317001: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 [11:29:55] +1 to removing the block [11:33:54] +1 [11:34:56] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1237189 [11:43:07] we need dnsmasq 2.92 for routed Ganeti (they kindly implemented a feature for us required for our use as a DHCP relay), for bookworm I've upgraded all uses of dnsmasq and uploaded 2.92 to the main component of bookworm-wikimedia [11:43:34] for trixie there are two use cases (cloudnet and cloudvirt), but the current version of dnsmasq is also much closer (they alreay use 2.91) [11:44:30] is there a simple way to confirm 2.92 works fine with cloudnet and cloudvirt, before I move 2.92 to apt.wikimedia.org? [11:51:36] moritzm: we could upgrade codfw1dev and test if VMs still schedule there [11:55:37] sounds good to me, I'll upgrade cloudnet2*dev and cloudvirt2*dev to 2.92 later the day [12:12:45] * volans lunch will merge and do the deply dance of the nfs stuff right after [14:03:58] puppet disabled on O{project:tools name:nfs}, proceeding with the toolsbeta deploy [15:26:11] andrewbogott: that ENC thing I mentioned being T416588 [15:26:11] T416588: Deleting ENC projects is broken - https://phabricator.wikimedia.org/T416588 [15:27:43] Oh it's that problem again. I hate that. [15:27:56] I think you're right that a second table to cache the relationship is probably the way [15:38:41] if there are no objections I'll proceed with the re-enabling of puppet in tools's nfs workers to deploy the nfs tracing fix [15:38:49] from all tests seems to work fine for me [15:39:46] !topic 🦄🎉 https://etherpad.wikimedia.org/p/WMCS-2026-02-12 | Channel is logged at https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud-admin/ | ping cteam | clinic duty: andrewbogott [15:40:01] welp [15:41:06] am I misremembering or did the dumps https servers have some sort of bandwidth limiting at some point? [15:53:12] I thought so but sure don't remember the specifics [15:54:08] ah, I'm blind, it is in the nginx config [15:54:48] (that rate limit has been 5 MBps/client since 2019, https://gerrit.wikimedia.org/r/c/operations/puppet/+/555632) [16:44:15] ok proceeding now (also !logged) [18:23:42] I've a very quick loki patch to bump a bit the query limits for infra-tracing if anyone is still around: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1136 [18:26:16] ship it [18:26:33] <3 [19:11:42] shipped, all good [19:11:51] gotta go now [21:13:18] andrewbogott: re https://toolsadmin.wikimedia.org/tools/membership/status/2113 -- that new account is a service user to hold ssh keys that are used from a GitHub Action to talk to Cloud VPS and Toolforge for video2commons. Context is in the mentioned T410730 task. [21:13:19] T410730: Continuous integration and continuous delivery - https://phabricator.wikimedia.org/T410730 [21:14:43] hmmmm [21:15:08] basically the same as https://wikitech.wikimedia.org/wiki/Help:Toolforge/Auto-update_a_tool_from_GitHub/GitLab#Using_GitHub_Actions [21:16:33] the account isolation is for limiting the compromise radius if the ssh secrets leak from GitHub [21:16:38] yeah, makes sense. It feels vaguely new to me to have a toolforge account not attached to a human, though? Or, I guess, two toolforge accounts attached to the same human [21:16:49] I don't think I have any technical objection, just reacted automatically to the novelty :) [21:17:27] not really new in my mind, but if folks on the SRE side object to this I guess there is a lot of things to reconsider [21:18:00] I don't actually object, I'm just absorbing. [21:18:07] I have at least 2 alts that are for this sort of thing in Cloud VPS [21:18:19] the associated SUL account is already marked as a bot account I guess? So that's pretty normal [21:18:56] the way that video2commons is spread across Toolforge and Cloud VPS is kind of weird here [21:19:13] AIBot is similar I guess [21:19:15] yeah [21:19:19] *IABot [21:19:32] anyway, I approved it, we will see if it trips over backend edge cases. [21:19:56] I really did misunderstand what it was at first blush, I thought it was meant as a shared account for multiple human users. [21:20:10] thinking of it as a bot account clarifies [21:21:24] :nod: thanks for re-reviewing :) [21:21:55] thank you for prodding! [21:23:16] I saw the link on the phab task and was going to approve, but then I saw your comment and didn't want to force past you.