[07:18:38] greetings [08:39:31] taavi: T411023 is fixed FWIW, you'd have to rebase the sandbox branch on top of production of course if you'd like to try [08:39:32] T411023: pontoon join-stack should ask for puppetserver host key verification - https://phabricator.wikimedia.org/T411023 [08:40:26] morning [08:56:43] * godog errand [09:02:45] morning [09:23:12] I'm enabling infra-tracing-loki on one tools's nfs worker to check it's all good and probably enable them all after that [09:23:30] unless there is a better way to gradually enabling them without having to edit each one's hiera one by one [09:24:28] s/-loki/-nfs/ [09:24:55] +1 [09:26:30] you might be able to use hiera prefixes to enable 10 workers at once, like tools-k8s-worker-nfs-6* [09:26:42] but I'm also fine with enabling on all if the first one is working fine [09:27:05] lol, nice trick, I might [09:30:10] for the haproxy patch here it is, hoping I got it right: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211610 [09:37:58] lol, taavi you're faster than me posting here :D it seems that prometheus::blackbox::check::http doesn't support to skip cert verification , checking how hard is to ad it [09:38:11] sorry :D [09:39:38] I'm going to restart toolsdb primary to apply a config change T409922 [09:39:39] T409922: [toolsdb] crash recovery can fail because of insufficient innodb_log_file_size - https://phabricator.wikimedia.org/T409922 [09:39:57] I consdiered sending an email to cloud-announce but it seems overkill, I'll write a message in #-cloud [09:42:04] do we have alternative ways of adding the check? it would be nice to have it [09:44:06] I restarted toolsdb replica yesterday and it was fast, but it's taking some time on the primary... [09:49:06] lesson for next time: always fail over the primary to the other host before restarting [09:50:50] aaaand we're back [09:51:04] volans: it would be possible to manually add a blackbox probe and the matching alerts, although I suspect it's easier to just modify prometheus::blackbox::check::http instead (cc godog) [10:03:01] can I get a +1? no specific reason to upgrade but I just remembered about it https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/91 [10:04:56] it's a noop right? [10:04:59] yes [10:05:02] if I'm reading the pipeline correclty [10:05:06] you can check it in the CI output [10:05:08] lgtm [10:22:47] yes likely easier to change puppet to change the deployed alert to not check expiry [10:28:46] I'll try sending a simple patch, although the logic there is a bit convoluted [10:29:17] it is yeah :| happy to help testing in Pontoon too [10:53:59] godog: this is my attempt, maybe you have cleaner ideas in mind [10:54:00] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211628 [10:55:05] also I'm not 100% sure if the tls config in the case of insecure check needs other settings anyway [10:58:46] ack, checking [11:04:03] volans: LGTM overall, though for tls_config I think might as well always send server_name unless I'm missing sth ? [11:04:57] godog: any clue why running puppet on a Pontoon node could be failing with this? https://phabricator.wikimedia.org/P85703 the mentioned line in profile::firewall is setting up the rule to allow ssh from the bastions [11:05:14] that's the part I'm not sure, I'll check better the docs [11:06:56] taavi: checking [11:09:57] taavi: I bet because of using nftables via 'profile::firewall::provider' for cloudgw role, that rings a bell from a while back while we tried to convert instance profile to firewall::service [11:10:24] as on how to fix or work around it, not sure yet [11:11:56] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184792 this for reference [11:13:05] aha [11:13:37] I found a workaround for the wmflib::hosts2ips type error, but that still fails since I guess pontoon instances still include profile::wmcs::instance? [11:13:48] thanks, I think that's enough for me to try to find a workaround at least [11:14:30] no problem! yes that's correct, profile::wmcs::instance is still included since there are wmcs instances after all [11:15:00] a bit of in-between with Pontoon using production hiera and a mix of wmcs classes heh [11:15:08] * godog bbiab [11:15:11] yep [11:15:20] makes sense [11:25:09] sigh, so I managed to make compilation succeed but there's a new problem: new debian cloud images (and so cloud-vps) use systemd-networkd, but the real cloudgws still use ifupdown [11:25:26] * taavi is starting to think that pontoon + cloudgw are not a great combo [11:40:48] godog: fixed, if you have an easy way to test it in pontoon that would be great to make sure it actually works as expected [11:45:11] volans: ok! will give it a spin [11:46:10] <3 thanks [11:48:15] taavi: heh yes I ran into the same problem when testing single-nic for cloudcephosd, good though you got compilation going [11:51:23] do you recall if there was a task about getting nftables working on wmcs instances? [11:52:25] I don't think we do have a task yet [11:52:56] just my goofy attempt above tbh [12:03:51] * godog lunch [12:19:32] I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211651 and the pcc for that seems to suggest I've uncovered an another puppet typing rabbit hole [12:19:34] * taavi lunch [12:50:38] oh no /o\ [14:21:40] I'm currently at ~50% nfs workers with infra-tracing-nfs enabled and all seems good so far [14:38:18] * volans going for the full deployment, if anything breaks or screams please ping me :) [14:52:06] that latest Puppet rabbit hole is now T411102 [14:52:07] T411102: Typing issues around nftables::service - https://phabricator.wikimedia.org/T411102 [14:58:35] the gift that keeps on giving [16:40:39] dhinus: the cloouddb plugs are moved. If we move clouddumps now will you be around for a bit to troubleshoot if disaster strikes? [16:41:12] yep I'm still around for 1 hour or so [16:41:19] cool [16:41:27] can I repool all clouddbs? [16:45:59] yep [16:46:10] and the clouddumps host is moved now too [16:50:41] ack [16:51:22] So I think we're all good -- there's one elastic box left but that doesn't require supervision, just waiting. [16:56:23] dhinus: ok, clouddumps1002 is misbehaving, discussion on -dcops [16:59:45] andrewbogott: ack, the last time that it was unreachable for more than a few minutes it had some big impact on toolforge [16:59:52] but for now things are working :fingerscrossed: [16:59:58] yeah :/ [17:02:15] pending pods in k8s are growing [17:18:46] dhinus: clouddumps1002 should (finally) be back up; do we need to do worker reboots now? [17:19:13] I hope not, I see the number of pending pods is going down [17:19:46] oh good [17:20:35] I think we're good [17:21:13] https://usercontent.irccloud-cdn.com/file/OFATl3TU/Screenshot%202025-11-26%20at%2018.20.56.png [17:21:43] cool [17:22:33] I could also double check a question we had last time: https://phabricator.wikimedia.org/T391369#11410304 [17:26:41] the puppet alert is going to go away in a few minutes, I see the number of puppet_agent_failed{project="tools"} is going down [18:36:44] * dhinus off