[07:18:38] <godog>	 greetings
[08:39:31] <godog>	 taavi: T411023 is fixed FWIW, you'd have to rebase the sandbox branch on top of production of course if you'd like to try
[08:39:32] <stashbot>	 T411023: pontoon join-stack should ask for puppetserver host key verification - https://phabricator.wikimedia.org/T411023
[08:40:26] <volans>	 morning
[08:56:43] * godog errand
[09:02:45] <dhinus>	 morning
[09:23:12] <volans>	 I'm enabling infra-tracing-loki on one tools's nfs worker to check it's all good and probably enable them all after that 
[09:23:30] <volans>	 unless there is a better way to gradually enabling them without having to edit each one's hiera one by one
[09:24:28] <volans>	 s/-loki/-nfs/
[09:24:55] <dhinus>	 +1
[09:26:30] <dhinus>	 you might be able to use hiera prefixes to enable 10 workers at once, like tools-k8s-worker-nfs-6*
[09:26:42] <dhinus>	 but I'm also fine with enabling on all if the first one is working fine
[09:27:05] <volans>	 lol, nice trick, I might
[09:30:10] <volans>	 for the haproxy patch here it is, hoping I got it right: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211610
[09:37:58] <volans>	 lol, taavi you're faster than me posting here :D it seems that prometheus::blackbox::check::http doesn't support to skip cert verification , checking how hard is to ad it
[09:38:11] <taavi>	 sorry :D
[09:39:38] <dhinus>	 I'm going to restart toolsdb primary to apply a config change T409922
[09:39:39] <stashbot>	 T409922: [toolsdb] crash recovery can fail because of insufficient innodb_log_file_size - https://phabricator.wikimedia.org/T409922
[09:39:57] <dhinus>	 I consdiered sending an email to cloud-announce but it seems overkill, I'll write a message in #-cloud
[09:42:04] <volans>	 do we have alternative ways of adding the check? it would be nice to have it
[09:44:06] <dhinus>	 I restarted toolsdb replica yesterday and it was fast, but it's taking some time on the primary...
[09:49:06] <dhinus>	 lesson for next time: always fail over the primary to the other host before restarting
[09:50:50] <dhinus>	 aaaand we're back
[09:51:04] <taavi>	 volans: it would be possible to manually add a blackbox probe and the matching alerts, although I suspect it's easier to just modify prometheus::blackbox::check::http instead (cc godog)
[10:03:01] <dhinus>	 can I get a +1? no specific reason to upgrade but I just remembered about it https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/91
[10:04:56] <volans>	 it's a noop right?
[10:04:59] <dhinus>	 yes
[10:05:02] <volans>	 if I'm reading the pipeline correclty
[10:05:06] <dhinus>	 you can check it in the CI output
[10:05:08] <volans>	 lgtm
[10:22:47] <godog>	 yes likely easier to change puppet to change the deployed alert to not check expiry
[10:28:46] <volans>	 I'll try sending a simple patch, although the logic there is a bit convoluted
[10:29:17] <godog>	 it is yeah :| happy to help testing in Pontoon too
[10:53:59] <volans>	 godog: this is my attempt, maybe you have cleaner ideas in mind
[10:54:00] <volans>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211628
[10:55:05] <volans>	 also I'm not 100% sure if the tls config in the case of insecure check needs other settings anyway
[10:58:46] <godog>	 ack, checking
[11:04:03] <godog>	 volans: LGTM overall, though for tls_config I think might as well always send server_name unless I'm missing sth ?
[11:04:57] <taavi>	 godog: any clue why running puppet on a Pontoon node could be failing with this? https://phabricator.wikimedia.org/P85703 the mentioned line in profile::firewall is setting up the rule to allow ssh from the bastions
[11:05:14] <volans>	 that's the part I'm not sure, I'll check better the docs
[11:06:56] <godog>	 taavi: checking
[11:09:57] <godog>	 taavi: I bet because of using nftables via 'profile::firewall::provider' for cloudgw role, that rings a bell from a while back while we tried to convert instance profile to firewall::service
[11:10:24] <godog>	 as on how to fix or work around it, not sure yet
[11:11:56] <godog>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184792 this for reference
[11:13:05] <taavi>	 aha
[11:13:37] <taavi>	 I found a workaround for the wmflib::hosts2ips type error, but that still fails since I guess pontoon instances still include profile::wmcs::instance?
[11:13:48] <taavi>	 thanks, I think that's enough for me to try to find a workaround at least
[11:14:30] <godog>	 no problem! yes that's correct, profile::wmcs::instance is still included since there are wmcs instances after all
[11:15:00] <godog>	 a bit of in-between with Pontoon using production hiera and a mix of wmcs classes heh
[11:15:08] * godog bbiab
[11:15:11] <taavi>	 yep
[11:15:20] <taavi>	 makes sense
[11:25:09] <taavi>	 sigh, so I managed to make compilation succeed but there's a new problem: new debian cloud images (and so cloud-vps) use systemd-networkd, but the real cloudgws still use ifupdown
[11:25:26] * taavi is starting to think that pontoon + cloudgw are not a great combo
[11:40:48] <volans>	 godog: fixed, if you have an easy way to test it in pontoon that would be great to make sure it actually works as expected
[11:45:11] <godog>	 volans: ok! will give it a spin
[11:46:10] <volans>	 <3 thanks
[11:48:15] <godog>	 taavi: heh yes I ran into the same problem when testing single-nic for cloudcephosd, good though you got compilation going
[11:51:23] <taavi>	 do you recall if there was a task about getting nftables working on wmcs instances?
[11:52:25] <godog>	 I don't think we do have a task yet
[11:52:56] <godog>	 just my goofy attempt above tbh
[12:03:51] * godog lunch
[12:19:32] <taavi>	 I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211651 and the pcc for that seems to suggest I've uncovered an another puppet typing rabbit hole
[12:19:34] * taavi lunch
[12:50:38] <godog>	 oh no /o\
[14:21:40] <volans>	 I'm currently at ~50% nfs workers with infra-tracing-nfs enabled and all seems good so far
[14:38:18] * volans going for the full deployment, if anything breaks or screams please ping me :)
[14:52:06] <taavi>	 that latest Puppet rabbit hole is now T411102
[14:52:07] <stashbot>	 T411102: Typing issues around nftables::service - https://phabricator.wikimedia.org/T411102
[14:58:35] <godog>	 the gift that keeps on giving
[16:40:39] <andrewbogott>	 dhinus: the cloouddb plugs are moved. If we move clouddumps now will you be around for a bit to troubleshoot if disaster strikes?
[16:41:12] <dhinus>	 yep I'm still around for 1 hour or so
[16:41:19] <andrewbogott>	 cool
[16:41:27] <dhinus>	 can I repool all clouddbs?
[16:45:59] <andrewbogott>	 yep
[16:46:10] <andrewbogott>	 and the clouddumps host is moved now too
[16:50:41] <dhinus>	 ack
[16:51:22] <andrewbogott>	 So I think we're all good -- there's one elastic box left but that doesn't require supervision, just waiting.
[16:56:23] <andrewbogott>	 dhinus: ok, clouddumps1002 is misbehaving, discussion  on -dcops
[16:59:45] <dhinus>	 andrewbogott: ack, the last time that it was unreachable for more than a few minutes it had some big impact on toolforge
[16:59:52] <dhinus>	 but for now things are working :fingerscrossed:
[16:59:58] <andrewbogott>	 yeah :/
[17:02:15] <dhinus>	 pending pods in k8s are growing
[17:18:46] <andrewbogott>	 dhinus: clouddumps1002 should (finally) be back up; do we need to do worker reboots now?
[17:19:13] <dhinus>	 I hope not, I see the number of pending pods is going down
[17:19:46] <andrewbogott>	 oh good
[17:20:35] <dhinus>	 I think we're good
[17:21:13] <dhinus>	 https://usercontent.irccloud-cdn.com/file/OFATl3TU/Screenshot%202025-11-26%20at%2018.20.56.png
[17:21:43] <andrewbogott>	 cool
[17:22:33] <dhinus>	 I could also double check a question we had last time: https://phabricator.wikimedia.org/T391369#11410304
[17:26:41] <dhinus>	 the puppet alert is going to go away in a few minutes, I see the number of puppet_agent_failed{project="tools"} is going down 
[18:36:44] * dhinus off