[08:02:14] morning! [08:42:24] taavi: 2022 :) https://gerrit.wikimedia.org/r/c/operations/homer/public/+/767476 [08:44:09] huh, nice find :) [08:51:56] quick review (emailer alert) https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/32 [09:02:55] dcaro: done [09:05:24] thanks! [09:06:45] I was also looking at it and I had the same comments :) [09:07:20] (but I forgot about "increase" and I was suggesting an uglier 3600*5*rate) [09:28:41] fixed it https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/32 [10:56:17] /me lunch [12:13:31] topranks: at your convenience, could you have a look at T396940 and venture an opinion about where those .private records came from and if we want/need them? [12:13:32] T396940: cloudcephosd1xxxx.private.eqiad.wikimedia.cloud - https://phabricator.wikimedia.org/T396940 [12:14:08] dcaro: anything I need to know about reimaging a cloudcephmon? It's safe as long as I only do one at a time, yes? [12:14:39] andrewbogott: yes, it's been a while though since I did my last reimage of one of those [12:34:47] shall I restart tools-k8s-worker-nfs-39 ? [12:39:17] * andrewbogott does [12:40:58] * andrewbogott 's sleep schedule has shifted so that now he has lots of questions at exactly lunchtime [12:59:20] sorry, okok, /me was in a meeting [13:01:05] dcaro: please remember to rotate the etherpad [13:01:12] oh yes [13:11:07] jobs-emailer patches are ready, quick reviews :) https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/32 (alert) and https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/24 (fix) [14:21:26] thanks topranks [14:58:47] andrewbogott: I see several ceph hosts in the cloud-private network https://netbox.wikimedia.org/ipam/prefixes/656/ip-addresses/ [14:59:26] yeah, it's a scattering [14:59:35] maybe we should just remove those right now so they're not there to surprise us later? [14:59:58] i think eventually it'd be great to move ceph traffic that's currently in cloud-hosts to cloud-private [15:00:15] so maybe we should do it the other way and enable the interfaces now, even if they're not used right now [15:00:23] I suspect that we started adding the nodes bit by bit, while reimaging/adding new ones [15:00:36] taavi: that will just set me/us up for more surprises any time we reimage [15:00:42] but did not reimage all yet, so the very old ones that were reimaged got the ip, and the newer ones got it too [15:00:51] but the chunks in-between did not get them yet? [15:01:01] andrewbogott: how so? [15:01:11] ummm [15:01:16] I'm missing context or you are, not sure which [15:01:39] https://phabricator.wikimedia.org/T396940 [15:01:51] ^ that diff appeared after reimaging cloudcephosd1016-1024 [15:02:07] yes, because the interface was configured in netbox but not by puppet [15:02:14] right [15:02:15] if you flip the puppet switch to enable the interfaces, that will not happen [15:03:01] oh, so the puppet code is written for that, just not applied everywhere? [15:03:28] "enable the interface" is a matter of applying a single profile [15:03:33] ok [15:03:55] + provisioning the addresses in netbox/dns, like we do for all other of our hosts already [15:04:04] And we won't have issues with some nodes talking on different networks and not being able to see each other because... [15:04:22] nothing is configured to use the cloud-private addresses for now [15:04:42] yep, but how will we transition? Will it require a total ceph shutdown? [15:04:48] Maybe I should not think that far ahead today [15:15:56] can I get a +1 on T397059? [15:15:56] T397059: Floating IP request for diffscan - https://phabricator.wikimedia.org/T397059 [15:18:34] done [15:22:34] taavi: previous conversation made me think I would see that interface enabled on select cloudcephosds but I don't... did y'all think it's puppetized for some hosts but not all? Or fully not puppetized? [15:22:57] andrewbogott: i think it's very unlikely that it's enabled for some but not all [15:23:10] what you're looking for is the inclusion of `profile::wmcs::cloud_private_subnet` [15:23:53] ok. Then I'm still confused by the inconsistent behavior from reimaging then, but I will dig deeper in a bit. [15:25:21] i think the inconsistency comes from the fact that some but not all of the addresses at https://netbox.wikimedia.org/ipam/prefixes/652/ip-addresses/ are actually attached to "interface" netbox objects, and some just exist independently [15:32:12] I'm not seeing the difference yet, can you point me to examples? [15:33:01] https://netbox.wikimedia.org/ipam/ip-addresses/14908/ is assigned to an interface, https://netbox.wikimedia.org/ipam/ip-addresses/17222/ is not [15:33:42] oh, I see. I was only looking at ceph examples [16:40:42] hmm…. perhaps the best idea is to do a puppetdb import or all the cloud hosts in netbox? [16:41:01] and then any IPs remaining that are allocated - but not assigned to a host - we delete? [16:46:22] If I add profile::wmcs::cloud_private_subnet to the ceph roles and wait a while, will all the right things magically appear in puppetdb? [16:46:37] Or only things that are already defined in netbox? [17:21:54] andrewbogott: Do you have any ideas about why a newly provisioned by Magnum instance would not be able to connect to the user_data service? T396935 [17:21:54] T396935: Magnum created instances failing to talk to OpenStack user_data service - https://phabricator.wikimedia.org/T396935 [17:22:26] hmm... pint is complaining about jobs_emailer_emails_sent_total not existing in tools prometheus, but it's there :/ [17:26:55] the runbook for that alert is not very useful... [17:27:55] bd808: that template looks reasonable to me; did you already doublecheck the security groups on all the cluster instances? [17:28:31] andrewbogott: sorry I got distracted there, and I've now properly read the backscroll [17:28:50] I think possibly the best way to proceed is what you said [17:29:01] add the "profile::wmcs::cloud_private_subnet" to the ceph roles and wait [17:29:29] ok. I'm in the middle of a different ceph upgrade but when that's stable I'll get you to follow along while I try that in codfw. [17:29:32] _however_ I think for that to work we need to make sure that we have an IP allocated for each ceph node with the correct DNS name, puppet will look up that name and add the dns if it finds it [17:29:52] oh, the pint error went away :/ [17:29:59] we can script something up to allocate an IP for each host to make that work [17:31:22] ahah! So that is the other way around from what I was thinking. [17:31:45] yeah it's not ideal, I believe that's how it works [17:35:43] andrewbogott: The security groups are all the defaults, or maybe better said I have not touched any of them via Horizon or the opentofu config. I assumed this would "just work" because I don't have any notes from the last Magnum cluster I worked on about needing to tweak things. [17:35:55] I guess I should try some more poking about [17:36:32] this is in 'zuul' or 'zuul3'? [17:36:40] `zuul` [17:37:51] hm I would expect it to just work too, there are security group rules to allow communication within the project which should be all you need [17:37:56] I haven't seen the failure you're getting [17:38:13] Everything is torn down at the moment. I will try to create the cluster again and see what state things end up in. [17:38:34] ok. I am interested but I have other things I need to clear out of my head first [17:44:00] I think things got further this time. So the classic "openstack hates me when I haven't bugged a.ndrew for a few days" I guess. [17:44:04] * dcaro off [17:44:06] cya tomorrow! [17:48:58] bd808: Typically I attribute flaky magnum to T392031 which is I think actually a Heat bug which is why I'm hoping to cut Heat out of the picture sometime soon. [17:48:58] T392031: openstack magnum (or heat) resource leak - https://phabricator.wikimedia.org/T392031 [17:49:20] but that is not going especially smoothly [17:51:23] andrewbogott: :nod: and of course by the transient nature of roadblocks I'm trying to do Magnum things at the same time you are trying to figure out how to do that provider switch. [17:51:54] Yeah, I wouldn't suggest that you wait for the switch though, not until I see it work in testing. [17:54:20] I got further just now than I did on Friday or Saturday. And the failure is new and actionable! `VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds allowed gigabytes quota.` Looks like I need 80G of volume quota for each Magnum managed node. [17:56:02] that's a lot, but easy to fix [17:56:09] or maybe can be overridden in the template [17:59:49] I'm pretty sure there will be a place to tweak that in the template. The toolforge k8s nodes are using a flavor that gives them 140G of ephemeral storage, so maybe 80G is not too wild. [18:04:13] I need to run a surprisingly time-consuming errand, back when I'm back [19:46:56] * andrewbogott back, annoyed at ceph-mgr [20:39:00] T397098 [20:39:01] T397098: Increase volume storge quota for zuul project - https://phabricator.wikimedia.org/T397098 [21:16:16] * bd808 has exciting new OpenStack errors to decypher [21:17:13] "Invalid service type for OpenStack-API-Version header" in case anyone has that one memorized already. It's a 406 error coming from the tofu magnum automation