[07:06:08] the update of cumin2002 to bookworm will start in 10 minutes [07:18:31] \o [07:44:06] Heads up: I'm about to start a new Flink test run. Expected duration is ~2 hours; I'll be hitting mw-api-int (from dse-k8s-eqiad) with 75 rps. [08:25:32] cumin2002 is now running bookworm, you can use it again [08:27:50] \o/ [09:31:05] I am planning to migrate the bacula director- if everything goes well, it will take 5 minutes, but any special requirements as both backups and recoveries can be unavailable for longer, should a problem arise? [09:33:12] I guess that goes for topranks, _joe_ as a heads up too? [09:46:49] jynus: ack, thanks [09:49:19] topranks: _joe_, codfw depool test for the k8s wikikube upgrade https://phabricator.wikimedia.org/T397148#10926714 in 10 minutes or so, we'll go single DC for a little bit [09:49:48] <_joe_> heh topranks I might disappear shortly after that [09:49:59] <_joe_> but as I said, for a short while in case [10:09:17] claime: Raine: I'll start the depool in 1min if no objections [10:09:27] ack [10:09:40] thanks [10:10:59] ah, nice...immediate cookbook fail :) [10:12:35] huh '^^ [10:21:05] I am not starting the migration, please don't accidentally delete files from production in the next minutes [10:21:07] *now [10:21:22] jayme: need a hand? [10:22:07] claime: hmm...no. A cookbook rewrite/fix. There's a patch from 2023 and since that it seems to have never been run [10:22:25] :/ ack [10:23:17] claime: Raine: I think we need to postpone, try again tomorrow. I'll try to fix the cookbook today so that we can at least test-cookbook it tomorrow [10:23:31] ack [10:23:46] I'll update the deployment cal [10:23:51] thanks! [10:24:39] done [10:26:23] the bacula migration will take some time, as puppet has to run on all backup-related hosts, including the clients [10:56:45] So I got an error "Error: The CRL issued by 'CN=Wikimedia_Internal_Root_CA,OU=Cloud Services,O=Wikimedia Foundation\, Inc,L=San Francisco,ST=California,C=US' has expired, verify time is synchronized" [10:57:09] I tried to regenerate the puppet cert, but it also gave me an error [10:57:23] backup1009.eqiad.wmnet: Error: Could not download CA certificate: Bad Request [10:59:00] I see the host is still puppet5 [10:59:17] it was a newly installed host, dc ops handled it until now [10:59:27] is there a migration process? [11:00:04] I would ask if it's a new host why was it installed with puppet5... [11:00:11] I have no idea [11:00:22] but this is blocking the migration process [11:00:30] yes there was a cookbook to migrate but has been a long time we don't use it as all new hosts and existing were migrated [11:00:33] let me see [11:01:01] sre.puppet.migrate-host but I would double check before using it [11:01:24] last run nov. '24 [11:02:38] spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2) [11:02:49] "Error: Could not download CA certificate: Bad Request" [11:05:19] I think it was on puppet7 already [11:05:27] just its role was renamed [11:05:39] are we talking about backup1009? [11:05:43] yes [11:06:02] and it is not defaulting to puppet7 yet? [11:06:09] was bought in 2022, doesn't seem a new one [11:06:24] I think it was puppet 7, we migrated everything to 5 [11:06:30] but when renaming the role, it lost the config [11:09:23] see git grep 'profile::puppet::agent::force_puppet7' hieradata/ | grep backup [11:09:31] in a puppet checkout to check it [11:09:40] yes, I just merged it [11:10:17] still, I am getting "Error: Could not download CA certificate: Bad Request" [11:10:50] on both puppet run and sre.puppet.migrate-host [11:11:07] that change would not migrate a host the cookbook was there for that purpose, and was telling you when to make the patch and merge it. If it doesn't work something else might be wrong with the host's puppet config [11:11:41] * volans I will need to step away in few minutes, sorry [11:11:43] should I revert the bacula migration? [11:11:52] we won't have backups anyway [11:12:00] as backup1009 is where most backups are stored [11:12:24] I don't have context on the migration details, which hosts are involved, etc... [11:12:44] that's ok, I am just asking if you know if backup1009 is solvable [11:13:02] it's ok if you don't [11:13:30] what's the current status of backup1009? is it holding any data? is it "new" despite being from 2022? [11:13:42] nope, it is the main backup host [11:13:56] 90% of backups live there [11:14:07] sorry I'm lost, you said above: "jynus| it was a newly installed host, dc ops handled it until now" [11:14:16] yes, that was a mistake [11:14:23] I meant backup1014 [11:14:44] I don't know that happened to backup1009, it was working until today [11:14:51] ah ok [11:15:07] what I belive it happened is that its role got renamed [11:15:09] then I guess we need to understand what happened to backup1009 first [11:15:19] reverting, based on what you said, to puppet5 [11:15:39] but why did it fail, because a new role should be puppet7 be default, right? [11:15:44] when was the role renamed? [11:15:52] just as part of the migration [11:15:58] not sure abnotu the defaults of the top of my head [11:16:16] then I would check puppet logs what puppet changed [11:16:20] and try to revert those bits [11:16:39] I cannot revert just those bits [11:16:44] in fact, I cannot run puppet [11:16:51] that's the main issue [11:17:29] I can run the install setup? [11:17:37] sorry have to step away now, can be back in a bit [11:17:42] ok, no worries [11:18:42] I cannot migrate it back to puppet7 because I cannot run puppet and I cannot run puppet because I cannot migrate it to puppet 7 [11:18:59] reverting won't help breaking that loop [11:19:48] what I can do is setup another host as main backup host, at least for new backups [11:20:31] try to restore the puppet.conf file to how it was before (see diff in puppet logs) merge the patch to make it a puppet7 host [11:20:44] thanks, that will work [11:20:47] will try that [11:20:51] and try to run puppet (you might need to installthe correct puppet version too from the component being a bullseye) [11:21:01] I did already merge the patch with the hiera key [11:21:07] see modules/profile/manifests/puppet/agent.pp [11:22:08] basically, what I think it happened is the host reverted to puppet5, and that caused all those issues [11:25:09] jynus: the "fix forward" section of the task description of https://phabricator.wikimedia.org/T349619 has step-by-step instructions [11:25:32] thanks [11:26:10] can I ask to change defaults to puppet7, so it doesn't happen again (I am guessing that's the source of the issue)? [11:26:14] and if there is a new (or renamed role), then we need "profile::puppet::agent::force_puppet7: true" in it's hieradata/role/common/foo.yaml [11:26:21] yep, that is done [11:26:29] we'd love to change the default! [11:26:44] sorry, I mean that I did it for the host [11:26:51] but there's still two dozen buster nodes, which is blocking this [11:26:52] will do ir more clean later [11:27:22] but they should now really be close to move to bullseye or later (and thus allowing Puppet 7) [11:27:22] the task is the most useful resource, thanks [11:28:03] mwmaint is close to decom, mwdebug is being worked on by Effie, Luca and myself are close to moving the new maps nodes into production and Andrew is working on cloudceph reimages [11:28:30] sorry, I got a bit nervous becaues of the downtime of backups [11:28:38] and I thought it was a hard fix [11:28:46] now I can see the light and no worried anymore [11:29:13] initially we ran into a various cornercases with the conversion cookbook, so these instructions were actually followed quite a few times :-) [11:29:29] so the rename of role caught me by surprise [11:29:36] knowing it, it won't in the future [11:30:15] "on the old puppetmaster clean the node" (which is the puppet5 one?) [11:30:20] moritzm: ^ [11:31:00] puppetmaster1001.eqiad.wmnet [11:31:17] blast from the past :-) [11:31:38] thanks, I now understand it- I think some of that happened while I was not around (vactions or sickness) [11:31:48] so I may not have seen this happening to others [11:32:11] so I think this added to my confussion [11:33:28] possibly, it's been a while, looking at git history the backup roles were migrated by myself in Feb 2024 [11:33:30] backup1009 is now happy, thanks moritzm and volans [11:33:36] great :-) [11:33:42] oh, that part I remember [11:33:53] that is why I was confused to be back in puppet5 [11:34:02] I didn't know about the rename issue [11:34:22] you saved the migration! [11:35:10] I will now first move from the hiera key to the role [11:35:22] to make the change definitive [11:35:26] great [11:35:55] I was worried I had to reimage the backup host (which was a possibility, but not something I wanted to do in the middle of this migration) [11:36:24] the other thing I will do is to setup a standby director, as the migration I thought was going to be trivial [11:36:38] but this is too slow to do it live in case of an outage [12:09:44] moritzm: feel free to merge my change [12:10:54] I merged my change [12:11:45] Notice: /Stage[main]/Bacula::Director/Service[bacula-director]/ensure: ensure changed 'stopped' to 'running' 👏👏👏 [12:12:26] jelto: puppet-merge only showed my change, we must have close avoided the race [12:12:50] yep [12:43:09] hey folks, I am dropping some old Tegola tile cache containers from thanos swift for https://phabricator.wikimedia.org/T396584 [12:44:26] marostegui: fyi, JennH is about to get ready to move the next pc host for https://phabricator.wikimedia.org/T378715#10902792 [13:01:23] topranks, _joe_ bacula migration finished, backup1001 should no longer be used, I updated docs and issue: https://phabricator.wikimedia.org/T387892#10928077 [13:01:32] <_joe_> thank you [13:01:55] I will send an email, there will be a lot of followups but monitoring, generating backups and restoring was tested [13:02:00] I will now take a break [13:15:43] topranks: did you get my CR on puppet-merge? [13:16:32] that merge is taking some time :) [13:16:51] vgutierrez: I did not, only my own [13:16:57] moritzm: ok to merge Moritz Mühlenhoff: memcached: Switch to profile::memcached::firewall_src_sets (cf6d49f0ee)? [13:17:03] however something distracted me and I left the prompt / lock open a few mins [13:17:10] vgutierrez: please merge along [13:17:15] it's done now if you want to run again [13:17:18] merging [13:17:35] sry bout that! [14:46:35] FYI, I'll be releasing a new version of conftool (5.3.0) in a bit, with the only new feature vs. 5.2.0 being that for T395696. [14:46:35] as usual, I'll be smoke-testing various non-mutating commands as I go [14:46:36] T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696 [14:47:28] swfrench-wmf: thanks <3 [15:36:25] XioNoX: sorry I was in an interview. Ideally we should get a bit bigger heads up if possible as we need to depool parsercache. The day before is just enough heads up for us to accommodate for that [15:56:21] marostegui: sounds good, JennH can you ping marostegui on the task when able to do the hosts move? or schedule some slots [15:57:01] yep will do [16:01:25] FYI, conftool 5.3.0 is live on all applicable hosts (i.e., minus buster hosts). no issues encountered during testing, but please feel free to flag here in case anything comes up.