[07:15:06] When switching s2 in codfw I've got hte master stuck, I am getting some stacktraces for the bug report [07:16:34] thanks [07:48:38] @marostegui should we reclone db2212 after https://phabricator.wikimedia.org/T396852 or we feel confident pooling it in? [07:50:25] federico3: go! [07:50:50] marostegui: go for pooling in or go for recloning? :D [07:51:14] Is it replicating well? [07:52:57] it's not started yet, only booted up by dc ops [07:53:45] I thnk you can pool it [07:53:56] The host didn't crash, it simply did't come back after reboot [07:55:49] (my point is that it could have corrupted some data during the crash) [07:55:59] It didnt crash [07:56:00] anyhow I'll start replication and let it sit for 2 days [07:56:07] You rebooted it for the kernel upgrade and it never came back, right? [07:56:38] i mean not crashed the mysql process itself but failed to power cycle [07:56:51] Yes, but mariadb was stopped when you rebooted it [07:57:04] Anyway, there is no reason to believe it is corrupted as mariadb stopped nicely [07:58:10] during the attempted boot ups errors around the raid controller popped up and dc ops reseated the connectors but anyhow we'll see if it behaves [07:58:44] ok [08:08:15] db2168 and db2216 are replicas and still need kernel upgrade but the other hosts in the same sections are updated, should I reboot them manually? Or perhaps we could use them as an opportunity to test the *OS upgrade* in auto_schema? The one in https://gitlab.wikimedia.org/repos/data_persistence/dbtools/scripts/-/merge_requests/4 [08:10:37] hey folks good morning [08:10:50] I see some thanos-swift related issues in https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=thanos [08:11:07] more precisely, there seems to be an issue with thanos-be2007 since a long time [08:11:20] afaics the host is pooled and serving traffic, do you know otherwise? [08:12:51] ok 9 days of uptime, maybe it is still in some sort of WIP [08:13:22] the /dev/sdl's partition may need some tweaking, cc: Emperor [08:17:30] federico3: I'd say test the os upgrade with them [08:17:55] marostegui: ok! [08:19:02] federico3: are you aware of this? https://phabricator.wikimedia.org/T385141#10911418 [08:19:30] BTW do we have automation for upgrades of x1 and x3? They have some hosts left: https://zarcillo.wikimedia.org/kernel-updates/6.1.140 [08:20:56] marostegui: yes: we have/had a bug in the clone+pool process as it should have set up the new host [08:22:04] federico3: Yeah, what I mean is, can you pool it? [08:22:11] I can start a cloning for it today [08:22:30] federico3: but it is replicating [08:24:13] right now afaik we don't have the abiilty to run the last parts of the cloning cookbook (update zarcillo instances, do icinga checks, dbctl etc) independently [08:24:48] or I could do it by hand if we don't want to wait [08:25:04] yes, let's pool it manually and then we can address it [08:27:26] dhinus: Do you plan to keep updating more clouddb* hosts this week? [08:28:44] marostegui: yep I was planning to do 2 per day, starting today [08:28:51] dhinus: great! thank you [09:23:19] we'd like to reimage cumin2002 to Bookworm, are the any long running maintenance tasks or systemd timers to account for? would something like next Tuesday work? [09:23:29] (for DB maint and backup tasks) [09:32:05] tuesday is the best day for backups, as none run from cumin that day- they run from dbprov [09:32:55] now, I don't know if there is an up to date client package for bookworm [09:43:26] elukey: thanks for the ping, those hosts are still with dc-ops in T392908 (and probably do need looking at, I'll try and get to them later today) [09:43:28] T392908: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908 [09:47:42] jynus: which package exactly, happy to check? [09:48:14] wmf-mariadb1011-client ? [09:48:59] can I start rolling restart on s1 codfw to update only db2216 including its OS? [09:49:17] marostegui,Amir1 ^^^ [09:50:13] hmmh, there's no wmf-mariadb1011-client package on either distro, neither for bullseye or bookworm? [09:52:51] that's why I said up to date [09:53:05] I think the current one is wmf-mariadb105-client ? [09:54:46] probably exists in some way, as it comes from the same source as the server: wmf-mariadb1011 [09:55:16] federico3: yes [09:55:30] moritzm: we normally don't compile the client anymore, we just use whatever comes with the distro [09:55:45] I can of course compile it if we prefer that [09:58:22] then no need [09:58:46] sounds good, no need then [10:05:45] next Tuesday also works for DB maintannce things or do these only happen from eqiad anyway? [10:06:04] window would be something like an hour [10:06:33] I don't think we are using cumin2002 at the moment, I think we are all on 1002 [10:06:39] I mean for DB maintenance [10:12:30] marostegui: only thing is that the firmware file for T396648 is only uploaded in cumin2002 so you can't update it from cumin1002 (which confused me a lot). But db2150 update must wait until the switchover anyway (just saying to prevent more confusion if you try to update it from cuin2002) [10:12:30] T396648: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648 [10:13:22] Amir1: I will try to switchover that host this week [10:14:01] Thank you! let me know if I can help on anything [10:14:54] Amir1: In any case I'd expect that firmware file to be re-uploaded to cumin2002 after the reimage? [10:14:55] moritzm: ^ [10:16:35] yeah, the firmware files will either be using a reuse partman recipe or by restoring the files from a copy [10:17:32] but unrelated to that, we can also simply copy the SSD firmware to cumin1003, I'm not sure which file this is using exactly, but if you tell me we can also sync it over [10:17:49] Amir1 federico3 we really have to get the grants for 1003 deployed [10:17:55] Can we get that done this week please? [10:18:00] the root cause is probably that the SSD firmware isn't downloaded from the vendor website (yet?) [11:12:24] Amir1: we can take 20 mins today to review https://gitlab.wikimedia.org/ladsgroup/db-password-rotation/-/merge_requests/3 together if you want [11:12:54] the reboot + os upgrade db2216 ran [11:14:52] with MariaDB 10.6.20 https://grafana.wikimedia.org/goto/7Hlce3YHg?orgId=1 [11:15:52] and yet it did not upgrade the packages, only the os [11:16:18] federico3: didn't you run apt full-upgrade in the end? [11:19:32] that's because it is not a package upgrade, it is a different package- one needs to remove the old one and install the new one (or let puppet do it) [11:19:58] assuming that's wanted, ofc [11:20:05] marostegui: I ran the script with the addition of the dist-upgrade in it but the upgrade did not start likely due to escaping glitch [11:20:13] jynus: No, you are talking about a mariadb migration, not upgrade [11:20:38] sorry, I didn't knew the context [11:25:04] federico3: So let's double check then, because it should also upgrade mariadb [12:07:34] Innodb_os_log_pending_writes was removed on MariaDB 10.8, and it is bad data on 10.11, should I remove it? [12:09:40] same for Innodb_os_log_pending_fsyncs [12:19:12] Hi folks; I'm now re-imaging ms-be2080 onto the new-style VLAN, after which it'll want to go back into the swift rings, so could I get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138832 please? [12:19:37] marostegui: it should be fixed in a bit [12:21:29] jynus: I wasn't aware! So then yeah, let's get rid of it [12:22:42] I left a note, but set them to 0 on graphs- there will be a need for a further review of graphs later on [12:22:55] Thanks! [12:23:15] maybe the important ones should be on git and that way we can discuss and introduce the improvements suggested by federico3 in a more controlled way [12:23:43] specially if in the future we alert based on some of those metrics [12:29:33] the metrics or the alerts on git? [12:30:01] the graph definitions and the alert definitions [12:37:08] marostegui: found the glitch https://gitlab.wikimedia.org/repos/data_persistence/dbtools/scripts/-/merge_requests/5 [12:42:47] for the cumin script, I suggest running on a couple of hosts, make sure everything is done correctly, move on to the next [13:00:52] Amir1: rolling_restart.py ? [13:01:32] Amir1: or are you talking about the cumin grants addition I guess? [13:06:51] cumin grants [13:07:01] Emperor: o/ whenever you have a moment, could you lemme know if this use case is good or if it may cause issues? No rush, even tomorrow :) [13:07:20] elukey: which use case, sorry? [13:07:39] good point, I forgot to paste https://phabricator.wikimedia.org/T396584:D [13:07:42] https://phabricator.wikimedia.org/T396584 [13:09:30] ah, that's a capacity question, I will have to stare hard at usage for a bit :-/ [13:11:37] marostegui: can I start the reboots in s7 codfw ? [13:12:22] federico3: sure [13:16:41] elukey: see updates on the ticket, but I'm tempted to say "delete all the old cruft you're not using first" :) [13:18:02] Sorry to ping again, but anyone got two ticks to eyeball https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138832 please? [13:18:15] Emperor: TIL about the old cruft :D Good point, I am going to ask for confirmation from the Content Transform team! [13:21:42] eyeball> thanks Amir.1 :) [13:22:01] marostegui: aha, apt-get upgrade ran on db2150.codfw.wmnet [13:22:27] so issue fixed? [13:22:41] yep: Setting up wmf-mariadb106 (10.6.22+deb12u1) [13:22:47] excellent!!! [13:24:03] grafana reported mariadb 10.6.17 and now .22 [13:24:20] that's great [13:48:56] marostegui: the current script does the upgrade + reboot *if* the current kernel is below the update threshold. That means we can use it to upgrade OS+mariadb version even if the kernel is already updated (just by setting a very high kernel version I guess) [13:49:27] federico3: that's great thanks [14:24:57] marostegui: not 100% sure how to best visualize change but does this makes sense? https://grafana.wikimedia.org/goto/gKwvd6LNg?orgId=1 [14:28:13] marostegui: can I start https://gitlab.wikimedia.org/ladsgroup/db-password-rotation/-/merge_requests/3 on one host? [14:48:52] hello data-persistence friends - any objections if I release a new version of conftool at some point during my day today? this would be conftool 5.3.0, with the only dbctl-affecting change being A.mir1's [0]. [14:48:52] [0] https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/80 [14:49:35] (as usual, I'd run basic non-mutating dbctl commands as a "smoke test" on the hosts to which I'll deploy) [14:54:55] I'm also happy to wait until tomorrow, and do so at an earlier time more amenable to your TZs if you'd prefer to be around :) [15:50:18] jynus: I'm trying using this alertmanager filter to monitor our monitoring https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=instance%3D~%22db1.*%7Cdb2.*%7Ces1.*%7Ces2.*%7Cpc1.*%7Cpc2.*%22 [15:58:33] whatever works for you :-D [18:13:30] fed.erico3: please look at swfrench-wmf's request when you get a chance tomorrow? [18:19:34] ^ and as a related note, once I upload the 5.3.0 packages to apt.wikimedia.org, it looks like the zarcillo images will need rebuilt as well