[13:20:50] dcaro: the cookbook to reactivate reimaged hosts is wmcs.ceph.osd.reactivate [13:21:01] So for instance last night I ran " sudo cookbook wmcs.ceph.osd.reactivate --cluster-name eqiad1 --osd-hostname cloudcephosd1006" [13:21:54] 👍 [13:21:59] from cumin1002? [13:22:04] cloudcumin [13:22:28] not sure if it will do the test if you re-run it now, it might [13:22:36] or it might skip since there's nothing to do [13:24:04] hmm. it did run the check [13:24:06] 2025-07-16 18:42:22,591 andrew 2982072 [DEBUG cumin.transports.clustershell.SyncEventHandler:783 in ev_hup] node=cloudcephosd1006.eqiad.wmnet, rc=0, command='sudo -i prometheus-node-pinger' [13:24:11] and the rc is 0 [13:24:26] running it now, it shows the error, let's see the return code [13:24:29] https://www.irccloud.com/pastebin/Pei8w1OI/ [13:25:33] I think it does not return 1 on failure [13:28:46] this should fix it https://gerrit.wikimedia.org/r/c/operations/puppet/+/1170342 [13:31:56] this broke it https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1076978, and of course, the test done is only for the positive case xd [13:34:15] ok, let's merge and make sure it works, then we can check if fixing the nic name gets jumbo frames working again [13:36:29] https://www.irccloud.com/pastebin/zped963b/ [13:36:32] tested 👍 [13:37:02] great, I'll merge the nic name change... [13:37:55] yes please :) [13:38:06] puppet-merge is waiting for you to confirm something [13:39:30] oh, I did already [13:39:36] it might have been deploying [13:39:39] try again :) [13:42:46] oh, my patch is wrong, I was looking at the wrong host. trying again... [13:43:40] hmm [13:44:23] oh the names of the interfaces, sorry I did not catch it [13:48:55] ok, 1006 should have the right nics puppetized now, did that fix jumbo frames? [13:49:06] 👀 [13:49:41] yep \o/ [13:49:44] https://www.irccloud.com/pastebin/VrbthRrA/ [13:49:52] ok! [13:50:09] So that's a good fix but doesn't explain the outage [13:50:15] (since the nic names were right, then) [13:50:25] https://usercontent.irccloud-cdn.com/file/keFMFkFV/image.png [13:50:26] going down now [13:50:29] Btw, now I'm seeing some 'prometheus-node-pinger.service is in failed status ' alerts. Maybe that patch applying? [13:50:50] there were some issues that day though [13:50:52] https://usercontent.irccloud-cdn.com/file/VcQLPVau/image.png [13:50:58] andrewbogott: yep, it should clear soon [13:51:06] (that graph is from the 11th [13:51:26] those are the second reimages I think [13:52:40] the first batch of reimages (1006-8) had the wrong nics for a bit and then I fixed them... so I'd expect to see some noise [13:52:45] let's see if I can find a timestamp... [13:53:53] it's this https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167905 [13:54:02] not sure if those timestamps are UTC or local? Probably local [13:54:50] so 16:50 [13:55:02] what day is that graph from? [13:55:58] the last one is form the 11th, so yep, the jumbos would be the reimages [13:56:35] so the spikes in that graph are probably me reverting to bullseye [13:56:40] and then fixing the nic names after [13:56:59] yep https://sal.toolforge.org/production?p=1&q=&d=2025-07-11 [13:57:01] so all the jumbo frame mess is symptom, not cause [13:57:40] I think it's what cause the big issue later (the second outage at 14:30) [13:58:01] or at least part of it, 1007 disk dying + noout did the rest [13:58:45] oh and cloudcephosd1037 dying too [14:00:12] I maybe missed something... there was a second different issue after we started the reverts? [14:00:45] 1037 crashed all it's osds at 15:09 UTC [14:01:06] (with the bluefs error I pasted in the task) [14:01:19] ok. I guess, again, I thought that was from the reimage. But i certainly didn't check timestamps [14:03:03] is sal in UTC? [14:03:08] it is [14:03:43] hmmm... then according to ceph crash info, and the mon logs, the crashes on 1037 happened after reimaging to bullseye, but they report bookworm [14:03:52] interesting [14:04:17] ah no, they report bullseye [14:05:12] so yep, might have been the reimage, but after the reimage [14:05:44] there's also https://usercontent.irccloud-cdn.com/file/3bHcthHD/image.png [14:05:50] that's from the 11th too [14:06:05] the osds with >10s there are from 1036 and 1037 [14:07:03] is there any chance that what it calling a 'crash' is just the OSDs coming back online and it noticing the flap? [14:08:01] it might mean that the data got corrupted by the reimage, and coming online noticed it [14:11:12] the osd daemon did die several times over it [14:11:28] (8 daemons died 43 times combined) [14:21:32] *47 [14:46:41] dcaro: andrewbogott: I've just created T399858 [14:46:41] T399858: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858 [14:47:03] thanks! [15:35:21] andrewbogott: 1006 is still increasing it's memory usage bit by bit [15:35:47] ok. so probably at some point it will hit a limit and panic [15:36:01] I just pasted a graph to T399858 with the "disk utilization" [15:36:02] T399858: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858 [15:36:15] that seems to have dipped again like on the previous upgrade [15:36:29] (on 1006) [15:40:27] that one makes no sense.... there's more than 2 procs [15:41:13] could be a reporting glitch [15:41:58] but comparing "htop" on 1006 and 1007, I see a lot more threads on 1007 [15:43:58] hmm, how do you see that? (I see the same amount more or less, and similar cpu usage) [15:44:34] I'm also trying to understand if it's just a different view [15:45:31] the total is similar, so probably just a different view [15:45:54] I'm worried about the memory though :/ [15:49:27] we can try to enable memory autotunning [15:49:30] see if it has an effect [15:49:32] 'osd_memory_target_autotune' [15:50:55] ah, I think that's only available if you use cephadm [15:50:57] I added the memory graphs to the ask, and indeed they seem to be good predictors of the crashes [15:51:33] maybe for some reason in bookworm it misdetects the total available memory? [15:53:18] the memory limit is hardcoded, independently of the host [15:53:22] (as of right now) [15:54:10] then I have no idea... [16:12:04] the node_procs_running comes from /proc/stat [16:12:11] from the kernel directly [16:13:10] it seems to be reporting lower yes (manually ran grep procs_running /proc/stat a bunch of times xd) [16:24:39] * dhinus off [17:07:35] andrewbogott: zigo did a talk here earlier today about what's new with openstack and os packaging in trixie, might be of interest to you once the recording is available [17:18:39] definitely [17:28:35] andrewbogott: T399882 is a thing that I am now curious about in the PAWS and Quarry Magnum clusters. Maybe the flannel-cni failure stuff jnuche found is just noise? [17:28:35] T399882: Flannel networking broken in Magnum cluster because upstream containers are missing - https://phabricator.wikimedia.org/T399882 [17:37:11] I think someone (taavi and dcaro maybe?) ran into that when the last rebuild PAWs... you should ask again when they're awake :) [17:38:09] paws uses calico I think [17:38:36] yep https://github.com/toolforge/paws/blob/b680d91921f31c0ba46b7bf89e68f8c8261d2125/tofu/127b.tf#L26 [17:42:10] cool. that gives me hope that changing will wok out better :)