[06:44:02] greetings [06:58:01] o/ [07:34:37] morning, anyone familiar with minio setup in lima-kilo? I think that my loki patch is failing because there is only a pod loki-tools-minio-0 and not an equivalent for loki-tracing [07:57:31] andrewbogott: thanks! [07:58:03] FYI getting ready to resize tools-nfs-3 shortly [08:04:16] ok instance is back, nfs clients better come back by themselves OR ELSE [08:05:00] morning [08:05:18] :D [08:06:15] volans: do you have a patch I can check? [08:07:59] wording LGTM, I just bolded a section to make it a bit easier to spot. [08:08:34] dcaro, yes but because I have to use a separate namespace I start to think that it will not work, at least staring at https://github.com/grafana/loki/blob/main/production/helm/loki/templates/_helpers.tpl#L204 [08:08:38] my patch: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/commit/e986d62ab447775b59cd3ded781982df5a6df397 [08:10:16] I think it should go on it's own directory, as a new component instead [08:10:28] (at least in the long term) [08:11:10] as in copying the whole `logging` into `infra-logging` for example (or whatever you want to call it) [08:11:33] that was my first approach, I was told to try to reuse the same just adding a new instance :D [08:12:55] ok decent result: nfs clients came back and were stuck in D for ~5min and unlocked themselves, the nfs server didn't require a puppet run on reboot to get its IP [08:13:31] ~3m availability hit for toolforge overall as reported by haproxy [08:13:34] volans: okok, might be good to try to decide on that at some point [08:13:42] godog: that's awesome 🎉 [08:14:08] great godog! [08:14:16] heheh thank you folks, appreciate it [08:14:47] dcaro: eh... I'm starting to think minio integration into loki would not create the second minio pod but I might be wrong and not seeing the yaml-magic setup for it [08:15:46] looking, might be some hardcoded thing in the charts or similar, things get tricky very quick [08:19:31] I see, I think you should use a different loki deployment, in a different namespace too [08:19:49] duplicate that also (not only the `loki-tools`, but also `loki`) [08:19:59] otherwise you'll have issues with the network policies and all that [08:20:07] and we would be mixing logging systems [08:21:20] same for alloy if you want to deploy through k8s [08:22:09] I don't need alloy inside k8s [08:22:24] logs will come from the VMs, not k8s [08:22:31] most likely [08:23:29] then yep, no need to duplicate alloy [08:25:54] hmm, not sure how to override the namespace of the nested charts, looking [08:26:28] let me test it just in case [08:26:53] the existing code has nameOverride [08:27:10] and I get pods named loki-tools-minio-0, loki-tools-0, etc.. [08:27:32] morning [08:28:04] do you get pods loki-tracing-minio-0? [08:29:11] no, that's the problem with my patch [08:29:24] and from looking at loki's source code I don't see how I can get a second one [08:30:18] hmm [08:30:31] from the loki-tracing-0 pod logs while deploying I get this [08:30:32] │ ts=2025-10-15T08:30:08.760358949Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-tracing-memberlist: lookup loki-tracing-memberlist on 10.96.0.10:53: no such host" │ [08:30:39] │ level=warn ts=2025-10-15T08:30:08.760381723Z caller=memberlist_client.go:660 phase=startup msg="joining memberlist cluster" attempts=3 max_attempts=10 err="1 error occurred:\n\t* Failed to resolve loki-tracing-memberlist: lookup loki-tracing-memberlist on 10.96.0.10:53: no such host\n\n" │ [08:30:58] sounds like you are missing some networkpolicies [08:31:45] there's a `loki-allow-dns` in the loki-tools values [08:32:23] you probably will need that one with the right selector for the loki-tracing pods [08:32:42] it has no selector, just namespace actually xd [08:34:09] I made various tries, I think this one too, but I can retry [08:34:57] hmpf... now my deployment is a mess xd, can't reinstall [08:35:35] looking, I might have messed something specifically [08:36:26] :( sorry [08:40:28] np [08:40:33] it's rebuildable :) [08:45:23] I'm trying to debug using a debug container and such [08:48:57] it seems they are in a deadlock of sorts, so the loki-tracing-* pods try to fetch the list of members of the loki-tracing-memberlist service, but get nothing, so they don't come up, so they don't get listed under the service [08:49:07] there might be some config somewhere with the wrong name or something [08:50:25] you don't think is the missing minio pod? [08:50:56] no, I don't think it gets to try to start minio [08:51:13] (I might be wrong of course) [08:51:43] because I got the same logs yesterday and didn't spot anything missing (I did try a bunch of variants) [08:52:06] but if we thing creating a new component is better/easier I have no problem with that and you don't' have to waste time on this [08:55:21] it actually does try [08:55:22] and fails [08:55:24] https://www.irccloud.com/pastebin/tRd8HsIw/ [08:56:23] doh, I missed that somehow [08:57:01] to fix that you have to change the registry-admission component config, and add loki-tracing to the list of namespaces excluded [08:57:28] I saw a loki there but not loki-tools hence why I didn't try to add it there [08:57:33] my bad I should have tried anyway [08:57:56] * volans tries [08:58:15] minio starts for me now [08:58:25] yep, the naming is confusing xd [08:58:38] the mix of "loki" and "loki-tools" for the existing one is pretty confusing yes [08:59:22] yep, now the others start getting healthy, so minio it was :) [08:59:35] maybe I cheered too soon [09:00:02] I still see the `│ level=warn ts=2025-10-15T08:59:44.261164429Z caller=memberlist_client.go:660 phase=startup msg="joining memberlist cluster" attempts=5 max_attempts=10 err="1 error occurred:\n\t* Failed to resolve loki-tracing-memberlist: lookup loki-tracing-memberlist on 10.96.0.10:53: server misbehaving\n\n" │` [09:00:19] oh, now they went healthy all three [09:00:22] onto the next problem :D sorry for taking your time, at least for minio I should have tried that bit too [09:01:10] np, we are here to help each other, sometimes another pair of eyes is all that's needed [10:47:50] * dcaro lunch [14:09:12] andrewbogott: good morning, please LMK when good to go and I'll kick off cloudcephosd1050 [14:11:57] godog: I'm in a bug triage meeting for a few more minutes but if you know what to do you should go ahead and try it [14:12:08] (did I already paste the cookbook command someplace?) [14:12:24] andrewbogott: ok! yes you did in https://phabricator.wikimedia.org/T405478#11211785 and I'll follow that [14:12:32] cool [14:14:11] you'll want to run in a screen session because it'll take all day [14:14:29] hehe makes sense, will do [14:16:59] so far looks like a totally normal osd node [14:18:28] indeed [14:19:53] Cluster still has (3604322) misplaced objects, at the current 1599 obj/s should take 0:37:33.024290 to finish, waiting 0:00:10 (timeout=8:00:00, elapsed=0:03:40.489739)... [14:21:50] andrewbogott: did I get it right from your comment above that I can ctrl-c at this point and only one osd will be added for now ? [14:22:07] yeah, it's fine to ctrl-c, all the cookbook is doing now is waiting for ceph [14:22:23] ok will do [14:28:47] the current loki instance in toolforge is accessible via a grafana instance? I don't see it in the data sources of grafana-rw.wmcloud.org [14:29:04] no [14:31:18] ack [14:36:23] dhinus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196464/ for fixing the an-redacteddb alert destination [14:38:58] taavi: thanks, left a comment there [16:57:31] where is the toolforge puppet private repo? (where does it lookup profile::toolforge::k8s::secrets) basically. There is an entry in labs/private but I guess there is another non-public one [16:59:11] volans: on the puppetserver itself. local commits overlayed on the labs/private clone. [16:59:43] ah like any other standalone puppet server, got it, thx [17:31:21] * dcaro off [17:33:03] * dhinus off [17:41:41] Raymond_Ndibe: if you get idle and want something to work on, you can try tackling T400616 to get rid of the old ci images, I think there's nothing else needed for it [17:41:42] T400616: [jobs-cli,builds-cli,toolforge-cli,components-cli,envvars-cli,webservice-cli] move the packaging scripts to bookworm - https://phabricator.wikimedia.org/T400616