[08:00:50] morning [08:03:02] good morning, new day new problem :D while adding a basic ingress-nginx rule to the new loki I'm getting "admission webhook "ingress-admission.tools.wmcloud.org" denied the request: Ingress host must be .toolforge.org", I guess hitting this: [08:03:06] https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/blob/main/server/ingressadmission.go#L61 [08:03:45] but I don't get why given I'm in a local env and hostnames are different. Does it ring a bell? [08:03:49] greetings [08:05:49] volans: probably the message should be changed to use the configured domain, currently you can configure them like https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/ingress-admission/values/local.yaml?ref_type=heads#L7 [08:06:36] as in, it should not hardcode that it should be `.toolforge.org`, but instead that it should match any of the configured domains [08:07:09] ok, so you're saying the message is misleading but the error real and I have a domain that is not .local? [08:28:17] probably :), what is the ingress object you are trying to create? [08:28:30] (sorry for the delay) [08:29:14] this one: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/loki-tracing/components/logging/values/loki-tracing/common.yaml.gotmpl#L34 [08:29:25] and the hostname seems to be correctly loaded from local.yaml as loki-tracing.loki.svc.cluster.local [08:33:52] hmpf.... I really think we should move it to it's own component :/, I might send a patch after you get it working [08:34:22] the ingress expects something in the form .local, but it might not be allowing subdomains, looking [08:34:52] subdomRe := regexp.MustCompile(fmt.Sprintf("^%s\\.(%s)", req.Namespace[5:], domstr)) [08:35:23] yep, that means it's using the namespaces without the first 5 chars (that'd be `tools-`) and then the domain [08:35:39] I think you can just skip the loki-tracing namespace from the admission controller completely [08:35:50] (it's meant for user tools) [08:36:10] https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/blob/main/deployment/chart/values.yaml#L18 [08:36:16] that's the value [08:37:09] probably simpler than trying to make it support non-tool namespaces [08:37:24] got it, sending a patch [08:38:25] the alert `Reduced availability for job openstack in cloud@codfw` just triggered again, let me have a quick look [08:41:18] yep, prometheus openstack exporter is crashing [08:41:23] https://www.irccloud.com/pastebin/ugs10ymH/ [08:41:33] morning [08:52:25] anyone knows if codfw should be working 100%, or there's any tests going on? [08:54:49] I'm not aware of any tests off the top of my head no [09:01:01] dcaro: thanks for the approval, I'm in a meeting right now but will ask how to deploy that after it :) [09:04:14] volans: added a note in the MR, let me know how it goes or if you have more questions 👍 [10:07:12] there's an alert ToolsToolsDBReplicationError, looking [10:08:08] "Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug" [10:08:56] "START REPLICA;" fixed it ¯\_(ツ)_/¯ [10:09:01] it's not the first time I see this [10:09:24] as documented in the alert runbook: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication#If_the_replication_is_NOT_running [10:11:11] better link: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication#Replication_stops_because_of_%22invalid_event%22 [10:16:26] hurray for docs [10:17:39] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/278 [10:17:50] and https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/280 [11:01:27] dcaro: thanks for the detailed explanation on MR, only question is, to make sure I understand it, when running the cookbook before merging the bump version MR, will it take the new version somehow? [11:13:12] I see it does (git checkout branch 'bump_ingress-admission') [11:21:47] ok I did run the cookbook for toolsbeta and I see it also runs a bunch of tests. All went well, can I proceed with tools too or should I test something else too before? [11:47:02] volans: yep :) [11:47:07] * dcaro back from lunch, sorry [11:48:02] so we're deploying before merging? [11:48:26] no worries [11:49:53] yep, the idea is that if it fails the tests, then we don't merge, so any commit we have is actually functional (up to the functional tests coverage at least) [11:50:05] so we can easily revert to any of them [11:50:57] ok, I guess I don't need to do any manual test for this simple change right? [11:51:13] not really, the functional tests will cover enough [11:51:23] (probably more than needed) [11:51:29] great, thanks, kicking the tools deploy then [11:59:04] 🛳️ [12:28:11] Dcaro, I'm reimagining codfw1dev cloudcontrols so wouldn't guarantee stability there [12:29:03] the exporter is failing to fetch anything there [12:29:03] https://phabricator.wikimedia.org/T407470#11280383 [12:29:20] and saw a bunch of stack tarces in the logs [12:29:46] andrewbogott: but yep, we can wait until it's stable to see if it keeps failing [12:30:36] Dang [12:58:38] andrewbogott: thank you for the assist re: cinder 'reserved' volumes on T406688, appreciated [12:58:38] T406688: Cinder volumes getting stuck on 'reserved' after detach - https://phabricator.wikimedia.org/T406688 [13:04:55] godog: I would've got there a lot faster if I'd started by reading all your notes :/ [13:05:28] Since folks on https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/issues/40 say that the issue went away with a later release I'm hopeful that that was true for us as well, but I'm going to dig around and see if I can find a timestamp anyway [13:06:49] definitely, with newer volumes I haven't experienced the same bug which leads me to believe the bug is already fixed for us [13:41:35] volans: I'm blocked by "gnt-instance console" not working on ganeti01 in eqiad; history on this task suggests you might know how to fix? https://phabricator.wikimedia.org/T309724#11279385 [13:45:41] andrewbogott: re: cloudcephosd1050 I was thinking of keep going with putting it in service, I can't see anything obviously wrong so far [13:47:06] I agree that it's looking good! How about first we set up a second single-nic OSD and pool one of its drives? Since 1nic<->1nic communication is the only remaining test case I can think of. [13:48:02] sure that's easy enough, I'll send out the patch for 1051 [13:48:05] andrewbogott: I remember the issue but I don't know the latest context, have you already checked with mori.tz? [13:49:19] volans: nope, let's try! moritzm, I'm getting caught on the issue that appears to be the same as T309724, can you suggest how to diagnose? [13:49:20] T309724: SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update - https://phabricator.wikimedia.org/T309724 [13:50:55] andrewbogott: actually no I stand corrected, I'd rather go ahead with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194967 first, I'll run pcc and merge it [13:51:06] andrewbogott: that said I can get a console [13:51:10] if you just do: [13:51:16] sudo gnt-instance console --show-cmd cloudbackup1002-dev.eqiad.wmnet [13:51:32] and then copy that command and change -oStrictHostKeyChecking=yes to no [13:51:43] yeah, that [13:51:47] it gives you a loging prompt at cloudbackup1002-dev [13:51:58] *login [13:52:25] oh nice, thank you for the workaround [13:52:29] I didn't know about --show-cmd [13:52:52] So that sorts my immediate problem [13:54:13] well, wait [13:54:27] works for you? [13:55:16] oh, wait, needs sudo [13:55:52] ctrl+] to exit [14:01:05] this thing where adding a drive causes the nic to be renumbered is really weird and funny! [14:01:45] predictable nic names ftw ;) [14:16:36] I hope someday to meet someone who wants to explain why this is better than just having 'etho' like we did for the first 30 years [16:03:48] what's the standard way to test exposed services in lima-kilo including ingress/egress rules (so no kubectl port-forward AFAIK)? [16:04:37] that will be followed by: what's the standard way to expose a service outside k8s to the hosts :) [16:05:55] you can check how api-gateway exposes the toolforge api [16:07:50] essentially using a node-port in lima-kilo [16:09:15] (then kind configuration exposes that one) [16:12:45] in prod/toolsbeta they are running on the ingress nodes, and than is balanced by haproxy (vms -k8s-haproxy-*) [16:15:58] the kind config comes from https://codesearch.wmcloud.org/search/?q=extra_forward_ports [16:23:25] ok, I'll take a look, thanks! [17:35:54] * dcaro off [17:35:56] ya [17:36:00] *cya