[08:00:50] <dcaro>	 morning
[08:03:02] <volans>	 good morning, new day new problem :D while adding a basic ingress-nginx rule to the new loki I'm getting "admission webhook "ingress-admission.tools.wmcloud.org" denied the request: Ingress host must be <toolname>.toolforge.org", I guess hitting this:
[08:03:06] <volans>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/blob/main/server/ingressadmission.go#L61
[08:03:45] <volans>	 but I don't get why given I'm in a local env and hostnames are different. Does it ring a bell?
[08:03:49] <godog>	 greetings
[08:05:49] <dcaro>	 volans: probably the message should be changed to use the configured domain, currently you can configure them like https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/ingress-admission/values/local.yaml?ref_type=heads#L7
[08:06:36] <dcaro>	 as in, it should not hardcode that it should be `.toolforge.org`, but instead that it should match any of the configured domains
[08:07:09] <volans>	 ok, so you're saying the message is misleading but the error real and I have a domain that is not .local?
[08:28:17] <dcaro>	 probably :), what is the ingress object you are trying to create?
[08:28:30] <dcaro>	 (sorry for the delay)
[08:29:14] <volans>	 this one: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/loki-tracing/components/logging/values/loki-tracing/common.yaml.gotmpl#L34
[08:29:25] <volans>	 and the hostname seems to be correctly loaded from local.yaml as loki-tracing.loki.svc.cluster.local
[08:33:52] <dcaro>	 hmpf.... I really think we should move it to it's own component :/, I might send a patch after you get it working
[08:34:22] <dcaro>	 the ingress expects something in the form <name>.local, but it might not be allowing subdomains, looking
[08:34:52] <volans>	 subdomRe := regexp.MustCompile(fmt.Sprintf("^%s\\.(%s)", req.Namespace[5:], domstr))
[08:35:23] <dcaro>	 yep, that means it's using the namespaces without the first 5 chars (that'd be `tools-`) and then the domain
[08:35:39] <dcaro>	 I think you can just skip the loki-tracing namespace from the admission controller completely
[08:35:50] <dcaro>	 (it's meant for user tools)
[08:36:10] <dcaro>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/blob/main/deployment/chart/values.yaml#L18
[08:36:16] <dcaro>	 that's the value
[08:37:09] <dcaro>	 probably simpler than trying to make it support non-tool namespaces
[08:37:24] <volans>	 got it, sending a patch
[08:38:25] <dcaro>	 the alert `Reduced availability for job openstack in cloud@codfw` just triggered again, let me have a quick look
[08:41:18] <dcaro>	 yep, prometheus openstack exporter is crashing
[08:41:23] <dcaro>	 https://www.irccloud.com/pastebin/ugs10ymH/
[08:41:33] <dhinus>	 morning
[08:52:25] <dcaro>	 anyone knows if codfw should be working 100%, or there's any tests going on?
[08:54:49] <godog>	 I'm not aware of any tests off the top of my head no
[09:01:01] <volans>	 dcaro: thanks for the approval, I'm in a meeting right now but will ask how to deploy that after it :)
[09:04:14] <dcaro>	 volans: added a note in the MR, let me know how it goes or if you have more questions 👍
[10:07:12] <dhinus>	 there's an alert ToolsToolsDBReplicationError, looking
[10:08:08] <dhinus>	 "Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug"
[10:08:56] <dhinus>	 "START REPLICA;" fixed it ¯\_(ツ)_/¯
[10:09:01] <dhinus>	 it's not the first time I see this
[10:09:24] <dhinus>	 as documented in the alert runbook: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication#If_the_replication_is_NOT_running
[10:11:11] <dhinus>	 better link: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication#Replication_stops_because_of_%22invalid_event%22
[10:16:26] <dcaro>	 hurray for docs
[10:17:39] <dcaro>	 quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/278
[10:17:50] <dcaro>	 and https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/280
[11:01:27] <volans>	 dcaro: thanks for the detailed explanation on MR, only question is, to make sure I understand it, when running the cookbook before merging the bump version MR, will it take the new version somehow?
[11:13:12] <volans>	 I see it does (git checkout branch 'bump_ingress-admission')
[11:21:47] <volans>	 ok I did run the cookbook for toolsbeta and I see it also runs a bunch of tests. All went well, can I proceed with tools too or should I test something else too before?
[11:47:02] <dcaro>	 volans: yep :)
[11:47:07] * dcaro back from lunch, sorry
[11:48:02] <volans>	 so we're deploying before merging?
[11:48:26] <volans>	 no worries
[11:49:53] <dcaro>	 yep, the idea is that if it fails the tests, then we don't merge, so any commit we have is actually functional (up to the functional tests coverage at least)
[11:50:05] <dcaro>	 so we can easily revert to any of them
[11:50:57] <volans>	 ok, I guess I don't need to do any manual test for this simple change right?
[11:51:13] <dcaro>	 not really, the functional tests will cover enough
[11:51:23] <dcaro>	 (probably more than needed)
[11:51:29] <volans>	 great, thanks, kicking the tools deploy then
[11:59:04] <dcaro>	 🛳️
[12:28:11] <andrewbogott>	 Dcaro, I'm reimagining codfw1dev cloudcontrols so wouldn't guarantee stability there
[12:29:03] <dcaro>	 the exporter is failing to fetch anything there 
[12:29:03] <dcaro>	 https://phabricator.wikimedia.org/T407470#11280383
[12:29:20] <dcaro>	 and saw a bunch of stack tarces in the logs
[12:29:46] <dcaro>	 andrewbogott: but yep, we can wait until it's stable to see if it keeps failing
[12:30:36] <andrewbogott>	 Dang
[12:58:38] <godog>	 andrewbogott: thank you for the assist re: cinder 'reserved' volumes on T406688, appreciated
[12:58:38] <stashbot>	 T406688: Cinder volumes getting stuck on 'reserved' after detach - https://phabricator.wikimedia.org/T406688
[13:04:55] <andrewbogott>	 godog: I would've got there a lot faster if I'd started by reading all your notes :/
[13:05:28] <andrewbogott>	 Since folks on https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/issues/40 say that the issue went away with a later release I'm hopeful that that was true for us as well, but I'm going to dig around and see if I can find a timestamp anyway
[13:06:49] <godog>	 definitely, with newer volumes I haven't experienced the same bug which leads me to believe the bug is already fixed for us
[13:41:35] <andrewbogott>	 volans: I'm blocked by "gnt-instance console" not working on ganeti01 in eqiad; history on this task suggests you might know how to fix?  https://phabricator.wikimedia.org/T309724#11279385
[13:45:41] <godog>	 andrewbogott: re: cloudcephosd1050 I was thinking of keep going with putting it in service, I can't see anything obviously wrong so far
[13:47:06] <andrewbogott>	 I agree that it's looking good! How about first we set up a second single-nic OSD and pool one of its drives? Since 1nic<->1nic communication is the only remaining test case I can think of.
[13:48:02] <godog>	 sure that's easy enough, I'll send out the patch for 1051
[13:48:05] <volans>	 andrewbogott: I remember the issue but I don't know the latest context, have you already checked with mori.tz?
[13:49:19] <andrewbogott>	 volans: nope, let's try!  moritzm, I'm getting caught on the issue that appears to be the same as T309724, can you suggest how to diagnose?
[13:49:20] <stashbot>	 T309724: SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update - https://phabricator.wikimedia.org/T309724
[13:50:55] <godog>	 andrewbogott: actually no I stand corrected, I'd rather go ahead with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194967 first, I'll run pcc and merge it
[13:51:06] <volans>	 andrewbogott: that said I can get a console
[13:51:10] <volans>	 if you just do:
[13:51:16] <volans>	 sudo gnt-instance console --show-cmd cloudbackup1002-dev.eqiad.wmnet
[13:51:32] <volans>	 and then copy that command and change -oStrictHostKeyChecking=yes to no
[13:51:43] <moritzm>	 yeah, that
[13:51:47] <volans>	 it gives you a loging prompt at cloudbackup1002-dev
[13:51:58] <volans>	 *login
[13:52:25] <andrewbogott>	 oh nice, thank you for the workaround
[13:52:29] <andrewbogott>	 I didn't know about --show-cmd
[13:52:52] <andrewbogott>	 So that sorts my immediate problem
[13:54:13] <andrewbogott>	 well, wait
[13:54:27] <andrewbogott>	 works for you?
[13:55:16] <andrewbogott>	 oh, wait, needs sudo
[13:55:52] <volans>	 ctrl+] to exit
[14:01:05] <andrewbogott>	 this thing where adding a drive causes the nic to be renumbered is really weird and funny!
[14:01:45] <volans>	 predictable nic names ftw ;)
[14:16:36] <andrewbogott>	 I hope someday to meet someone who wants to explain why this is better than just having 'etho' like we did for the first 30 years
[16:03:48] <volans>	 what's the standard way to test exposed services in lima-kilo including ingress/egress rules (so no kubectl port-forward AFAIK)?
[16:04:37] <volans>	 that will be followed by: what's the standard way to expose a service outside k8s to the hosts :)
[16:05:55] <dcaro>	 you can check how api-gateway exposes the toolforge api
[16:07:50] <dcaro>	 essentially using a node-port in lima-kilo
[16:09:15] <dcaro>	 (then kind configuration exposes that one)
[16:12:45] <dcaro>	 in prod/toolsbeta they are running on the ingress nodes, and than is balanced by haproxy (vms <tools/toolsbeta-test>-k8s-haproxy-*)
[16:15:58] <dcaro>	 the kind config comes from https://codesearch.wmcloud.org/search/?q=extra_forward_ports
[16:23:25] <volans>	 ok, I'll take a look, thanks!
[17:35:54] * dcaro off
[17:35:56] <dcaro>	 ya
[17:36:00] <dcaro>	 *cya