[07:14:50] greetings [07:47:09] morning [08:02:15] FYI this morning I'll be giving the nfs server switch in toolsbeta another try, no dns / network change this time so slightly simpler [08:03:45] ack [08:24:08] hello! [08:31:56] I'm playing with the new trixie-based lima-kilo, and I feel like there was a performance regression at least on macos [08:32:12] tests are running but very slow, and sometimes they fail [08:32:43] it's possible something changed on my setup, I haven't run tests in lima-kilo since a while [08:32:50] I'll compare with the old bookworm-based version [08:33:37] hmm but maybe the bookworm one won't build anymore [08:38:08] of course volume volume is in status 'reserved' now again [08:38:23] I have this suspicion it has to do with opentofu [08:42:59] godog: which volume? [08:46:28] dhinus: toolsbeta-nfs [09:11:51] not particularly proud of https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1194135 and it is better than me finishing the migration manually when that step fails [09:14:15] dhinus: I have not noticed anything special, tests have been getting slower though, the full test run is pretty slow, that's why we split it in per-component related tests [09:23:35] godog: when it failedd because the state was "reserved", was the state still "reserved" a few seconds later? maybe it just takes a while to go from reserved to available after "volume_detach()"? [09:27:22] dhinus: yes when I checked on horizon manually it was still reserved [09:28:20] mmhh thinking about it more I'll change the cookbook to check the volume state and change it if state == reserved [09:29:45] the detach api call is async, so it's still a bit risky to run a command straight after [09:30:09] unless the python implementation of detach() is already waiting, but I don't think so [09:32:03] the python method is actually using the CLI "server remove volume" [09:38:47] have you tried detaching from the CLI and/or from horizon? does it also get stuck in "reserved"? [10:00:38] no I haven't tried detaching other than with the cookbook [10:11:03] I'm getting disconnected kinda often from bast1003, anyone having the same issues? [10:11:51] no idea tbh, I'm using european bastions [10:12:05] 6003 right now to be exact [10:15:43] I have no issues with bast1003 [10:21:03] okok, I had it time out three times today while deploying things (in the middle, while I was getting output) [10:21:10] maybe it's on my provider [10:25:28] maybe just try a different bastion, if it's something in the network path between your ISP and eqiad, then it should already solve it [11:43:25] FYI I'm trying another nfs switch on toolsbeta, this time back to the previous trixie vm [12:25:15] that actually worked far better than I expected (i.e. as intended) [13:06:00] I think I'm ready to do tools on Mon, will get things going wrt announcement [13:07:18] I'm torn between 1h or 2h window, I'm fairly sure we can do the switch in 1h and clean up / reboot, though 2h will certainly give more peace of mind [13:10:51] based on my past experience of underestimating, I would go for 2h :D [13:11:45] lol that's fair, ok [13:16:30] always better to reserve a longer slot and then finish early than risk not finishing it in time [13:20:28] godog: that 'stuck in reserved' thing reminds me of a familiar cinder issue although in theory that issue is fixed. Are you still stuck on that or did it work on your most recent attempt? [13:21:22] andrewbogott: I see what you did there re: being stuck, though with this change at least the cookbook finishes unattended https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1194135/3/cookbooks/wmcs/nfs/migrate_service.py [13:22:08] oh, nice. Had it been 'reserved' all along, or was that something that happened during the switchover? [13:22:08] I'll do a bunch of tests manually [13:22:32] heh good question, my understanding so far is after detach volume goes reserved [13:23:31] Hm... when things are working properly it should go from attached to available very quickly. But it may pass through 'reserved' briefly en route. [13:23:51] But when cinder is working properly I've never seen a volume be 'reserved' for more than a second or two. [13:24:26] happen to have the uuid of that volume handy? I want to see if the logs show anything [13:24:37] yes one sec [13:24:57] 648504db-18c2-4cee-b731-567dcb4dadf6 [13:25:08] I'll try the detach manually now [13:25:44] yep volume is now reserved [13:26:01] what I did was this, i.e. same as the cookbook [13:26:02] wmcs-openstack server remove volume 19c9ecd1-6fb2-4a2d-954a-c1dc6c956034 648504db-18c2-4cee-b731-567dcb4dadf6 [13:26:10] dhinus: ^ [13:27:03] andrewbogott: I was also trying to make sense of this "reserved" state, but my only theory is that it could have been originated by the "attach" action that the cookbook triggers too soon (without waiting for the detach to complete) [13:27:33] I also searched logstash and didn't find anything but I didn't look too deep [13:27:37] dhinus: possible although that seems like a bug [13:27:46] yeah, all I see is happy 200s in the logs [13:27:59] detach is async, so calling detach+ attach immediately after seems prone to errors [13:28:13] but I'm still not understanding exactly why the "reserved" [13:28:33] another data point is that I have not seen this behavior with the nfs volumes in testlabs [13:28:41] I've always understodd "reserved" to mean "attaching..." [13:29:19] Does/could the cookbook have a wait-for-status nap after it detaches? [13:30:01] it doesn't atm, though even if it did I'm not sure it'll finish, the volume is now in status reserved [13:30:16] after I issued the server remove volume command like the cookbook does [13:30:29] yeah if it goes to reserved and gets stuck there, there's not much we can do [13:30:58] but there must be an explanation why it doesn't detach cleanly [13:31:02] dhinus: right, but you were thinking it got stuck because of getting a second request while mid-transition [13:31:10] andrewbogott: yes that was my original theory [13:31:12] seems like it, I'm ok to leave things as is for investigation if (collective) you would like, or reattach [13:31:35] but if manually detaching (without attempting to reattach) also triggers the "reserved" status, my theory is wrong [13:32:02] godog: can you leave it detached for a minute? [13:32:15] dhinus: sure, i'll standby [13:32:46] ok I just wanted to run a "volume show", which still shows "reserved" [13:32:54] you can reattach unless andrewbogott wants to debug more [13:33:17] a bit of googling suggests that the stuck state happens because of a bad interation between cinder and nova. So e.g. cinder is telling nova to attach it and nova fails to respond for one reason or another and cinder is left hanging. [13:34:15] so it's possible that the 'curse' is on the VM it's trying to attach to, rather than on the volume itself. [13:34:29] (or more likely on the hypervisor hosting that VM) [13:34:49] if it was reproducible, we could test /that/ by moving the VM to a different cloudvirt [13:34:58] ...but this all depends on how curious you are [13:35:45] sure I'm ok to test while we're at it andrewbogott [13:35:51] I'll reattach [13:36:03] ok, you're reattaching to the 'from' VM right? [13:36:10] what's the name or uuid of the 'to' VM? [13:37:04] toolsbeta-nfs-4 or toolsbeta-nfs-5? [13:37:11] so atm I'm considering only the from vm, that's -4 [13:37:23] i.e. I've run 'wmcs-openstack server remove volume 19c9ecd1-6fb2-4a2d-954a-c1dc6c956034 648504db-18c2-4cee-b731-567dcb4dadf6' manually, which still reproduces the issue [13:37:56] great, I will push 19c9ecd1-6fb2-4a2d-954a-c1dc6c956034 around [13:38:05] * godog nods [13:38:11] it's on 1059 currently [13:39:04] ok, now it's on 1046 [13:39:12] want to try again? [13:39:19] sure [13:40:01] {{done}} [13:40:26] yeah still reserved looks like [13:40:30] by the way: earlier in the day nova reports "[None req-1a367538-75d3-4256-a764-337663f88bb5 novaadmin admin - - default default] Failed to detach device sdb from instance 19c9ecd1-6fb2-4a2d-954a-c1dc6c956034 from the persistent domain config. Libvirt did not report any error but the device is still in the config." [13:40:49] and it says it again right now [13:40:55] so that's at least a log message about the problem [13:41:30] interesting, yeah seems to check ok [13:41:33] check out rather [13:41:56] I'll reattach unless there are objections ? [13:42:02] go ahead [13:42:03] Yeah, go ahead [13:42:11] I'm out of ideas for the moment, going to eat breakfast and ponder. [13:42:46] {{done}} ok [13:42:47] Any reason to think this is trixie-specific? Have you seen things work right on trixie before? (Really the VM OS should be a passive player in all this unless libvirt is doing something very fancy) [13:43:36] I have not seen this behaviour while running the cookbooks in 'testlabs' project [13:43:44] with trixie VMs too [13:44:08] ok [13:50:14] dhinus, A preview of things to come: I need mcrouter packages for Trixie and it looks like you made the Bookworm packages. Right now I'm stuck on one of the dependency libraries not building; I'm going to try setting up a build VM so I can show you what I see. [13:53:29] ha yes, I have bad memories of that mcrouter build, but in the end it did work :) [13:54:43] godog: here's the code that produces that log message [13:54:45] there are some notes from the bookworm build in the gerrit patch https://gerrit.wikimedia.org/r/c/operations/debs/mcrouter/+/959212 [13:54:50] https://www.irccloud.com/pastebin/JbwgOZm0/ [13:55:08] so you deserve congratulations! [13:56:41] lol [13:57:01] dhinus, yep, my new build is based off of that patch. Things are going pretty swell until I got to a bit of upstream c++ that doesn't compile [13:57:33] I remember I had many similar issues, and I went manually to each upstream dep to find a reasonable commit to compile [13:58:38] oh, ok! That was what I started doing last night, just trying to find an older point in git that builds. But... does that really mean they're committing unbuildable code? [13:58:55] according to the patch the commits I used were "the latest available release tag in each upstream repo" [13:59:19] ok, I will try tags rather than head [13:59:23] so yes probably they commit unbuildable code (at least unbuildable for some version of debian) [13:59:37] but apparently using release tags were more stable, at least when I tried [13:59:40] oh yeah, I guess they're building on ubuntu and not debian [13:59:57] and we don't know for sure which version of ubuntu, hence which version of underlying libs... [14:00:00] but right now it's breaking on a c++ syntax issue which does not impress me [14:00:14] but probably the c++ spec changes every few weeks [14:00:46] I remember packaging it for bookworm was quite a nightmare, I forgot the details but I think it was mostly trial and error with different combinations of commits [14:01:04] oh good [14:01:06] (for onlookers, the 'they' above is facebook) [14:01:27] "move fast and break things" :P [16:36:40] I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194167 but something unrelated is broken on apt1002: "aptmethod error receiving 'https://deb.nodesource.com/node_22.x/dists/trixie/Release.gpg'" [16:36:44] I pinged in #-sre [16:39:49] maybe https://github.com/nodesource/distributions/issues/1865 [16:41:05] not sure, because git blame shows that repo was added to our list after that [16:52:10] maybe nodesource pulled some of the packages? if I replace 22 with 20, and trixie with bookworm, the URL is working [17:00:50] maybe, that would be the whole repo though [17:03:07] maybe they tried refreshing the key and removed it xd, I could see myself doing that by mistake [17:05:50] I will leave an update in the task for someone else to pick up tomorrow... [17:06:39] T405742 [17:06:40] T405742: tofu-provisioning: Failed to install provider - https://phabricator.wikimedia.org/T405742 [17:19:43] * dhinus off, back next Tuesday! [17:39:18] * dcaro off [18:41:12] dduvall: I think I filled T406271 but it's been ages since we've done anything with qos so if you're willing to run a side-by-side performance test I'd like to hear if it really does anything :) [18:41:13] T406271: Grant gitlab-runners-staging access to fast-iops volume type and a 4xiops instance flavor - https://phabricator.wikimedia.org/T406271 [18:41:46] andrewbogott: awesome, ty! i'll give it a whirl [18:43:15] oh except I only did half of it because I'm a poor reader... [18:57:41] ok, now you should have the flavor as well as the volume type