[11:05:32] * taavi --> lunch, then upgrading toolsbeta [12:07:33] starting the k8s upgrade [12:26:14] filed T413874 [12:26:14] T413874: toolforge jobs logs: requests.exceptions.HTTPError: 400 Client Error: Bad Request for url - https://phabricator.wikimedia.org/T413874 [12:56:07] aha, apparently there's a k8s bug where a 1.31 kubelet can't talk to a 1.30 apiserver, which is causing Issues when only some of the api servers have been upgraded [12:56:23] (that "bug" being https://github.com/kubernetes/kubernetes/pull/123905#issuecomment-2260639787) [12:56:53] I will just plow through upgrading all the control nodes [13:17:17] taavi: the next thing on my list is rebuilding a toolsbeta etcd node, let me know when you're satisfied that the upgrade is stable. [13:17:27] Meanwhile... breakfast [13:38:28] andrewbogott: upgrade is done, except the last run of the test suite [13:39:43] great, I'll start in a few [14:22:05] * dhinus paged, but I don't see any active incidents in victorops [14:22:31] ok there's a resolved one [14:23:03] cloudvirtdown for cloudvirtlocal1003, seems to have recovered [14:24:14] I'm reimaging it [14:24:22] sorry to disturb! [14:36:34] np [15:17:49] I'm done messing with etcd [15:18:26] andrewbogott: the toolschecker etcd health check is still failing? [15:18:52] I think that's just tools-checker needing a puppet refresh, will force that [15:19:40] yep [15:19:45] yeah, that's better :P [15:19:57] etcd is still on bullseye, right? [15:20:03] yes [15:20:09] we should fix that :P [15:20:13] yeah [15:21:02] oh, unrelated: do you have a plan for 'Object storage quota by 'objects' is 80.93% full for project tools-logging' or should I just increase the quota? [15:24:43] andrewbogott: yeah, that needs just bumping I think - it seems to be growth from adding the tracing instance, that should level off in about a month when we have data for the entire retention period [15:25:06] ok, I'll adjust it [15:25:10] ty! [15:30:59] andrewbogott: the alert was an object count one, not a storage space one? [15:31:16] yep, just realizing that [15:49:08] curious, everything is working properly but "openstack network agent list --agent-type l3 --router cloudinstances2b-gw --long" now shows us as in standby/standby. [15:49:12] That's not super reassuring [15:51:48] uh, what exactly are you doing_ [15:55:17] I upgraded cloudnet1005 (the standby) to Trixie. Next I want to flip 1005 over to the active host and then upgrade 1006... [15:55:41] but when I stopped neutron services on 1006 (formerly the active host) it started reporting standby/standby [15:56:23] I don't see any actual network interruption, I assume it's just the reporting that's wrong. [15:57:31] that's concerning [15:58:30] yep [16:00:37] andrewbogott: seems like the roll reboot cloudnets cookbook sets the neutron admin state instead of stopping services directly, which might be worth trying? (so take the services on 1006 back up again, then set them as down in neutron) [16:01:19] sure, let me look at that cookbook too...