[09:18:40] FYI, I'll use the current depooled status of ulsfo for the switch maint to reimage bast4006 to trixie, it should be back in ~ 20m [09:34:24] slyngs: ^^ [09:35:08] Perfect, I'll hold on with repooling until it's back [09:43:20] or quickly use a different bastion [09:43:26] I'll report back when it's completed [09:46:17] I don't think I use bast4006 anyway :-) [09:49:55] yeah, for anyone using FIDO keys it makes sense to only use one bastion for all sites and then simply pick the network-topologically closest to oneself [10:12:11] fabfur, slyngs: bast4006.wikimedia.org is reimaged [10:12:28] Thank you [10:12:43] my other post switch maintanence steps for Ganeti are also completed, so from my PoV ulsfo can be repooled [10:14:24] thanks! [10:18:29] The number of requests to the rest gateway with trust level E (or F) dropped drastically at 7:30 UTC this morning. This only happened on codfw, eqiad looks unchanged. Any idea what happened? Was a new rule put in place at the edge, rejecting such requests before they get to the gateway? https://grafana.wikimedia.org/goto/fflc31ndxcfswe?orgId=1 [10:37:11] fabfur, moritzm: any idea what may have cause this? 80% of API traffic just vanished from codfw. [10:47:19] we repooled ulsfo but not at that time... some minutes ago so it shouldn't be the cause [10:47:53] anything relevant on SAL at that time? [11:54:44] congrats all on the ulsfo migration!! [12:04:42] jynus: not sure what you're trying to do but you can group the hosts from tasks like https://phabricator.wikimedia.org/T421719 by using cumin. For example : `sudo cumin ' es[1033,1045-1046,1051-1053,1057].eqiad.wmnet' 'facter -p lldp.parent'` [12:05:04] that will group them by switch and thus by racks [15:00:29] XioNoX: I cannot reimage in most cases more than 1 host at a time for redundancy reasons [15:00:38] and a lot of time in between [15:05:07] ah, the fact is useful to know [17:08:19] mutante: wondering if you have ideas/opinion about preventing T421147 in the future. [17:08:20] T421147: Codesearch stuck at Feb 12th? - https://phabricator.wikimedia.org/T421147 [17:08:57] perhaps some sensible way to remove these lock files if they're above a certain age and if no other proc is active or something like that? [17:09:49] or we could try to figure out how they were left behind in the first place, but it seems somewhat inevitable at scale that a a service can get killed for whatever reason and end up midway a git command with a lockfile left behind. [17:10:44] I guess the reason that git itself doesn't recover this automatically when it encounters it in the future, is that it is not trivial to tell whether another "git" process is active that will unlock it vs knowing for sure that those past procesess are long gone. [17:11:09] especially once you factor in NFS, or containers that don't "see" each ohter's procs but may share a mount of sorts. [17:11:21] but as sysadmins we can presumably know that with more certainty than Git itself can