[09:47:42] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11463124 (10ayounsi) @Jhancock.wm I'll leave it to you and @RobH to procure the needed equipment. If you prefer a fiber run between the two devi... [10:32:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11463290 (10cmooney) >>! In T410717#11463123, @ayounsi wrote: > If a copper run is fine, then it's an SFP-T (that you probably have in stock) on... [13:09:59] 10netbox, 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: Automatically run Capirca Netbox script regularly - https://phabricator.wikimedia.org/T361549#11463998 (10ayounsi) Once the two patches above are deployed, comes the question on how to run it regularly. There are 2 possible options : *... [13:20:23] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11464054 (10fgiunchedi) >>! In T399180#11432250, @cmooney wrote: >>>! In T399180#11432052, @fgiunchedi wrote: >> I think the easiest would be to: >> >> * Remove the spuri... [13:50:36] 10netops, 06Infrastructure-Foundations, 06SRE: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11464225 (10ayounsi) a:03Papaul @Papaul would you be ok to work with Nokia's support to figure out what those inbound errors mean ? Thanks [14:11:53] topranks: o/ https://phabricator.wikimedia.org/T412807 is really weird, IIUC the debian installer does dhcp correctly but it ends up setting the IP to the wrong interface? [14:12:23] elukey: I don't think we do DHCP in the debian installer anymore? unless I'm mistaken [14:12:49] we default to HTTP boot, so the IP details are passed to the debian installer through the ipxe loading of the boot image? [14:13:48] topranks: this is something that I am missing then - I thought that we passed all the info indeed, but not the final IP settings [14:14:15] no we don't to DHCP in the d-i stage [14:14:37] that wouldn't work on the Nokia switches as they aren't putting the option 82 info into the DHCP discover packets [14:14:48] right okok [14:15:01] ok I totally missed this bit [14:15:07] we DHCP once, from the PXE/UEFI stage, and that system puts its UUID in that packet, which we use to identify [14:15:35] yeah that part was clear, I didn't connect to the fact that we couldn't do dhcp later on in d-i as well of course [14:15:42] but Debian wouldn't put that ID in the DHCP, so what happens is the IP that is assigned by DHCP in the first stage is somehow passed to d-i [14:16:54] I hit the same issue on an sretest host a few weeks back, however in that case the host had two working network ports (for my testing), I shut one down on the switch side and tried again and the second time the IP was set on the right interface (once the others were all 'down') [14:16:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167883 [14:17:01] here it's adding it on the down interface though [14:17:42] elukey: yep, and there were more recent patches to make it work for legacy bios mode too [14:18:34] ah ok so with "down interface" you mean in the case of the db host [14:19:11] so the ipxe script in that case is the one setting the wrong interface, for $reasons [14:19:20] is that your point? [14:22:26] elukey: iirc, ipxe is sending stuff, but it's D-I with `netcfg/choose_interface=auto` that is applying the IP config on the wrong NIC [14:22:57] XioNoX: ah ok thanks [14:23:07] I don't suppose you know how to fix it? [14:23:25] eno1np0: [14:23:27] there isn't a "choose_interface=auto_but_not_the_wrong_one" option? [14:24:17] eh, "just_work_please" [14:24:50] XioNoX: TIL thanks [14:24:58] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11464421 (10Papaul) @ayounsi what else needs to be done here? [14:25:14] topranks: I know that you are looking forward to check all the netcfg di options right!? :D [14:25:49] maybe the logs have something interesting? [14:26:00] not exciting as I'd hoped, seems the other option apart from auto is "select" and you give the actual interface name [14:28:19] topranks: is it possible that something is connected (or even just an optic in the port) of the wrong interface that makes it think it should use it? [14:28:37] `ethtool eno1np0` could maybe tel us some more info? [14:28:39] 10CAS-SSO, 06Infrastructure-Foundations: Provide an official Docker image for CAS-SSO - https://phabricator.wikimedia.org/T412826 (10Arendpieter) 03NEW [14:28:41] there is no ethtool in the busybox to check that [14:28:52] but yeah I thought of that possibility alright, could be [14:28:55] we could have dc-ops check [14:29:01] yeah [14:30:24] 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team, 06SRE: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938#11464449 (10Arendpieter) [14:30:31] or the other way around, the eno3 has an issue that makes it take time to boot up on time [14:33:53] there is another option: #d-i netcfg/link_wait_timeout string 10 [14:34:03] might help if that's the issue, but I have my doubts that's the problem [14:34:38] thought the docs do say "# netcfg will choose an interface that has link if possible. This makes it skip displaying a list if there is more than one interface." [14:36:51] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11464499 (10ayounsi) I was working on that as we speak. As sretest2003 was reclaimed to test hosts I was able to run some more tests. Running the still not m... [14:43:50] maybe the host is telling us to start using its 10G port [14:54:53] someone knows what's up with CI on https://gerrit.wikimedia.org/r/c/operations/software/homer/+/1218739 ? [15:06:13] XioNoX or we can remove the extra 10G nic and avoid the corner case (if not provided by the motherboard itself). Kind of nuclear option but more streamlined with what we have [15:07:21] re: CI - can you repro locally? [15:08:37] I use python 3.13 locally so it's not exactly the same errors but look like yes [15:08:44] but also I don't think it's related to my change? [15:24:17] it seems something related to a new version of sphinx, if I had to guess [15:28:57] trying to repro locally [15:37:07] I get Sphinx==9.0.4 installed on a fresh tox run [15:37:46] released Dec 4th, but I don't see much in the changelog [15:38:51] mmm maybe it is sphinx 9.x the issue [15:43:46] yeah I think that is the issue [15:43:52] filing a patch in a bit [15:44:03] we can just restrict it to < 9.0.0 [15:46:06] XioNoX: https://gerrit.wikimedia.org/r/c/operations/software/homer/+/1218785 should do it [15:46:39] nice, thanks! [15:48:46] np :) [15:49:38] CI is also super slow, new PS at 16:22, CI result at 16:38... [16:54:02] merged my change, yours is green now