[07:29:28] greetings [11:11:29] hrm, `kubectl get pod -A` is hanging for me on toolforge [11:18:10] taavi: works for me in tools-k8s-control-7 [11:18:52] dhinus: huh, indeed. so why does it not work from a bastion? [11:19:45] no idea :/ [11:19:56] * dhinus lunch, bbl [12:03:52] well, now it works again but still feels really sluggish [12:24:21] topranks: are these interface errors on the cloudnet interface that drastically increased a few months ago something to be worried about? https://librenms.wikimedia.org/graphs/to=1778156400/id=20117/type=port_errors/from=1746620400/ [12:24:40] noticed while I was going through everything on the network path of those strange retransmitted k8s packets [12:27:13] taavi: they are not "errors" as such (though LibreNMS shows them on that graph). The equivalent grafana dashboard shows no errors, so there are no link problems, checksum errrors etc [12:27:55] what they are are discards: [12:27:56] https://grafana-rw.wikimedia.org/d/5p97dAASz/queue-and-error-stats-by-network-device?orgId=1&from=now-3M&to=now&timezone=utc&var-site=000000006&var-device=cloudsw1-d5-eqiad:9804&var-interface=xe-0%2F0%2F11&refresh=30s [12:29:16] so effectively what they show is we have micro-bursts of traffic [12:29:33] (sorry that dash may take a while to load, you can see it even over a week too) [12:30:01] the rate-over-time outbound to that host is ~2Gb/sec, so not exceeding our 10G line rate on average [12:31:44] but the discards/drops tell us that at times in-between our 1min sampling of the interface the packets out that interface is exceeding the line rate, and doing so for long enough that the available buffer to store the packets until its free is also filling up [12:32:19] the result is the switch has to drop packets [12:32:50] the actual number I would say is not a major concern [12:33:00] even taking a period where we have a relatively high number of drops: [12:33:02] https://grafana-rw.wikimedia.org/d/5p97dAASz/queue-and-error-stats-by-network-device?orgId=1&from=2026-05-06T11:47:54.106Z&to=2026-05-06T14:02:11.130Z&timezone=utc&var-site=000000006&var-device=cloudsw1-d5-eqiad:9804&var-interface=xe-0%2F0%2F11&refresh=30s [12:33:21] the actual percentage of packets dropped is still only 0.000353% [12:34:03] the number of transmitted packets is consistently several hundred thousands, and the drops show 20-100 packets droped per second once in a while [12:34:07] so I don't think a big concern [12:34:10] fair [12:34:12] thanks anyway [12:34:14] what we can do to address: [12:34:39] 1) upgrade the switches (planned). the Trident 3 switches we're replacing those with have 3x the buffer memory, so will likely fix [12:34:56] 2) upgrade the cloudnet line rate, if the line rate is higher it will be idle more and less packets need to buffer [12:35:21] 3) do some qos work. this is where qos kicks in, we can signal to the network that when this happens what packets we'd prefer it to drop [12:35:35] probably here it's not a big worry, let's just leave it and hopefully it disappears after 1 [13:46:03] * taavi current side track counter: 4 or 5 [13:51:33] * taavi wonders the difference between 'br-internal' and 'br-int' bridges on cloudnets [13:57:53] br-int turns out to be an ovs internal thing, while br-internal is the neutron/ovs interface for the legacy vlan [13:58:33] one more reason to look forward to switching off the legacy vlan [14:49:39] moritzm: re https://gerrit.wikimedia.org/r/c/operations/puppet/+/1283025 , the role is still in use on cloud vps [14:50:01] well, it's the deployment-prep mail server at least [15:55:10] does anybody have any idea that could explain the k8s scheduling error reported in T425696? [15:55:11] T425696: restarted pod failed to schedule due to resource constraints - https://phabricator.wikimedia.org/T425696 [15:56:30] "74 Insufficient cpu" is surprising, as according to grafana we're only at 57% cpu requests overall [16:25:36] andrewbogott: I had a brief look at packaging the magnum driver you linked, the python part is fine but dealing with the rust dependencies 'properly' gets difficult very quickly even on unstable [16:26:55] Ok. Thanks for looking. Does it have rust dependencies at runtime or just build time? [16:27:45] build time I would assume [16:28:22] i.e. I imagine it's going to compile it at build time at which point it's going to behave like any other native-backed python library [16:36:22] yeah, ok. So it might need a bespoke build host [16:37:55] not really I think [16:38:26] either someone needs to spend some time fixing those missing rust packages, or we run the build somewhere that can access the internet and pull it from cargo directly [16:42:16] I was able to build (the wheel) on a VM with a few extra 'apt install' lines. Do you have any concerns about that being the 'official' build process for now? (Other than it being clumsy, obv.) [16:44:46] where and how would the wheel be installed onto? [16:53:59] I'm just assuming that if I can build the wheel that I must have the rust compilation bits sorted. [16:54:11] We wouldn't literally install in wheel form [16:54:50] I will try to figure out what wheel2deb actually does, I imagine it explodes the wheel into component parts. [21:15:15] andrewbogott: seeing https://github.com/V4bel/dirtyfrag/blob/master/README.md. Are we affected? [21:16:19] nope, we were ahead of time via puppet.