-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SR-IOV improvements (VF-LAG) #439
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @daniloegea, thanks for figuring out resolution to the multiple calls of netplan apply --sriov-only
. I only have two suggestions here.
One is that we should check if the bond type is one of the supported types for VF LAG. According to the NVIDIA docs it must be either Active-Backup
, Balance-XOR
or LACP
.
Second suggestion is in the in-lined comment.
1ab7b52
to
28008e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This approach is looking good to me and I think we should continue this path.
I've added a bunch of inline comments already. But let's revisit a full review once the branch is cleaned up and has improved testing.
Note: Out-of-band we were also discussing the use of https://docs.kernel.org/networking/devlink/netdevsim.html to simulate SR-IOV devices in order to run integration tests. This would need a bigger refactoring though, as we cannot access those simulated devices through PCI. IMO, we should postpone this refactoring and additional testing to a later point in time and independent PR. Let's fix the issues at hand first.
The mlx5 driver added support for debugfs in https://github.com/torvalds/linux/commit/7f46a0b7327a | ||
It's available since kernel 5.19 https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: Let's ask the openstack team if there is any better (more stable) interface to query this information, which would potentially also be available on older Kernels. If not, then debugfs is the best we can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They weren't aware of anything better during our last meeting.
b4613d9
to
7a18c01
Compare
Hi Lukas, thanks for looking at it. I think it's ready for another review. Please look at all the commits again as I did some significant changes and added new commits. I also prepared a PPA for Mantic with all these patches https://launchpad.net/~danilogondolfo/+archive/ubuntu/netplan-sriov |
7a18c01
to
8a52193
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
I left a few nitpicks inline, nothing urgent. I assume you tested in on the ConnectX-5 hardware we used for development, so should be fine.
Still, it would be great to get confirmation from @mkalcok that this is working as expected, using the PPA provided by Danilo.
The mlx5 driver added support for debugfs in https://github.com/torvalds/linux/commit/7f46a0b7327a | ||
It's available since kernel 5.19 https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They weren't aware of anything better during our last meeting.
Add a new internal function to expose the bond mode from netdef.bond_params.mode and a Python binding for it.
netplan apply will be called multiple times in parallel when the udev rules matches multiple device. That might cause problems during boot. Move the call to "netplan apply --sriov-only" to a service so it will be called just once for all the interfaces. The sriov-rebind service must wait for this new service to finish before start running. This change was tests with real hardware and appears to work fine.
The eswitch mode will be modified with devlink regardless its current mode. It causes issues when VF LAG is enabled in the PF. This particular problem will only happen on Mellanox interfaces (hopefully) due to the VF LAG feature. Retrieving the eswitch mode with devlink seems to not work on all NICs. If it fails to get the information, the new method will return 'undetermined' and we set the eswitch mode anyway. In a setup with Mellanox NICs with eswitch mode set to switchdev and VF LAG enabled, netplan apply will break the setup even if there is nothing to change in the SR-IOV NICs. This change will prevent it to happen.
When a PF is part of a bond interface, the mlx5 driver will try to activate the VF LAG feature. While it happens, VFs can't be bound or the process will fail. This change adds a verification step to netplan rebind so, if the PF is part of a bond, it will wait for a few seconds until the VF LAG feature is reported as active by the driver. This is done through the debugfs interface provided by the mlx5 driver in newer version of the kernel (5.19+).
All the subcommands support the parameter --debug but it's not really implemented (only the global netplan --debug is). The main reason for that is because the global --debug also enables glib debugging messages and I'd like to not have all that noise in the output of netplan rebind as --debug will be used by default in the systemd service unit. But the real reason for that is because it's really tricky to get the messages from the root logger from stderr in unit tests. Changing the root logger to log them to stdout will break many tests.
And add a few more tests and delete tests that don't make sense anymore.
8a52193
to
27346fc
Compare
Today I tested this change with the provided PPA and it seems to work as expected. The only concern that I have, and it may be caused by something outside of Netplan's control, is the delays that I see between individual steps. Following is a snippet from
So there's about 100 seconds between last VF getting un-bound and first PF switching to |
Hi Martin, thanks a lot for testing it. Do you see anything useful in the I'm not completely sure about how long each operation takes to complete. But I believe it's just how long the driver takes to unblock the write operations to the sysfs files and it seems to be piling up... |
I'll double-check @daniloegea I suspect that this delay is not coming from netplan. We are double checking actual connectivity via the configured bond, I'll get back to you as soon as we confirm that everything works. I have one more question that might be bit outside of scope of this change, but is there a way to wait until netplan is done with this setup? Particularly at boot, we'll need a way to tell |
I assume we should be able to sort a system unit after the corresponding Netplan services to make that work?
How and when is |
Looks like we might want some |
When activating VF LAG and at the same time changing the e-switch mode to switchdev, the e-switch change MUST be performed before the bonding. Trying to change it after the bond is created will fail and the driver will report that the e-switch is busy. In fact, all the SR-IOV initial setup must happen before the networking configuration. Change the order in which we call netplan apply --sriov-only to happen before network-pre.target. Also, add DefaultDependencies=no to the service to prevent the creation of dependency cycles.
2c82bb1
to
b8f98f9
Compare
I added the DefaultDependencies=no. It fixes the dependency cycles found during tests. |
The openstack team confirmed that their VMs come up with VF-LAG enabled, using this branch. So let's get it merged. |
We have had requests from users on Ubuntu Jammy that are interested in these improvements to resolve issues with consistent upbringing of VFs, embedded switch mode and VF LAG on their systems. Are there any plans to bring Netplan 1.0, or to backport these fixes, to Ubuntu Jammy? |
Backporting the isolated fixes might be an option. Would you mind creating a bug report on the Ubuntu package about this? https://launchpad.net/ubuntu/+source/netplan.io Our full-version backport policy is usually to support the latest Ubuntu LTS, which would be Noble in this case. Porting 1.0 back to Jammy is especially tricky due to the breaking ABI changes and SOVER bump of libnetplan0 -> libnetplan1, introducing a transition in a stable release. |
Main points of this PR:
The udev rule logic was moved to a systemd service that will call netplan apply --sriov-only a single time once all the devices are available. Previously, this command was being called in parallel for all the PFs found in the system (as far as I can tell, it’s not called for each VF).
The netplan-sriov-rebind.service file will only run when netplan-sriov-apply.service is completed. Although, it’s not enough to fix the VF LAG activation. The rebind operation must monitor the state of the VF LAG feature and only when it’s active VF can be bound to the driver.
netplan rebind was changed to support running specific quirks. For the Mellanox case, it will check if the PFs are associated with a bond in Netplan and, if it is, it will monitor the mlx5 lag state file from the debugfs interface and wait until it’s active (or timeout). Only when it’s active it will proceed and call the bind() method. This seems to be enough to activate the VF LAG feature reliably during boot. It will also check if the bond mode is one of the VF LAG supported modes.
When the system is up and running and VF LAG is activated, calling netplan apply will wrack havoc with the SR-IOV configuration. It happens because the code will try to change the eswitch mode with devlink even if it wasn’t changed in the YAML configuration. So I implemented some checks to see if the desired switch mode is already the current mode and don’t try to change it. Unfortunately the devlink dev eswitch show command seems to not work with different drivers (such as the intel ixgbe), but it works with mlx5_core.
Description
A good write-up on the VF-LAG initialization: https://www.stackhpc.com/vflag-kayobe.html
Checklist
make check
successfully.make check-coverage
).