Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SR-IOV improvements (VF-LAG) #439

Merged
merged 7 commits into from
Feb 28, 2024
Merged

Conversation

daniloegea
Copy link
Collaborator

@daniloegea daniloegea commented Jan 25, 2024

Main points of this PR:

  • The udev rule logic was moved to a systemd service that will call netplan apply --sriov-only a single time once all the devices are available. Previously, this command was being called in parallel for all the PFs found in the system (as far as I can tell, it’s not called for each VF).

  • The netplan-sriov-rebind.service file will only run when netplan-sriov-apply.service is completed. Although, it’s not enough to fix the VF LAG activation. The rebind operation must monitor the state of the VF LAG feature and only when it’s active VF can be bound to the driver.

  • netplan rebind was changed to support running specific quirks. For the Mellanox case, it will check if the PFs are associated with a bond in Netplan and, if it is, it will monitor the mlx5 lag state file from the debugfs interface and wait until it’s active (or timeout). Only when it’s active it will proceed and call the bind() method. This seems to be enough to activate the VF LAG feature reliably during boot. It will also check if the bond mode is one of the VF LAG supported modes.

  • When the system is up and running and VF LAG is activated, calling netplan apply will wrack havoc with the SR-IOV configuration. It happens because the code will try to change the eswitch mode with devlink even if it wasn’t changed in the YAML configuration. So I implemented some checks to see if the desired switch mode is already the current mode and don’t try to change it. Unfortunately the devlink dev eswitch show command seems to not work with different drivers (such as the intel ixgbe), but it works with mlx5_core.

Description

A good write-up on the VF-LAG initialization: https://www.stackhpc.com/vflag-kayobe.html

Checklist

  • Runs make check successfully.
  • Retains 100% code coverage (make check-coverage).
  • New/changed keys in YAML format are documented.
  • (Optional) Adds example YAML for new feature.
  • (Optional) Closes an open bug in Launchpad.

Copy link

@mkalcok mkalcok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @daniloegea, thanks for figuring out resolution to the multiple calls of netplan apply --sriov-only. I only have two suggestions here.

One is that we should check if the bond type is one of the supported types for VF LAG. According to the NVIDIA docs it must be either Active-Backup, Balance-XOR or LACP.

Second suggestion is in the in-lined comment.

netplan_cli/cli/commands/sriov_rebind.py Outdated Show resolved Hide resolved
@daniloegea daniloegea force-pushed the sriov_udev_and_rebind branch 2 times, most recently from 1ab7b52 to 28008e1 Compare January 29, 2024 14:24
Copy link
Collaborator

@slyon slyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This approach is looking good to me and I think we should continue this path.
I've added a bunch of inline comments already. But let's revisit a full review once the branch is cleaned up and has improved testing.

Note: Out-of-band we were also discussing the use of https://docs.kernel.org/networking/devlink/netdevsim.html to simulate SR-IOV devices in order to run integration tests. This would need a bigger refactoring though, as we cannot access those simulated devices through PCI. IMO, we should postpone this refactoring and additional testing to a later point in time and independent PR. Let's fix the issues at hand first.

src/sriov.c Outdated Show resolved Hide resolved
src/sriov.c Outdated Show resolved Hide resolved
tests/generator/test_ethernets.py Show resolved Hide resolved
netplan_cli/cli/sriov.py Outdated Show resolved Hide resolved
netplan_cli/cli/sriov.py Outdated Show resolved Hide resolved
netplan_cli/cli/commands/sriov_rebind.py Outdated Show resolved Hide resolved
Comment on lines +110 to +144
The mlx5 driver added support for debugfs in https://github.com/torvalds/linux/commit/7f46a0b7327a
It's available since kernel 5.19 https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.19
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: Let's ask the openstack team if there is any better (more stable) interface to query this information, which would potentially also be available on older Kernels. If not, then debugfs is the best we can do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They weren't aware of anything better during our last meeting.

netplan_cli/cli/commands/sriov_rebind.py Outdated Show resolved Hide resolved
netplan_cli/cli/commands/sriov_rebind.py Outdated Show resolved Hide resolved
src/sriov.c Outdated Show resolved Hide resolved
@daniloegea daniloegea force-pushed the sriov_udev_and_rebind branch 5 times, most recently from b4613d9 to 7a18c01 Compare February 9, 2024 10:24
@daniloegea daniloegea marked this pull request as ready for review February 9, 2024 11:36
@daniloegea daniloegea requested a review from slyon February 9, 2024 11:37
@daniloegea
Copy link
Collaborator Author

daniloegea commented Feb 9, 2024

Hi Lukas, thanks for looking at it. I think it's ready for another review. Please look at all the commits again as I did some significant changes and added new commits.

I also prepared a PPA for Mantic with all these patches https://launchpad.net/~danilogondolfo/+archive/ubuntu/netplan-sriov

Copy link
Collaborator

@slyon slyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

I left a few nitpicks inline, nothing urgent. I assume you tested in on the ConnectX-5 hardware we used for development, so should be fine.

Still, it would be great to get confirmation from @mkalcok that this is working as expected, using the PPA provided by Danilo.

tests/test_sriov.py Outdated Show resolved Hide resolved
Comment on lines +110 to +144
The mlx5 driver added support for debugfs in https://github.com/torvalds/linux/commit/7f46a0b7327a
It's available since kernel 5.19 https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.19
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They weren't aware of anything better during our last meeting.

netplan_cli/cli/commands/sriov_rebind.py Outdated Show resolved Hide resolved
netplan_cli/cli/commands/sriov_rebind.py Outdated Show resolved Hide resolved
src/sriov.c Show resolved Hide resolved
Add a new internal function to expose the bond mode from
netdef.bond_params.mode and a Python binding for it.
netplan apply will be called multiple times in parallel when the udev
rules matches multiple device. That might cause problems during boot.

Move the call to "netplan apply --sriov-only" to a service so it will be
called just once for all the interfaces. The sriov-rebind service must
wait for this new service to finish before start running.

This change was tests with real hardware and appears to work fine.
The eswitch mode will be modified with devlink regardless its current
mode. It causes issues when VF LAG is enabled in the PF. This particular
problem will only happen on Mellanox interfaces (hopefully) due to the
VF LAG feature.

Retrieving the eswitch mode with devlink seems to not work on all NICs.
If it fails to get the information, the new method will return
'undetermined' and we set the eswitch mode anyway.

In a setup with Mellanox NICs with eswitch mode set to switchdev and VF
LAG enabled, netplan apply will break the setup even if there is nothing
to change in the SR-IOV NICs. This change will prevent it to happen.
When a PF is part of a bond interface, the mlx5 driver will try to
activate the VF LAG feature. While it happens, VFs can't be bound or the
process will fail.

This change adds a verification step to netplan rebind so, if the PF is
part of a bond, it will wait for a few seconds until the VF LAG feature
is reported as active by the driver.

This is done through the debugfs interface provided by the mlx5 driver
in newer version of the kernel (5.19+).
All the subcommands support the parameter --debug but it's not really
implemented (only the global netplan --debug is).

The main reason for that is because the global --debug also enables glib
debugging messages and I'd like to not have all that noise in the output
of netplan rebind as --debug will be used by default in the systemd
service unit.

But the real reason for that is because it's really tricky to get the
messages from the root logger from stderr in unit tests. Changing the
root logger to log them to stdout will break many tests.
And add a few more tests and delete tests that don't make sense anymore.
@mkalcok
Copy link

mkalcok commented Feb 21, 2024

Today I tested this change with the provided PPA and it seems to work as expected. The only concern that I have, and it may be caused by something outside of Netplan's control, is the delays that I see between individual steps. Following is a snippet from dmesg:

[  126.053468] pci 0000:03:0f.5: [15b3:1018] type 00 class 0x020000
[  126.053554] pci 0000:03:0f.5: enabling Extended Tags
[  126.054920] pci 0000:03:0f.5: Adding to iommu group 218
[  126.055293] mlx5_core 0000:03:0f.5: enabling device (0000 -> 0002)
[  126.055413] mlx5_core 0000:03:0f.5: firmware version: 16.35.2000
[  126.366847] mlx5_core 0000:03:0f.5: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[  126.382849] mlx5_core 0000:03:0f.5: Assigned random MAC address 72:03:0c:2f:38:0c
[  126.567068] mlx5_core 0000:03:0f.5: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
[  126.580817] mlx5_core 0000:03:0f.5 enp3s0f1v59: renamed from eth0
# ^Last VF being un-bound

[  222.821313] mlx5_core 0000:03:00.0: E-Switch: Disable: mode(LEGACY), nvfs(60), necvfs(0), active vports(61)
[  224.755201] mlx5_core 0000:03:00.0: E-Switch: Supported tc chains and prios offload
[  226.392238] mlx5_core 0000:03:00.0 enp3s0f0np0: Link down
[  226.396783] mlx5_core 0000:03:00.0 enp3s0f0np0: Dropping C-tag vlan stripping offload due to S-tag vlan
[  226.396788] mlx5_core 0000:03:00.0 enp3s0f0np0: Disabling HW_VLAN CTAG FILTERING, not supported in switchdev mode
[  232.743648] mlx5_core 0000:03:00.0 enp3s0f0np0: Link up
[  233.703233] mlx5_core 0000:03:00.0: E-Switch: Enable: mode(OFFLOADS), nvfs(60), necvfs(0), active vports(60)
# ^ PF1 switched to e-switch mode

[  320.650100] mlx5_core 0000:03:00.1: E-Switch: Disable: mode(LEGACY), nvfs(60), necvfs(0), active vports(61)
[  322.486716] mlx5_core 0000:03:00.1: E-Switch: Supported tc chains and prios offload
[  323.173694] mlx5_core 0000:03:00.1 enp3s0f1np1: Link down
[  323.179134] mlx5_core 0000:03:00.1 enp3s0f1np1: Dropping C-tag vlan stripping offload due to S-tag vlan
[  323.179138] mlx5_core 0000:03:00.1 enp3s0f1np1: Disabling HW_VLAN CTAG FILTERING, not supported in switchdev mode
[  328.352086] mlx5_core 0000:03:00.1 enp3s0f1np1: Link up
[  330.213742] mlx5_core 0000:03:00.1: E-Switch: Enable: mode(OFFLOADS), nvfs(60), necvfs(0), active vports(60)
# ^ PF2 switched to e-switch mode

So there's about 100 seconds between last VF getting un-bound and first PF switching to eswitch mode, then another ~90 seconds before second PF swithces to eswitch mode.
I don't see any obvious reason from the code why it should take this long, do you have any thoughts on this?

@daniloegea
Copy link
Collaborator Author

Hi Martin, thanks a lot for testing it.

Do you see anything useful in the netplan-sriov-rebind.service and netplan-sriov-apply.service journals? I enabled debugging by default in netplan-sriov-rebind.service.

I'm not completely sure about how long each operation takes to complete. But I believe it's just how long the driver takes to unblock the write operations to the sysfs files and it seems to be piling up...

@mkalcok
Copy link

mkalcok commented Feb 22, 2024

I'll double-check @daniloegea I suspect that this delay is not coming from netplan. We are double checking actual connectivity via the configured bond, I'll get back to you as soon as we confirm that everything works.

I have one more question that might be bit outside of scope of this change, but is there a way to wait until netplan is done with this setup? Particularly at boot, we'll need a way to tell nova (or other hypervisor) to wait until netplan is finished, otherwise it will try to start VMs and bind VFs to them. This would either fail because the VFs are not ready yet, or it will screw up with VF-LAG setup because it will bind VFs that are supposed to be unbound.

@slyon
Copy link
Collaborator

slyon commented Feb 22, 2024

I assume we should be able to sort a system unit after the corresponding Netplan services to make that work?

After=netplan-sriov-apply.service
After=netplan-sriov-rebind.service

How and when is nova started? Is it controlled through systemd?

@slyon
Copy link
Collaborator

slyon commented Feb 27, 2024

Looks like we might want some DefaultDependencies=no changes in the newly generated services, too.

When activating VF LAG and at the same time changing the e-switch mode
to switchdev, the e-switch change MUST be performed before the bonding.
Trying to change it after the bond is created will fail and the driver
will report that the e-switch is busy.

In fact, all the SR-IOV initial setup must happen before the networking
configuration.

Change the order in which we call netplan apply --sriov-only to happen
before network-pre.target.

Also, add DefaultDependencies=no to the service to prevent the creation
of dependency cycles.
@daniloegea
Copy link
Collaborator Author

I added the DefaultDependencies=no. It fixes the dependency cycles found during tests.

@slyon slyon changed the title SR-IOV improvements SR-IOV improvements (VF-LAG) Feb 28, 2024
@slyon
Copy link
Collaborator

slyon commented Feb 28, 2024

The openstack team confirmed that their VMs come up with VF-LAG enabled, using this branch. So let's get it merged.

@slyon slyon merged commit 32b19b8 into canonical:main Feb 28, 2024
14 of 15 checks passed
@fnordahl
Copy link
Member

We have had requests from users on Ubuntu Jammy that are interested in these improvements to resolve issues with consistent upbringing of VFs, embedded switch mode and VF LAG on their systems.

Are there any plans to bring Netplan 1.0, or to backport these fixes, to Ubuntu Jammy?

cc @narindergupta

@slyon
Copy link
Collaborator

slyon commented Apr 29, 2024

Are there any plans to bring Netplan 1.0, or to backport these fixes, to Ubuntu Jammy?

Backporting the isolated fixes might be an option. Would you mind creating a bug report on the Ubuntu package about this? https://launchpad.net/ubuntu/+source/netplan.io

Our full-version backport policy is usually to support the latest Ubuntu LTS, which would be Noble in this case. Porting 1.0 back to Jammy is especially tricky due to the breaking ABI changes and SOVER bump of libnetplan0 -> libnetplan1, introducing a transition in a stable release.

@slyon slyon added the stable Might be merged in a stable branch label Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stable Might be merged in a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants