title | authors | reviewers | creation-date | last-updated | |
---|---|---|---|---|---|
software bridge management |
|
15-02-2024 |
15-02-2024 |
When NIC is configured to switchdev mode, a VF representor net device is created for each VF on it. These representors are used by a software switch (OVS, Linux bridge) to control traffic and configure hardware offloads. The software bridge is an essential part of using NICs in switchdev mode.
sriov-network-operator can set switchdev mode for a NIC and create VFs on it, but it doesn't provide any functionality to create and configure software bridges.
This document contains a proposal to add limited support for software bridges configuration to the sriov-network-operator.
This feature assumes integration with ovs-cni and accelerated-bridge-cni.
Depends on switchdev and systemd modes refactoring feature.
SRIOV Legacy mode is no longer actively developed, and we need to encourage users to migrate to switchdev mode, which is actively developed and will continue to receive new features and improvements.
To promote switching to switchdev configurations, we need to provide a nice UX for the end user. This requires providing an easy way to configure software switches, which are prerequisites for NICs in switchdev mode.
- As a user, I expect that sriov-network-operator will install
ovs-cni
andaccelerated-bridge-cni
to hosts. - As a user, I want to create
OVSNetwork
CR, which will result in creation ofNetworkAttachmentDefinition
CR that usesovs-cni
and contains required resource request. - As a user, I want to create
BridgeNetwork
CR which will result in creation ofNetworkAttachmentDefinition
CR that usesaccelerated-bridge-cni
and contains the required resource requests. - As a user, I want to define configuration for software bridges inside the
SriovNetworkNodePolicy
CR and expect that the operator will create required bridges, configure them, and attach uplinks (physical functions).
- handle installation of
ovs-cni
andaccelerated-bridge-cni
- add
OVSNetwork
andBridgeNetwork
CRDs as an API for end users to simplify creation ofNetworkAttachmentDefinition
CR forovs-cni
andaccelerated-bridge-cni
- extend
SriovNetworkNodePolicy
CRD to support configuration of software bridges (bridge-level configuration) - extend
SriovNetworkNodeState
(spec and status) CRD to support configuration of software bridges (bridge-level configuration) - support configuration of software bridge in both modes (operator's
configurationMode
setting):daemon
andsystemd
- implementation should be compatible with Externally Managed PF feature
-
replace
SriovNetworkPoolConfig
CRD -
change API for host-level settings, e.g.
ovs-hw-offload
Note: we may need to extend this API to support additional options
-
add support for VF-lag use-case
- deploy
ovs-cni
andaccelerated-bridge-cni
with init containers ofsriov-network-config-daemon
Pod - define
OVSNetwork
andBridgeNetwork
CRDs and implement controllers for them which will createNetworkAttachmentDefinition
CRs - extend
SriovNetworkNodePolicy
andSriovNetworkNodeState
(spec and status) CRDs to support configuration of software bridges (bridge-level configuration)
Implementation should be compatible with the following workflows:
- Fully automatic workflow
- NIC configuration only flow (Externally managed bridge)
- Externally managed NIC flow
This workflow assumes that sriov-network-operator handles PFs and VFs configuration, creation and configuration of software bridge, announcement of SRIOV resources with device plugin and preparation of NetworkAttachmentDefinition
CR.
-
User creates
SriovNetworkNodePolicy
where:- eswitch mode set to
switchdev
- configuration for selected software bridge is defined
- eswitch mode set to
-
The operator populates
SriovNetworkNodeState
for matching nodes with PF and bridge configurations -
sriov-network-config-daemon
applies PF configuration, create and configure bridge, attach PF to the bridge, applies VF configurationNote 1: if the operator runs in the
systemd
mode then bridge creation should happen in thepre
phase.Note 2: we should create udev rule which will set
NM_UNMANAGED=1
for bridges created by the operator -
sriov-network-config-daemon
should report information about software bridges in the status field of theSriovNetworkNodeState
CR. -
SRIOV resources are announced by the Device plugin
-
User creates
OVSNetwork
orBridgeNetwork
CR to createNetworkAttachmentDefinition
CR, which relies on resources announced by the Device Plugin
Note: created software bridge should be removed during the PF configuration reset
This workflow is kept to support existing HW offloading use-case.
In this case, sriov-network-operator handles PFs and VFs configuration, announcement of SRIOV resources with device plugin and preparation of NetworkAttachmentDefinition
CR.
In some scenarios, it may be compatible with configurationMode: daemon
, but it is supposed to be used when the operator runs in systemd
mode.
-
User creates
SriovNetworkNodePolicy
where:- eswitch mode set to
switchdev
Note:
SriovNetworkNodePolicy
CR should not include bridge configuration - eswitch mode set to
-
The operator populates
SriovNetworkNodeState
for matching nodes with PF configurations -
pre
systemd service creates VFs and configure PF -
NetworkManager or systemd-networkd or environment-specific scripts create bridge
-
post
systemd service binds VFs to required driver and proceed with other configuration steps -
SRIOV resources are announced by the Device plugin
-
User creates
OVSNetwork
orBridgeNetwork
CR to createNetworkAttachmentDefinition
CR which relies on resource announced by the Device Plugin
Note: it is possible to use externally created NetworkAttachmentDefinition
CR that contains configuration for any CNI plugin that support
resources from NICs in switchdev mode.
Note 1: bridge is not removed during the PF configuration reset
Note 2: information about bridges is not reported in the status field of the SriovNetworkNodeState
CR
In this case, sriov-network-operator handles announcement of SRIOV resources with device plugin and preparation of NetworkAttachmentDefinition
CR.
-
User configures PFs, creates VFs, creates bridge and attaches PFs to the bridge with custom scripts.
-
User creates
SriovNetworkNodePolicy
where:- eswitch mode set to
switchdev
- numVFs set to be equal or less than amount of precreated VFs
externallyManaged
set to true for PFs
- eswitch mode set to
-
sriov-network-config-daemon
configure VFs -
SRIOV resources are announced by the Device plugin
-
User creates
OVSNetwork
orBridgeNetwork
CR to createNetworkAttachmentDefinition
CR, which relies on resources announced by the Device Plugin
Variable | Description |
---|---|
OVS_CNI_IMAGE |
contains full image name for ovs-cni |
ACCELERATED_BRIDGE_CNI_IMAGE |
contains full image name for accelerated-bridge-cni |
OVSDB_SOCKET_PATH |
path to the OVSDB socket, used to path to the sriov config daemon |
manageSoftwareBridges
- control state of the feature (default: false
)
--ovsdb-socket-path
- path to the OVSDB socket, defaults to /var/run/openvswitch/db.sock
--manage-software-bridges
- enables management of the OVS and linux bridges by the operator
// OVSNetworkSpec defines the desired state of OvsNetwork
type OVSNetworkSpec struct {
// Namespace of the NetworkAttachmentDefinition custom resource
NetworkNamespace string `json:"networkNamespace,omitempty"`
// OVS Network device plugin endpoint resource name
ResourceName string `json:"resourceName"`
// Capabilities to be configured for this network.
// Capabilities supported: (mac|ips), e.g. '{"mac": true}'
Capabilities string `json:"capabilities,omitempty"`
// IPAM configuration to be used for this network.
IPAM string `json:"ipam,omitempty"`
// name of the OVS bridge, if not set OVS will automatically select bridge
// based on VF PCI address
Bridge string `json:"bridge,omitempty"`
// Vlan to assign for the OVS port
Vlan uint `json:"vlan,omitempty"`
// Mtu for the OVS port
MTU uint `json:"mtu",omitempty`
// Trunk configuration for the OVS port
Trunk []*TrunkConfig `json:"trunk,omitempty"`
// The type of interface on ovs.
InterfaceType string `json:"interfaceType,omitempty"`
// MetaPluginsConfig configuration to be used in order to chain metaplugins
MetaPluginsConfig string `json:"metaPlugins,omitempty"`
}
// OVSTrunkConfig contains configuration for OVS trunk
type TrunkConfig struct {
MinID *uint `json:"minID,omitempty"`
MaxID *uint `json:"maxID,omitempty"`
ID *uint `json:"id,omitempty"`
}
// OvsNetworkStatus defines the observed state of OvsNetwork
type OvsNetworkStatus struct {
}
// OvsNetwork is the Schema for the ovsnetworks API
type OvsNetwork struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec OvsNetworkSpec `json:"spec,omitempty"`
Status OvsNetworkStatus `json:"status,omitempty"`
}
// BridgeNetworkSpec defines the desired state of BridgeNetwork
type BridgeNetworkSpec struct {
// Namespace of the NetworkAttachmentDefinition custom resource
NetworkNamespace string `json:"networkNamespace,omitempty"`
// OVS Network device plugin endpoint resource name
ResourceName string `json:"resourceName"`
// Capabilities to be configured for this network.
// Capabilities supported: (mac|ips), e.g. '{"mac": true}'
Capabilities string `json:"capabilities,omitempty"`
// IPAM configuration to be used for this network.
IPAM string `json:"ipam,omitempty"`
// name of the Linux bridge, if not set will automatically select bridge
// based on VF PCI address
Bridge string `json:"bridge,omitempty"`
// VLAN ID for VF
Vlan uint `json:"vlan,omitempty"`
// VLAN Trunk configuration
Trunk []TrunkConfig `json:"trunk,omitempty"`
// enable setting matching vlan tags on the bridge uplink interface, default is false
SetUplinkVlan bool `json:"setUplinkVlan,omitempty"`
// MTU for VF and representor
MTU uint `json:"mtu,omitempty"`
// MetaPluginsConfig configuration to be used in order to chain metaplugins
MetaPluginsConfig string `json:"metaPlugins,omitempty"`
}
// BridgeNetworkStatus defines the observed state of BridgeNetwork
type BridgeNetworkStatus struct {
}
// BridgeNetwork is the Schema for the ovsnetworks API
type BridgeNetwork struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec BridgeNetworkSpec `json:"spec,omitempty"`
Status BridgeNetworkStatus `json:"status,omitempty"`
}
//+kubebuilder:object:root=true
// BridgeNetworkList contains a list of BridgeNetwork
type BridgeNetworkList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []BridgeNetwork `json:"items"`
}
// SriovNetworkNodePolicySpec defines the desired state of SriovNetworkNodePolicy
type SriovNetworkNodePolicySpec struct {
// ...existing fields...
// contains spec for the software bridge
Bridge Bridge `json:"bridge,omitempty"`
}
// contains spec for the bridge
// only one bridge type can be set
type Bridge struct {
// contains optional config for OVS bridge
Ovs *OVSConfig `json:"ovs,omitempty"`
// contains optional config for Linux bridge
Linux *LinuxBridgeConfig `json:"linux,omitempty"`
}
// OVSConfig optional configuration for OVS bridge and uplink Interface
type OVSConfig struct {
// contains bridge level settings
Bridge OVSBridgeConfig `json:"bridge,omitempty"`
// contains settings for uplink (PF)
Uplink OVSUplinkConfig `json:"uplink,omitempty"`
}
// OVSBridgeConfig contains some options from the Bridge table in OVSDB
type OVSBridgeConfig struct {
DatapathType string `json:"datapathType,omitempty"`
ExternalIDs map[string]string `json:"externalIDs,omitempty"`
OtherConfig map[string]string `json:"otherConfig,omitempty"`
}
// OVSUplinkConfig contains PF interface configuration for the bridge
type OVSUplinkConfig struct {
Interface OVSInterfaceConfig `json:"interface,omitempty"`
// can be extended to support OVSPortConfig which will include
// settings from the OVS Port table
}
// OVSInterfaceConfig contains some options from the Interface table of the OVSDB for PF
type OVSInterfaceConfig struct {
Type string `json:"type,omitempty"`
Options map[string]string `json:"options,omitempty"`
ExternalIDs map[string]string `json:"externalIDs,omitempty"`
OtherConfig map[string]string `json:"otherConfig,omitempty"`
}
// LinuxBridgeConfig optional configuration for Linux bridge and uplink interface
type LinuxBridgeConfig struct {
Bridge BridgeConfig `json:"bridge,omitempty"`
Uplink map[string]string `json:"uplink,omitempty"` // TODO clarify required settings
}
// BridgeConfig contains some options for linux bridge
type BridgeConfig struct {
VlanFiltering bool `json:"vlanFiltering,omitempty"`
// +kubebuilder:validation:Enum=802.1Q;802.1ad
VlanProtocol string `json:"vlanProtocol,omitempty"`
}
Note 1: multiple policies can match single PF (vf range use-case), bridge settings from the policy with
the higher priority (which is the one with the lowest .spec.priority
value) will be applied to PF.
Note 2: multiple NICs can match the same policy on a host. In this case a separate bridge will be created for each NIC.
type SriovNetworkNodeStateSpec struct {
// ...existing fields...
Interfaces Interfaces `json:"interfaces,omitempty"`
Bridges Bridges `json:"bridges,omitempty"`
}
// SriovNetworkNodeStateStatus defines the observed state of SriovNetworkNodeState
type SriovNetworkNodeStateStatus struct {
Interfaces InterfaceExts `json:"interfaces,omitempty"`
Bridges Bridges `json:"bridges,omitempty"`
SyncStatus string `json:"syncStatus,omitempty"`
LastSyncError string `json:"lastSyncError,omitempty"`
}
// Bridges contains list of bridges
type Bridges struct {
OVS []OVSConfigExt `json:"ovs,omitempty"`
Linux []LinuxBridgeConfigExt `json:"linux,omitempty"`
}
type OVSConfigExt struct {
// name of the bridge
Name string `json:"name"`
// bridge-level configuration for the bridge
Bridge OVSBridgeConfig `json:"bridge,omitempty"`
// uplink-level bridge configuration for each uplink(PF).
// in the initial implementation will always contain one element
Uplinks []OVSUplinkConfigExt `json:"uplinks,omitempty"`
}
type OVSUplinkConfigExt struct {
// pci address of the PF
PciAddress string `json:"pciAddress"`
// name of the PF interface
Name string `json:"name,omitempty"`
// configuration from the Interface OVS table for the PF
Interface OVSInterfaceConfig `json:"interface,omitempty"`
}
type LinuxBridgeConfigExt struct {
Name string `json:"name"`
Bridge BridgeConfig `json:"bridge,omitempty"`
Uplinks []LinuxBridgeUPlinkConfigExt `json:"uplinks,omitempty"`
}
type LinuxBridgeUPlinkConfigExt struct {
PciAddress string `json:"pciAddress"`
Name string `json:"name,omitempty"`
Uplink map[string]string `json:"uplink,omitempty"`
}
type VirtualFunction struct {
// ...existing fields...
// contains VF representor name for NICs in switchdev mode
RepresentorName string `json:"representorName,omitempty"`
}
SriovNetworkNodeState.spec
and SriovNetworkNodeState.status
should be extended to contain the same Bridges
struct.
Note: The Bridges
struct in the SriovNetworkNodeState.status
can later be extended based on user feedback
to report additional information required to improve UX.
The feature is only supported on baremetal clusters
The proposed implementation requires changes in ovs-cni
and accelerated-bridge-cni
. We need to change their behavior when deviceID
argument is provided in CNI ARGS.
If deviceID
is set and bridge
arg is empty, the cni plugin should try to automatically select the right bridge by following the chain:
VF (PCI address is in deviceID
arg) > PF > Bond (if PF is part of the bond) > Bridge
Note: accelerated-bridge-cni
already has similar logic, but now it selects the bridge from the predefined list of bridges.
The feature assumes phased implementation.
Add support for NIC configuration only flow and Externally managed NIC flow for Open vSwitch
Requirements:
- add bridge auto-selection logic to
ovs-cni
- define
OVSNetwork
CRD and implement controller for it
Add support for Fully automatic workflow for Open vSwitch
Requirements:
- requirements from phase 1
- extend
SriovNetworkNodePolicy
andSriovNetworkNodeState
CRDs to support configuration for ovs (bridge-level configuration) - modify code:
- add support for ovs bridges creation
- add support for reporting information about configured ovs bridges on the node
- add support for removing auto-created ovs bridges during the PF reset
Add support for NIC configuration only flow, Externally managed NIC flow and Fully automatic workflow for Linux bridge
Requirements:
- add bridge auto-selection logic to
accelerated-bridge-cni
- define
BridgeNetwork
CRD and implement controller for it - extend
SriovNetworkNodePolicy
andSriovNetworkNodeState
CRDs to support configuration for linux bridge (bridge-level configuration) - modify code:
- add support for linux bridges creation
- add support for reporting information about configured linux bridges on the node
- add support for removing auto-created linux bridges during the PF reset
This feature doesn't contain any breaking changes. Automatic upgrades should be safe and will not require any manual steps.
Downgrading without PF configuration reset may be problematic and may keep the node in an inconsistent state. It is recommended to reset PFs attached to bridges first and then do a downgrade.
New functionality should be covered with unit tests.
Manual and automatic e2e testing will require hardware with NICs that support switchev mode and hardware offloading for software bridges.
type SriovNetworkPoolConfigSpec struct {
// OvsHardwareOffloadConfig describes the OVS HWOL configuration for selected Nodes
OvsHardwareOffloadConfig OvsHardwareOffloadConfig `json:"ovsHardwareOffloadConfig,omitempty"`
// NodeSelector only valid for the fields below
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
Bridges []BridgeConf `json:"bridges,omitempty"`
}
type BridgeConf struct {
// same configuration as in the main option
Bridge *Bridge `json:"bridge"`
// NicSelector uses the same type as SriovNetworkNodePolicySpec
NicSelector SriovNetworkNicSelector `json:"nicSelector"`
}
The main problem with this option is that we can achieve the reliable scheduling of workloads only in a complicated way.
The scheduler considers information from the device plugins to ensure that required resources are available on the host before putting workloads on it.
In the main option outlined in this doc, a configuration of the bridge is a part of the SriovNetworkNodePolicy
and that means that the bridge is for sure available on the host if the host announces resource name defined in SriovNetworkNodePolicy
.
In that alternative option, there is no warranty that NodeSelector + NicSelector for a configuration of bridges is in sync with the NodeSelector + NicSelectors from a policy, so we can't rely on the sriov resource name from the policy to do a reliable scheduling - host can have SRIOV VFs, but may miss a bridge.
To solve this problem, we can create a device plugin, which will expose information about available bridges in the form of resources,
e.g., <prefix>/<bridge_type>_<pool_config_name>
. A user must explicitly request SRIOV + bridge resources while creating a Pod.