Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ovs-hardware-offload enhancement #296

Closed
wants to merge 2 commits into from

Conversation

zshi-redhat
Copy link
Contributor

No description provided.

daemon and SR-IOV hardware NIC.

- OpenvSwitch daemon is managed by Cluster Network Operator (CNO) and runs in
`ovs-node` DaemonSet. A boolean configurable field will be added in CNO
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this boolean node specific? AFAIK, it is not mandatory to have all the nodes with offload and ovs running on the specific node should have the offload config enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be a cluster-wide configuration for ovn-kubernetes.
I think it's worth discussion whether we want per-node or per-role control of offload enablement.
One possible way of doing per-role control is to add a pool selector in spec.defaultNetwork.ovnKubernetesConfig to specify which MachineConfigPool shall be enabled with hw-offload.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that possible to make it pod specific? Can offload and non-offload interfaces work in parallel in one OVS instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ovs hw offload will be supported on primary pod interface with SR-IOV device. Having second interface in one OVS instance implies that ovn-k8s-cni-overlay be used for additional pod interface, which is not considered in this enhancement.
Traffic will only be offloaded when SR-IOV VF (in switchdev mode) is attached to pod primary network. so ovs hw offload can be request in pod basis by using customized default net-attach-def.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the configuration is global to ovs so all datapath flows that the NIC vendor support hw offload will be offloaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moshe010 For a non-switchdev pod that uses veth pair as primary interface, how would OVS decides not to offload the traffic? for example, does it check whether the hw-tc-offload feature is enabled or not on the attached device (veth or VF representor)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rethinking on this, it might make sense to enable ovs hw offload for a specific MachineConfigPool instead of the whole cluster, to isolate the potential impact of "the whole cluster is down due to hw-offload failure" and "not all cluster nodes have hw offload capable NICs".

@pliurh @krsacme @moshe010 @danwinship wdyt?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be configurable on a node group rather than the whole cluster, to allow mixed capability on the cluster.

Copy link

@adrianchiris adrianchiris Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moshe010 For a non-switchdev pod that uses veth pair as primary interface, how would OVS decides not to offload the traffic? for example, does it check whether the hw-tc-offload feature is enabled or not on the attached device (veth or VF representor)?

When OVS is configured with hardware offload, it uses TC as its default datapath. if the DP rule cannot be handled in TC then OVS kernel DP is used.

in TC, if the underlying device is capable of hardware offload it will try to offload the rule to the device (unless otherwise specified in tc-policy , i.e skip-sw / skip-hw) for VETH device, as its not capable of hardware offloads, the flow will not be offloaded (but would still be handled by TC)

+1 on a per-node group configurations for it. i would not move the entire cluster to use OVS with TC Datapath until it was properly validated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When OVS is configured with hardware offload, it uses TC as its default datapath. if the DP rule cannot be handled in TC then OVS kernel DP is used.

in TC, if the underlying device is capable of hardware offload it will try to offload the rule to the device (unless otherwise specified in tc-policy , i.e skip-sw / skip-hw) for VETH device, as its not capable of hardware offloads, the flow will not be offloaded (but would still be handled by TC)

I was thinking to have the default tc-policy configured to skip-sw so that OVS will do either hw offload or kernel datapath until the TC datapath is properly validated and exposing tc-policy as a confiugrable option in case people want to try tc software datapath.

+1 on a per-node group configurations for it. i would not move the entire cluster to use OVS with TC Datapath until it was properly validated.

To identify a group of nodes, we have two options:
one is to use existing MachineConfigPool, the other is to define customized labels for a node group.
The first group may not contains the exact hw-offload capable nodes (some nodes may have capable NIC, others may not as the pool is not grouped according to hw-offload capability). The second would allow us to accurately define a hw-offload capable node pool. I think the second is what we should do.

is `false`.

- Setting `tc-policy` in CNO `ovs-node` DaemonSet is equivalent to executing
`ovs-vsctl set Open_vSwitch . other_config:tc-policy=skip_sw`. The default
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value none uses both - writes to software and if available to hardware. Any particular reason to use skip-sw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no particuar reason, this is an example of tc-policy, it will be configurable by cluster admin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just set hw-offload=true and tc-policy= on all the nodes by default, and not expose the knob to users? Is there any drawback for using tc-flower on no hardware offloading node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ovs hw offload is experimental in OVS project and is not in GA support on RHEL, until then I think we may want to have it disabled by default or be controlled per cluster.

If tc-policy is configured to allow both sw and hw, the traffic can fall back to use ovs software datapath if flow cannot be offloaded to hw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krsacme Using skip-sw allows ovs to switch back to ovs kernel datapath(the original datapath used on OpenShift) from hw offload failure, this excludes tc software datapath. There are uncertainties of using tc software datapath: is it matured enough? what's the performance look like compared with ovs kernel datapath? what is the debuggbility?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are uncertainties of using tc software datapath

I don't know about uncertainties on using tc. But it would recommeded to use the configuration which offload is validated with. OvS goes with the default options as tc than the ovs. Mellanox can be in a better possition to answer it.

what's the performance look like compared with ovs kernel datapath?

There is no analysis that i know of, between "none" and "skip_sw". It is a better to do that to choose the mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are uncertainties of using tc software datapath

I don't know about uncertainties on using tc. But it would recommeded to use the configuration which offload is validated with. OvS goes with the default options as tc than the ovs. Mellanox can be in a better possition to answer it.

Would the kernel be the default datapath type if not specified? or is it changed to use tc as default, in which ovs version?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it would recommeded to use the configuration which offload is validated with. OvS goes with the default options as tc than the ovs. Mellanox can be in a better possition to answer it.

when hw-offload is enabled the default datapath would be TC, we recommend to use tc-policy=None which is how this feature is validated when SR-IOV is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any known issue when using tc-policy=skip-sw or it's just not validated?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any known issue when using tc-policy=skip-sw or it's just not validated?

its not validated

enhancements/network/ovs-hardware-offload.md Outdated Show resolved Hide resolved
enhancements/network/ovs-hardware-offload.md Show resolved Hide resolved

#### SR-IOV Operator

- SR-IOV Operator will support new device type `switchdev` in Operator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any udev rules created in order to retain the name of the PF device? After chaning the mode to switchdev, the PF devices name will be changed. Since the VF-Reps are already renamed, there is no need for udev rules for VF-reps, but PF naming has to be managed appropriately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, PF will be renamed to its original name when provisioned to switchdev mode using udev rules. otherwise, bond interface (ifcfg-bond) will not be able to find the right interface.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in openstack we do have udev rules for the VF representors, but in ovn-kubernetes we rename the VF representor to the container id same has veth. It may be better to use udev rules and VF rep name consistante because it easier to debug and also on driver restart to kernel put the old name and I am not sure what will be in the ovs. Also we udev need to disable NetworkManager has it can remove tc qdisc see [1]
[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1815060

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moshe010 if using shared-gateway mode, do you think we still have this issue (NetworkManager restart interface which causes qdisc rules be removed)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so we solve it in openstack by adding udev rule [1] to unmanage the VF representors

[1] - https://github.com/openstack/os-net-config/blob/master/os_net_config/sriov_config.py#L355-L359

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. In openstack we add udev rule to disable NM on VF represntors [1]

[1] - https://github.com/openstack/os-net-config/blob/master/os_net_config/sriov_config.py#L355-L359

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sentence in the config daemon section which will create/apply udev rules for both PF device and VF representors

- SR-IOV Operator will support new device type `switchdev` in Operator
[SriovNetworkNodePolicy API](https://github.com/openshift/sriov-network-operator/blob/master/pkg/apis/sriovnetwork/v1/sriovnetworknodepolicy_types.go#L33).

- SR-IOV config daemon (a DaemonSet managed by SR-IOV Operator) will support
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switchdev mode configuration has to be applied again on the node reboot (along with creatin VFs). How is this acheived?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SR-IOV config daemon will reconcile the existing policies when it's launched after rebooting.
It should also apply the configuration before network service starts(when bond interface will be created).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for reference, Mellanox team has proposed a PR in SNO for this feature, openshift/sriov-network-operator#75

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A systemd service will be used to apply the switchdev mode config as well as num of VFs. This is to guarantee the order of: numVfs -> switchdev config -> bond config -> enslave bond in ovs

daemon and SR-IOV hardware NIC.

- OpenvSwitch daemon is managed by Cluster Network Operator (CNO) and runs in
`ovs-node` DaemonSet. A boolean configurable field will be added in CNO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that possible to make it pod specific? Can offload and non-offload interfaces work in parallel in one OVS instance?

- Multus supports overwriting cluster network definition by specifying a
customized default net-attach-def in pod annotation. Customized net-attach-def
will still use `ovn-k8s-cni-overlay` CNI plugin, but with an special annotation
that requests a SR-IOV device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be better if we could have an admission controller to do the modification automatically which can make it transparent to the users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be a place where user can request a hw offload pod/interface, currently the design relies on user to specify default cluster-network annoatation, how would admission controller knows whether user wants to have a hw-offload pod?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a pod annotation to indicate whether a pod needs HW offloading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording here is not clear about who is writing this annotation and why. (ie, that it needs to be specified on the pods that will have offload, and that any pod that doesn't have the annotation won't get hw offload). I agree with Peng that we shouldn't require the admin to write out a complete (and correct) net-attach-def. There should be some simple way to declare it.

Does this need any security / authorization associated with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re the wording, I'll update and add the owner of actions to make it clear who will do what.

Current pod creation flow is the same as how normal SR-IOV pod is created, user first defines a SR-IOV net-attach-def via SR-IOV Operator API, then specifies the net-attach-def in pod annotation. If we were to add a another way to declare the use of ovs hw offload pod, it would be inconsistent. The customized net-attach-def for ovs hw offload in my opinion is not complex as below:

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: switch
  annotations:
    k8s.v1.cni.cncf.io/resourceName: openshift.io/switchdev_network
spec:
  Config: '{"cniVersion":"0.3.1","name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay","ipam":{},"dns":{}}'

It has the same spec.Config body as the default net-attach-def in Multus CNI config, the only difference is the annotation part which varies per resource name (randomly defined by cluster network admin). so it will not be a fixed net-attach-def that can be pre-defined, but I agree that with the help of admission controller (whether it's multus or sriov), it can be achieved by adding a different annotation in pod to declare the use of ovs hw offload and the annotation should contain the desired sr-iov resource name, then admission controller would inspect each pod annotation, create and add the customized net-attach-def for ovs-hw-offload pod.

Re security / authorization, it will have the same security /authorization as was applied to normal net-attach-def, like namespace isolation etc. An extra security check may need to be added for the customized net-attach-def to make sure the body of net-attach-def is not configured with any unexpected values in ipam/dns.

Copy link
Contributor Author

@zshi-redhat zshi-redhat Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to create customized net-attach-def automatically by sriov-operator which knows the switchdev resource name to be used in the net-attach-def. With this, user will only need to add the default network annotation in pod annotation and no need to create the customized net-attach-def themselves.

is `false`.

- Setting `tc-policy` in CNO `ovs-node` DaemonSet is equivalent to executing
`ovs-vsctl set Open_vSwitch . other_config:tc-policy=skip_sw`. The default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just set hw-offload=true and tc-policy= on all the nodes by default, and not expose the knob to users? Is there any drawback for using tc-flower on no hardware offloading node?


- Multus will support specifying net-attach-def with customized annotations for
[default cluster network in pod annotations](https://github.com/intel/multus-cni/blob/master/doc/configuration.md#specify-default-cluster-network-in-pod-annotations).
Customized annotations will contain a resource name requesting SR-IOV device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, maybe we can create an admission controller to mutate the pods automatically.

- SR-IOV Operator will support new device type `switchdev` in Operator
[SriovNetworkNodePolicy API](https://github.com/openshift/sriov-network-operator/blob/master/pkg/apis/sriovnetwork/v1/sriovnetworknodepolicy_types.go#L33).

- SR-IOV config daemon (a DaemonSet managed by SR-IOV Operator) will support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for reference, Mellanox team has proposed a PR in SNO for this feature, openshift/sriov-network-operator#75

enhancements/network/ovs-hardware-offload.md Show resolved Hide resolved
enhancements/network/ovs-hardware-offload.md Show resolved Hide resolved
enhancements/network/ovs-hardware-offload.md Show resolved Hide resolved
- Multus supports overwriting cluster network definition by specifying a
customized default net-attach-def in pod annotation. Customized net-attach-def
will still use `ovn-k8s-cni-overlay` CNI plugin, but with an special annotation
that requests a SR-IOV device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording here is not clear about who is writing this annotation and why. (ie, that it needs to be specified on the pods that will have offload, and that any pod that doesn't have the annotation won't get hw offload). I agree with Peng that we shouldn't require the admin to write out a complete (and correct) net-attach-def. There should be some simple way to declare it.

Does this need any security / authorization associated with it?

- Mellanox SR-IOV VF-LAG provides hardware link-aggregation (LAG) for hardware
offloaded ports. It allows offloading of OVS rules to a linux `bond` network
device that combines two idential and hardware offload capable Physical
Functions (PFs). In order to use SR-IOV VF-LAG, a bond interface needs to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear if you're describing a required implementation detail of OVS offload ("you have to create a bond device even if you aren't using bonding") or if you're describing how OVS offload and bonding interact with each other. ("If you're using bonding and want OVS offload too, then you have to set things up like THIS, rather than like THAT.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, please take a second look on the update.
It now starts with To use OVS Hardware Offload with link redundency, a bond interface needs to be created and used as geneve endpoint of OVS bridge.

### Test Plan

- The changes in each relevant component must pass full e2e suite.
- OVS hardware offload functional tests must pass on supported NICs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these "OVS hardware offload functional tests"?

So QE will have hardware to test this with? Do they have this already?

We don't plan to do full e2e runs on clusters with hardware offload configured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no defined functional tests in any of ovs hw offload related repos, the basic functional tests I was thinking includes following steps:

  1. setup a baremetal cluster with ovs hw offload enabled.
  2. create a pod with SR-IOV device working in switchdev mode
  3. check the ovs flows for the pod are offloaded to hardware NIC and traffic between pods are not blocked.

QE has the same model of Mellanox CX-5 NIC as we had in baremetal CI, it will be used for running the ovs hw offload testing. The NIC is currently used for SR-IOV testing.

Regarding e2e with hardware offload enabled, it will be added as in current OVN CI once the relevant change is ready.

enhancements/network/ovs-hardware-offload.md Outdated Show resolved Hide resolved

### Graduation Criteria

Tech Preview:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for graduating to Tech Preview or from Tech Preview?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to Tech Preview. Updated.

- Cluster Network Operator has API definition to enable OVS hardware offload
- OVN-Kubernetes can configure VF and associated representor
- Document how to use customized net-attach-def with Multus
- SR-IOV Operator has API definition to configure `switchdev` device type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very low-level, and it doesn't actually include the requirement "... and it actually works" 🙂. I would just say that the criteria is that you can bring up a cluster using hardware offload, and you can do X, Y, and Z on it. (eg, you can bring up a mix of pods using offload and not using offload, on different nodes, and they can all communicate with each other, and you can apply NetworkPolicies to block some traffic, and that works too, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

enhancements/network/ovs-hardware-offload.md Show resolved Hide resolved

### Goals

Enable OpenFlow hardware offload for OVN datapath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the goal targets not only OvS, also other OpenFlow devices (i.e. OpenFlow hardware switch, like ovs-db/OSP)? Or just for OvS offloading?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be change to ovs hardware offload

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s1061123 it will be just ovs offloading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moshe010 will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

enhancements/network/ovs-hardware-offload.md Show resolved Hide resolved

### Goals

Enable OpenFlow hardware offload for OVN datapath
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be change to ovs hardware offload


### Non-Goals

Using hardware offload for any part of the OpenShift control plane
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain how the control plane is configured today. Is it all managed by ovs? if you then I don't think you can control what will be offload or not. The hardware offload is global configration to ovs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we cannot control what will be offload if it's enabled on ovs instance, but what can be controlled is which node ovs is enabled with hw-offload and which pod can have switchdev VF.

I think by saying not offload any part of openshift control plane, it means any OVN infra pods(ovnkube-master, ovnkube-node, ovs-node) are not using VF for flow offloading. @danwinship did I understand it correct?

Copy link
Contributor

@danwinship danwinship May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I meant: all pods managed by core OCP operators will still use non-accelerated veths; only pods created by the end user will use accelerated VFs.

(In particular, if an admin tries to add an annotation to a "core OCP" pod to make it accelerated, then the operator that owns that pod will just revert the change. So if we wanted to support a use case like "make prometheus have an accelerated interface", then we'd need to add some explicit way to configure that. There's no reason we couldn't do that later if we wanted to, but it requires some additional stuff beyond what's already in this proposal.)

daemon and SR-IOV hardware NIC.

- OpenvSwitch daemon is managed by Cluster Network Operator (CNO) and runs in
`ovs-node` DaemonSet. A boolean configurable field will be added in CNO
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the configuration is global to ovs so all datapath flows that the NIC vendor support hw offload will be offloaded.

enhancements/network/ovs-hardware-offload.md Show resolved Hide resolved

#### SR-IOV Operator

- SR-IOV Operator will support new device type `switchdev` in Operator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in openstack we do have udev rules for the VF representors, but in ovn-kubernetes we rename the VF representor to the container id same has veth. It may be better to use udev rules and VF rep name consistante because it easier to debug and also on driver restart to kernel put the old name and I am not sure what will be in the ovs. Also we udev need to disable NetworkManager has it can remove tc qdisc see [1]
[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1815060

enhancements/network/ovs-hardware-offload.md Show resolved Hide resolved
enhancements/network/ovs-hardware-offload.md Show resolved Hide resolved

- When enabling OVS Hardware Offload with SR-IOV VF LAG, MCO needs to apply
linux bond configuration. It's important that SR-IOV Operator (installed day 2)
makes sure `switchdev` configurations are applied before linux bond
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to add other FW config while using switchev mode to improve performance can we had them to SR-IOV operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would customer get the firmware? I'm not sure if we shall expose an interface for such configuration in openshift level.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think @moshe010 means that additional vendor specific operations may be required (performed by sriov-network-operator plugin) to improve performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can add it via SR-IOV Operator plugins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moshe010 @adrianchiris what other FW configs shall be used to improve performance?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now im not aware of anything specific we would want to do in OpenShift.

There were requests in OSP to address specific use-cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrianchiris do you have a reference to the OSP specific use-case? just trying to understand the case.

enhancements/network/ovs-hardware-offload.md Show resolved Hide resolved
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zshi-redhat
To complete the pull request process, please assign mrunalp
You can assign the PR to them by writing /assign @mrunalp in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

be created and used as geneve endpoint of OVS bridge. Mellanox SR-IOV VF-LAG
provides hardware link-aggregation (LAG) for hardware offloaded ports. It
allows offloading of OVS rules to a linux `bond` network device that combines
two idential and hardware offload capable Physical Functions (PFs). With the
Copy link

@adrianchiris adrianchiris Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: idential -> identical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix in next update.

@squeed
Copy link
Contributor

squeed commented Nov 9, 2020

Reviewing the upstream documentation, I see that this has a complicated interaction with multus and device plugins.

If a pod is scheduled to a node with ovs hardware offload enabled, and that pod does not have the NetAttachDefinition applied, will it still function?

@zshi-redhat
Copy link
Contributor Author

zshi-redhat commented Nov 9, 2020

Reviewing the upstream documentation, I see that this has a complicated interaction with multus and device plugins.

If a pod is scheduled to a node with ovs hardware offload enabled, and that pod does not have the NetAttachDefinition applied, will it still function?

@squeed Yes, it will still work with a regular veth interface attached to pod.
However, there is a difference on which ovs datapath is used according to the tc-policy (skip_hw, skip_sw, none) configured to ovs. default tc-policy is none, if hw doesn't work, fall back to use ovs tc sw datapath, if tc doesn't work, fall back to ovs kernel datapath.

@squeed
Copy link
Contributor

squeed commented Nov 10, 2020

However, there is a difference on which ovs datapath is used according to the tc-policy

I see. I'm not familiar with how all these pieces fit together. Assuming there will "always" be non-offload pods on the node, will they be able to reach each other?

I'm wondering if we should add the requisite device plugin support in to ovn-kubernetes directly.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 9, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 11, 2021
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants