-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Network Devices #1239
Comments
/cc @samuelkarp |
Could you explain why CNI can't be extended to support your use case? I also wonder if OCI hooks can be used. |
CNI is about network interface creation and configuration https://github.com/containernetworking/cni/blob/main/SPEC.md#cni-operations
CNI is also an implementation detail of container runtimes, and has some limitations, in Kubernetes projects use annotations and different out of band methods to pass this additional information for other interfaces, more on containernetworking/cni#891 In kubernetes, Pods use I think that most of the problems in this area come because we are trying to conflate network device and network configuration, my proposal is to decouple this, so adding a new field to Pods as |
I'm not well versed in this area, I had this conversation with @samuelkarp , and he thought it was worth at least to open this debate, |
I think they can. But it moves control from a declarative model (like the rest of the OCI spec) to imperative via the hook implementation. If the goal for the runtime spec is to allow a bundle author to specify the attributes of the container and for a runtime (such as runc) to implement, I do think it'd be nice to include some aspects of networking in that as well. However, networking is fairly complex. @aojea I'm still not entirely clear on exactly what you'd like to see here (e.g., just interface moves? veth creation? setting up routes? etc). Can you elaborate a bit more? |
just interface moves, being able to reference any netDevice in the host to move into the container network namespace |
After spending a few weeks exploring different options, I can find how all these new patterns enabled by the CDI https://github.com/cncf-tags/container-device-interface can benefit Kubernetes and all containers environments of instructing runtimes to move some specific netdevice by name into the runtime namespace, @elezar WDYT? Right now you have to do an exotic dance between annotations and out of band operations just to get the information to the CNI plugin to be able to move one interface to the network namespace, if the container runtimes can declaratively move the netdivce specified by name into the network namespace, everything will be much simpler |
My main use case it to model GPUs and its relation with the high speed NICs used for GPUDirect.
It is complex to model this relation in systems like kubernetes, since traditionally NICs are treated as part of the CNI, but in this case, the NICs are only netdevices associated to the GPUs, they are consumed directly by the GPU, and not by the Kubernetes cluster or users. If the OCI spec support "netdevices", it is possible to use mechanisms like CDI to mutate the OCI spec and add this bundle in a declarative way to the Pod, so an user can create a Pod or a Container requesting one or multiple GPUs, and the https://github.com/cncf-tags/container-device-interface CDI driver can mutate the OCI spec to add the NICs/Netdevices associated, without the users having to do the manual plumbing that is error prone, device drivers can always check the Node topology and assign the best NIC or NICs for each case cc: @klueska |
/cc |
1 similar comment
/cc |
runc's Are you proposing that we add libcontainer's Network type to the OCI bundle, or that we add a new structure defining existing host interfaces that are expected to be moved (and possibly renamed) to a container's network namespace? runc Network type
|
the later, Network Type and network configuration is just what I want to avoid, is unbounded and contentious ... on the other side, moving host interfaces to container namespaces is IMHO well defined and solves important use cases very easily, my reasoning is that same as block devices are moved into the container namespace, network devices can be moved "declaratively" too, there should be possible to define some of the properties of Specially interesting is the case where some devices have both an RDMA and a Netdevice, this will solve this problem really well, instead of having to split the responsibility of the RDMA device to the OCI runtime and the Netdevice to the CNI, that is always going to be racy |
@aojea this might be worth running this by Kata or the other virtualized runtimes. |
are those implementing the OCI runtime spec? |
Yes the communication between the runtime is via OCI. The CreateTask api in containerd uses the runtime oci spec to communicate with the lower level runtimes. Unless something has changed :-P @mikebrow keep me honest ha! |
Then is unrelated, who implements the OCI spec is containerd in this case |
containerd/kata both implement the oci spec |
https://github.com/kata-containers/kata-containers/blob/main/docs/Limitations.md |
I see value in passing a list of netDevices to the oci runtimes however I would rather have the CNI plugins create/move the netdevs to the appropriate location. While this helps the Windows container networking stack, I would want to know if alignment exists in kata and other oci runtime as well. It seems like we have a disconnect here in regards to the virtualized oci runtimes for networking. |
Creating and moving netdevs are on purpose out of the scope, this is a 1 to 1 mapping to the "block devices" API and functionality, so you have /dev/gpu1 or /dev/sound0 or similar and you can reference them and move into a container. In this case the OS does not represent netdevices as files (see description) but allow userspace to reference them and change their properties, so I'm proposing to provide the same functionality |
The spec describes Devices that are container based, but there are another class of Devices, Network Devices that are defined per namespace, quoting "Linux Device Drivers, Second Edition , Chapter 14. Network Drivers"
Network Devices are also used for providing connectivity to the network namespaces, and commonly container runtimes use the CNI specification to provide this capacity of adding a network device to the namespace and configure its networking parameters.
Runc already has the concept of network device and how to configure it, in addition to the CNI specifixation https://github.com/opencontainers/runc/tree/main/libcontainer
https://github.com/opencontainers/runc/blob/main/libcontainer/configs/network.go#L3-L51
The spec already has a reference to the network in https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#network , that references network devices, but does not allow to specify the network devices that will be part of the namespace.
However, there are cases that a Kubernetes Pod or container may want to add, in a declarative way, existing Network Devices to the namespace, it is important to mention that the Network Device configuration or creation is non-goal and is left out of the spec on purpose.
The use cases for adding network devices to namespaces are more common lately with the new AI accelerators devices that are presented as network devices to the system, but they are not really considered as an usual network device. Ref: https://lwn.net/Articles/955001/ (Available Jan 4th without subscription)
The proposal is to be able to add existing Network devices to a linux namespace by referencing them https://docs.kernel.org/networking/netdevices.html, in a similar way to the existing definition of Devices
Linux defines an structure like this one in https://man7.org/linux/man-pages/man7/netdevice.7.html
though we only need the index or the name to be able to reference one interface
Proposal: #1240
runc prototype: https://github.com/opencontainers/runc/compare/main...aojea:runc:netdevices?expand=1
References:
The text was updated successfully, but these errors were encountered: