KNI [Kubernetes Networking Interface] Initial Draft KEP #4477

MikeZappa87 · 2024-02-02T22:43:03Z

One-line PR description: This is the first draft of the KNI KEP, user stories and additions to be discussed as a community

Issue link: [KNI] Kubernetes Network reImagined (Interface) #4410

Other comments:

Since I'm going to be helping co-author this KEP I shouldn't be an approver, but I can still be a reviewer.

Signed-off-by: Shane Utt <[email protected]>

KEP updates

Co-authored-by: Shane Utt <[email protected]>

Update Summary and Goals

chore: cleanup template text and blank space

Another pass at the goals

Signed-off-by: Shane Utt <[email protected]>

docs: kni add goal for container network namespaces

k8s-ci-robot · 2024-02-02T22:43:05Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

MikeZappa87 · 2024-02-14T20:27:01Z

Why do some KEP's need user stories and others don't?
Most KEP's don't have a distinction between user and developer story, I feel we should have both thoughts?

There are KEPs that are straightforward and are mostly a description of the problem and the solution https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2595-expanded-dns-config . There are other KEPs that are controversial or can be disruptive, there are a lot of questions as we want to be completely sure what problem are we solving and what are the consequences and tradeoffs and the alternatives , see the number of comments on other KEPs

If I read back, a lot of "user stories" aren't even properly formatted.

I don't stop saying it, "we need reviewers" , so people want to help and contribute, but I don't see anybody doing reviews, review is free and you learn a lot and start to have context to know why some things are the way they are, because you were in the discussion when that was decided ... if you review a KEP and the user story is not clear just make it clear in your review before it merges ... if you feel that something merged and the assumption turned to be wrong, then open an issue and fix it, we reverted ClusterCIDR kubernetes/kubernetes#121229 and provided an alternative out of tree because we realized was not the right thing for the project ...

Let's go to the spirit of the norm of the user story, we don't want to be pedantic as in perfect agile, we just want to know the problem we are trying to solve, feel free to use the wording and the context you want to provide, the important is to have clear what is the problem we are solving for the end users, how all these changes are going to benefit kubernetes users and what things are going to be improved ... we went through this with the KPNG KEP too #3788 (comment)

Are you able to provide an example of an acceptable user story,

fair, let me put an example so we are in the same page, I also want to recommend @danwinship KEP, that should be a reference for all of us https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/3866-nftables-proxy

User story 1

"As a kubernetes user that need to deploy AI/ML or Telco workloads, I need to move inside my Pods some Network interfaces from the host so my applications can use them, an I'd like to have a better UX and reliability using netdevices as the existing one with other type of devices. The existing projects solving this problem, typically multus or some network plugins, have to depend on out of band communications channels like annotations or CRDs and, since the Pod creation is a local and imperative request from the Kubelet to the container runtime through the CRI API, when the runtimes makes the CNI ADD request, this needs one or more additional roundtrips to the apiserver that cause a local process to depend on the control plane, limiting the performance, scalability and reliability of the solution. and making it painful to troubleshoot"

Questions to this user story:

Only for existing netdevices on the hosts or we want creation of netdevices too?

only physical or physical and virtual netdevices?

Some of this netdevices require provisioning and configuration, is this part of the API too or is the netdvice plugin able to make this without more data?

Is netdevice a CNI thing? or a container runtime thing? It can not be kubelet because the container runtimes creates the network namespace, or can it? is this simpler or more complex? how do we proof it?

Alternative 1: Device plugin like

https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/

Problem: runtime spec does not have the concept of netdevice opencontainers/runtime-spec#1239

Pros

Cons

Alternative 2: DRA

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation Is this good enough to solve all the problems?

Pros

Cons

Alternative 3: new API
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-block-volume
spec:
   netDevices:
        - name: eth0
           hostInterface: enps0
          type: physical
Pros

Cons

Who consumes the API and how? is the CNI plugin? if not, are the runtimes going to

Alternative 4: NRI plugins

It seems only implemented in containerd and crio, what about kata and others, do they need it?

Pros

Cons

...

References:

[feature requirements] specify hw devices in container kubernetes#60748

This is excellent! This is what I have been wanting :-)

uablrek · 2024-02-20T08:10:40Z

Can you squash the commits? I find it hard to comment lines in the KEP, because I don't know which commit to use?

So I include quotes instead below.

the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods.

Can you give examples? I am pretty well aware of how things work (or so I thought), and my interpretation is:

"container runtime" examples crio, containerd
CNI-plugin, examples Cilium, Calico
OCI runtime, examples runc, crun

Is that correct?

And what is "kernel isolated pods"? That I don't know. Is it kata containers for instance (VMs)?

uablrek · 2024-02-20T08:22:01Z

however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.

I guess "the pod" refers to a CNI-plugin agent POD, e.g. calico-node-6x5z8? To me the phrasing intends to imply that general POD creation suffers from this, and would be improved by KNI. That is not true, and IMO you should scratch the

~~These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.~~

And, the files are not "usually downloaded", they are copied from the init-container to the host fs.

However, I agree that files in the host filesystem should be avoided.

danwinship · 2024-02-20T19:12:07Z

keps/sig-network/4410-k8s-network-interface/README.md

+
+## Motivation
+
+Kubernetes networking is an area of complexity and multiple layers which has created several challenges and areas of improvement. These challenges include deployment of the CNI plugins, troubleshooting networking issues and development of new functionality. 


(linewrap! it's too hard to comment on lines that are too long)

Yes, please. The current formatting has too many discussion points per line, it's difficult to follow and comment on.

MikeZappa87 · 2024-02-20T20:07:44Z

however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.

I guess "the pod" refers to a CNI-plugin agent POD, e.g. calico-node-6x5z8? To me the phrasing intends to imply that general POD creation suffers from this, and would be improved by KNI. That is not true, and IMO you should scratch the

~~These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.~~

And, the files are not "usually downloaded", they are copied from the init-container to the host fs.

However, I agree that files in the host filesystem should be avoided.

For most cases, the init container downloads the primary plugin aka flannel, cilium however when they leverage the community plugins they are downloaded. This usually happens at a different time though usually container runtime install. containerd install script does this.

I want to avoid this and have all dependencies inside of the container image and no longer needing to be in the host file system

uablrek · 2024-02-21T07:35:30Z

For most cases, the init container downloads the primary plugin aka flannel, cilium however when they leverage the community plugins they are downloaded. This usually happens at a different time though usually container runtime install. containerd install script does this.

No, they copy their primary plugin and the community plugins they need from their init-container (which usually uses the same image as the "main" one, but with different start-up). The only community plugins that is required on start is "loopback".

Examples what cni-plugins install beside their primary plugins:

Cilium: doesn't install, or need, any community plugins.
Kindnet: bridge, host-local, ptp, portmap
Calico: bandwidth host-local portmap tuning and (to my suprise) flannel
Antrea: bandwidth portmap
Flannel: doesn't install any community plugins, but hangs forever if "portmap" isn't there (a bug)

But none of them downloads anything on installation. I set ip ro replace default dev lo to make sure.

But in any case, this is misleading:

~~These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.~~

It should be something like:

It would save a second or so on node reboot if cni-plugin init-containers wouldn't have to copy plugins to the host fs.

This adds a section to the KNI KEP for "ongoing considerations" where we can put ideas and concerns that don't need to be resolved just yet, but are good to come back to as we progress. It then adds an idea for using Kubernetes controllers as an alternative to gRPC APIs for some of the KNI implementation to the ongoing considerations. Signed-off-by: Shane Utt <[email protected]>

uablrek · 2024-02-22T10:32:00Z

I discovered that you can add sequence diagrams in github with mermaid, and I really like sequence diagrams 😄

I have created some really simplified ones that describe network setup with and without KNI.

And, I know that KNI can do more than call a CNI-plugin, so please don't start a discussion on that. It's just an example and it must be there for backward compatibility.

I have not worked with a CRI-plugin, so I may have got the interaction with the OCI runtime (runc, crun) all wrong.

Current network setup

sequenceDiagram
participant api as API server
participant kubelet as Kubelet
participant cri as CRI plugin
participant oci as OCI runtime
participant cni as CNI plugin
api->>kubelet: Notify POD update
kubelet->>cri: Create POD
activate cri
cri->>oci: Create sandbox
activate oci
oci-->>cri: response
deactivate oci
cri->>cni: ADD cmd (exec with files)
activate cni
cni-->>cri: response
deactivate cni
cri->>oci: Create containers
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>api: Update status

Network setup with KNI

sequenceDiagram
participant api as API server
participant kubelet as Kubelet
participant cri as CRI plugin
participant oci as OCI runtime
participant kni as KNI agent
participant cni as CNI plugin
api->>kubelet: Notify POD update
kubelet->>cri: Create POD
activate cri
cri->>oci: Create sandbox
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>kni: Setup network (gRPC)
activate kni
kni-->>api: read objects
kni->>cni: ADD cmd (exec w stdin)
activate cni
cni-->>kni: response
deactivate cni
kni-->>kubelet: response
deactivate kni
kubelet->>cri: Create containers
activate cri
cri->>oci: Create containers
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>api: Update status

…nsidered

bleggett · 2024-02-22T16:22:20Z

keps/sig-network/4410-k8s-network-interface/README.md

+
+Currently networking happens in three layers of the stack, Kubernetes itself by means of kube-proxy or another controller based solution, the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods. All of this communication happens through non network specific APIs which to the reader of the code makes it hard to determine where ‘networking’ is happening. Having networking in several layers presents an issue when needing to troubleshoot issues as one needs to check several areas and some cannot be done via kubectl logs such as the CNI execution logs. This becomes more of an effort as multiple uncoordinated processes are making changes to the same resource, the network namespace of either the root or pod. The KNI aims at reducing the complexity by consolidating the networking into a single layer and having a uniform process for both namespaced and kernel isolated pods through a gRPC API. Leveraging gRPC will allow users the ability to migrate away from the current execution model that the CNI currently leverages.   
+
+The next challenge is the deployment of the CNI plugins that provide the network setup and teardown of the pod. The idiomatic way to deploy a workload in Kubernetes is that everything should be in a pod; however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state. Since all existing K8s network plugins are running as daemonsets we will take this approach as well, where all the dependencies are packaged into the container image thus adopting a well known approach. This will have added benefits of the network pod startup being much faster as nothing should need to be downloaded. 


A key part of what current CNI provides is a framework for different vendors to independently add extensions (CNI plugins) to a node, and have guarantees about when in the lifecycle of pod creation those are invoked, and what context they have from other plugins.

There may be multiple daemonsets from multiple vendors installing multiple CNI plugins on the same node or the node may come from a cloud provider with a CNI already installed and some other vendor might want to chain another extension onto that - any model we adopt must reckon with this as a first-class concern.

That, for me, is critical to retain for this to be a 1-1 replacement to the existing CNI - we can probably do something simpler than the current model of "kubelet talks to CRI, CRI impl reads shared node config with plugin list, serially execs the plugins as arbitrary privileged binaries", as well.

At the very least, moving that "list of extensions + CNI config" to an etcd-managed kube resource would fix currently non-fixable TOCTOU bugs we have in Istio, for instance: istio/istio#46848

At a minimum it's important for for me that any CNI/KNI extension support meets the following basic criteria:

I am able to validate that KNI is "ready" on a node

I am able to subscribe or register an extension with KNI from an in-cluster daemonset, and have guarantees TOCTOU errors will not silently unregister my extension.

I have a guarantee that if the above things are true, my extension will be invoked for every pod creation after that point, and that if my extension (or any other extension) fails during invocation, the pod will not be scheduled on the node.

I am able to get things like interfaces, netns paths, and assigned IPs for that pod, as context for my invocation.

Ideally as a "this is a good enough replacement" test, I would want to see current CNI plugins from a few different projects implemented as KNI extensions, and installed/managed on the same node. If we can do that, we are effectively proving this can be a well-scoped replacement for the status quo.

MikeZappa87 · 2024-02-22T16:39:13Z

For most cases, the init container downloads the primary plugin aka flannel, cilium however when they leverage the community plugins they are downloaded. This usually happens at a different time though usually container runtime install. containerd install script does this.

No, they copy their primary plugin and the community plugins they need from their init-container (which usually uses the same image as the "main" one, but with different start-up). The only community plugins that is required on start is "loopback".

Examples what cni-plugins install beside their primary plugins:

Cilium: doesn't install, or need, any community plugins.

Kindnet: bridge, host-local, ptp, portmap

Calico: bandwidth host-local portmap tuning and (to my suprise) flannel

Antrea: bandwidth portmap

Flannel: doesn't install any community plugins, but hangs forever if "portmap" isn't there (a bug)

But none of them downloads anything on installation. I set ip ro replace default dev lo to make sure.

But in any case, this is misleading:

~~These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.~~

It should be something like:

It would save a second or so on node reboot if cni-plugin init-containers wouldn't have to copy plugins to the host fs.

I believe flannel requires bridge plugin. I just installed that to make sure I wasn't crazy. From the performance testing I have done, KNI is faster taking a consistent 1 second vs 9-23 seconds for the network pod setup. The pod network setup is faster as well.

MikeZappa87 · 2024-02-22T16:42:00Z

I discovered that you can add sequence diagrams in github with mermaid, and I really like sequence diagrams 😄

I have created some really simplified ones that describe network setup with and without KNI.

And, I know that KNI can do more than call a CNI-plugin, so please don't start a discussion on that. It's just an example and it must be there for backward compatibility.

I have not worked with a CRI-plugin, so I may have got the interaction with the OCI runtime (runc, crun) all wrong.

Current network setup

sequenceDiagram
participant api as API server
participant kubelet as Kubelet
participant cri as CRI plugin
participant oci as OCI runtime
participant cni as CNI plugin
api->>kubelet: Notify POD update
kubelet->>cri: Create POD
activate cri
cri->>oci: Create sandbox
activate oci
oci-->>cri: response
deactivate oci
cri->>cni: ADD cmd (exec with files)
activate cni
cni-->>cri: response
deactivate cni
cri->>oci: Create containers
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>api: Update status

Network setup with KNI

sequenceDiagram
participant api as API server
participant kubelet as Kubelet
participant cri as CRI plugin
participant oci as OCI runtime
participant kni as KNI agent
participant cni as CNI plugin
api->>kubelet: Notify POD update
kubelet->>cri: Create POD
activate cri
cri->>oci: Create sandbox
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>kni: Setup network (gRPC)
activate kni
kni-->>api: read objects
kni->>cni: ADD cmd (exec w stdin)
activate cni
cni-->>kni: response
deactivate cni
kni-->>kubelet: response
deactivate kni
kubelet->>cri: Create containers
activate cri
cri->>oci: Create containers
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>api: Update status

We actually have a backwards compatible model with a roadmap to deploy side by side, then with libkni. However your diagram is pretty much spot on. If you want, I can set up some time with you to go over the migration stories with demos?

bleggett · 2024-02-22T17:50:17Z

keps/sig-network/4410-k8s-network-interface/README.md

+
+Kubernetes networking is an area of complexity and multiple layers which has created several challenges and areas of improvement. These challenges include deployment of the CNI plugins, troubleshooting networking issues and development of new functionality. 
+
+Currently networking happens in three layers of the stack, Kubernetes itself by means of kube-proxy or another controller based solution, the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods. All of this communication happens through non network specific APIs which to the reader of the code makes it hard to determine where ‘networking’ is happening. Having networking in several layers presents an issue when needing to troubleshoot issues as one needs to check several areas and some cannot be done via kubectl logs such as the CNI execution logs. This becomes more of an effort as multiple uncoordinated processes are making changes to the same resource, the network namespace of either the root or pod. The KNI aims at reducing the complexity by consolidating the networking into a single layer and having a uniform process for both namespaced and kernel isolated pods through a gRPC API. Leveraging gRPC will allow users the ability to migrate away from the current execution model that the CNI currently leverages.   


What is meant by namespaced AND kernel-isolated pods? isn't that the same thing?

What I was trying to do here is separate the virtualized oci runtimes from the non-virtualized oci runtimes. Aka kata vs runc. However we just got off a call with Kubevirt where you could use either. Both will leverage network namespaces however the virtualized cases have the additional kernel isolation.

If both will use (at least) network namespacing it's probably less confusing to just say that (and provide more detail later in the doc for other cases that use addl isolation, if need be).

bleggett · 2024-02-22T18:03:08Z

keps/sig-network/4410-k8s-network-interface/README.md

+
+Currently networking happens in three layers of the stack, Kubernetes itself by means of kube-proxy or another controller based solution, the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods. All of this communication happens through non network specific APIs which to the reader of the code makes it hard to determine where ‘networking’ is happening. Having networking in several layers presents an issue when needing to troubleshoot issues as one needs to check several areas and some cannot be done via kubectl logs such as the CNI execution logs. This becomes more of an effort as multiple uncoordinated processes are making changes to the same resource, the network namespace of either the root or pod. The KNI aims at reducing the complexity by consolidating the networking into a single layer and having a uniform process for both namespaced and kernel isolated pods through a gRPC API. Leveraging gRPC will allow users the ability to migrate away from the current execution model that the CNI currently leverages.   
+
+The next challenge is the deployment of the CNI plugins that provide the network setup and teardown of the pod. The idiomatic way to deploy a workload in Kubernetes is that everything should be in a pod; however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state. Since all existing K8s network plugins are running as daemonsets we will take this approach as well, where all the dependencies are packaged into the container image thus adopting a well known approach. This will have added benefits of the network pod startup being much faster as nothing should need to be downloaded. 


These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.

This is confusing me a bit - typically the CNI config is something workload pods are not aware of at all, via init containers or otherwise - only the CRI implementation handles them?

The pod I was referring to was the network plugin daemonset pods (flannel, calico, ...). I can try and clean this up to be more clear.

Ah ok thanks, yeah - "node agent daemonset" or "privileged node-bound pod" or whatever. Something general and consistent, since I don't think it has to be an init container specifically, in the Pod sense.

Please see also #4477 (comment), and the followup comments

MikeZappa87 · 2024-02-22T18:41:39Z

Can you squash the commits? I find it hard to comment lines in the KEP, because I don't know which commit to use?

So I include quotes instead below.

the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods.

Can you give examples? I am pretty well aware of how things work (or so I thought), and my interpretation is:

"container runtime" examples crio, containerd

CNI-plugin, examples Cilium, Calico

OCI runtime, examples runc, crun

Is that correct?

And what is "kernel isolated pods"? That I don't know. Is it kata containers for instance (VMs)?

I was referring to kernel isolated pods as pods that leverage kata or kubevirt. The additional network setup happens for both kata/kubevirt use cases after the CNI add. This is done in a couple ways, for Kata the setup happens through the execution of the kata-runtime via containerd/cri-o. In containerd its via the StartTask thus not clear that additional networking is happening.

I can squash the commits as well. It might make sense to move the contents to a google doc as well?

BenTheElder · 2024-02-22T20:07:11Z

keps/sig-network/4410-k8s-network-interface/README.md

+
+Currently networking happens in three layers of the stack, Kubernetes itself by means of kube-proxy or another controller based solution, the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods. All of this communication happens through non network specific APIs which to the reader of the code makes it hard to determine where ‘networking’ is happening. Having networking in several layers presents an issue when needing to troubleshoot issues as one needs to check several areas and some cannot be done via kubectl logs such as the CNI execution logs. This becomes more of an effort as multiple uncoordinated processes are making changes to the same resource, the network namespace of either the root or pod. The KNI aims at reducing the complexity by consolidating the networking into a single layer and having a uniform process for both namespaced and kernel isolated pods through a gRPC API. Leveraging gRPC will allow users the ability to migrate away from the current execution model that the CNI currently leverages.   
+
+The next challenge is the deployment of the CNI plugins that provide the network setup and teardown of the pod. The idiomatic way to deploy a workload in Kubernetes is that everything should be in a pod; however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state. Since all existing K8s network plugins are running as daemonsets we will take this approach as well, where all the dependencies are packaged into the container image thus adopting a well known approach. This will have added benefits of the network pod startup being much faster as nothing should need to be downloaded. 


Since all existing K8s network plugins are running as daemonsets we will take this approach as well [...]

This is definitely not true ...?
CNI plugin binaries can be pre-installed to the host and run entirely on the host without a daemonset and there are major cluster operators working this way.

Through CRI, even the plumbing node IPAM can be done without any pod.

CNI plugin binaries can be pre-installed to the host and run entirely on the host without a daemonset and there are major cluster operators working this way.

Yep, some environments bake the nodes and don't let you change them (and the corollary of that is you can't use other CNIs or meshes). Some do.

Either way, the current gate is "you need node privilege to install CNI plugins" - some envs have K8S policies that allow pods to assume node-level privileges, some do not and prebake everything. That k8s-based gate should be maintained for KNI, so either option works, depending on operator/environment need.

I don't think there's a world in which we can effectively separate "able to add new networking extensions" from "require privileged access to the node". That's worth considering tho.

If extensions are driven by K8S resources it makes policy enforcement via operator-managed admission hooks potentially a bit easier, I guess.

Either way, the current gate is "you need node privilege to install CNI plugins" [...]

I would expect that to be true, given the position they fill in the stack (!)

How you deploy the CNI plugins can be opinionated. However, the piece that is important is that you no longer need to have any files in the host filesystem such as CNI binaries or CNI configuration files.

However, the piece that is important is that you no longer need to have any files in the host filesystem such as CNI binaries or CNI configuration files.

But why is that important? I mean taken to an extreme ... You cannot run Kubernetes without files on the host filesystem anyhow, and configuring the pod network is an incredibly privileged place to operate.

But why is that important? I mean taken to an extreme ... You cannot run Kubernetes without files on the host filesystem anyhow, and configuring the pod network is an incredibly privileged place to operate.

It's important because allowing privileged pods to run on a node (where privileged necessarily means "can mutate node state", networking or otherwise) is an operator choice today, and it seems wrong to take that choice away from operators.

This is how, for instance, it is possible to install Cilium in AWS today. Or Calico in GKE. Or Openshift, etc etc.

Anyone that doesn't want to allow privileged pods on a node can already choose to leverage existing K8S APIs to preclude that as a matter of operational policy - it's rather orthogonal to CNI.

BenTheElder

Wouldn't it be "less layers" to just put opinionated networking in containerd/cri-o ?

CNI is enables extensibility, and we've got a large and healthy ecosystem around it.

This feels like a big technical-debt inducing, disruptive undertaking in search of a concrete need.

I don't see any discussion of goals that cannot be accomplished by improving the current tools, except some non-specific hand-waving about reducing layers.

There also seem to be some bad assumptions, e.g. that "all plugins are populated by daemonset" (and therefore network readiness is blocked by pulling / starting this pod) which is not true, e.g. off the top of my head: GKE does not do this, kind does not do this (you can pre-populate plugin binaries / config or containerd CNI config template with very fast node network readiness), and then KNI is suggested to ... run as a container image, which will ... block network readiness on pulling a container...?

At the very least, I'd suggest spelling out the use-cases and why they cannot be accomplished without a new RPC service.

BenTheElder · 2024-02-22T20:09:27Z

keps/sig-network/4410-k8s-network-interface/README.md

+
+Another area that KNI will improve is ‘network readiness’. Currently the container runtime is involved with providing both network and runtime readiness with the Status CRI RPC. The container runtime defines the network as ‘ready’ by the presence of the CNI network configuration in the host file system. The more recent CNI specification does include a status verb, however this is still bound by the current limitations, files on disk and execution model. The KNI will provide an RPC that can be implemented so the kubelet will call the KNI via gRPC. 
+
+KNI aims to help the community and other proposals in the Kubernetes ecosystem. We will do this by providing necessary information via the gRPC service. We should be the API that provides the “what networks are available on this node” so that another effort can make the kube-scheduler aware of networks. We should also provide IPAM status as a common issue, is that the IPAM runs out of assignable IP addresses and pods are no longer able to be scheduled on that node until intervention. We should provide visibility into this so that we can indicate “no more pods” as setting the node to not ready will evict the healthy pods. While the future state of KNI could aim to propose changes to the kube-scheduler, it's not a part of our initial work and instead should try to assist other efforts such as DRA/device plugin to provide the information they need. 


We should also provide IPAM status as a common issue, is that the IPAM runs out of assignable IP addresses and pods are no longer able to be scheduled on that node until intervention.

Typically this is handled by coordinating the assigned IP range and the max pods setting in the kubelet.
Cluster operators already have the tools to prevent this issue, why would you allow a node to be configured to have more pods than IPs?

BenTheElder · 2024-02-22T20:10:03Z

keps/sig-network/4410-k8s-network-interface/README.md

+
+KNI aims to help the community and other proposals in the Kubernetes ecosystem. We will do this by providing necessary information via the gRPC service. We should be the API that provides the “what networks are available on this node” so that another effort can make the kube-scheduler aware of networks. We should also provide IPAM status as a common issue, is that the IPAM runs out of assignable IP addresses and pods are no longer able to be scheduled on that node until intervention. We should provide visibility into this so that we can indicate “no more pods” as setting the node to not ready will evict the healthy pods. While the future state of KNI could aim to propose changes to the kube-scheduler, it's not a part of our initial work and instead should try to assist other efforts such as DRA/device plugin to provide the information they need. 
+
+The community may ask for more features, as we are taking a bold approach to reimagining Kubernetes networking by reducing the amount of layers involved in networking. We should prioritize feature parity with the current CNI model and then capture future work. KNI aims to be the foundational network api that is specific for Kubernetes and should make troubleshooting easier, deploying more friendly and innovate faster while reducing the need to make changes to core Kubernetes. 


The community may ask for more features, as we are taking a bold approach to reimagining Kubernetes networking by reducing the amount of layers involved in networking.

Reducing the layers ... by adding a new GRPC service? Why isn't this CRI? (An existing Kubernetes-Specific gRPC service for Pods, which includes some networking related bits today ...)

Adding this to the CRI-API is a design detail. The only piece of information relevant to networking is the PodIP coming back in PodSandboxStatus. This talks about eliminating the network setup/teardown and netns creation from the container/oci runtimes.

This talks about eliminating the network setup/teardown and netns creation from the container/oci runtimes.

Has this been a significant obstacle for implementers? Examples?

What blocks a CNI revision from handling netns/teardown?

BenTheElder · 2024-02-22T20:11:33Z

keps/sig-network/4410-k8s-network-interface/README.md

+
+Since the network runtime can be run separated from the container runtime, you can package everything into a pod and not need to have binaries on disk. This allows the CNI plugins to be isolated in the pod and the pod will never need to mount /opt/cni/bin or /etc/cni/net.d. This offers a potentially more ability to control execution. Keep in mind CNI is the implementation however when this is used chaining is still available.
+
+## Ongoing Considerations


There doesn't seem to be any discussion of how we might improve CNI, CRI instead and why that isn't sufficient versus this entirely new API and RPC service.

This could live in the CRI-API as multiple services, no one has indicated that we must use a new API. However the CNI 2.0 was talked about being closer to K8s for years now. This is that effort.

I do agree we could probably be more clear up front in the KEP about why the current CNI model (slurping out-of-k8s config files from well-known paths, TOCTOU errors, telling CRI implementations via out-of-band mechanisms to exec random binaries on the node by-proxy) is something we could explicitly improve on with KNI, and that KNI is basically "CNI 2.0" - it is the proposed improvement to current CNI.

BenTheElder · 2024-02-22T20:12:49Z

keps/sig-network/4410-k8s-network-interface/README.md

+
+### User Stories
+
+We are constantly adding these user stories, please join the community sync to discuss. 


We are constantly adding these user stories [...]

Where?

BenTheElder · 2024-02-22T20:13:56Z

keps/sig-network/4410-k8s-network-interface/README.md

+
+## Motivation
+
+Kubernetes networking is an area of complexity and multiple layers which has created several challenges and areas of improvement. These challenges include deployment of the CNI plugins, troubleshooting networking issues and development of new functionality. 


Yes, please. The current formatting has too many discussion points per line, it's difficult to follow and comment on.

BenTheElder · 2024-02-22T20:22:54Z

keps/sig-network/4410-k8s-network-interface/README.md

+### Goals
+
+- Design a cool looking t-shirt
+- Provide a RPC for the Attachment and Detachment of interface[s] for a Pod


This seems like a significant departure from current expectations around pod networking.

Can you clarify this? We already do something similar with CNI ADD/DEL with an execution model. This is levering gRPC to communicate with the gRPC server that would be flannel, calico, cilium.

BenTheElder · 2024-02-22T20:23:32Z

keps/sig-network/4410-k8s-network-interface/README.md

+- Design a cool looking t-shirt
+- Provide a RPC for the Attachment and Detachment of interface[s] for a Pod
+- Provide a RPC for the Querying of Pod network information (interfaces, network namespace path, ip addresses, routes, ...)
+- Provide a RPC to prevent additional scheduling of pods if IPAM is out of IP addresses without evicting running pods


Surely this is just reporting status up through CRI?

Last place I mention this, that the decision around CRI or a new API is a design detail.

BenTheElder · 2024-02-22T20:25:08Z

keps/sig-network/4410-k8s-network-interface/kep.yaml

+approvers:
+
+see-also:
+  - "/keps/sig-aaa/1234-we-heard-you-like-keps"


delete these, or update with relevant keps?

same below for replaces

BenTheElder · 2024-02-22T20:25:36Z

keps/sig-network/4410-k8s-network-interface/kep.yaml

+milestone:
+  alpha: "v1.31"
+  beta: "v1.32"
+  stable: "v1.33"


This seems unlikely, without even defining an API yet?

BenTheElder · 2024-02-22T20:27:29Z

keps/sig-network/4410-k8s-network-interface/kep.yaml

+  - "@shaneutt"
+owning-sig: sig-network
+participating-sigs:
+  - sig-network


Surely at least SIG Node should be participating (there's no way this doesn't affect kubelet, CRI)..?

I would also tag Cluster Lifecycle at least as FYI / advisory since cluster lifecycle folks will know about and have suggestions re: node readiness and cluster configuration.

BenTheElder · 2024-02-22T20:49:41Z

If use cases and design are still being collected, a working group may be more appropriate than a KEP.

MikeZappa87 · 2024-02-22T20:52:29Z

If use cases and design are still being collected, a working group may be more appropriate than a KEP.

Lets sync up on slack and schedule some time to discuss.

uablrek · 2024-02-23T11:45:34Z

Are there any serious KNI use-cases that don't include multi-networking? KNI doesn't enable multiple interfaces, but I think that something like KNI is a prerequisite for K8s multi-net. But to independently motivate both KNI and K8s-multi-net with multi-networking use-cases is very confusing. I hope Antonio's workshop at KubeCon will sort this out (great initiative! But I can't attend myself unfortunately).

But the referred comment above must be considered if multiple interfaces are handled via KNI: who cleans up when a POD is deleted?

Obviously someone who knows what interfaces that are in the POD. So if something except kubelet calls the KNI to add various interfaces, kubelet would be unaware of them and there is a problem.

k8s-triage-robot · 2024-06-02T17:24:43Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

MikeZappa87 and others added 24 commits January 11, 2024 13:28

init of kni kep

2bede53

update issue number

14eeea2

WIP: KNI KEP

eae3341

chore: remove shaneutt as approver

00209db

Since I'm going to be helping co-author this KEP I shouldn't be an approver, but I can still be a reviewer.

chore: add title to KEP

9e9ef49

Signed-off-by: Shane Utt <[email protected]>

chore: first draft of a motivation section

eae3f0c

Signed-off-by: Shane Utt <[email protected]>

Merge pull request #1 from shaneutt/kni-kep

68738dd

KEP updates

Merge branch 'kubernetes:master' into KNI-KEP

9f215c6

change ordering of goals

217f1c3

update goals and summary

64eca47

update goals/non goals and notes

664c2e0

Update keps/sig-network/4410-k8s-network-interface/README.md

17a0fa6

Co-authored-by: Shane Utt <[email protected]>

update with shane comments

d547f62

Merge pull request #3 from MikeZappa87/zappa/v2

f957158

Update Summary and Goals

add create network

abc4210

chore: cleanup template text and blank space

cefc7c9

Merge pull request #4 from shaneutt/shaneutt/kni-cleanup-template

a6e3c30

chore: cleanup template text and blank space

support vm/kata

1f05981

docs: another pass at the kni kep goals

e770486

Merge pull request #5 from shaneutt/shaneutt/kni-goals-2

855d5e7

Another pass at the goals

docs: add goal about Pod network ns APIs

8a33b31

Signed-off-by: Shane Utt <[email protected]>

docs: add a user story for network ns goals to KNI KEP

0c3fb89

Signed-off-by: Shane Utt <[email protected]>

Merge pull request #6 from shaneutt/patch-1

325bbfc

docs: kni add goal for container network namespaces

update motivation

17baf99

k8s-ci-robot requested a review from aojea February 2, 2024 22:43

MikeZappa87 added 3 commits February 15, 2024 10:54

update kep and temp remove user stories

1c3107b

update goals

2081e13

update goal

cd3f4b2

danwinship reviewed Feb 20, 2024

View reviewed changes

Merge pull request #10 from shaneutt/shaneutt/kni-kep-alternatives-co…

30d4804

…nsidered

bleggett reviewed Feb 22, 2024

View reviewed changes

BenTheElder reviewed Feb 22, 2024

View reviewed changes

This was referenced Feb 23, 2024

[GEP-1651] add a note about multi-network + routability kubernetes-sigs/gateway-api#2441

Merged

add gateway api user stories kni kep MikeZappa87/enhancements#12

Closed

shaneutt mentioned this pull request Mar 12, 2024

Initial Egress GEP kubernetes-sigs/gateway-api#1971

Closed

dprotaso mentioned this pull request Apr 2, 2024

[Upstream] Gateway API supports cluster-local Gateway knative-extensions/net-gateway-api#369

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 2, 2024

MikeZappa87 closed this Jun 21, 2024


		## Motivation

		Kubernetes networking is an area of complexity and multiple layers which has created several challenges and areas of improvement. These challenges include deployment of the CNI plugins, troubleshooting networking issues and development of new functionality.


		Currently networking happens in three layers of the stack, Kubernetes itself by means of kube-proxy or another controller based solution, the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods. All of this communication happens through non network specific APIs which to the reader of the code makes it hard to determine where ‘networking’ is happening. Having networking in several layers presents an issue when needing to troubleshoot issues as one needs to check several areas and some cannot be done via kubectl logs such as the CNI execution logs. This becomes more of an effort as multiple uncoordinated processes are making changes to the same resource, the network namespace of either the root or pod. The KNI aims at reducing the complexity by consolidating the networking into a single layer and having a uniform process for both namespaced and kernel isolated pods through a gRPC API. Leveraging gRPC will allow users the ability to migrate away from the current execution model that the CNI currently leverages.

		The next challenge is the deployment of the CNI plugins that provide the network setup and teardown of the pod. The idiomatic way to deploy a workload in Kubernetes is that everything should be in a pod; however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state. Since all existing K8s network plugins are running as daemonsets we will take this approach as well, where all the dependencies are packaged into the container image thus adopting a well known approach. This will have added benefits of the network pod startup being much faster as nothing should need to be downloaded.


		Another area that KNI will improve is ‘network readiness’. Currently the container runtime is involved with providing both network and runtime readiness with the Status CRI RPC. The container runtime defines the network as ‘ready’ by the presence of the CNI network configuration in the host file system. The more recent CNI specification does include a status verb, however this is still bound by the current limitations, files on disk and execution model. The KNI will provide an RPC that can be implemented so the kubelet will call the KNI via gRPC.

		KNI aims to help the community and other proposals in the Kubernetes ecosystem. We will do this by providing necessary information via the gRPC service. We should be the API that provides the “what networks are available on this node” so that another effort can make the kube-scheduler aware of networks. We should also provide IPAM status as a common issue, is that the IPAM runs out of assignable IP addresses and pods are no longer able to be scheduled on that node until intervention. We should provide visibility into this so that we can indicate “no more pods” as setting the node to not ready will evict the healthy pods. While the future state of KNI could aim to propose changes to the kube-scheduler, it's not a part of our initial work and instead should try to assist other efforts such as DRA/device plugin to provide the information they need.


		Since the network runtime can be run separated from the container runtime, you can package everything into a pod and not need to have binaries on disk. This allows the CNI plugins to be isolated in the pod and the pod will never need to mount /opt/cni/bin or /etc/cni/net.d. This offers a potentially more ability to control execution. Keep in mind CNI is the implementation however when this is used chaining is still available.

		## Ongoing Considerations


		### User Stories

		We are constantly adding these user stories, please join the community sync to discuss.

KNI [Kubernetes Networking Interface] Initial Draft KEP #4477

KNI [Kubernetes Networking Interface] Initial Draft KEP #4477

Conversation

MikeZappa87 commented Feb 2, 2024 • edited Loading

k8s-ci-robot commented Feb 2, 2024

MikeZappa87 commented Feb 14, 2024

uablrek commented Feb 20, 2024

uablrek commented Feb 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeZappa87 commented Feb 20, 2024

uablrek commented Feb 21, 2024

uablrek commented Feb 22, 2024

Current network setup

Network setup with KNI

bleggett Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

MikeZappa87 commented Feb 22, 2024

MikeZappa87 commented Feb 22, 2024

Current network setup

Network setup with KNI

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleggett Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeZappa87 commented Feb 22, 2024

Choose a reason for hiding this comment

bleggett Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleggett Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

BenTheElder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleggett Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenTheElder commented Feb 22, 2024

MikeZappa87 commented Feb 22, 2024

uablrek commented Feb 23, 2024

k8s-triage-robot commented Jun 2, 2024

MikeZappa87 commented Feb 2, 2024 •

edited

Loading

bleggett Feb 22, 2024 •

edited

Loading

bleggett Feb 22, 2024 •

edited

Loading

bleggett Feb 22, 2024 •

edited

Loading

bleggett Feb 22, 2024 •

edited

Loading

bleggett Mar 4, 2024 •

edited

Loading