Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KNI [Kubernetes Networking Interface] Initial Draft KEP #4477

Closed
wants to merge 36 commits into from

Conversation

MikeZappa87
Copy link

@MikeZappa87 MikeZappa87 commented Feb 2, 2024

  • One-line PR description: This is the first draft of the KNI KEP, user stories and additions to be discussed as a community
  • Other comments:

MikeZappa87 and others added 24 commits January 11, 2024 13:28
Since I'm going to be helping co-author this KEP I shouldn't
be an approver, but I can still be a reviewer.
Signed-off-by: Shane Utt <[email protected]>
chore: cleanup template text and blank space
docs: kni add goal for container network namespaces
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. labels Feb 2, 2024
@MikeZappa87
Copy link
Author

Why do some KEP's need user stories and others don't?
Most KEP's don't have a distinction between user and developer story, I feel we should have both thoughts?

There are KEPs that are straightforward and are mostly a description of the problem and the solution https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2595-expanded-dns-config . There are other KEPs that are controversial or can be disruptive, there are a lot of questions as we want to be completely sure what problem are we solving and what are the consequences and tradeoffs and the alternatives , see the number of comments on other KEPs

image

If I read back, a lot of "user stories" aren't even properly formatted.

I don't stop saying it, "we need reviewers" , so people want to help and contribute, but I don't see anybody doing reviews, review is free and you learn a lot and start to have context to know why some things are the way they are, because you were in the discussion when that was decided ... if you review a KEP and the user story is not clear just make it clear in your review before it merges ... if you feel that something merged and the assumption turned to be wrong, then open an issue and fix it, we reverted ClusterCIDR kubernetes/kubernetes#121229 and provided an alternative out of tree because we realized was not the right thing for the project ...

Let's go to the spirit of the norm of the user story, we don't want to be pedantic as in perfect agile, we just want to know the problem we are trying to solve, feel free to use the wording and the context you want to provide, the important is to have clear what is the problem we are solving for the end users, how all these changes are going to benefit kubernetes users and what things are going to be improved ... we went through this with the KPNG KEP too #3788 (comment)

Are you able to provide an example of an acceptable user story,

fair, let me put an example so we are in the same page, I also want to recommend @danwinship KEP, that should be a reference for all of us https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/3866-nftables-proxy

User story 1

"As a kubernetes user that need to deploy AI/ML or Telco workloads, I need to move inside my Pods some Network interfaces from the host so my applications can use them, an I'd like to have a better UX and reliability using netdevices as the existing one with other type of devices. The existing projects solving this problem, typically multus or some network plugins, have to depend on out of band communications channels like annotations or CRDs and, since the Pod creation is a local and imperative request from the Kubelet to the container runtime through the CRI API, when the runtimes makes the CNI ADD request, this needs one or more additional roundtrips to the apiserver that cause a local process to depend on the control plane, limiting the performance, scalability and reliability of the solution. and making it painful to troubleshoot"

Questions to this user story:

  • Only for existing netdevices on the hosts or we want creation of netdevices too?
  • only physical or physical and virtual netdevices?
  • Some of this netdevices require provisioning and configuration, is this part of the API too or is the netdvice plugin able to make this without more data?
  • Is netdevice a CNI thing? or a container runtime thing? It can not be kubelet because the container runtimes creates the network namespace, or can it? is this simpler or more complex? how do we proof it?

Alternative 1: Device plugin like

https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/

Problem: runtime spec does not have the concept of netdevice opencontainers/runtime-spec#1239

  • Pros
  • Cons

Alternative 2: DRA

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation Is this good enough to solve all the problems?

  • Pros
  • Cons

Alternative 3: new API

apiVersion: v1
kind: Pod
metadata:
  name: pod-with-block-volume
spec:
   netDevices:
        - name: eth0
           hostInterface: enps0
          type: physical
  • Pros
  • Cons

Who consumes the API and how? is the CNI plugin? if not, are the runtimes going to

Alternative 4: NRI plugins

It seems only implemented in containerd and crio, what about kata and others, do they need it?

  • Pros
  • Cons

...

References:

This is excellent! This is what I have been wanting :-)

@uablrek
Copy link

uablrek commented Feb 20, 2024

Can you squash the commits? I find it hard to comment lines in the KEP, because I don't know which commit to use?

So I include quotes instead below.

the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods.

Can you give examples? I am pretty well aware of how things work (or so I thought), and my interpretation is:

  • "container runtime" examples crio, containerd
  • CNI-plugin, examples Cilium, Calico
  • OCI runtime, examples runc, crun

Is that correct?

And what is "kernel isolated pods"? That I don't know. Is it kata containers for instance (VMs)?

@uablrek
Copy link

uablrek commented Feb 20, 2024

however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.

I guess "the pod" refers to a CNI-plugin agent POD, e.g. calico-node-6x5z8? To me the phrasing intends to imply that general POD creation suffers from this, and would be improved by KNI. That is not true, and IMO you should scratch the

These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.

And, the files are not "usually downloaded", they are copied from the init-container to the host fs.

However, I agree that files in the host filesystem should be avoided.


## Motivation

Kubernetes networking is an area of complexity and multiple layers which has created several challenges and areas of improvement. These challenges include deployment of the CNI plugins, troubleshooting networking issues and development of new functionality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(linewrap! it's too hard to comment on lines that are too long)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please. The current formatting has too many discussion points per line, it's difficult to follow and comment on.

@MikeZappa87
Copy link
Author

however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.

I guess "the pod" refers to a CNI-plugin agent POD, e.g. calico-node-6x5z8? To me the phrasing intends to imply that general POD creation suffers from this, and would be improved by KNI. That is not true, and IMO you should scratch the

These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.

And, the files are not "usually downloaded", they are copied from the init-container to the host fs.

However, I agree that files in the host filesystem should be avoided.

For most cases, the init container downloads the primary plugin aka flannel, cilium however when they leverage the community plugins they are downloaded. This usually happens at a different time though usually container runtime install. containerd install script does this.

I want to avoid this and have all dependencies inside of the container image and no longer needing to be in the host file system

@uablrek
Copy link

uablrek commented Feb 21, 2024

For most cases, the init container downloads the primary plugin aka flannel, cilium however when they leverage the community plugins they are downloaded. This usually happens at a different time though usually container runtime install. containerd install script does this.

No, they copy their primary plugin and the community plugins they need from their init-container (which usually uses the same image as the "main" one, but with different start-up). The only community plugins that is required on start is "loopback".

Examples what cni-plugins install beside their primary plugins:

  • Cilium: doesn't install, or need, any community plugins.
  • Kindnet: bridge, host-local, ptp, portmap
  • Calico: bandwidth host-local portmap tuning and (to my suprise) flannel
  • Antrea: bandwidth portmap
  • Flannel: doesn't install any community plugins, but hangs forever if "portmap" isn't there (a bug)

But none of them downloads anything on installation. I set ip ro replace default dev lo to make sure.

But in any case, this is misleading:

These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.

It should be something like:

It would save a second or so on node reboot if cni-plugin init-containers wouldn't have to copy plugins to the host fs.

This adds a section to the KNI KEP for "ongoing considerations"
where we can put ideas and concerns that don't need to be resolved
just yet, but are good to come back to as we progress. It then adds
an idea for using Kubernetes controllers as an alternative to gRPC
APIs for some of the KNI implementation to the ongoing considerations.

Signed-off-by: Shane Utt <[email protected]>
@uablrek
Copy link

uablrek commented Feb 22, 2024

I discovered that you can add sequence diagrams in github with mermaid, and I really like sequence diagrams 😄

I have created some really simplified ones that describe network setup with and without KNI.

And, I know that KNI can do more than call a CNI-plugin, so please don't start a discussion on that. It's just an example and it must be there for backward compatibility.

I have not worked with a CRI-plugin, so I may have got the interaction with the OCI runtime (runc, crun) all wrong.

Current network setup

sequenceDiagram
participant api as API server
participant kubelet as Kubelet
participant cri as CRI plugin
participant oci as OCI runtime
participant cni as CNI plugin
api->>kubelet: Notify POD update
kubelet->>cri: Create POD
activate cri
cri->>oci: Create sandbox
activate oci
oci-->>cri: response
deactivate oci
cri->>cni: ADD cmd (exec with files)
activate cni
cni-->>cri: response
deactivate cni
cri->>oci: Create containers
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>api: Update status
Loading

Network setup with KNI

sequenceDiagram
participant api as API server
participant kubelet as Kubelet
participant cri as CRI plugin
participant oci as OCI runtime
participant kni as KNI agent
participant cni as CNI plugin
api->>kubelet: Notify POD update
kubelet->>cri: Create POD
activate cri
cri->>oci: Create sandbox
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>kni: Setup network (gRPC)
activate kni
kni-->>api: read objects
kni->>cni: ADD cmd (exec w stdin)
activate cni
cni-->>kni: response
deactivate cni
kni-->>kubelet: response
deactivate kni
kubelet->>cri: Create containers
activate cri
cri->>oci: Create containers
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>api: Update status
Loading


Currently networking happens in three layers of the stack, Kubernetes itself by means of kube-proxy or another controller based solution, the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods. All of this communication happens through non network specific APIs which to the reader of the code makes it hard to determine where ‘networking’ is happening. Having networking in several layers presents an issue when needing to troubleshoot issues as one needs to check several areas and some cannot be done via kubectl logs such as the CNI execution logs. This becomes more of an effort as multiple uncoordinated processes are making changes to the same resource, the network namespace of either the root or pod. The KNI aims at reducing the complexity by consolidating the networking into a single layer and having a uniform process for both namespaced and kernel isolated pods through a gRPC API. Leveraging gRPC will allow users the ability to migrate away from the current execution model that the CNI currently leverages.

The next challenge is the deployment of the CNI plugins that provide the network setup and teardown of the pod. The idiomatic way to deploy a workload in Kubernetes is that everything should be in a pod; however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state. Since all existing K8s network plugins are running as daemonsets we will take this approach as well, where all the dependencies are packaged into the container image thus adopting a well known approach. This will have added benefits of the network pod startup being much faster as nothing should need to be downloaded.
Copy link

@bleggett bleggett Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A key part of what current CNI provides is a framework for different vendors to independently add extensions (CNI plugins) to a node, and have guarantees about when in the lifecycle of pod creation those are invoked, and what context they have from other plugins.

There may be multiple daemonsets from multiple vendors installing multiple CNI plugins on the same node or the node may come from a cloud provider with a CNI already installed and some other vendor might want to chain another extension onto that - any model we adopt must reckon with this as a first-class concern.

That, for me, is critical to retain for this to be a 1-1 replacement to the existing CNI - we can probably do something simpler than the current model of "kubelet talks to CRI, CRI impl reads shared node config with plugin list, serially execs the plugins as arbitrary privileged binaries", as well.

At the very least, moving that "list of extensions + CNI config" to an etcd-managed kube resource would fix currently non-fixable TOCTOU bugs we have in Istio, for instance: istio/istio#46848

At a minimum it's important for for me that any CNI/KNI extension support meets the following basic criteria:

  • I am able to validate that KNI is "ready" on a node
  • I am able to subscribe or register an extension with KNI from an in-cluster daemonset, and have guarantees TOCTOU errors will not silently unregister my extension.
  • I have a guarantee that if the above things are true, my extension will be invoked for every pod creation after that point, and that if my extension (or any other extension) fails during invocation, the pod will not be scheduled on the node.
  • I am able to get things like interfaces, netns paths, and assigned IPs for that pod, as context for my invocation.

Ideally as a "this is a good enough replacement" test, I would want to see current CNI plugins from a few different projects implemented as KNI extensions, and installed/managed on the same node. If we can do that, we are effectively proving this can be a well-scoped replacement for the status quo.

@MikeZappa87
Copy link
Author

For most cases, the init container downloads the primary plugin aka flannel, cilium however when they leverage the community plugins they are downloaded. This usually happens at a different time though usually container runtime install. containerd install script does this.

No, they copy their primary plugin and the community plugins they need from their init-container (which usually uses the same image as the "main" one, but with different start-up). The only community plugins that is required on start is "loopback".

Examples what cni-plugins install beside their primary plugins:

  • Cilium: doesn't install, or need, any community plugins.
  • Kindnet: bridge, host-local, ptp, portmap
  • Calico: bandwidth host-local portmap tuning and (to my suprise) flannel
  • Antrea: bandwidth portmap
  • Flannel: doesn't install any community plugins, but hangs forever if "portmap" isn't there (a bug)

But none of them downloads anything on installation. I set ip ro replace default dev lo to make sure.

But in any case, this is misleading:

These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.

It should be something like:

It would save a second or so on node reboot if cni-plugin init-containers wouldn't have to copy plugins to the host fs.

I believe flannel requires bridge plugin. I just installed that to make sure I wasn't crazy. From the performance testing I have done, KNI is faster taking a consistent 1 second vs 9-23 seconds for the network pod setup. The pod network setup is faster as well.

@MikeZappa87
Copy link
Author

I discovered that you can add sequence diagrams in github with mermaid, and I really like sequence diagrams 😄

I have created some really simplified ones that describe network setup with and without KNI.

And, I know that KNI can do more than call a CNI-plugin, so please don't start a discussion on that. It's just an example and it must be there for backward compatibility.

I have not worked with a CRI-plugin, so I may have got the interaction with the OCI runtime (runc, crun) all wrong.

Current network setup

sequenceDiagram
participant api as API server
participant kubelet as Kubelet
participant cri as CRI plugin
participant oci as OCI runtime
participant cni as CNI plugin
api->>kubelet: Notify POD update
kubelet->>cri: Create POD
activate cri
cri->>oci: Create sandbox
activate oci
oci-->>cri: response
deactivate oci
cri->>cni: ADD cmd (exec with files)
activate cni
cni-->>cri: response
deactivate cni
cri->>oci: Create containers
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>api: Update status
Loading

Network setup with KNI

sequenceDiagram
participant api as API server
participant kubelet as Kubelet
participant cri as CRI plugin
participant oci as OCI runtime
participant kni as KNI agent
participant cni as CNI plugin
api->>kubelet: Notify POD update
kubelet->>cri: Create POD
activate cri
cri->>oci: Create sandbox
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>kni: Setup network (gRPC)
activate kni
kni-->>api: read objects
kni->>cni: ADD cmd (exec w stdin)
activate cni
cni-->>kni: response
deactivate cni
kni-->>kubelet: response
deactivate kni
kubelet->>cri: Create containers
activate cri
cri->>oci: Create containers
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>api: Update status
Loading

We actually have a backwards compatible model with a roadmap to deploy side by side, then with libkni. However your diagram is pretty much spot on. If you want, I can set up some time with you to go over the migration stories with demos?


Kubernetes networking is an area of complexity and multiple layers which has created several challenges and areas of improvement. These challenges include deployment of the CNI plugins, troubleshooting networking issues and development of new functionality.

Currently networking happens in three layers of the stack, Kubernetes itself by means of kube-proxy or another controller based solution, the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods. All of this communication happens through non network specific APIs which to the reader of the code makes it hard to determine where ‘networking’ is happening. Having networking in several layers presents an issue when needing to troubleshoot issues as one needs to check several areas and some cannot be done via kubectl logs such as the CNI execution logs. This becomes more of an effort as multiple uncoordinated processes are making changes to the same resource, the network namespace of either the root or pod. The KNI aims at reducing the complexity by consolidating the networking into a single layer and having a uniform process for both namespaced and kernel isolated pods through a gRPC API. Leveraging gRPC will allow users the ability to migrate away from the current execution model that the CNI currently leverages.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by namespaced AND kernel-isolated pods? isn't that the same thing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was trying to do here is separate the virtualized oci runtimes from the non-virtualized oci runtimes. Aka kata vs runc. However we just got off a call with Kubevirt where you could use either. Both will leverage network namespaces however the virtualized cases have the additional kernel isolation.

Copy link

@bleggett bleggett Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both will use (at least) network namespacing it's probably less confusing to just say that (and provide more detail later in the doc for other cases that use addl isolation, if need be).


Currently networking happens in three layers of the stack, Kubernetes itself by means of kube-proxy or another controller based solution, the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods. All of this communication happens through non network specific APIs which to the reader of the code makes it hard to determine where ‘networking’ is happening. Having networking in several layers presents an issue when needing to troubleshoot issues as one needs to check several areas and some cannot be done via kubectl logs such as the CNI execution logs. This becomes more of an effort as multiple uncoordinated processes are making changes to the same resource, the network namespace of either the root or pod. The KNI aims at reducing the complexity by consolidating the networking into a single layer and having a uniform process for both namespaced and kernel isolated pods through a gRPC API. Leveraging gRPC will allow users the ability to migrate away from the current execution model that the CNI currently leverages.

The next challenge is the deployment of the CNI plugins that provide the network setup and teardown of the pod. The idiomatic way to deploy a workload in Kubernetes is that everything should be in a pod; however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state. Since all existing K8s network plugins are running as daemonsets we will take this approach as well, where all the dependencies are packaged into the container image thus adopting a well known approach. This will have added benefits of the network pod startup being much faster as nothing should need to be downloaded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.

This is confusing me a bit - typically the CNI config is something workload pods are not aware of at all, via init containers or otherwise - only the CRI implementation handles them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod I was referring to was the network plugin daemonset pods (flannel, calico, ...). I can try and clean this up to be more clear.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok thanks, yeah - "node agent daemonset" or "privileged node-bound pod" or whatever. Something general and consistent, since I don't think it has to be an init container specifically, in the Pod sense.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see also #4477 (comment), and the followup comments

@MikeZappa87
Copy link
Author

Can you squash the commits? I find it hard to comment lines in the KEP, because I don't know which commit to use?

So I include quotes instead below.

the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods.

Can you give examples? I am pretty well aware of how things work (or so I thought), and my interpretation is:

  • "container runtime" examples crio, containerd
  • CNI-plugin, examples Cilium, Calico
  • OCI runtime, examples runc, crun

Is that correct?

And what is "kernel isolated pods"? That I don't know. Is it kata containers for instance (VMs)?

I was referring to kernel isolated pods as pods that leverage kata or kubevirt. The additional network setup happens for both kata/kubevirt use cases after the CNI add. This is done in a couple ways, for Kata the setup happens through the execution of the kata-runtime via containerd/cri-o. In containerd its via the StartTask thus not clear that additional networking is happening.

I can squash the commits as well. It might make sense to move the contents to a google doc as well?


Currently networking happens in three layers of the stack, Kubernetes itself by means of kube-proxy or another controller based solution, the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods. All of this communication happens through non network specific APIs which to the reader of the code makes it hard to determine where ‘networking’ is happening. Having networking in several layers presents an issue when needing to troubleshoot issues as one needs to check several areas and some cannot be done via kubectl logs such as the CNI execution logs. This becomes more of an effort as multiple uncoordinated processes are making changes to the same resource, the network namespace of either the root or pod. The KNI aims at reducing the complexity by consolidating the networking into a single layer and having a uniform process for both namespaced and kernel isolated pods through a gRPC API. Leveraging gRPC will allow users the ability to migrate away from the current execution model that the CNI currently leverages.

The next challenge is the deployment of the CNI plugins that provide the network setup and teardown of the pod. The idiomatic way to deploy a workload in Kubernetes is that everything should be in a pod; however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state. Since all existing K8s network plugins are running as daemonsets we will take this approach as well, where all the dependencies are packaged into the container image thus adopting a well known approach. This will have added benefits of the network pod startup being much faster as nothing should need to be downloaded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all existing K8s network plugins are running as daemonsets we will take this approach as well [...]

This is definitely not true ...?
CNI plugin binaries can be pre-installed to the host and run entirely on the host without a daemonset and there are major cluster operators working this way.

Through CRI, even the plumbing node IPAM can be done without any pod.

Copy link

@bleggett bleggett Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CNI plugin binaries can be pre-installed to the host and run entirely on the host without a daemonset and there are major cluster operators working this way.

Yep, some environments bake the nodes and don't let you change them (and the corollary of that is you can't use other CNIs or meshes). Some do.

Either way, the current gate is "you need node privilege to install CNI plugins" - some envs have K8S policies that allow pods to assume node-level privileges, some do not and prebake everything. That k8s-based gate should be maintained for KNI, so either option works, depending on operator/environment need.

I don't think there's a world in which we can effectively separate "able to add new networking extensions" from "require privileged access to the node". That's worth considering tho.

If extensions are driven by K8S resources it makes policy enforcement via operator-managed admission hooks potentially a bit easier, I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, the current gate is "you need node privilege to install CNI plugins" [...]

I would expect that to be true, given the position they fill in the stack (!)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How you deploy the CNI plugins can be opinionated. However, the piece that is important is that you no longer need to have any files in the host filesystem such as CNI binaries or CNI configuration files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the piece that is important is that you no longer need to have any files in the host filesystem such as CNI binaries or CNI configuration files.

But why is that important? I mean taken to an extreme ... You cannot run Kubernetes without files on the host filesystem anyhow, and configuring the pod network is an incredibly privileged place to operate.

Copy link

@bleggett bleggett Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why is that important? I mean taken to an extreme ... You cannot run Kubernetes without files on the host filesystem anyhow, and configuring the pod network is an incredibly privileged place to operate.

It's important because allowing privileged pods to run on a node (where privileged necessarily means "can mutate node state", networking or otherwise) is an operator choice today, and it seems wrong to take that choice away from operators.

This is how, for instance, it is possible to install Cilium in AWS today. Or Calico in GKE. Or Openshift, etc etc.

Anyone that doesn't want to allow privileged pods on a node can already choose to leverage existing K8S APIs to preclude that as a matter of operational policy - it's rather orthogonal to CNI.

Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be "less layers" to just put opinionated networking in containerd/cri-o ?

CNI is enables extensibility, and we've got a large and healthy ecosystem around it.

This feels like a big technical-debt inducing, disruptive undertaking in search of a concrete need.

I don't see any discussion of goals that cannot be accomplished by improving the current tools, except some non-specific hand-waving about reducing layers.

There also seem to be some bad assumptions, e.g. that "all plugins are populated by daemonset" (and therefore network readiness is blocked by pulling / starting this pod) which is not true, e.g. off the top of my head: GKE does not do this, kind does not do this (you can pre-populate plugin binaries / config or containerd CNI config template with very fast node network readiness), and then KNI is suggested to ... run as a container image, which will ... block network readiness on pulling a container...?

At the very least, I'd suggest spelling out the use-cases and why they cannot be accomplished without a new RPC service.


Another area that KNI will improve is ‘network readiness’. Currently the container runtime is involved with providing both network and runtime readiness with the Status CRI RPC. The container runtime defines the network as ‘ready’ by the presence of the CNI network configuration in the host file system. The more recent CNI specification does include a status verb, however this is still bound by the current limitations, files on disk and execution model. The KNI will provide an RPC that can be implemented so the kubelet will call the KNI via gRPC.

KNI aims to help the community and other proposals in the Kubernetes ecosystem. We will do this by providing necessary information via the gRPC service. We should be the API that provides the “what networks are available on this node” so that another effort can make the kube-scheduler aware of networks. We should also provide IPAM status as a common issue, is that the IPAM runs out of assignable IP addresses and pods are no longer able to be scheduled on that node until intervention. We should provide visibility into this so that we can indicate “no more pods” as setting the node to not ready will evict the healthy pods. While the future state of KNI could aim to propose changes to the kube-scheduler, it's not a part of our initial work and instead should try to assist other efforts such as DRA/device plugin to provide the information they need.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also provide IPAM status as a common issue, is that the IPAM runs out of assignable IP addresses and pods are no longer able to be scheduled on that node until intervention.

Typically this is handled by coordinating the assigned IP range and the max pods setting in the kubelet.
Cluster operators already have the tools to prevent this issue, why would you allow a node to be configured to have more pods than IPs?


KNI aims to help the community and other proposals in the Kubernetes ecosystem. We will do this by providing necessary information via the gRPC service. We should be the API that provides the “what networks are available on this node” so that another effort can make the kube-scheduler aware of networks. We should also provide IPAM status as a common issue, is that the IPAM runs out of assignable IP addresses and pods are no longer able to be scheduled on that node until intervention. We should provide visibility into this so that we can indicate “no more pods” as setting the node to not ready will evict the healthy pods. While the future state of KNI could aim to propose changes to the kube-scheduler, it's not a part of our initial work and instead should try to assist other efforts such as DRA/device plugin to provide the information they need.

The community may ask for more features, as we are taking a bold approach to reimagining Kubernetes networking by reducing the amount of layers involved in networking. We should prioritize feature parity with the current CNI model and then capture future work. KNI aims to be the foundational network api that is specific for Kubernetes and should make troubleshooting easier, deploying more friendly and innovate faster while reducing the need to make changes to core Kubernetes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The community may ask for more features, as we are taking a bold approach to reimagining Kubernetes networking by reducing the amount of layers involved in networking.

Reducing the layers ... by adding a new GRPC service? Why isn't this CRI? (An existing Kubernetes-Specific gRPC service for Pods, which includes some networking related bits today ...)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this to the CRI-API is a design detail. The only piece of information relevant to networking is the PodIP coming back in PodSandboxStatus. This talks about eliminating the network setup/teardown and netns creation from the container/oci runtimes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This talks about eliminating the network setup/teardown and netns creation from the container/oci runtimes.

Has this been a significant obstacle for implementers? Examples?

What blocks a CNI revision from handling netns/teardown?


Since the network runtime can be run separated from the container runtime, you can package everything into a pod and not need to have binaries on disk. This allows the CNI plugins to be isolated in the pod and the pod will never need to mount /opt/cni/bin or /etc/cni/net.d. This offers a potentially more ability to control execution. Keep in mind CNI is the implementation however when this is used chaining is still available.

## Ongoing Considerations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't seem to be any discussion of how we might improve CNI, CRI instead and why that isn't sufficient versus this entirely new API and RPC service.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could live in the CRI-API as multiple services, no one has indicated that we must use a new API. However the CNI 2.0 was talked about being closer to K8s for years now. This is that effort.

Copy link

@bleggett bleggett Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree we could probably be more clear up front in the KEP about why the current CNI model (slurping out-of-k8s config files from well-known paths, TOCTOU errors, telling CRI implementations via out-of-band mechanisms to exec random binaries on the node by-proxy) is something we could explicitly improve on with KNI, and that KNI is basically "CNI 2.0" - it is the proposed improvement to current CNI.


### User Stories

We are constantly adding these user stories, please join the community sync to discuss.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are constantly adding these user stories [...]

Where?


## Motivation

Kubernetes networking is an area of complexity and multiple layers which has created several challenges and areas of improvement. These challenges include deployment of the CNI plugins, troubleshooting networking issues and development of new functionality.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please. The current formatting has too many discussion points per line, it's difficult to follow and comment on.

### Goals

- Design a cool looking t-shirt
- Provide a RPC for the Attachment and Detachment of interface[s] for a Pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a significant departure from current expectations around pod networking.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this? We already do something similar with CNI ADD/DEL with an execution model. This is levering gRPC to communicate with the gRPC server that would be flannel, calico, cilium.

- Design a cool looking t-shirt
- Provide a RPC for the Attachment and Detachment of interface[s] for a Pod
- Provide a RPC for the Querying of Pod network information (interfaces, network namespace path, ip addresses, routes, ...)
- Provide a RPC to prevent additional scheduling of pods if IPAM is out of IP addresses without evicting running pods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this is just reporting status up through CRI?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last place I mention this, that the decision around CRI or a new API is a design detail.

approvers:

see-also:
- "/keps/sig-aaa/1234-we-heard-you-like-keps"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete these, or update with relevant keps?

same below for replaces

milestone:
alpha: "v1.31"
beta: "v1.32"
stable: "v1.33"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unlikely, without even defining an API yet?

- "@shaneutt"
owning-sig: sig-network
participating-sigs:
- sig-network
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely at least SIG Node should be participating (there's no way this doesn't affect kubelet, CRI)..?

I would also tag Cluster Lifecycle at least as FYI / advisory since cluster lifecycle folks will know about and have suggestions re: node readiness and cluster configuration.

@BenTheElder
Copy link
Member

If use cases and design are still being collected, a working group may be more appropriate than a KEP.

@MikeZappa87
Copy link
Author

If use cases and design are still being collected, a working group may be more appropriate than a KEP.

Lets sync up on slack and schedule some time to discuss.

@uablrek
Copy link

uablrek commented Feb 23, 2024

Are there any serious KNI use-cases that don't include multi-networking? KNI doesn't enable multiple interfaces, but I think that something like KNI is a prerequisite for K8s multi-net. But to independently motivate both KNI and K8s-multi-net with multi-networking use-cases is very confusing. I hope Antonio's workshop at KubeCon will sort this out (great initiative! But I can't attend myself unfortunately).

But the referred comment above must be considered if multiple interfaces are handled via KNI: who cleans up when a POD is deleted?

Obviously someone who knows what interfaces that are in the POD. So if something except kubelet calls the KNI to add various interfaces, kubelet would be unaware of them and there is a problem.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-rebase Denotes a PR that should be rebased by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.