- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Add a new field spec.internalTrafficPolicy
to Service that allows node-local and topology-aware routing for Service traffic.
Internal traffic routed to a Service has always been randomly distributed to all endpoints. This KEP proposes a new API in Service to address use-cases such as node-local and topology aware routing for internal Service traffic.
- Allow internal Service traffic to be routed to node-local or topology-aware endpoints.
- Default behavior for internal Service traffic should not change.
- Topology aware routing for zone/region topologies -- while this field enables this feature, this KEP only covers node-local routing. See the Topology Aware Hints KEP for more details.
Introduce a new field in Service spec.internalTrafficPolicy
. The field will have 2 codified values:
- Cluster (default): route to all cluster-wide endpoints (or use topology aware subsetting if enabled).
- Local: only route to node-local endpoints, drop otherwise.
A feature gate ServiceInternalTrafficPolicy
will also be introduced this feature.
The internalTrafficPolicy
field cannot be set on Service during the alpha stage unless the feature gate is enabled.
During the Beta stage, the feature gate will be on by default.
The internalTrafficPolicy
field will not apply for headless Services or Services of type ExternalName
.
As a platform owner, I want to create a Service that always directs traffic to a logging daemon or metrics agent on the same node. Traffic should never bounce to a daemon on another node since the logs would then report an incorrect log source.
- When the
Local
policy is set, it is the user's responsibility to ensure node-local endpoints are ready, otherwise traffic will be dropped.
Proposed addition to core v1 API:
type ServiceInternalTrafficPolicyType string
const (
ServiceTrafficPolicyTypeCluster ServiceTrafficPolicyType = "Cluster"
ServiceTrafficPolicyTypeLocal ServiceTrafficPolicyType = "Local"
)
// ServiceSpec describes the attributes that a user creates on a service.
type ServiceSpec struct {
...
...
// InternalTrafficPolicy specifies if the cluster internal traffic
// should be routed to all endpoints or node-local endpoints only.
// "Cluster" routes internal traffic to a Service to all endpoints.
// "Local" routes traffic to node-local endpoints only, traffic is
// dropped if no node-local endpoints are ready.
// The default value is "Cluster".
// +featureGate=ServiceInternalTrafficPolicy
// +optional
InternalTrafficPolicy *ServiceInternalTrafficPolicyType `json:"internalTrafficPolicy,omitempty" protobuf:"bytes,22,opt,name=internalTrafficPolicy"`
}
This field will be independent from externalTrafficPolicy. In other words, internalTrafficPolicy only applies to traffic originating from internal sources.
Proposed changes to kube-proxy:
- when
internalTrafficPolicy=Cluster
, default to existing behavior today. - when
internalTrafficPolicy=Local
, route to endpoints in EndpointSlice that maches the local node's topology, drop traffic if none exist.
Overlap with topology-aware routing:
ExternalTrafficPolicy | InternalTrafficPolicy | Topology | External Result | Internal Result |
---|---|---|---|---|
- | - | Auto | Topology | Topology |
Local | - | Auto | Local | Topology |
Local | Local | Auto | Local | Local |
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
pkg/registry/core/service
:v1.22
-API strategy tests for feature enablement and upgrade safety (dropping disabled fields)
pkg/apis/core/v1
:v1.22
-API defauting tests
pkg/proxy/iptables
:v1.22
-iptables rules tests + feature enablement
pkg/proxy/ipvs
:v1.22
-ipvs rules tests + feature enablement
pkg/proxy
:v1.22
-generic kube-proxy Service tests
NOTE: search ServiceInternalTrafficPolicy in the Kubernetes repo for references to existing tests.
Test_ExternalNameServiceStopsDefaultingInternalTrafficPolicy
: https://github.com/kubernetes/kubernetes/blob/61b983a66b92142e454c655eb2add866c9b327b0/test/integration/service/service_test.go#L34Test_ExternalNameServiceDropsInternalTrafficPolicy
: https://github.com/kubernetes/kubernetes/blob/61b983a66b92142e454c655eb2add866c9b327b0/test/integration/service/service_test.go#L78Test_ConvertingToExternalNameServiceDropsInternalTrafficPolicy
: https://github.com/kubernetes/kubernetes/blob/61b983a66b92142e454c655eb2add866c9b327b0/test/integration/service/service_test.go#L125
should respect internalTrafficPolicy=Local Pod to Pod
: https://github.com/kubernetes/kubernetes/blob/4bc1398c0834a63370952702eef24d5e74c736f6/test/e2e/network/service.go#L2520should respect internalTrafficPolicy=Local Pod (hostNetwork: true) to Pod
: https://github.com/kubernetes/kubernetes/blob/4bc1398c0834a63370952702eef24d5e74c736f6/test/e2e/network/service.go#L2598should respect internalTrafficPolicy=Local Pod and Node, to Pod (hostNetwork: true)
: https://github.com/kubernetes/kubernetes/blob/4bc1398c0834a63370952702eef24d5e74c736f6/test/e2e/network/service.go#L2678
Alpha:
- feature gate
ServiceInternalTrafficPolicy
must be enabled for apiserver to accept values forspec.internalTrafficPolicy
. Otherwise field is dropped. - kube-proxy handles traffic routing for 2 initial internal traffic policies
Cluster
, andLocal
. - Unit tests as defined in "Test Plan" section above. E2E tests are nice to have but not required for Alpha.
Beta:
- integration tests exercising API behavior for
spec.internalTrafficPolicy
field of Service. - e2e tests exercising kube-proxy routing when
internalTrafficPolicy
isLocal
. - feature gate
ServiceInternalTrafficPolicy
is enabled by default. - consensus on how internalTrafficPolicy overlaps with topology-aware routing.
GA:
- metrics for total number of Services that have no endpoints (kubeproxy/sync_proxy_rules_no_endpoints_total) with additional labels for internal/external and local/cluster policies.
- Fix a bug where internalTrafficPolicy=Local would force externalTrafficPolicy=Local (kubernetes/kubernetes#106497).
- Sufficient integration/e2e tests (many were already added for Beta, but we'll want to revisit tests based on changes that landed during Beta).
- The
internalTrafficPolicy
field will be off by default during the alpha stage but can handle any existing Services that has the field already set. This ensures n-1 apiservers can handle the new field on downgrade. - On upgrade, if the feature gate is enabled there should be no changes in the behavior since the default value for
internalTrafficPolicy
isCluster
.
Since this feature will be alpha for at least 1 release, an n-1 kube-proxy should handle enablement of this feature if a new apiserver enabled it.
This section must be completed when targeting alpha to a release.
-
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
ServiceInternalTrafficPolicy
- Components depending on the feature gate: kube-apiserver, kube-proxy
- Feature gate name:
- Feature gate (also fill in values in
-
Does enabling the feature change any default behavior?
No, enabling the feature does not change any default behavior since the default value of internalTrafficPolicy
is Cluster
.
- Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, the feature gate can be disabled, but Service resource that have set the new field will persist that field unless unset by the user.
- What happens if we reenable the feature if it was previously rolled back?
New Services should be able to set the internalTrafficPolicy
field. Existing Services that have the field set will begin to apply the policy again.
- Are there any tests for feature enablement/disablement?
There will be unit tests to verify that apiserver will drop the field when the ServiceInternalTrafficPolicy
feature gate is disabled.
Tests added so far:
- https://github.com/kubernetes/kubernetes/blob/0038bcfad495a0458372867a77c8ca646f361c40pkg/registry/core/service/strategy_test.go#L368-L390
- https://github.com/kubernetes/kubernetes/blob/0038bcfad495a0458372867a77c8ca646f361c40/pkg/registry/core/service/storage/storage_test.go#L682-L684
- https://github.com/kubernetes/kubernetes/blob/0038bcfad495a0458372867a77c8ca646f361c40/test/integration/service/service_test.go
This section must be completed when targeting beta graduation to a release.
- How can a rollout fail? Can it impact already running workloads?
Rollout should have minimal impact because the default value of internalTrafficPolicy
is Cluster
, which is the default behavior today.
- What specific metrics should inform a rollback?
The sync_proxy_rules_no_local_endpoints_total
metric will inform users whether a Service is dropping local traffic.
- Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
API Rollback was manually validated using the following steps:
Create a kind cluster:
$ kind create cluster
Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.25.0) 🖼
✓ Preparing nodes 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kind
Thanks for using kind! 😊
Check that Services set spec.internalTrafficPolicy
:
$ kubectl -n kube-system get svc kube-dns -o yaml
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/port: "9153"
prometheus.io/scrape: "true"
creationTimestamp: "2022-09-27T14:03:46Z"
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
kubernetes.io/name: CoreDNS
name: kube-dns
namespace: kube-system
resourceVersion: "219"
uid: 9b455b45-ae9f-43d1-98ca-f275b805ab95
spec:
clusterIP: 10.96.0.10
clusterIPs:
- 10.96.0.10
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: dns
port: 53
protocol: UDP
targetPort: 53
- name: dns-tcp
port: 53
protocol: TCP
targetPort: 53
- name: metrics
port: 9153
protocol: TCP
targetPort: 9153
selector:
k8s-app: kube-dns
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
Turn off the ServiceInternalTrafficPolicy
feature gate in kube-apiserver
:
$ docker exec -ti kind-control-plane bash
$ vim /etc/kubernetes/manifests/kube-apiserver.yaml
# append --feature-gates=ServiceInternalTrafficPolicy=false to kube-apiserver flags
Validate that the Service still preserves the field. This is expected for backwards compatibility reasons:
$ kubectl -n kube-system get svc kube-dns -o yaml
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/port: "9153"
prometheus.io/scrape: "true"
creationTimestamp: "2022-09-27T14:03:46Z"
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
kubernetes.io/name: CoreDNS
name: kube-dns
namespace: kube-system
resourceVersion: "219"
uid: 9b455b45-ae9f-43d1-98ca-f275b805ab95
spec:
clusterIP: 10.96.0.10
clusterIPs:
- 10.96.0.10
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: dns
port: 53
protocol: UDP
targetPort: 53
- name: dns-tcp
port: 53
protocol: TCP
targetPort: 53
- name: metrics
port: 9153
protocol: TCP
targetPort: 9153
selector:
k8s-app: kube-dns
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
Validate that new Services do not have the field anymore:
$ cat service.yaml
---
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
selector:
app: nginx
ports:
- port: 80
protocol: TCP
$ kubectl apply -f service.yaml
service/nginx created
$ kubectl get svc nginx -o yaml
apiVersion: v1
kind: Service
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"nginx"},"name":"nginx","namespace":"default"},"spec":{"ports":[{"port":80,"protocol":"TCP"}],"selector":{"app":"nginx"}}}
creationTimestamp: "2022-09-27T14:10:51Z"
labels:
app: nginx
name: nginx
namespace: default
resourceVersion: "867"
uid: e1bf394a-3759-4534-ac44-8cb8e44c1971
spec:
clusterIP: 10.96.55.182
clusterIPs:
- 10.96.55.182
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: nginx
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
Rolling back kube-proxy behavior was tested manually with the following steps:
Create a Kind cluster with 2 worker nodes:
$ cat kind.yaml
# three node (two workers) cluster config
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
$ kind create cluster --config kind.yaml
Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.25.0) 🖼
✓ Preparing nodes 📦 📦 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
✓ Joining worker nodes 🚜
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kind
Create a webserver Deployment and Service with internalTrafficPolicy=Local
:
$ cat agnhost-webserver.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: agnhost-server
labels:
app: agnhost-server
spec:
replicas: 3
selector:
matchLabels:
app: agnhost-server
template:
metadata:
labels:
app: agnhost-server
spec:
containers:
- name: agnhost
image: registry.k8s.io/e2e-test-images/agnhost:2.40
args:
- serve-hostname
- --port=80
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: agnhost-server
labels:
app: agnhost-server
spec:
internalTrafficPolicy: Local
selector:
app: agnhost-server
ports:
- port: 80
protocol: TCP
Exec into one node and verify that kube-proxy only proxies to the local nodes.
$ kubectl get svc agnhost-server
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
agnhost-server ClusterIP 10.96.226.141 <none> 80/TCP 7m56s
$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
agnhost-server-7d4db667fc-h5m5p 1/1 Running 0 6m2s 10.244.2.3 kind-worker <none> <none>
agnhost-server-7d4db667fc-nrdg4 1/1 Running 0 5m59s 10.244.2.4 kind-worker <none> <none>
agnhost-server-7d4db667fc-tpbpk 1/1 Running 0 6m 10.244.1.4 kind-worker2 <none> <none>
$ docker exec -ti kind-worker2 bash
root@kind-worker2:/# iptables-save
...
...
...
-A KUBE-SVL-VPG43MSD43N5Z7KJ ! -s 10.244.0.0/16 -d 10.96.226.141/32 -p tcp -m comment --comment "default/agnhost-server cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SVL-VPG43MSD43N5Z7KJ -m comment --comment "default/agnhost-server -> 10.244.1.4:80" -j KUBE-SEP-MNGWCV7VA4JSYYKU
COMMIT
# Completed on Tue Sep 27 14:50:07 2022
Turn off the ServiceInternalTrafficPolicy
feature gate in kube-proxy by editing the kube-proxy ConfigMap and restarting kube-proxy:
$ kubectl -n kube-system edit cm kube-proxy
# set featureGates["ServiceInternalTrafficPolicy"]=false in config.conf
configmap/kube-proxy edited
$ kubectl -n kube-system delete po -l k8s-app=kube-proxy
pod "kube-proxy-d6nf2" deleted
pod "kube-proxy-f89g4" deleted
pod "kube-proxy-lrk9l" deleted
Exec into the same node as before and check that kube-proxy now proxies to all endpoints even though
the Service has internalTrafficPolicy
set to Local
:
$ docker exec -ti kind-worker2 bash
root@kind-worker2:/# iptables-save
...
...
...
-A KUBE-SVC-VPG43MSD43N5Z7KJ ! -s 10.244.0.0/16 -d 10.96.226.141/32 -p tcp -m comment --comment "default/agnhost-server cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SVC-VPG43MSD43N5Z7KJ -m comment --comment "default/agnhost-server -> 10.244.1.4:80" -m statistic --mode random --probability 0.33333333349 -j KUBE-SEP-MNGWCV7VA4JSYYKU
-A KUBE-SVC-VPG43MSD43N5Z7KJ -m comment --comment "default/agnhost-server -> 10.244.2.3:80" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-EMZSNZ2TWKTA6UM4
-A KUBE-SVC-VPG43MSD43N5Z7KJ -m comment --comment "default/agnhost-server -> 10.244.2.4:80" -j KUBE-SEP-KTGIEIH27TIG3JO7
COMMIT
# Completed on Tue Sep 27 14:55:04 2022
- Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
This section must be completed when targeting beta graduation to a release.
-
How can an operator determine if the feature is in use by workloads?
-
Check Service to see if
internalTrafficPolicy
is set toLocal
. -
A per-node "blackhole" metric will be added to kube-proxy which represent Services that are being intentionally dropped (internalTrafficPolicy=Local and no endpoints). The metric will be named
kubeproxy/sync_proxy_rules_no_local_endpoints_total
(subject to rename). -
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
They can check the kubeproxy/sync_proxy_rules_no_local_endpoints_total
metric when internalTrafficPolicy=Local and there are no endpoints.
- What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
This will depend on Service topology and whether internalTrafficPolicy=Local
is being used.
- Are there any missing metrics that would be useful to have to improve observability of this feature?
The sync_proxy_rules_no_local_endpoints_total
metric was added to inform users on whether a Service is dropping local traffic.
This section must be completed when targeting beta graduation to a release.
- Does this feature depend on any specific services running in the cluster? Think about both cluster-level services (e.g. metrics-server) as well as node-level agents (e.g. specific version of CRI). Focus on external or optional services that are needed. For example, if this feature depends on a cloud provider API, or upon an external software-defined storage or network control plane.
No.
For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.
- Will enabling / using this feature result in any new API calls?
No, since this is a user-defined field in Service. No extra calls will be required from EndpointSlice as well since topology information is already stored there.
- Will enabling / using this feature result in introducing new API types?
No API types are introduced, only a new field in Service.
- Will enabling / using this feature result in any new calls to the cloud provider?
No
- Will enabling / using this feature result in increasing size or count of the existing API objects?
This feature will (negligibly) increase the size of Service by adding a single field.
- Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? Think about adding additional work or introducing new steps in between (e.g. need to do X to start a container), etc. Please describe the details.
This feature may slightly increase kube-proxy's sync time for iptable / IPVS rules,
since node topology must be calculated, but this is likely negligible given we
already have many checks like this for externalTrafficPolicy: Local
.
- Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
Any increase in CPU usage by kube-proxy to calculate node-local topology will likely
be offset by reduced iptable rules it needs to sync when using the Local
traffic policy.
The Troubleshooting section currently serves the Playbook
role. We may consider
splitting it into a dedicated Playbook
document (potentially with some monitoring
details). For now, we leave it here.
This section must be completed when targeting beta graduation to a release.
- How does this feature react if the API server and/or etcd is unavailable?
Services will not be able to update their internal traffic policy.
- What are other known failure modes?
A Service internalTrafficPolicy
is set to Local
but there are no node-local endpoints.
-
What steps should be taken if SLOs are not being met to determine the problem?
-
check Service for internal traffic policy
-
check EndpointSlice to ensure nodeName is set correctly
-
check iptables/ipvs rules on kube-proxy
2020-10-09: KEP approved as implementable in "alpha" stage. 2021-03-08: alpha implementation merged for v1.21 2021-05-12: KEP approved as implementable in "beta" stage.
Added complexity in the Service API and in kube-proxy to address node-local routing. This also pushes some responsibility on application owners to ensure pods are scheduled to work with node-local routing.
EndpointSlice subsetting per node can address the node-local use-case, but this would not be very scalable for large clusters since that would require an EndpointSlice resource per node.
Instead of trafficPolicy
field with codified values, a bool field can be used to enable node-local routing.
While this is simpler, it is not expressive enough for the PreferLocal
use-case where traffic should ideally go
to a local endpoint, but be routed somewhere else otherwise.