Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 4770: EndpointSlice Controller Flexibility #4771

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
362 changes: 362 additions & 0 deletions keps/sig-network/4770-endpointslice-controller-flexibility/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,362 @@
# KEP-4770: EndpointSlice Controller Flexibility

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Well-Known Label](#well-known-label)
- [User Stories (Optional)](#user-stories-optional)
- [Story 1: Services Over Secondary Networks](#story-1-services-over-secondary-networks)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Well-Known Label](#well-known-label-1)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
- [Unit tests](#unit-tests)
- [Integration tests](#integration-tests)
- [e2e tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
- [GA](#ga)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Well-known label: Use Annotation as Selector](#well-known-label-use-annotation-as-selector)
- [Well-known label: Use Dummy Selector](#well-known-label-use-dummy-selector)
- [Well-known label: Disable the Kube-Controller-Manager Controllers](#well-known-label-disable-the-kube-controller-manager-controllers)
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
<!-- /toc -->

## Release Signoff Checklist

Items marked with (R) are required *prior to targeting to a milestone / release*.

- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- [ ] (R) Graduation criteria is in place
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
[kubernetes/website]: https://git.k8s.io/website

## Summary

aojea marked this conversation as resolved.
Show resolved Hide resolved
This proposal adds a new well-known label `service.kubernetes.io/endpoint-controller-name` to Kubernetes Services. This label disables the default Kubernetes EndpointSlice, EndpointSlice Mirroring and Endpoints controllers for the services where this label is applied and delegates the control of EndpointSlices to a custom EndpointSlice controller.

## Motivation

As of now, a service can be delegated to a custom Service-Proxy/Gateway if the label `service.kubernetes.io/service-proxy-name` is set. Introduced in [KEP-2447](https://github.com/kubernetes/enhancements/issues/2447), this allows custom Service-Proxies/Gateways to implement services in different ways to address different purposes / use-cases. However, the EndpointSlices attached to this service will still be reconciled in the same way as any other service. Addressing more purposes / use-cases, for example, different pod IP addresses, is therefore not natively possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? What if you define a Service without a selector and then do your own reconciliation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment, I clarified that it is not natively possible while using the Service Selector field (The KEP is about this, the fact to use the service as close as possible to how a user would define it for the Kubernetes default network).
About this topic, there are details in this conversation: #4771 (comment)


Delegating EndpointSlice control would allow custom controllers to define their own criteria for pod availability, selecting different pod IPs than the pod.status.PodIPs and more. As a reference implementation, and since the EndpointSlice Reconciler has been moved into Staging in [KEP-3685](https://github.com/kubernetes/enhancements/issues/3685), the reconciler logic used by Kubernetes can be reused by custom EndpointSlice controllers.

### Goals

* Provide the ability to disable the Kubernetes EndpointSlice, EndpointSlice Mirroring and Endpoints controllers for particular services.

### Non-Goals
aojea marked this conversation as resolved.
Show resolved Hide resolved

* Change / Replace / Deprecate the existing behavior of the Kubernetes EndpointSlice, EndpointSlice Mirroring and Endpoints controllers.
* Introduce additional supported types of the EndpointSlice controllers/Reconciler as part of Kubernetes.
* Modify the Service / EndpointSlice / Endpoints Specs.
* Changing the behavior of kube-proxy.

## Proposal

#### Well-Known Label

`service.kubernetes.io/endpoint-controller-name` will be added as a well-known label applying on the Service object. When set on a service, no matter the service specs, the Endpoint, EndpointSlice, and EndpointSlice Mirroring controllers for that service will be disabled, thus Endpoints and EndpointSlices for this service will not be created by the Kubernetes Controller Manager. If the label is not set, the Endpoint, EndpointSlice, and EndpointSlice Mirroring controllers will be enabled for that service and the Endpoints and EndpointSlices will be handled as of today.
aojea marked this conversation as resolved.
Show resolved Hide resolved

The EndpointSlice, EndpointSlice Mirroring and Endpoints controllers will obey this label both at object creation and on dynamic addition/removal/updates of this label.

### User Stories (Optional)

#### Story 1: Services Over Secondary Networks

As a Cloud Native Network Function (CNF) vendor, some of my Kubernetes services are handled by custom Service-Proxies/Gateways (using `service.kubernetes.io/service-proxy-name`) over secondary networks provided by, for example, [Multus](https://github.com/k8snetworkplumbingwg/multus-cni). IPs configured in the service and registered by the EndpointSlice controller must be only the secondary IPs provided by the secondary network provider.

Therefore, it must be possible to disable the default Kubernetes Endpoints and EndpointSlice Controller for certain services and use a specialized EndpointSlice reconciler implementation to create a controller for secondary network providers.

### Notes/Constraints/Caveats (Optional)

N/A
Comment on lines +108 to +110
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if it's "N/A" you can just remove the section, since it's "(Optional)")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove them later, if anyone has any concern that some are in fact applicable.


### Risks and Mitigations

The existing behavior will be kept by default, and the Kubernetes EndpointSlice Controller will not manage the Services with the label. This ensures services without the label to continue to be managed as usual.

This will have no effect on other EndpointSlice controller implementations since they will not be influenced by the presence of this label.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the biggest risk is that we could introduce significant bugs because this is making significant changes to the code behind a critical GA controller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also another risk, controllers that rely on Services that consume the Pods directly instead of the EndpointSlices (cc @bowei @swetharepakula as the ingress-gce does this for NEG) will need to be modified for these services.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This counts as a point in favor of "use an empty selector". But also, that just seems broken. :)


## Design Details

### Well-Known Label

The kube-controller-manager will pass to the Endpoints, EndpointSlice and EndpointSlice Mirroring Controllers an informer selecting services that are not labeled with `service.kubernetes.io/endpoint-controller-name`. Thus, if the label is added to an existing service (by updating the service), the service with the label will be considered as a deleted service for the controllers, and the Endpoints and EndpointSlices will be deleted. If a Service is created with the label, the controllers will not be informed about it, so the Endpoints and EndpointSlices will not be created. If the label is removed from an existing service (by updating the service), the service with the label will be considered as a newly created service for the controllers, and the Endpoints and EndpointSlices will be created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robscott @danwinship @thockin The mutability here is a friction point, as this is "the same" as removing the selector on a Service , something we didn't reach a consensus yet and this is defining a behavior that is inconsistent with current state kubernetes/kubernetes#118376.

I'm leaning more to avoid to deal with Updates on Services, specially because I can not understand why one Service with one selector managed by the cluster endpoint slice controller will switch to an external one slice controller, it seems that should be better handled by deleting and recreating the Service.

This will be straight forward to implement, just validating on the Service API that this special label can not be added or mutated after the Service has been created

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the annotation is mutable (which it has to be) then we have to deal with changes to it. Even if kcm tries to ignore changes, what happens if you restart kcm after changing the annotation?

I feel like 118376 was mostly waiting for Rob to bless a solution, and then we all stopped thinking about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I miss something. To me, 118376 is different issue than the one describe in this KEP.

Here is how I see the implementation for this KEP: The service informer passed to the EndpointSlice Controller will have a selector to not include the services that are labeled with service.kubernetes.io/endpoint-controller-name. This means the controller will consider adding the label as if the service has been removed (which should already be covered as of now: when you remove a service, the EndpointSlices for it are removed).

If KCM is restarting, then the EndpointSlice controller will see the previous EndpointSlice(s) and enqueue the service associated (service name is in the EndpointSlice annotation). The service will be considered as non-existing (as the informer is not selecting labeled service), so the EndpointSlice(s) will be deleted.

Here is the implementation: kubernetes/kubernetes@master...LionelJouin:kubernetes:custom-endpointslice-controller

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The similarity between this and 118376 is that in both cases someone (potentially) has a Service with endpoints that have been generated in a particular way, and then later they want to change the Service so that the endpoints will be generated in a different way. Antonio was suggesting that if we couldn't figure out a solution for 118376, then we won't be able to come up with a solution for this either (and so we should forbid the case where you change things after the endpoints have already been generated), and I was disputing that we couldn't figure out a solution for 118376.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this comment: kubernetes/kubernetes#103576 (comment). The behavior that I see with this KEP is different. To me, the first interpretation seems more logical for this KEP, so if we add the label, controllers are disabled for the service, and objects (Endpoints and EndpointSlices) previously created by the KCM are deleted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the annotation is mutable (which it has to be)

I think you mean label instead of annotation ... why it has to be mutable?

if we add the label, controllers are disabled for the service, and objects (Endpoints and EndpointSlices) previously created by the KCM are deleted

Here is the implementation: kubernetes/kubernetes@master...LionelJouin:kubernetes:custom-endpointslice-controller

I see, so this is using this property of the informers so the controller "sees" a Delete event when the label is added, with 118376 the problem is that it "sees" an Update event ... I can not think of any edge cases from the top of my head, I think it may work ...

@robscott WDYT? maybe should we modify the controllers to process the selector removal as a delete event too?


In the Endpoints, EndpointSlice and EndpointSlice Mirroring Controllers, the behavior to create Endpoints/EndpointSlices on service creation and the behavior to delete the Endpoints/EndpointSlices on service deletion is already in place. Only the service informer passed to these controllers must be tweaked for the proposed well-known label (`service.kubernetes.io/endpoint-controller-name`) to work properly.

### Test Plan

[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.

##### Prerequisite testing updates

##### Unit tests

TDB

##### Integration tests

- Usage of `service.kubernetes.io/endpoint-controller-name` on services
* With the feature gate enable, a service is created with the label and the service has then no Endpoints neither EndpointSlices. Then service is updated removing the label and the service has now Endpoints and EndpointSlices.
* With the feature gate enable, a service is created without the label, the service has Endpoints and EndpointSlices. Then service is updated with the label and the service has no longer any Endpoints nor EndpointSlices.
* With the feature gate disabled, a service is created with the label and the service has EndpointSlices as it would have had without the label.

##### e2e tests

- Usage of `service.kubernetes.io/endpoint-controller-name` on services
LionelJouin marked this conversation as resolved.
Show resolved Hide resolved
* A service is created with the label, an EndpointSlice is manually created with an endpoint (simulating an external controller). The test verifies the service is reachable.

### Graduation Criteria

#### Alpha

- Feature implemented behind feature gates (`ExternalEndpointController`). Feature Gates are disabled by default.
- Documentation provided.
- Initial unit, integration and e2e tests completed and enabled.

#### Beta

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has to be some e2e conformance of these out of tree controllers, since is an external controller adding an endpointslice, how do we validate these out of tree controllers are generating the endpointslices accordingly ? what if these controllers does not respect the selectors and generete the endpointslices anyway?

@robscott I really like to hear your thoughts on this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting thought. I'd assumed that the purpose of this KEP was to build controllers that would intentionally have different behavior than the default in-tree EndpointSlice controller. In some cases that might be to select a different IP for a Pod, but in other cases that might be to select an entirely different type of resource, for example a custom backend type.

This does certainly make it more likely that unexpected variations of EndpointSlices will exist, which is certainly something to watch out for. Since the outset of the EndpointSlice API anyone has been able to create EndpointSlices, and sometimes unexpected variations have led to bugs in controllers consuming those EndpointSlices, but I don't think that's necessarily new with this KEP, just potentially a more common set of edge cases that we'll need to be sure that controller authors are aware of.

For example, nodeName is optional in EndpointSlices, but some controllers likely assume it exists because the EndpointSlice controllers in k/k generally populate it. The same is largely true for zone and hostname (see podToEndpoint for a reference).

I think this risk is unavoidable with this KEP, so I'd like to see it called out + a corresponding commitment to improved documentation around this as a requirement for graduating past alpha. Unfortunately I can't think of a way to write conformance tests that would help us here though.

/cc @swetharepakula @gauravkghildiyal

- Feature Gates are enabled by default.
- No major outstanding bugs.
- Feedback collected from the community (developers and users) with adjustment provided, implemented and tested.

#### GA

- 2 examples of real-world usage.
- Allowing time for feedback from developers and users.

### Upgrade / Downgrade Strategy

N/A

### Version Skew Strategy

N/A

## Production Readiness Review Questionnaire

### Feature Enablement and Rollback

###### How can this feature be enabled / disabled in a live cluster?

- [x] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: `ExternalEndpointController`
- Components depending on the feature gate: kube-controller-manager
- [ ] Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control
plane? No
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? No

###### Does enabling the feature change any default behavior?

When the feature-gate `ExternalEndpointController` is enabled, the label `service.kubernetes.io/endpoint-controller-name` will work as described in this KEP. Otherwise, no, for the existing services without the `service.kubernetes.io/endpoint-controller-name` label, the EndpointSlice, EndpointSlice Mirroring and Endpoints controllers will continue to generate Endpoints and EndpointSlices for all services.

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

If a service is labeled with `service.kubernetes.io/endpoint-controller-name`, and the feature is disabled, then the Kubernetes Controller Manager will start reconciling the Endpoints and EndpointSlices for this service. This could potentially cause traffic disturbance for the service as unexpected IPs (Pod.Status.PodIPs) will be registered to the EndpointSlices/Endpoints.

###### What happens if we reenable the feature if it was previously rolled back?

If a service is labeled with `service.kubernetes.io/endpoint-controller-name`, and the feature is re-enabled, then the Kubernetes Controller Manager (KCM) will stop reconciling the Endpoints and EndpointSlices for this service and will delete the existing KCM managed ones.

###### Are there any tests for feature enablement/disablement?

Enablement/disablement of this feature is tested as part of the integration tests.

### Rollout, Upgrade and Rollback Planning

###### How can a rollout or rollback fail? Can it impact already running workloads?

N/A

###### What specific metrics should inform a rollback?

N/A

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

N/A

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

N/A

### Monitoring Requirements

###### How can an operator determine if the feature is in use by workloads?

N/A

###### How can someone using this feature know that it is working for their instance?

- [ ] Events
- Event Reason:
- [x] API .status
- Condition name:
- Other field: When the `service.kubernetes.io/endpoint-controller-name` label is set on a service, no Endpointslice and no Endpoint will be created but the Kubernetes Controller Manager.
- [ ] Other (treat as last resort)
- Details:

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

- [ ] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- [ ] Other (treat as last resort)
- Details:

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

No

### Dependencies

###### Does this feature depend on any specific services running in the cluster?

No

### Scalability

###### Will enabling / using this feature result in any new API calls?

No

###### Will enabling / using this feature result in introducing new API types?

No

###### Will enabling / using this feature result in any new calls to the cloud provider?

No

###### Will enabling / using this feature result in increasing size or count of the existing API objects?

No

###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No

###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No

### Troubleshooting

###### How does this feature react if the API server and/or etcd is unavailable?

N/A

###### What are other known failure modes?

N/A

###### What steps should be taken if SLOs are not being met to determine the problem?

N/A

## Implementation History

- Initial proposal: 2024-07-19

## Drawbacks

TBD

## Alternatives
LionelJouin marked this conversation as resolved.
Show resolved Hide resolved
LionelJouin marked this conversation as resolved.
Show resolved Hide resolved

### Well-known label: Use Annotation as Selector

Services without selectors will not get any EndpointSlice objects. Therefore, selecting pods can be done in different ways, for example, via annotation. An annotation will be used in the service to select which pods will be used as backend for this service. For example, [nokia/danm](https://github.com/nokia/danm) uses `danm.k8s.io/selector` (e.g. [DANM service declaration](https://github.com/nokia/danm/blob/v4.3.0/example/svcwatcher_demo/services/internal_lb_svc.yaml#L7)), and [projectcalico/vpp-dataplane](https://github.com/projectcalico/vpp-dataplane) uses `extensions.projectcalico.org/selector` (e.g. [Calico-VPP Multinet services](https://github.com/projectcalico/vpp-dataplane/blob/v3.25.1/docs/multinet.md#multinet-services)). To simplify the user experience, a mutating webhook could read the selector, add them to the annotation and clear them from the specs when the type of service is detected.

The custom EndpointSlice Controller will then read the annotation to select the pods targeted by the service.

```yaml
apiVersion: v1
kind: Service
metadata:
name: service
annotations:
selector: "app=a"
spec: {}
```

This alternative potentially leads to confusion among users and inconsistency in how services are managed as each implementation is using its own annotation (see the nokia/danm and projectcalico/vpp-dataplane examples), leading to a fragmented approach.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial reaction to this KEP is that we should not do it. This alternative is better for a bunch of reasons, but mostly because it already exists.

We have a way of expressing "don't generate endpoints for me" - leave the selector blank. Fortunately for you, it means exactly what you want.

The argument of inconsistency across controllers doesn't hold a lot of water for me. If you specify a pod selector, you're going to get pods as endpoints. If you don't specify it, you get something else. IMO, being able to specify that something else in your own schema is an advantage, not "fragmentation". Suppose you need an expression that is more flexible than the legacy selector-map (e.g. modern selectors can use "In" and "NotIn" and other operators).

This is not an accidental side-effect that I am trying to shoe-horn you into, it's literally why this mechanism was designed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty selector feature exists to cover use-cases that are not covered by the existing selector, i.e. selecting backends that are not selectable via labels. For example (from the Service documentation), selecting backends in a different namespace/cluster, selecting backends that are running outside of a Kubernetes, or as you mentioned, selecting backends with a more flexible expression covered with modern selectors. In these cases I agree the empty selector must be used and the users must define their own way to select backends, via, for example, annotations.

I realized that the goals and non-goals are not clear enough, so I fixed them in the latest commit. The goal is to disable the Kubernetes EndpointSlice controller and allow the user to re-use the selector field from the Service Spec. Custom EndpointSlice Controller should be able to replicate the usage of the selector, by selecting workloads by labels (labels specified in the selector field) but the difference with the Kubernetes EndpointSlice Controller is the compilation of these EndpointSlices, in this KEP the main use case is to select different IPs than the Pod.Status.PodIPs.

A well-known label would also provide an explicit and standard way to inform which component is managing the EndpointSlices for a particular service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still struggling with this one. I spoke with Rob a little, since he seemed "pro". The argument that carried the most weight for me was that there are (or might be?) cases where the alternative controller is doing exactly what the "real" controller does but extracting some different information (e.g. the multi-network case). In that case, you have to specify the network(s) somewhere right? Another example, albeit hypothetical, was something doing different topology hinting.

I'm not sure how I feel about this, still. It seems "simple" to implement, but I am worried about the tail of side-effects.

As a thought experiment - what if you could specify a function, e.g. CEL, which took a pod as argument and returned a list of IPs to use -- would that eliminate the need for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct, multi-network is my primary use case, I have personally no plan, at least for now, to do something different such as topology hinting. But, multi-network also means more than just picking the IP for the pods, it also means the network must be attached and up for the selected pods.

Yes, I need to specify which network(s) must be used for the selected pods. Having this label (service.kubernetes.io/endpoint-controller-name) will give me freedom to choose the way to specify those networks. For my case, I am currently thinking of using Services together with Gateway API (I made a PoC about it: https://github.com/LionelJouin/l-3-4-gateway-api-poc). Gateway in Gateway API allows me to specify new Service-Proxies (Secondary Network Service (load-balancing) acts directly in the Gateway instance) and attach networks via the Gateway.Spec.Infrastructure field, this way I can also specify which networks must be selected. Here is an example on how the Service/Gateway could look like: https://github.com/LionelJouin/l-3-4-gateway-api-poc/blob/main/examples/kpng-gateway-api.yaml#L59-L62.

If I understand correctly, CEL would select a specific field based on what the user would specify via a new field in the Service. I think this would be more complex to maintain, more difficult to use, and less flexible. As of today, the most used solution is Multus, the secondary networks are attached and reported via pod annotation in a json format, I guess it would be quite challenging to match the IPs for a particular network. This could have worked with the Multi-Network KEP changes since the Secondary Network IPs would have been reported in Pod.Status. Now the discussion for multi-network is about DRA and ResourceClaim.Status to report the Secondary Network IPs, I am not sure how CEL could fit here.

For reference, at the beginning, another goal for this KEP was to improve the Kubernetes EndpointSlice Reconciler to make it generic enough so it can be re-used for Secondary Network IPs (No matter if Multus, KEP#3698 or DRA is used). This way, the custom EndpointSlice Controllers for secondary networks would be quite easy to implement. I kept it in the commit history (removed in this commit) and I am still planning to propose again these changes somehow in a better way in the future.

Copy link
Member

@aojea aojea Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, at the beginning, another goal for this KEP was to improve the Kubernetes EndpointSlice Reconciler to make it generic enough so it can be re-used for Secondary Network IPs (No matter if Multus, KEP#3698 or DRA is used). This way, the custom EndpointSlice Controllers for secondary networks would be quite easy to implement. I kept it in the commit history (removed in this commit) and I am still planning to propose again these changes somehow in a better way in the future.

This is a big no for me, if someone wants to create EndpointSlices controllers out of tree is ok to share some common types and logic but adding more load to the core maintainers just to remove from the custom endpointslice controllers maintainers is simply unfair.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big no for me, if someone wants to create EndpointSlices controllers out of tree is ok to share some common types and logic but adding more load to the core maintainers just to remove from the custom endpointslice controllers maintainers is simply unfair.

I agree with this, the idea with this part (generic EndpointSlice Reconciler) of the KEP was to solve what is stated in the readme of EndpointSlice Reconciler: This EndpointSlice reconciler library is not sufficiently generic to be used by the EndpointSlice Mirroring controller.... But also keeping what is written for compatibility: There are NO compatibility guarantees for this repository, yet. It is in direct support of Kubernetes, so branches will track Kubernetes and be compatible with that repo..


### Well-known label: Use Dummy Selector

The set of Pods targeted by a Service is determined by a selector, the labels in the selector must be included as part of the pod labels. If a dummy selector is added to the service, Kubernetes will not select any pod, the endpointslices created by Kubernetes will then be empty. To simplify the user experience, a mutating webhook could add the dummy selector when the type of service is detected.

The custom EndpointSlice Controller could read the service.spec.selector and ignore the dummy label to select pods targeted by the service.

```yaml
apiVersion: v1
kind: Service
metadata:
name: service
spec:
selector:
app: a
dummy-selector: "true"
```

This alternative fails to prevent the placeholder (empty) EndpointSlice(s) to be created by Kube-Controller-Manager. This also potentially causes confusion among users as every implementation could use a different dummy-selector key. Additionally, a miss-configuration with the missing dummy label will lead to unintended EndpointSlices being created with Pod.Status.PodIPs.

### Well-known label: Disable the Kube-Controller-Manager Controllers

The list of controllers to enable in the Kube-Controller-Manager can be set using the `--controllers` flag ([kube-controller-manager documentation](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/)). The EndpointSlice can then be disabled in the Kube-Controller-Manager and implemented as an external one that will support the label feature described in this KEP.

This alternative requires significant changes to the cluster management as the cluster level configuration must be modified and a new EndpointSlice controller for the primary network must be developed and deployed to replace the disabled one in the Kube-Controller-Manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should - I pretty much think must - document the "don't specify a selector, then do your own thing" option. It is an alternative and we shouldn't pretend otherwise.

## Infrastructure Needed (Optional)

N/A
Loading