-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-Cluster Services API #1646
Conversation
Welcome @JeremyOT! |
Hi @JeremyOT. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
#### Terminology | ||
|
||
- **supercluster** - a placeholder name for a group of clusters with a high degree of mutual trust and shared ownership that share services amongst themselves. | ||
- **mcsd-controller** - a controller that syncs services across clusters. There may be multiple implementations, this doc describes expected common behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind elaborating on the meaning of mcsd
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated this explanation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some peanut gallery questions and comments, overall this looks very nice!
|
||
The Multi-Cluster Services API aims to fix these problems. | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an explicit goal to extend the topology model to support cross-cluster topology concerns? For example region/zone affinity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No explicit extension there - the hope is that whatever K8s does for service topology within a cluster can extend across clusters. For example, zonal affinity is already something that matters within clusters. The alpha Service Topology API would already seem to just work with multiple clusters, and recent sig-network discussion about simplifying topology to do the right thing with less configuration would apply here as well
--> | ||
- Define specific implementation details beyond general API behavior. | ||
- Change behavior of single cluster services in any way. | ||
- Define what NetworkPolicy means for multi-cluster services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can't leave this out, we at least have to define some semantics here. Otherwise it is potentially possible to bypass network policy controls by using multi-cluster services.
I think at a minimum there needs to be some discussion on the semantics of existing network policy and multi-cluster services. At least ip-based policies should continue to work right? But namespace-based selectors probably won't...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point. As you say, ip-based policies should continue to work but selectors won't be able to cross the cluster boundary. I think multi-cluster network policy is critical and needs to be figured out, but I was hoping we can start to define how to extend Service first. Users will definitely want policy, but there are some mechanics that we're missing - e.g. how can we support selectors against backends in another cluster?
As for any possibility of circumventing policy in the meantime, I think that would be a risk the service owner needs to be aware of. To export an existing service you'd need to have access to the service namespace which I hope means you have permission to weigh the risk and make the decision.
Let me think about this and try to fill out some thoughts on what this means for existing policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cillium ClusterMesh does global services and "supercluster" network policy.
They replicate state to cluster-local etcd instances.
It may be worth reviewing and/or citing their implementation for inspiration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some semantics exploration done months ago over sig-multicluster and sig-network https://docs.google.com/document/d/1_QzuasJPiQ-4t8tUGODoRS2E-Q3KLJajqMyCvLPH2tY/edit
Also one implementation we started in submariner: https://github.com/submariner-io/coastguard but still not published is just a POC implementing what the document describes, and heavily based in kubefed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the service topology work needs to be also called in/out here.
Another CRD called `ImportedService` will be introduced to store connection | ||
information about the `Services` in each cluster, e.g. topology. This is | ||
analogous to the traditional `Service` type in Kubernetes. Each cluster will | ||
have an `ImportedService` for each `Service` that has been exported within the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe clarify that this is per unique service name (namespace+name), and not per service resource. So if you have the "same" service in 5 different clusters each cluster will only have a single ImportedService representing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - clarified
|
||
If multiple clusters export a `Service` with the same namespaced name, they will | ||
be recognized as a single combined service. The resulting `ImportedService` will | ||
reference endpoints from both clusters. Properties of the `ImportedService` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endpoint reference goes the other way right? EndpointSlice refers to the ImportedService. There will be an EndpointSlice for each cluster that has exported the service, all pointing to the same ImportedService.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, reference isn't the right word here, I'm not trying to dig into mechanics yet. I've loosened the language a bit. The method of association between EndpointSlices and ImportedService is left for the design details
|
||
``` | ||
<<[UNRESOLVED]>> | ||
Due to additional constraints that apply to stateful services (e.g. each cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential solution here is to allow statefully-named supercluster services. Originally I was thinking cluster name but if you want to support migration it should be something else.
This is similar to a feature to allow arbitrary supercluster names which we could add later (as a field on ExportedService for example). Then the controller would just aggregate based on supercluster name, with the default name being name.namespace.svc.supercluster.local if unspecified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would a statefully-named supercluster service look like? When you say it would be a field on the export, are you thinking it would be an alias for the service name - as in exportName: "my-alias"
is exported as my-alias.namespace.svc.supercluster.local
? That seems like it could solve e.g. export DB-A from cluster A and DB-B from cluster B.
What if I want to run something like Cassandra across multiple clusters? Then I may want a single my-cassandra.ns.svc.supercluster.local
in front of the aggregate service. In that instance, it could be solved by requiring that users just use different names (cluster name prefix?) for the stateful set in each cluster, but it would be much better if we could do something that didn't require different configuration in each cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these stateful services, I think we need to get some real use-cases before we can pin it down.
https://github.com/kubernetes/dns/blob/master/docs/specification.md says
"""
A unique, system-assigned identifier for the endpoint. The exact format and source of this identifier is not prescribed by this specification. However, it must be possible to use this to identify a specific endpoint in the context of a Service. This is used in the event no explicit endpoint hostname is defined.
"""
Assuming we stay similar, this means that DNS must provide a unique name to every endpoint of every supercluster service. It does not (yet) prescribe what that name should be. Given that clusters are uncoordinated today in this regard, we should not assume they will become coordinated. The "obvious" solution is to use cluster-name as part of the name.
E.g Service 's' has 2 endpoints in cluster "cla" and 2 endpoints in cluster "clb". The supercluster DNS would have 1 VIP record s.<ns>.<suffix>
and individual 4 endpoints <uniq1>.cla.s.<ns>.<suffix>
, <uniq2>.cla.s.<ns>.<suffix>
, <uniq3>.clb.s.<ns>.<suffix>
, <uniq4>.clb.s.<ns>.<suffix>
. This would allow us to respect a Pod's requested "hostname" (but only within the context of it's cluster name).
Statefulset services COULD follow the same suit, but now I question whether the "within the context of it's cluster" makes any sense. If so, good, problem solved (?). If not, then the whole model needs to be rethought.
Absent real requirements this is hard, so let's keep this as an open item - what are the expectations on supercluster DNS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supercluster DNS needs some definition for sure.
To further this discussion, I've been leaning toward this obvious cluster-scoped model. It seems like <hostname>.<cluster_id>.<svc>.<ns>.*
would provide necessary flexibility. Each pod gets its own unique name supercluster-wide but we still respect the hostname. It does of course mean that you can't connect to a specific pod with only its hostname and service name, but would there be a situation where that's needed and you don't also know the pod's cluster? If you had supercluster-unique hostnames, it would still be possible to look up the pod's FQDN via SRV records
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts:
-
Careful with "cluster ID" vs "cluster name". There is/was a proposal to formalize ClusterID, and I don't know if that is what you mean. A "group-unique cluster identifier or name" is probably sufficient.
-
Elsewhere I suggested MAYBE
.<svc>.s.<ns>.<suffix>
to leave room for other kinds of names. E.g DNS for Gateway is not fully locked down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I mean the "group-unique cluster identifier or name", adding cluster_name to the terminology section and use that moving forward so it's clear
-
I think that structure generally makes more sense. My take is that this needs to be driven from Service and MCS can be co-developed to adopt improvements to Service. I'm worried about trying to independently improve DNS for MCS and diverging from Service. ..svc. is not ideal, but it's consistent and there's no room for confusion coming from single cluster services
these will be considered the same service and will be combined at the | ||
supercluster level. | ||
|
||
This requires that within a supercluster, a given namespace is governed by a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should mention somewhere that we can expand this later to support explicit control over mapping, but that the proposal is to start with identity mapping and carefully add more configurability as needed. For example we could support the arbitrary name mapping as I noted above, but that may require some notion of RBAC control over the mapped-into names (otherwise it would be easy for malicious users to inject themselves into others' names). Similarly we may want to offer control over what namespace a particular cluster is allowed to export services from. So cluster A and cluster B are in a supercluster together, and cluster A only trusts namespace Foo from cluster B, and not namespace Bar. So even if cluster B has an ExportedService in namespace Bar cluster A would not import those endpoints.
Anyway too much detail but maybe a little wording here that this proposal allows for those types of things to come in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that may require some notion of RBAC control over the mapped-into names (otherwise it would be easy for malicious users to inject themselves into others' names)
Yeah this is one of the big draws for name-mapped exports. As soon as you allow cross-namespace exporting (either as source or target), we have to start creating custom RBAC. With name-mapping its easy enough to disallow creation of ServiceExport
resources.
we may want to offer control over what namespace a particular cluster is allowed to export services from
This is a good point. The way I've been thinking about this would be namespace level control over whether or not exports are allowed, but there could be a cluster level control over whether or not that namespace is allowed in the cluster. Would there be a case that you'd want to actively use a namespace in all clusters, but only allow exports from certain clusters?
So cluster A and cluster B are in a supercluster together, and cluster A only trusts namespace Foo from cluster B, and not namespace Bar. So even if cluster B has an ExportedService in namespace Bar cluster A would not import those endpoints.
In my model, this would mean that cluster A does not trust namespace Bar and doesn't have anything specifically to do with cluster B. Since a given namespace should have a single owner throughout the supercluster, I take this to mean cluster A doesn't trust ns Bar's owner and as a result nothing in cluster A is allowed to access anything in (or deploy anything to) namespace Bar. This seems reasonable to me, so long as namespace Bar isn't used for a different purpose/by a different team in cluster A.
What if it was specified that so long as namespace sameness is respected, an implementation could offer to exclude subsets of clusters from importing services in a given namespace? I wonder though, in this scenario, does anyone trust Bar across cluster boundaries? If no, then maybe this comes back to just prohibiting Bar from creating exports.
--> | ||
#### Terminology | ||
|
||
- **supercluster** - a placeholder name for a group of clusters with a high |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I saw it in this doc, but we need to define exclusivity semantics. Specifically is a supercluster an exclusive group, or does each cluster decide what other clusters are in its supercluster? Is asymmetry allowed or is symmetry enforced? (if cluster A thinks cluster B is in its supercluster, cluster B thinks cluster A is in its supercluster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We've been assuming it's symmetric and transitive. I've made that explicit. If cluster A thinks it's in a supercluster with B and C, then all three clusters agree.
When a `ServiceExport` is created, this will cause a domain name for the | ||
multi-cluster service to become accessible from within the supercluster. The | ||
domain name will be | ||
`<service-export-name>.<service-export-namespace>.svc.supercluster.local`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to determine:
- If the supercluster.local suffix can be overridden, and if so on a per-cluster or supercluster basis.
- What happens if the user happens to have used "supercluster.local" as their suffix for their cluster.
For #2, maybe it should not be .svc. and instead something else? Then there would be no potential for conflict. .ssvc.cluster.local? .super.cluster.local?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a tough one. I think this needs further discussion. In an ideal world we would be able to reserve a name here, but that is not our world :). First thoughts:
-
I think ideally we'd have some consistent standard name here, and you could create an alias if needed. I don't see why this couldn't be per-cluster as with the VIP - you'd likely want it to be supercluster level, but that's up to the implementation since each cluster consumes a service independently based solely on local information.
-
Maybe something.cluster.local is a solution. It would avoid conflicts, but semantically it feels like a big stretch for cluster.local as it's explicitly not cluster local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was dreading coming to this.
First, we should not perpetuate .local for this. There ARE some reserved TLDs that we can use: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 calls them "user-assigned" and sets aside x[a-z]
. So we could choose "supercluster.xx" or "mesh.xl" or something more creative.
Nothing we can do, short of using cluster.local, can prevent us picking a name someone has used. So we either go forth cautiously with a name that we think is exceedingly unlikely to be in use (as we did with cluster.local) or we plan to allow this to be configurable. It's already obnoxious to find the suffix of your local cluster. This will be even worse, unless sig-multi-cluster can normalize cluster-ID and group-membership metadata (which seems useful to me).
We don't really need the "svc" literal in there do we? Why not <svc>.<ns>.<suffix>
? Do we really believe we'll have supercluster names for non-services? If so, I'll propose <svc>.s.<ns>.<suffix>
(because I have proposed the same structure for a reboot of in-cluster DNS).
See other comments regarding endpoint names.
We will need to come up with a carefully worded spec for supercluster DNS. Can you please put that as a graduation criteria?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with all points. I'll absolutely add DNS spec as a graduation criteria.
In the meantime, it seems like a few questions have come up repeatedly in discussions around this KEP:
- Should the name be fixed or configurable? fixed makes life easier, but users may want customization.
- Should supercluster services use a new suffix or live under cluster.local (e.g. super.cluster.local)?
- What should the structure of a supercluster address be?
I don't think we need the svc literal in there, but I don't think we can say at this point that there will never be supercluster names for non-services (though I don't know what they would be). The main reason this proposal has kept the literal is for familiarity with cluster.local names, which seems like a good thing to maintain. I'm all for restructuring in-cluster DNS and supercluster DNS should follow suit, but it seems risky to develop supercluster DNS improvements independently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main failing of svc.<suffix>
is that it is used in search paths (which I do not expect here), which makes for 2 user-controlled tokens which can be ambiguous. It's also WRONG from a hierarchy point of view (services are smaller than namespaces), but I think I'd be OK to keep it for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely agree. Consistency is the main driver, but fixing the hierarchy in general would be great and MCS should adopt any improvements
`<service-export-name>.<service-export-namespace>.svc.supercluster.local`. | ||
Requests to this domain name from within the supercluster will resolve to the | ||
supercluster VIP, which points to the endpoint addresses for pods within the | ||
underlying `Service`(s) across the supercluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that it says that existing Service routing will not change, but can we please add a note here explicitly stating:
- clients will need to use the supercluster-specific DNS zone to recieve multi-cluster routing
- the existing default
svc.cluster.local
zone behavior is unchanged even whenImportedServices
and matchingEndpointSlices
exist
I became concerned about suggesting we exclude default/kubernetes
and kube-system/kube-dns
from being exported, because I incorrectly assumed exporting them would cause undesirable changes to existing behavior until I read the DNS section more carefully.
It may also be beneficial to mention that exporting these special/canonical services is allowed and may have valid use-cases such as dns-aggregation or apiserver-monitoring. (perhaps here or earlier on when discussing use-cases or export/RBAC behavior)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point - I've added a note to make that clear
What is out of scope for this KEP? Listing non-goals helps to focus discussion | ||
and make progress. | ||
--> | ||
- Define specific implementation details beyond general API behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the excellent proposal.
This design helps me understand how we could use the existing in-cluster instances of CoreDNS or a supercluster specific deployment of CoreDNS to fulfill the DNS requirements for MCSD.
This is very feasible from cluster-local CoreDNS instances since the ExportedService
state is imported/replicated throughout the APIServers of the sueprcluster.
Writing a new CoreDNS plugin that modifies the kubernetes / kubernetai plugin behavior in the style of k8s_external and federation should work well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great to hear! Thank you for your thorough comments and feedback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lighthouse plugin for CoreDNS currently has a design that is a very close fit to this proposal. Over next few weeks we're planning to come up with a PoC implementation of this proposal in Lighthouse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
`EndpointSlice` is associated with its `Service` in a single cluster. Each | ||
imported `EndpointSlice` will also have a | ||
`multicluster.kubernetes.io/source-cluster` label with a registry-scoped unique | ||
identifier for the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do any existing core cluster-components such as kube-proxy/CoreDNS/kube-dns need to be modified to ignore these new EndpointSlices
?
What about ingress-controllers, prometheus / kube-prometheus, and CNI related daemons?
Will using EndpointSlices
in this way potentially affect user's scripts, automation, or integrations?
Have we considered also cloning the API into a CRD like ImportedEndpointSlices
in the same way we're doing for Services
?
This would make the UX for kubectl get ep,importedendpoints
match kubectl get svc,serviceexports,importedservices
instead of conflating one but not the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing components should not require any modification to ignore these new slices - that was one driver for using ImportedService
instead of trying to extend Service
. One of the nice features of EndpointSlice
compared to Endpoints
is that they are not strictly coupled to a Service, so as long as we don't use the kubernetes.io/service-name
label, our slices should have no impact on existing components that discover them by association with a Service
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modulo scalabiity aspects (which in general we didn't yet solve), because things like dns, kube-proxy etc. watch for all endpoints (so may oom or sth...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - though existing logic shouldn't be impacted by these new unlabelled slices (at least from the Service API's POV), they will still trigger events and could potentially have noticeable side effects at scale.
`Service` owner who controls whether or not a given `Service` is exported. Using | ||
`ObjectReference` instead would also open the possibility of having multiple | ||
exports acting on a single service and would require more effort to determine if | ||
a given service has been exported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth noting somewhere here that allowing users to create ServiceExports
in the default namespace will allow people to export the default kubernetes Service
.
This is more of an administrator action, but it is very common for app teams and non admins to have wide RBAC in the default namespace.
We should encourage people intending to use an MCSD implementation to be wary of this.
One suggested solution could be to not use the default namespace for apps or create specific RBAC to guard against it.
Otherwise, this seems like a set of defaults that could compose into mistakes and security issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - default and kube-system both need to be used cautiously. There may be good reasons to allow exports from either, but it would be a best practice to restrict exports without clear justification. I've added a "Restricting Exports" section to clarify that
|
||
#### Endpoint TTL | ||
|
||
To prevent stale endpoints from persisting in the event that a cluster becomes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another important problem that use to happen in these scenarios (commonly with BGP peering) is unstability or flapping due to problems with the peers, per example because of network hiccups or other connectivity problems. If it connects and reconnects constantly it creates a lot of pressure on the clusters creating and deleting resources. Maybe we should indicate that we should add some protection to avoid this?
I don't think that right now implementations details are necessary but just for reference and completeness, BGP solves it using flap damping https://tools.ietf.org/html/rfc2439, but just whatever mechanism that guarantees certain stability will be valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've purposefully avoided specifying exactly how the supercluster reconciles services between clusters beyond what's necessary to implement the API, but here's an example of how I see this working:
The controller periodically validates connectivity to each cluster in the supercluster through some mechanism, then in each cluster it manages (in a centralized model it would be all clusters, in a distributed model maybe just the local cluster) it would renew the source-cluster-specific lease for each currently connected cluster. The TTL would be some multiple of the validation interval and allow for some flakiness. This way we'd only cleanup endpoints if a source cluster was unreachable for a sustained period.
For the reasons you suggest, I don't think a supercluster controller would want to explicitly delete endpoints for a cluster because it was unreachable - rather let the lease expire in that case. This way there should be no action if there's a temporary disconnect so long as it recovered before the TTL.
What if this instead read:
To prevent stale endpoints from persisting in the event that a cluster becomes
unreachable to the supercluster controller for a sustained period of time, ...
Does that make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I was more concerned about a possible overlook of this problem, but is obvious you have it very clear and well designed 😅 . Current wording works for me too. Thanks for the clarification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My big concern here is to quickly weed out endpoints that are not alive. absent real health-checking, this is hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should figure out how to do better than TTL (see DNS caching)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this, and maybe TTL doesn't need to be part of this spec and instead made a responsibility of the implementation. Even as described, TTL only handles the worst case scenario should a cluster or supercluster controller become unavailable or otherwise stop functioning. That's clearly necessary, but there are various ways to go about it.
An implementation should strive to do better than this - for example, quickly cleaning up endpoints that have been removed from their source cluster and/or actually perform health checking. I don't think it's necessary (or desirable) to define that behavior as preference may depend on the environment or other to prioritize other goals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absent a kubernetes-level healthcheck spec, implementations are free to do whatever they want, but there's no consistency. Adding an HC block to Services may be the "real" solution, but that's probably out-of-scope for this KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. That said, what are thoughts on keep/removing TTL as part of the spec?
- We keep it to bound worst-case endpoint cleanup behavior for all implementations, but with the understanding that implementations should strive to do better. This still means inconsistent behavior but at least guarantees that endpoints wont live forever.
- We remove it and leave cleanup entirely to implementation. There's even less consistency and we open to the possibility of technically compliant implementations that leave dead endpoints forever.
It seems like a useful thing to have right now, but if we expect e.g. an HC block to be added to Services TTL may become redundant quickly as MCS adopts that instead
|
||
We propose a new CRD called `ServiceExport`, used to specify which services | ||
should be exposed across all clusters in the supercluster. `ServiceExports` must | ||
be created in each cluster that the underlying `Service` resides in. Creation of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Service
object is a key piece of this KEP but I didn't find any mention to the Type
of Service
. Is this KEP addressing any Type of service?
- ClusterIP: Headless and No-Headless
- NodePort
- LoadBalancer
- ExternalName
My impression after reading it and for the mentions to the EndpoinSlices and the SuperClusterIP
is that is targeting ClusterIP
services only, is that correct? 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this needs some more clarification. This is how I see it:
The resulting SuperclusterIP
service that's imported into each cluster is like ClusterIP
(no node port, no external LB IP), but the source services don't have to be. The supercluster endpoints are the union of endpoints attached to services exported from each cluster, regardless of how those endpoints are used in-cluster.
Let's start with exporting ClusterIP
- this is the the straightforward case most of the doc assumes. Each EndpointSlice
associated with the exported service is combined with slices from other clusters to make up the supercluster service. They will be imported to the cluster behind the supercluster IP.
Headless - Today's headless services likely don't want a VIP and may not function properly behind one - there's a note a little further down about stateful sets and complications that come with them. We haven't answered what that could look like. It probably doesn't make sense to export a current headless service to the supercluster, it would work, but likely not the way you want.
NodePort
and LoadBalancer
type services both create ClusterIP
services and that's what would sync. For example If you export a NodePort
service, the resulting cross-cluster service will still be a SuperclusterIP
type. You could use node ports to access the cluster-local service in the source cluster, but not in any other cluster, and it would only route to local endpoints.
ExternalName
really is the outlier here - it doesn't make sense to export an ExternalName
service. They can't be merged with other exports, and it seems like it would only complicate deployments by even attempting to stretch them across clusters. Why not just create regular ExternalName
services in each cluster that needs them? Perhaps we should just disallow exporting them. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make headless cluster services become headless supercluster services?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would make sense, but mechanically what does that mean? If "headless" is derived from cluster-level services, is a supercluster service headless if any component service is headless? Only if all are? I was thinking along these lines, e.g. we create pod A/AAAA records based on exporting service headlessness, but at which point should we switch the service record from VIP to pods IPs? I don't really see a case for mixing and matching headless/non-headless services so maybe we can just say a supercluster service is headless if all component services are headless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can define resolution, along the lines of "If any same-name exported cluster service has a VIP, the supercluster service also has a VIP. If all same-name exported cluster services are headless, the supercluster service is headless".
You need to be prepared to allocate a supercluster IP if one member service changes, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and every non-correct ServiceExport needs to get status, and maybe events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense to me. Adding to the KEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start with exporting
ClusterIP
- this is the the straightforward case most of the doc assumes. EachEndpointSlice
associated with the exported service is combined with slices from other clusters to make up the supercluster service. They will be imported to the cluster behind the supercluster IP.
An alternative, why not have one EndpointSlice
map to one ClusterIP
in one cluster. So if a service that exists in N clusters we will have N EndpointSlice
and we only need to route packets to that cluster's ClusterIP for the service. From there the kube-proxy of target cluster can route it to the pods locally. Client clusters don't need to be aware of POD IPs and will save with scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a built-in assumption that all clusters participating in the supercluster have the ability to resolve clusterIPs services (not exposed as NodePort or LB services) and/or clusterIPs of pods between each other? Can member clusters continue to have separate network overlays but allow private NodePort/LB service routing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current assumption is that all participating clusters have mutual connectivity at the Pod level - clusters do not need to resolve each other's clusterIPs, but do need to access each others Pods. The SuperclusterIP introduced here is a VIP, local to the importing cluster, that routes to Pods from exporting clusters.
Connecting directly to a Service LB would start to break down if multiple clusters exported the same service. You'd have multiple LB IPs, one for each cluster, but since there's no concept of weight in kube-proxy, each IP would get an even share of traffic regardless of the number of backing pods. However, I could see configurations where only the specific pods behind an exported service are exposed through a gateway - the work to update Service APIs should help with that
supercluster service DNS implementation must expose not-ready addresses for | ||
slices from that cluster. | ||
|
||
### Test Plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can emulate these scenarios for testing with KIND
cc: @BenTheElder
I've described an example here https://github.com/kubernetes-sigs/kind/pull/1337/files#diff-7906da116e09ff32d7a6c92aaf1113abR63-R69
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KIND is great :) - I plan to work on this section over the next week
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up here- I've been playing with this for local testing and it works great. Thanks for the tip!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI that in v0.8.0 (unreleased but soon) nodes will move from the default bridge to a kind
bridge (to get docker user defined network features, namely the embedded DNS which fixes many problems)
shouldn't affect this much hopefully, all nodes / clusters are still peers on the same docker network. kubernetes-sigs/kind#1337 may need some updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good to know - i'll be following the release
cluster required to make on upgrade in order to make use of the enhancement? | ||
--> | ||
|
||
### Version Skew Strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will the version skew strategy here?
A Supercluster x will support individual clusters x-1, x-2, ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level:
This comes down to the controller. The controller should be upgraded first and support current and N (TBD) previous API versions. It will need to understand how to read exports/services/endpoints from source clusters of any supported version and sync an aggregated service to each cluster in a way that remains as consistent and predictable as possible. Clusters that have not been upgraded yet won't be able to support new capabilities but they will still be in sync retaining previous behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just pick a reasonable minimal N and strategy for now. Allowing for greater than minimal required skew post Beta stage seems like asking for trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
/ok-to-test |
Create a provisional KEP for MC services based on SIG conversations and this original doc: http://bit.ly/k8s-mc-svc-api-proposal
- Adds a note on the use of ObjectReference to Alternatives - Clarifies the meaning of MCSD - Note on intention to use generic conditions once the relevant KEP is implemented. - Unresolved: scalability is still an open question
Importing some discussion from the public doc, adds notes on an unresolved alternative approach using label selectors instead of name mapping.
note ImportedService uniqueness
Rename based on PR feedback
Thanks! rebased |
/unassign @justaugustus |
/assign @pmorie |
// +optional | ||
Reason *string `json:"reason,omitempty"` | ||
// +optional | ||
Message *string `json:"message,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeremyOT Should Message
contain list of clusters that were successfully synced? Won't be better to have a list of clusters synced and yet to be synced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The not-yet-synced clusters would make sense to list in the message, as a human operator might use them to help track down a failure. For synced clusters, the implementation/registry have the list of available clusters, and a condition representing a successful sync would mean that all clusters are synced so I'm not sure what added benefit would come from listing them. I'd worry that we could end up having to carry a large number of cluster names on every export status in the happy path.
port: 80 | ||
endpoints: | ||
- addresses: | ||
- "10.1.2.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an expectation that the clusters should have non-overlapping Pod CIDR? or this endpoint address could be something assigned by the mcsd-controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic case that we've been working with so far is that this would be used in a flat network with non-overlapping Pod CIDR ranges, but these could be alternative addresses assigned by the controller. The important part is that, at least until kube-proxy has some concept of weighting, traffic will be spread evenly between them so they should represent equivalent backends
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JeremyOT, pmorie The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is a great start. Let's merge so we can iterate with the numerous follow-ups that have been discussed. Thanks a lot @JeremyOT /lgtm |
Congratulations @JeremyOT good work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just keeping track of thoughts for later.
--> | ||
- Define specific implementation details beyond general API behavior. | ||
- Change behavior of single cluster services in any way. | ||
- Define what NetworkPolicy means for multi-cluster services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to describe what happens with existing policies so that namespace admins who use network policy and then export their services are not surprised by the outcome. I think it's also important to describe whether network policy can be applied to imported services.
I see the following questions
- if I export a service with a restrictive network policy, will that policy...
- be honored by remote clusters
- create a deny-by-default behavior (this would appear safe)
- do something else
- if I import a service can I apply network policy to it in my local cluster
Even if changes are not proposed, being able to describe what happens is important to deciding whether or not to enable the feature.
expected common behavior. | ||
|
||
We propose a new CRD called `ServiceExport`, used to specify which services | ||
should be exposed across all clusters in the supercluster. `ServiceExports` must |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please link from here to a description of what happens when this doesn't happen. I see a couple cases
- ServiceExport exists in clusterA, but not clusterB. ClusterB does not have a conflicting service
- ServiceExport exists in clusterA, but not clusterB. ClusterB does have a conflicting service.
Another CRD called `ServiceImport` will be introduced to store information | ||
about the services exported from each cluster, e.g. topology. This is analogous | ||
to the traditional `Service` type in Kubernetes. Each cluster will have a | ||
coresponding `ServiceImport` for each uniquely named `Service` that has been |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for a cluster to import a service exported by another cluster, but not export that same service? I'm imagining a migration case where I've moved a namespace to a different cluster and no longer provide a local endpoint.
--> | ||
#### Terminology | ||
|
||
- **supercluster** - A placeholder name for a group of clusters with a high |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a single cluster be part of multiple superclusters? For instance, can I have one supercluster for my DB2 admins, a different supercluster for my frontend developers, and a third one for corporate IT?
--> | ||
### Exporting Services | ||
|
||
Services will not be visible to other clusters in the supercluster by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a single cluster is allowed to be part of multiple superclusters, is there a selection criteria for which superclusters it is a part of? I would expect not and that it would be up to which clusters schose to import it (If I have understood it correctly so far).
#### Restricting Exports #### | ||
|
||
Cluster administrators may use RBAC rules to prevent creation of | ||
`ServiceExports` in select namespaces. While there are no general restrictions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some more detail here would be appreciated. I'm not sure RBAC is fully sufficient, but recommendations and expectations would be ehlpful
|
||
#### SuperclusterIP | ||
|
||
When a `ServiceExport` is created, an IP address is reserved and assigned to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you walk this through with an example? I'm having some trouble working out what this means.
When a `ServiceExport` is created, this will cause `EndpointSlice` objects for | ||
the underlying `Service` to be created in each cluster within the supercluster. | ||
One or more `EndpointSlice` resources will exist for each cluster that exported | ||
the `Service`, with each `EndpointSlice` containing only endpoints from its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if multiple constituent clusters in the supercluster share IP address space with each other on their pod, service, or node IP ranges?
there is no clear way to determine how a service should be accessed. **If any | ||
global properties have conflicts that can not be resolved, a condition will be | ||
set on the `ServiceExport` with a description of the conflict. The service will | ||
not be synced, and an error will be set on the status of each affected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are some endpoints that are already exported, will they be forced out why a new and incompatible set of endpoints?
global properties have conflicts that can not be resolved, a condition will be | ||
set on the `ServiceExport` with a description of the conflict. The service will | ||
not be synced, and an error will be set on the status of each affected | ||
`ServiceExport` and any previously-derived `ServiceImports` will be deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this bit in particular makes it appear as though a new bad actor could bring down a previously functional service. Is there a way to contain the damage instead?
Create a provisional KEP for MC services based on SIG conversations and
this original doc: http://bit.ly/k8s-mc-svc-api-proposal