Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Cluster Services API #1646

Merged
merged 10 commits into from
May 20, 2020
Merged

Conversation

JeremyOT
Copy link
Member

Create a provisional KEP for MC services based on SIG conversations and
this original doc: http://bit.ly/k8s-mc-svc-api-proposal

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 30, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @JeremyOT!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @JeremyOT. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 30, 2020
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Mar 30, 2020
@JeremyOT
Copy link
Member Author

/sig multicluster
/cc @pmorie @thockin

Looks like sig-multicluster doesn't have a KEPs folder yet, so maybe that needs to get set up with owners first

@k8s-ci-robot k8s-ci-robot added the sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster. label Mar 30, 2020
#### Terminology

- **supercluster** - a placeholder name for a group of clusters with a high degree of mutual trust and shared ownership that share services amongst themselves.
- **mcsd-controller** - a controller that syncs services across clusters. There may be multiple implementations, this doc describes expected common behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind elaborating on the meaning of mcsd here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this explanation

@JeremyOT JeremyOT requested a review from wojtek-t March 30, 2020 16:54
Copy link

@smawson smawson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some peanut gallery questions and comments, overall this looks very nice!


The Multi-Cluster Services API aims to fix these problems.

### Goals
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an explicit goal to extend the topology model to support cross-cluster topology concerns? For example region/zone affinity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No explicit extension there - the hope is that whatever K8s does for service topology within a cluster can extend across clusters. For example, zonal affinity is already something that matters within clusters. The alpha Service Topology API would already seem to just work with multiple clusters, and recent sig-network discussion about simplifying topology to do the right thing with less configuration would apply here as well

-->
- Define specific implementation details beyond general API behavior.
- Change behavior of single cluster services in any way.
- Define what NetworkPolicy means for multi-cluster services.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can't leave this out, we at least have to define some semantics here. Otherwise it is potentially possible to bypass network policy controls by using multi-cluster services.

I think at a minimum there needs to be some discussion on the semantics of existing network policy and multi-cluster services. At least ip-based policies should continue to work right? But namespace-based selectors probably won't...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. As you say, ip-based policies should continue to work but selectors won't be able to cross the cluster boundary. I think multi-cluster network policy is critical and needs to be figured out, but I was hoping we can start to define how to extend Service first. Users will definitely want policy, but there are some mechanics that we're missing - e.g. how can we support selectors against backends in another cluster?

As for any possibility of circumventing policy in the meantime, I think that would be a risk the service owner needs to be aware of. To export an existing service you'd need to have access to the service namespace which I hope means you have permission to weigh the risk and make the decision.

Let me think about this and try to fill out some thoughts on what this means for existing policy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cillium ClusterMesh does global services and "supercluster" network policy.
They replicate state to cluster-local etcd instances.

It may be worth reviewing and/or citing their implementation for inspiration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some semantics exploration done months ago over sig-multicluster and sig-network https://docs.google.com/document/d/1_QzuasJPiQ-4t8tUGODoRS2E-Q3KLJajqMyCvLPH2tY/edit

Also one implementation we started in submariner: https://github.com/submariner-io/coastguard but still not published is just a POC implementing what the document describes, and heavily based in kubefed...

Copy link
Contributor

@khenidak khenidak May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the service topology work needs to be also called in/out here.

Another CRD called `ImportedService` will be introduced to store connection
information about the `Services` in each cluster, e.g. topology. This is
analogous to the traditional `Service` type in Kubernetes. Each cluster will
have an `ImportedService` for each `Service` that has been exported within the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify that this is per unique service name (namespace+name), and not per service resource. So if you have the "same" service in 5 different clusters each cluster will only have a single ImportedService representing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - clarified


If multiple clusters export a `Service` with the same namespaced name, they will
be recognized as a single combined service. The resulting `ImportedService` will
reference endpoints from both clusters. Properties of the `ImportedService`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endpoint reference goes the other way right? EndpointSlice refers to the ImportedService. There will be an EndpointSlice for each cluster that has exported the service, all pointing to the same ImportedService.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, reference isn't the right word here, I'm not trying to dig into mechanics yet. I've loosened the language a bit. The method of association between EndpointSlices and ImportedService is left for the design details


```
<<[UNRESOLVED]>>
Due to additional constraints that apply to stateful services (e.g. each cluster
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential solution here is to allow statefully-named supercluster services. Originally I was thinking cluster name but if you want to support migration it should be something else.

This is similar to a feature to allow arbitrary supercluster names which we could add later (as a field on ExportedService for example). Then the controller would just aggregate based on supercluster name, with the default name being name.namespace.svc.supercluster.local if unspecified.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would a statefully-named supercluster service look like? When you say it would be a field on the export, are you thinking it would be an alias for the service name - as in exportName: "my-alias" is exported as my-alias.namespace.svc.supercluster.local? That seems like it could solve e.g. export DB-A from cluster A and DB-B from cluster B.

What if I want to run something like Cassandra across multiple clusters? Then I may want a single my-cassandra.ns.svc.supercluster.local in front of the aggregate service. In that instance, it could be solved by requiring that users just use different names (cluster name prefix?) for the stateful set in each cluster, but it would be much better if we could do something that didn't require different configuration in each cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these stateful services, I think we need to get some real use-cases before we can pin it down.

https://github.com/kubernetes/dns/blob/master/docs/specification.md says

"""
A unique, system-assigned identifier for the endpoint. The exact format and source of this identifier is not prescribed by this specification. However, it must be possible to use this to identify a specific endpoint in the context of a Service. This is used in the event no explicit endpoint hostname is defined.
"""

Assuming we stay similar, this means that DNS must provide a unique name to every endpoint of every supercluster service. It does not (yet) prescribe what that name should be. Given that clusters are uncoordinated today in this regard, we should not assume they will become coordinated. The "obvious" solution is to use cluster-name as part of the name.

E.g Service 's' has 2 endpoints in cluster "cla" and 2 endpoints in cluster "clb". The supercluster DNS would have 1 VIP record s.<ns>.<suffix> and individual 4 endpoints <uniq1>.cla.s.<ns>.<suffix>, <uniq2>.cla.s.<ns>.<suffix>, <uniq3>.clb.s.<ns>.<suffix>, <uniq4>.clb.s.<ns>.<suffix>. This would allow us to respect a Pod's requested "hostname" (but only within the context of it's cluster name).

Statefulset services COULD follow the same suit, but now I question whether the "within the context of it's cluster" makes any sense. If so, good, problem solved (?). If not, then the whole model needs to be rethought.

Absent real requirements this is hard, so let's keep this as an open item - what are the expectations on supercluster DNS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supercluster DNS needs some definition for sure.

To further this discussion, I've been leaning toward this obvious cluster-scoped model. It seems like <hostname>.<cluster_id>.<svc>.<ns>.* would provide necessary flexibility. Each pod gets its own unique name supercluster-wide but we still respect the hostname. It does of course mean that you can't connect to a specific pod with only its hostname and service name, but would there be a situation where that's needed and you don't also know the pod's cluster? If you had supercluster-unique hostnames, it would still be possible to look up the pod's FQDN via SRV records

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts:

  1. Careful with "cluster ID" vs "cluster name". There is/was a proposal to formalize ClusterID, and I don't know if that is what you mean. A "group-unique cluster identifier or name" is probably sufficient.

  2. Elsewhere I suggested MAYBE .<svc>.s.<ns>.<suffix> to leave room for other kinds of names. E.g DNS for Gateway is not fully locked down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I mean the "group-unique cluster identifier or name", adding cluster_name to the terminology section and use that moving forward so it's clear

  2. I think that structure generally makes more sense. My take is that this needs to be driven from Service and MCS can be co-developed to adopt improvements to Service. I'm worried about trying to independently improve DNS for MCS and diverging from Service. ..svc. is not ideal, but it's consistent and there's no room for confusion coming from single cluster services

these will be considered the same service and will be combined at the
supercluster level.

This requires that within a supercluster, a given namespace is governed by a
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mention somewhere that we can expand this later to support explicit control over mapping, but that the proposal is to start with identity mapping and carefully add more configurability as needed. For example we could support the arbitrary name mapping as I noted above, but that may require some notion of RBAC control over the mapped-into names (otherwise it would be easy for malicious users to inject themselves into others' names). Similarly we may want to offer control over what namespace a particular cluster is allowed to export services from. So cluster A and cluster B are in a supercluster together, and cluster A only trusts namespace Foo from cluster B, and not namespace Bar. So even if cluster B has an ExportedService in namespace Bar cluster A would not import those endpoints.

Anyway too much detail but maybe a little wording here that this proposal allows for those types of things to come in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that may require some notion of RBAC control over the mapped-into names (otherwise it would be easy for malicious users to inject themselves into others' names)

Yeah this is one of the big draws for name-mapped exports. As soon as you allow cross-namespace exporting (either as source or target), we have to start creating custom RBAC. With name-mapping its easy enough to disallow creation of ServiceExport resources.

we may want to offer control over what namespace a particular cluster is allowed to export services from

This is a good point. The way I've been thinking about this would be namespace level control over whether or not exports are allowed, but there could be a cluster level control over whether or not that namespace is allowed in the cluster. Would there be a case that you'd want to actively use a namespace in all clusters, but only allow exports from certain clusters?

So cluster A and cluster B are in a supercluster together, and cluster A only trusts namespace Foo from cluster B, and not namespace Bar. So even if cluster B has an ExportedService in namespace Bar cluster A would not import those endpoints.

In my model, this would mean that cluster A does not trust namespace Bar and doesn't have anything specifically to do with cluster B. Since a given namespace should have a single owner throughout the supercluster, I take this to mean cluster A doesn't trust ns Bar's owner and as a result nothing in cluster A is allowed to access anything in (or deploy anything to) namespace Bar. This seems reasonable to me, so long as namespace Bar isn't used for a different purpose/by a different team in cluster A.

What if it was specified that so long as namespace sameness is respected, an implementation could offer to exclude subsets of clusters from importing services in a given namespace? I wonder though, in this scenario, does anyone trust Bar across cluster boundaries? If no, then maybe this comes back to just prohibiting Bar from creating exports.

-->
#### Terminology

- **supercluster** - a placeholder name for a group of clusters with a high
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I saw it in this doc, but we need to define exclusivity semantics. Specifically is a supercluster an exclusive group, or does each cluster decide what other clusters are in its supercluster? Is asymmetry allowed or is symmetry enforced? (if cluster A thinks cluster B is in its supercluster, cluster B thinks cluster A is in its supercluster)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We've been assuming it's symmetric and transitive. I've made that explicit. If cluster A thinks it's in a supercluster with B and C, then all three clusters agree.

When a `ServiceExport` is created, this will cause a domain name for the
multi-cluster service to become accessible from within the supercluster. The
domain name will be
`<service-export-name>.<service-export-namespace>.svc.supercluster.local`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to determine:

  1. If the supercluster.local suffix can be overridden, and if so on a per-cluster or supercluster basis.
  2. What happens if the user happens to have used "supercluster.local" as their suffix for their cluster.

For #2, maybe it should not be .svc. and instead something else? Then there would be no potential for conflict. .ssvc.cluster.local? .super.cluster.local?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a tough one. I think this needs further discussion. In an ideal world we would be able to reserve a name here, but that is not our world :). First thoughts:

  1. I think ideally we'd have some consistent standard name here, and you could create an alias if needed. I don't see why this couldn't be per-cluster as with the VIP - you'd likely want it to be supercluster level, but that's up to the implementation since each cluster consumes a service independently based solely on local information.

  2. Maybe something.cluster.local is a solution. It would avoid conflicts, but semantically it feels like a big stretch for cluster.local as it's explicitly not cluster local.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was dreading coming to this.

First, we should not perpetuate .local for this. There ARE some reserved TLDs that we can use: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 calls them "user-assigned" and sets aside x[a-z]. So we could choose "supercluster.xx" or "mesh.xl" or something more creative.

Nothing we can do, short of using cluster.local, can prevent us picking a name someone has used. So we either go forth cautiously with a name that we think is exceedingly unlikely to be in use (as we did with cluster.local) or we plan to allow this to be configurable. It's already obnoxious to find the suffix of your local cluster. This will be even worse, unless sig-multi-cluster can normalize cluster-ID and group-membership metadata (which seems useful to me).

We don't really need the "svc" literal in there do we? Why not <svc>.<ns>.<suffix> ? Do we really believe we'll have supercluster names for non-services? If so, I'll propose <svc>.s.<ns>.<suffix> (because I have proposed the same structure for a reboot of in-cluster DNS).

See other comments regarding endpoint names.

We will need to come up with a carefully worded spec for supercluster DNS. Can you please put that as a graduation criteria?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with all points. I'll absolutely add DNS spec as a graduation criteria.

In the meantime, it seems like a few questions have come up repeatedly in discussions around this KEP:

  • Should the name be fixed or configurable? fixed makes life easier, but users may want customization.
  • Should supercluster services use a new suffix or live under cluster.local (e.g. super.cluster.local)?
  • What should the structure of a supercluster address be?

I don't think we need the svc literal in there, but I don't think we can say at this point that there will never be supercluster names for non-services (though I don't know what they would be). The main reason this proposal has kept the literal is for familiarity with cluster.local names, which seems like a good thing to maintain. I'm all for restructuring in-cluster DNS and supercluster DNS should follow suit, but it seems risky to develop supercluster DNS improvements independently

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main failing of svc.<suffix> is that it is used in search paths (which I do not expect here), which makes for 2 user-controlled tokens which can be ambiguous. It's also WRONG from a hierarchy point of view (services are smaller than namespaces), but I think I'd be OK to keep it for consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree. Consistency is the main driver, but fixing the hierarchy in general would be great and MCS should adopt any improvements

`<service-export-name>.<service-export-namespace>.svc.supercluster.local`.
Requests to this domain name from within the supercluster will resolve to the
supercluster VIP, which points to the endpoint addresses for pods within the
underlying `Service`(s) across the supercluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that it says that existing Service routing will not change, but can we please add a note here explicitly stating:

  • clients will need to use the supercluster-specific DNS zone to recieve multi-cluster routing
  • the existing default svc.cluster.local zone behavior is unchanged even when ImportedServices and matching EndpointSlices exist

I became concerned about suggesting we exclude default/kubernetes and kube-system/kube-dns from being exported, because I incorrectly assumed exporting them would cause undesirable changes to existing behavior until I read the DNS section more carefully.

It may also be beneficial to mention that exporting these special/canonical services is allowed and may have valid use-cases such as dns-aggregation or apiserver-monitoring. (perhaps here or earlier on when discussing use-cases or export/RBAC behavior)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point - I've added a note to make that clear

What is out of scope for this KEP? Listing non-goals helps to focus discussion
and make progress.
-->
- Define specific implementation details beyond general API behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the excellent proposal.

This design helps me understand how we could use the existing in-cluster instances of CoreDNS or a supercluster specific deployment of CoreDNS to fulfill the DNS requirements for MCSD.

This is very feasible from cluster-local CoreDNS instances since the ExportedService state is imported/replicated throughout the APIServers of the sueprcluster.

Writing a new CoreDNS plugin that modifies the kubernetes / kubernetai plugin behavior in the style of k8s_external and federation should work well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great to hear! Thank you for your thorough comments and feedback!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lighthouse plugin for CoreDNS currently has a design that is a very close fit to this proposal. Over next few weeks we're planning to come up with a PoC implementation of this proposal in Lighthouse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

`EndpointSlice` is associated with its `Service` in a single cluster. Each
imported `EndpointSlice` will also have a
`multicluster.kubernetes.io/source-cluster` label with a registry-scoped unique
identifier for the cluster.
Copy link
Member

@stealthybox stealthybox Apr 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any existing core cluster-components such as kube-proxy/CoreDNS/kube-dns need to be modified to ignore these new EndpointSlices?
What about ingress-controllers, prometheus / kube-prometheus, and CNI related daemons?
Will using EndpointSlices in this way potentially affect user's scripts, automation, or integrations?

Have we considered also cloning the API into a CRD like ImportedEndpointSlices in the same way we're doing for Services?

This would make the UX for kubectl get ep,importedendpoints match kubectl get svc,serviceexports,importedservices instead of conflating one but not the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing components should not require any modification to ignore these new slices - that was one driver for using ImportedService instead of trying to extend Service. One of the nice features of EndpointSlice compared to Endpoints is that they are not strictly coupled to a Service, so as long as we don't use the kubernetes.io/service-name label, our slices should have no impact on existing components that discover them by association with a Service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modulo scalabiity aspects (which in general we didn't yet solve), because things like dns, kube-proxy etc. watch for all endpoints (so may oom or sth...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - though existing logic shouldn't be impacted by these new unlabelled slices (at least from the Service API's POV), they will still trigger events and could potentially have noticeable side effects at scale.

`Service` owner who controls whether or not a given `Service` is exported. Using
`ObjectReference` instead would also open the possibility of having multiple
exports acting on a single service and would require more effort to determine if
a given service has been exported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth noting somewhere here that allowing users to create ServiceExports in the default namespace will allow people to export the default kubernetes Service.

This is more of an administrator action, but it is very common for app teams and non admins to have wide RBAC in the default namespace.

We should encourage people intending to use an MCSD implementation to be wary of this.
One suggested solution could be to not use the default namespace for apps or create specific RBAC to guard against it.
Otherwise, this seems like a set of defaults that could compose into mistakes and security issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - default and kube-system both need to be used cautiously. There may be good reasons to allow exports from either, but it would be a best practice to restrict exports without clear justification. I've added a "Restricting Exports" section to clarify that


#### Endpoint TTL

To prevent stale endpoints from persisting in the event that a cluster becomes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another important problem that use to happen in these scenarios (commonly with BGP peering) is unstability or flapping due to problems with the peers, per example because of network hiccups or other connectivity problems. If it connects and reconnects constantly it creates a lot of pressure on the clusters creating and deleting resources. Maybe we should indicate that we should add some protection to avoid this?

I don't think that right now implementations details are necessary but just for reference and completeness, BGP solves it using flap damping https://tools.ietf.org/html/rfc2439, but just whatever mechanism that guarantees certain stability will be valid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've purposefully avoided specifying exactly how the supercluster reconciles services between clusters beyond what's necessary to implement the API, but here's an example of how I see this working:

The controller periodically validates connectivity to each cluster in the supercluster through some mechanism, then in each cluster it manages (in a centralized model it would be all clusters, in a distributed model maybe just the local cluster) it would renew the source-cluster-specific lease for each currently connected cluster. The TTL would be some multiple of the validation interval and allow for some flakiness. This way we'd only cleanup endpoints if a source cluster was unreachable for a sustained period.

For the reasons you suggest, I don't think a supercluster controller would want to explicitly delete endpoints for a cluster because it was unreachable - rather let the lease expire in that case. This way there should be no action if there's a temporary disconnect so long as it recovered before the TTL.

What if this instead read:

To prevent stale endpoints from persisting in the event that a cluster becomes
unreachable to the supercluster controller for a sustained period of time, ...

Does that make more sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was more concerned about a possible overlook of this problem, but is obvious you have it very clear and well designed 😅 . Current wording works for me too. Thanks for the clarification

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My big concern here is to quickly weed out endpoints that are not alive. absent real health-checking, this is hard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should figure out how to do better than TTL (see DNS caching)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this, and maybe TTL doesn't need to be part of this spec and instead made a responsibility of the implementation. Even as described, TTL only handles the worst case scenario should a cluster or supercluster controller become unavailable or otherwise stop functioning. That's clearly necessary, but there are various ways to go about it.

An implementation should strive to do better than this - for example, quickly cleaning up endpoints that have been removed from their source cluster and/or actually perform health checking. I don't think it's necessary (or desirable) to define that behavior as preference may depend on the environment or other to prioritize other goals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absent a kubernetes-level healthcheck spec, implementations are free to do whatever they want, but there's no consistency. Adding an HC block to Services may be the "real" solution, but that's probably out-of-scope for this KEP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. That said, what are thoughts on keep/removing TTL as part of the spec?

  • We keep it to bound worst-case endpoint cleanup behavior for all implementations, but with the understanding that implementations should strive to do better. This still means inconsistent behavior but at least guarantees that endpoints wont live forever.
  • We remove it and leave cleanup entirely to implementation. There's even less consistency and we open to the possibility of technically compliant implementations that leave dead endpoints forever.

It seems like a useful thing to have right now, but if we expect e.g. an HC block to be added to Services TTL may become redundant quickly as MCS adopts that instead


We propose a new CRD called `ServiceExport`, used to specify which services
should be exposed across all clusters in the supercluster. `ServiceExports` must
be created in each cluster that the underlying `Service` resides in. Creation of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Service object is a key piece of this KEP but I didn't find any mention to the Type of Service. Is this KEP addressing any Type of service?

  • ClusterIP: Headless and No-Headless
  • NodePort
  • LoadBalancer
  • ExternalName

My impression after reading it and for the mentions to the EndpoinSlices and the SuperClusterIP is that is targeting ClusterIP services only, is that correct? 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this needs some more clarification. This is how I see it:

The resulting SuperclusterIP service that's imported into each cluster is like ClusterIP (no node port, no external LB IP), but the source services don't have to be. The supercluster endpoints are the union of endpoints attached to services exported from each cluster, regardless of how those endpoints are used in-cluster.

Let's start with exporting ClusterIP - this is the the straightforward case most of the doc assumes. Each EndpointSlice associated with the exported service is combined with slices from other clusters to make up the supercluster service. They will be imported to the cluster behind the supercluster IP.

Headless - Today's headless services likely don't want a VIP and may not function properly behind one - there's a note a little further down about stateful sets and complications that come with them. We haven't answered what that could look like. It probably doesn't make sense to export a current headless service to the supercluster, it would work, but likely not the way you want.

NodePort and LoadBalancer type services both create ClusterIP services and that's what would sync. For example If you export a NodePort service, the resulting cross-cluster service will still be a SuperclusterIP type. You could use node ports to access the cluster-local service in the source cluster, but not in any other cluster, and it would only route to local endpoints.

ExternalName really is the outlier here - it doesn't make sense to export an ExternalName service. They can't be merged with other exports, and it seems like it would only complicate deployments by even attempting to stretch them across clusters. Why not just create regular ExternalName services in each cluster that needs them? Perhaps we should just disallow exporting them. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make headless cluster services become headless supercluster services?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make sense, but mechanically what does that mean? If "headless" is derived from cluster-level services, is a supercluster service headless if any component service is headless? Only if all are? I was thinking along these lines, e.g. we create pod A/AAAA records based on exporting service headlessness, but at which point should we switch the service record from VIP to pods IPs? I don't really see a case for mixing and matching headless/non-headless services so maybe we can just say a supercluster service is headless if all component services are headless

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can define resolution, along the lines of "If any same-name exported cluster service has a VIP, the supercluster service also has a VIP. If all same-name exported cluster services are headless, the supercluster service is headless".

You need to be prepared to allocate a supercluster IP if one member service changes, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and every non-correct ServiceExport needs to get status, and maybe events.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense to me. Adding to the KEP

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start with exporting ClusterIP - this is the the straightforward case most of the doc assumes. Each EndpointSlice associated with the exported service is combined with slices from other clusters to make up the supercluster service. They will be imported to the cluster behind the supercluster IP.

An alternative, why not have one EndpointSlice map to one ClusterIP in one cluster. So if a service that exists in N clusters we will have N EndpointSlice and we only need to route packets to that cluster's ClusterIP for the service. From there the kube-proxy of target cluster can route it to the pods locally. Client clusters don't need to be aware of POD IPs and will save with scale.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a built-in assumption that all clusters participating in the supercluster have the ability to resolve clusterIPs services (not exposed as NodePort or LB services) and/or clusterIPs of pods between each other? Can member clusters continue to have separate network overlays but allow private NodePort/LB service routing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current assumption is that all participating clusters have mutual connectivity at the Pod level - clusters do not need to resolve each other's clusterIPs, but do need to access each others Pods. The SuperclusterIP introduced here is a VIP, local to the importing cluster, that routes to Pods from exporting clusters.

Connecting directly to a Service LB would start to break down if multiple clusters exported the same service. You'd have multiple LB IPs, one for each cluster, but since there's no concept of weight in kube-proxy, each IP would get an even share of traffic regardless of the number of backing pods. However, I could see configurations where only the specific pods behind an exported service are exposed through a gateway - the work to update Service APIs should help with that

supercluster service DNS implementation must expose not-ready addresses for
slices from that cluster.

### Test Plan
Copy link
Member

@aojea aojea Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can emulate these scenarios for testing with KIND
cc: @BenTheElder
I've described an example here https://github.com/kubernetes-sigs/kind/pull/1337/files#diff-7906da116e09ff32d7a6c92aaf1113abR63-R69

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KIND is great :) - I plan to work on this section over the next week

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up here- I've been playing with this for local testing and it works great. Thanks for the tip!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI that in v0.8.0 (unreleased but soon) nodes will move from the default bridge to a kind bridge (to get docker user defined network features, namely the embedded DNS which fixes many problems)

shouldn't affect this much hopefully, all nodes / clusters are still peers on the same docker network. kubernetes-sigs/kind#1337 may need some updates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good to know - i'll be following the release

cluster required to make on upgrade in order to make use of the enhancement?
-->

### Version Skew Strategy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will the version skew strategy here?
A Supercluster x will support individual clusters x-1, x-2, ...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level:

This comes down to the controller. The controller should be upgraded first and support current and N (TBD) previous API versions. It will need to understand how to read exports/services/endpoints from source clusters of any supported version and sync an aggregated service to each cluster in a way that remains as consistent and predictable as possible. Clusters that have not been upgraded yet won't be able to support new capabilities but they will still be in sync retaining previous behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just pick a reasonable minimal N and strategy for now. Allowing for greater than minimal required skew post Beta stage seems like asking for trouble.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Apr 21, 2020
@bowei
Copy link
Member

bowei commented Apr 24, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 24, 2020
Create a provisional KEP for MC services based on SIG conversations and
this original doc: http://bit.ly/k8s-mc-svc-api-proposal
- Adds a note on the use of ObjectReference to Alternatives
- Clarifies the meaning of MCSD
- Note on intention to use generic conditions once the relevant KEP is
implemented.
- Unresolved: scalability is still an open question
Importing some discussion from the public doc, adds notes on an
unresolved alternative approach using label selectors instead of name
mapping.
note ImportedService uniqueness
Rename based on PR feedback
@JeremyOT
Copy link
Member Author

Thanks! rebased
/assign @justaugustus

@JeremyOT
Copy link
Member Author

/unassign @justaugustus

@JeremyOT
Copy link
Member Author

/assign @pmorie

// +optional
Reason *string `json:"reason,omitempty"`
// +optional
Message *string `json:"message,omitempty"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeremyOT Should Message contain list of clusters that were successfully synced? Won't be better to have a list of clusters synced and yet to be synced?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The not-yet-synced clusters would make sense to list in the message, as a human operator might use them to help track down a failure. For synced clusters, the implementation/registry have the list of available clusters, and a condition representing a successful sync would mean that all clusters are synced so I'm not sure what added benefit would come from listing them. I'd worry that we could end up having to carry a large number of cluster names on every export status in the happy path.

port: 80
endpoints:
- addresses:
- "10.1.2.3"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an expectation that the clusters should have non-overlapping Pod CIDR? or this endpoint address could be something assigned by the mcsd-controller?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic case that we've been working with so far is that this would be used in a flat network with non-overlapping Pod CIDR ranges, but these could be alternative addresses assigned by the controller. The important part is that, at least until kube-proxy has some concept of weighting, traffic will be spread evenly between them so they should represent equivalent backends

@pmorie
Copy link
Member

pmorie commented May 20, 2020

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JeremyOT, pmorie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 20, 2020
@pmorie
Copy link
Member

pmorie commented May 20, 2020

This is a great start. Let's merge so we can iterate with the numerous follow-ups that have been discussed. Thanks a lot @JeremyOT

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2020
@k8s-ci-robot k8s-ci-robot merged commit a8b2056 into kubernetes:master May 20, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone May 20, 2020
@mangelajo
Copy link

Congratulations @JeremyOT good work.

Copy link
Contributor

@deads2k deads2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just keeping track of thoughts for later.

-->
- Define specific implementation details beyond general API behavior.
- Change behavior of single cluster services in any way.
- Define what NetworkPolicy means for multi-cluster services.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to describe what happens with existing policies so that namespace admins who use network policy and then export their services are not surprised by the outcome. I think it's also important to describe whether network policy can be applied to imported services.

I see the following questions

  1. if I export a service with a restrictive network policy, will that policy...
    1. be honored by remote clusters
    2. create a deny-by-default behavior (this would appear safe)
    3. do something else
  2. if I import a service can I apply network policy to it in my local cluster

Even if changes are not proposed, being able to describe what happens is important to deciding whether or not to enable the feature.

expected common behavior.

We propose a new CRD called `ServiceExport`, used to specify which services
should be exposed across all clusters in the supercluster. `ServiceExports` must
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link from here to a description of what happens when this doesn't happen. I see a couple cases

  1. ServiceExport exists in clusterA, but not clusterB. ClusterB does not have a conflicting service
  2. ServiceExport exists in clusterA, but not clusterB. ClusterB does have a conflicting service.

Another CRD called `ServiceImport` will be introduced to store information
about the services exported from each cluster, e.g. topology. This is analogous
to the traditional `Service` type in Kubernetes. Each cluster will have a
coresponding `ServiceImport` for each uniquely named `Service` that has been
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for a cluster to import a service exported by another cluster, but not export that same service? I'm imagining a migration case where I've moved a namespace to a different cluster and no longer provide a local endpoint.

-->
#### Terminology

- **supercluster** - A placeholder name for a group of clusters with a high
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a single cluster be part of multiple superclusters? For instance, can I have one supercluster for my DB2 admins, a different supercluster for my frontend developers, and a third one for corporate IT?

-->
### Exporting Services

Services will not be visible to other clusters in the supercluster by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a single cluster is allowed to be part of multiple superclusters, is there a selection criteria for which superclusters it is a part of? I would expect not and that it would be up to which clusters schose to import it (If I have understood it correctly so far).

#### Restricting Exports ####

Cluster administrators may use RBAC rules to prevent creation of
`ServiceExports` in select namespaces. While there are no general restrictions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more detail here would be appreciated. I'm not sure RBAC is fully sufficient, but recommendations and expectations would be ehlpful


#### SuperclusterIP

When a `ServiceExport` is created, an IP address is reserved and assigned to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you walk this through with an example? I'm having some trouble working out what this means.

When a `ServiceExport` is created, this will cause `EndpointSlice` objects for
the underlying `Service` to be created in each cluster within the supercluster.
One or more `EndpointSlice` resources will exist for each cluster that exported
the `Service`, with each `EndpointSlice` containing only endpoints from its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if multiple constituent clusters in the supercluster share IP address space with each other on their pod, service, or node IP ranges?

there is no clear way to determine how a service should be accessed. **If any
global properties have conflicts that can not be resolved, a condition will be
set on the `ServiceExport` with a description of the conflict. The service will
not be synced, and an error will be set on the status of each affected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are some endpoints that are already exported, will they be forced out why a new and incompatible set of endpoints?

global properties have conflicts that can not be resolved, a condition will be
set on the `ServiceExport` with a description of the conflict. The service will
not be synced, and an error will be set on the status of each affected
`ServiceExport` and any previously-derived `ServiceImports` will be deleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this bit in particular makes it appear as though a new bad actor could bring down a previously functional service. Is there a way to contain the damage instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.