Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Multi-Cluster Allocation Policies #597

Closed
jkowalski opened this issue Feb 15, 2019 · 34 comments
Closed

Proposal: Multi-Cluster Allocation Policies #597

jkowalski opened this issue Feb 15, 2019 · 34 comments
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/design Proposal discussing new features / fixes and how they should be implemented
Milestone

Comments

@jkowalski
Copy link
Contributor

Background

When operating multiple Agones clusters to support world-wide game launch it is often necessary to perform multi-cluster allocations (allocations from a set of clusters instead of just one) based on defined policies.

Examples of common policies include:

  • burst to cloud - first allocate from a fleet in a cluster located on premises but when it's out of capacity, start allocating from a fleet hosted in the cloud.
  • provider preference - prefer certain cloud providers over others based on factors such as cost or quality of service, with the ability to fall back to others when capacity is not available
  • load spreading - distribute allocations across clusters from different providers to limit the blast radius of an outage of a single cluster (e.g. network fiber cut)

Proposal

This document proposes changing how GameServerAllocation API works by adding support for forwarding allocation requests to other clusters based on policies that can be applied on a per-request basis.

We will add a new CRD called GameServerAllocationPolicy that controls how multi-cluster allocations will be performed. The policy will contain a list of clusters to allocate from with corresponding priorities and weights. The credentials for accessing those clusters will be stored in secrets.

apiVersion: "stable.agones.dev/v1alpha1"
kind: GameServerAllocationPolicy
metadata:
  name: my-policy
  namespace: default
spec:
  clusters:
  - name: on-prem
    priority: 1
    server: "https://some-endpoint"
    credentials: "some-secret-name"
  - name: another-cloud
    priority: 2
    weight: 100
    server: "https://some-other-endpoint"
    credentials: "some-other-secret-name"
  - name: this-cluster
    priority: 2
    weight: 300
    server: "local"

The name of the policy can be specified when creating GameServerAllocation.

apiVersion: "stable.agones.dev/v1alpha1"
kind: GameServerAllocation
metadata:
  generateName: my-allocation-
spec:
  policy: my-policy
  required:
    matchLabels:
      game: my-game
  preferred:
    - matchLabels:
        stable.agones.dev/fleet: green-fleet
  scheduling: Packed
  metadata:
    ...

If policy is not specified, the allocation will be attempted from the local cluster as it is done today.

When policy is present on a the GameServerAllocation request, the API handler would become a router that calls the specified clusters in their priority order, and for clusters with equal priority it would randomly pick a clusters, with probability of choosing a cluster proportional to its weight. If a cluster is out of capacity, the handler would try other clusters until the allocation succeeds or the list of clusters to try is exhausted.

In the example above, when the allocation request comes in, we would always try allocating from on-prem cluster first, because it's a cluster with highest priority.

If allocation from on-prem fails, we proceed to next highest priority which includes two possible clusters: this-cluster (with 75% probability) or other-cloud (with 25% probability).

Deployment Topologies

In a multi-cluster scenario, several allocation topologies are possible, based on a decision where to put the GameServerAllocationPolicy objects:

Single Cluster Responsible For Routing

In this mode, a single cluster is selected to server multi-cluster allocation APIs and Match Maker is pointed at its allocation endpoint. The cluster has GameServerAllocationPolicy that points at all other clusters. This has the benefit of simplicity, but has a single point of failure, which is the chosen cluster.

Pros:

  • simple configuration of policies (one place)
  • simple configuration of secrets (one place)

Cons:

  • single point of failure
  • secrets hosted in the cluster that's running game servers

single routing cluster

Dedicated Cluster Responsible For Routing

Another option, similar to the single cluster is to create a dedicated cluster that's only responsible for allocations, but does not host game servers or fleets. This cluster will have only routing policies and secrets to talk to other clusters.

Pros:

  • simplified management of policies (one cluster only)
  • simplified management of secrets (one cluster only)
  • no secrets are stored in clusters where game servers run

Cons:

  • still single point of failure
  • additional Agones cluster to manage

dedicated routing cluster

All Clusters Responsible For Routing

In this mode, all clusters will have policies and secrets that allow them to route allocation requests to all other clusters when necessary. A global load balancer (could be a VIP or DNS-based) will randomly pick a cluster to allocate from, which will perform an extra "hop" to the cluster based on policy.

all routing clusters

Pros:

  • resilient to cluster outages
  • no additional clusters to manage

Cons:

  • complex configuration of policies
  • complex configuration of secrets

Other Topologies

Other, more complex topologies are possible, including hierarchical ones where routing-only clusters form a hierarchy, routing-only-clusters behind load-balancers, etc.

@jkowalski jkowalski added kind/design Proposal discussing new features / fixes and how they should be implemented area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc labels Feb 15, 2019
@markmandel
Copy link
Member

This looks pretty good to me, I'm not seeing any glaring red flags, and looks to solve the HA problem as well. - only question I have - what if I don't want to have a probability of going to another cluster - say I want to go allocate against "on-prem" until it fills up, and then move to "cloud-1", do I set the weight to 1? 0? All the same weights? (does weight have a default?)

@Kuqd I know you've been working with multiple clusters a lot -- what are your thoughts?

@ilkercelikyilmaz
Copy link
Contributor

  • only question I have - what if I don't want to have a probability of going to another cluster - say I want to go allocate against "on-prem" until it fills up, and then move to "cloud-1", do I set the weight to 1? 0? All the same weights? (does weight have a default?)

This is controlled by the Priority. If there is only one entry with Priority=1 then there is no probability and the allocation will happen only on this cluster (until it is out of capacity). If there are multiple entries with same priority, then it will use weight of each entry to distribute allocation between clusters.
If there is no weight is set, we can assign a default weight.

@markmandel
Copy link
Member

markmandel commented Feb 18, 2019

Actually, I see this is also covered in @jkowalski 's example. on-prem has no weight, so it is given top priorty, but because No.2 and No.3 have weights, then it is randomly distributed based on weight. 👍

That makes sense to me.

@cyriltovena
Copy link
Collaborator

cyriltovena commented Feb 19, 2019

I think it would be nice to describe what kind of credentials we are supporting, my guess is service account token.

One con of 3 is security but that’s the price for no single point of failure. Would a regional cluster with option 2 be good enough to remove the single point of failure?

Would be nice to be able to target a cluster from the GSA, so matchmaking can make some ping requests and select the right cluster, WDYT?

Basically being able to say I would prefer us-east but if it’s full follow the policy. Seems already possible if you create a bunch of policy so that’s good.

@markmandel
Copy link
Member

So I've assumed the credentials are Kubernetes credentials (bearer token)? So essentially a service account + rbac permissions (although you are right - we should be explicit about this).

Since you can create your own topology with this design - the tradeoffs are up to the user. If you feel that 3 GSA router clusters is enough HA, then that's fine. But if you want more than that, you can add more. In fact -- you can adjust on the fly.

Originally I had thought a director/agent model would be better -- but looking at this, I think this is better because:

  1. Under the agent model, you still have to have some kind of auth (certs probably), and they all still need to be managed. Also then the bots need service accounts anyway, so you're doing double the work.
  2. We basically reuse existing tech - k8s creds, vpcs, etc to lock everything down if people want to (especially b/w on prem and cloud) - so it's less work for us - and less security risk.

We will want to have some pretty explicit docs on how to create and manage these tokens through - or point to some documentation that does this (I haven't seen anything yet on my travels).

Would be nice to be able to target a cluster from the GSA, so matchmaking can make some ping requests and select the right cluster, WDYT?

Basically being able to say I would prefer us-east but if it’s full follow the policy. Seems already possible if you create a bunch of policy so that’s good.

I think this is covered by the policies. I think the idea here is that the game server ops team can decide on the policy set that is in place - and the team working on matchmaking / game logic isn't able to accidentally override that when attempting to get a game server. They can only choose from a pre-approved set.

@markmandel
Copy link
Member

We should also point people at: https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/

To ensure they know to encrypt secrets at rest. Do we need to do more security wise here?

@pooneh-m
Copy link
Contributor

pooneh-m commented Mar 6, 2019

What do you think of changing GameServerAllocationPolicy to support time, while policies are ordered chronologically? For example for

t1<---policy1----->t2<----policy2----...

The CRD for GameServerAllocationPolicy will be something like:

 spec:
  policies:
  - name: policy1
    start time: t1
    clusters:
    - name: on-prem
    ...
    - name: another-cloud
    ...
  - name: policy2
    start time: t2
    clusters:
    - name ...

@ilkercelikyilmaz
Copy link
Contributor

So Agones will select the policies the ones that have the start time closest to the current time?

@markmandel
Copy link
Member

@pooneh-m - I'm wondering if the motivation for this is to be able to automatically declare the start and end times of a policy, so that a user doesn't have to do this manually? (Then it becomes a question of how to handle crossover periods). Is that correct?

@pooneh-m
Copy link
Contributor

pooneh-m commented Mar 6, 2019

@ilkercelikyilmaz Yes. Agones will pick the policy that its current time is passed the start time and before the next start time.
@markmandel yes, I proposed a simplified version of that.

@pooneh-m
Copy link
Contributor

What do you think of naming allocation policy CR?

  1. GameServerAllocationPolicy
  2. MultiClusterAllocationPolicy
  3. FleetAllocationPolicy
  4. GameServerMultiClusterAllocationPolicy

@markmandel
Copy link
Member

I think I'm leaning more towards:
MultiClusterAllocationPolicy or GameServerMultiClusterAllocationPolicy (even though it's a huge name, it is descriptive).

WDYT?

@markmandel
Copy link
Member

Another potentially fun question - should the AllocationPolicy be under a multicluster group of some kind? (Rather than stable?)

(And maybe stable => core?)

@pooneh-m
Copy link
Contributor

I am more leaning towards keeping GameServer prefix for Agones CRs because there is less risk of having the same CRD name for two CRs in the same cluster. I think either GameServerAllocationPolicy or GameServerMultiClusterAllocationPolicy is fine.

GameServerMultiClusterAllocationPolicy has two votes.

@pooneh-m
Copy link
Contributor

Another potentially fun question - should the AllocationPolicy be under a multicluster group of some kind? (Rather than stable?)

(And maybe stable => core?)

Lets discuss this in issue #703 that you opened.

@markmandel
Copy link
Member

Yeah I agree - we should let grouping dictate naming - and 100% agreed on GameServer as a prefix for the reasons described above!

@pooneh-m
Copy link
Contributor

pooneh-m commented Apr 15, 2019

Based on the group naming suggestion in #703, I am choosing GameServerAllocationPolicy, since the full name <plural>.<group> has multicluster in it
-> gameserverallocationpolicies.multicluster.agones.dev

@pooneh-m
Copy link
Contributor

pooneh-m commented Apr 17, 2019

I'll be adding a new field to GameServerAllocation to extend it for multicluster allocation.

MultiClusterPolicySelector metav1.LabelSelector

By default the multicluster policy will not be effective for allocation. If MultiClusterPolicySelector is specified, multicluster policy is enforced per request.

There are two benefits to it:

  1. If cluster 1 forwards allocation requests to cluster 2 per multicluster policy definition, the multicluster policy on cluster 2 will be disabled to avoid rerouting.
  2. We can reuse the existing gameserverallocation API for multicluster allocation, instead of introducing a new API.

We can also make enabling and disabling the multicluster policy by introducing an explicit flag, but I don't think it is necessary.

MultiClusterPolicy {
   Enable bool
   PolicySelector metav1.LabelSelector
}

@markmandel
Copy link
Member

Just so I'm 100% clear, PolicySelector would then match all the Policies on the cluster and apply all of them? Is there any control over the order, or is it essentially random?

@pooneh-m
Copy link
Contributor

A list of multicluster policies are selected using PolicySelector per incoming allocation request. Policies are then ordered based on their priority and weight and the first policy in the ordered list is selected. Based on the policy, the allocation request is either handled locally or redirected to another cluster.

@markmandel
Copy link
Member

Oh neat - so it's more of a merge operation really - all the spec.clusters are merged into a single, sorted list based on weight. Nice! SGTM!

@pooneh-m
Copy link
Contributor

About the cluster to cluster connectivity, apparently, service accounts are not forever. One way to solve the connectivity is to introduce allocation as a service that can call other cluster's allocation services directly instead of through API servers using pre-installed certificates. WDYT?

@markmandel
Copy link
Member

Not sure I understand the above tbh. Sounds like a re-architecting of how the kubernetes API is authenticated (if I read it correctly)? How does that impact connectivity? If kubectl can be used from outside a cluster, we should be able to do the same thing, no? (it all uses client-go, after all)

@pooneh-m
Copy link
Contributor

pooneh-m commented Apr 26, 2019

Yes, we need a slight re-architecture to handle authentication for cluster to cluster allocation requests.

For match making service calling to an allocation service from a different cluster or for multi-cluster allocation scenario, we cannot store a service account token and assume it lasts forever. We cannot also assume customers can enable a plugins for authenticating with an identity provider e.g. OIDC on their cluster or accept a client or TLS cert.

The solution is to (1) introduce a reverse proxy with external IP on the cluster that performs the authentication of allocation requests and then forward the requests to API server. For better performance (2) we should move the allocation service logic (controller) to the proxy and call it allocation service. Then (3) remove API server extension for allocation, which is a breaking change and should be done before 1.0 release.

The solution will be similar to this sample.

For talking to GKE, kubectl is using user account authenticated with google identity provider, instead of service account and it has expiry on the token.

@markmandel
Copy link
Member

I have a strange emotional attachment to GameServerAllocations 🤷‍♂️ I feel like it's so nice to be able to do allocations with kubectl on the command line for testing and development etc. I'd like to have some more user input on that aspect.

I still like the idea of keeping them around, also because it's a nice in-cluster and/or developer experience -- but I totally get the reasoning of potentially removing them. Maybe I'm being overly sentimental? (I can admit to that).

We had a short discussion earlier about completing (1) above, and then seeing how our performance goes? Is that our first step, and then maybe I can live in hope that they may stay around? 😄

But yes - I 100% agree we need to make a decision before 1.0, as it affects the API surface, and we need to lock that down before 1.0.

@markmandel
Copy link
Member

@pooneh-m - just running a smoke test on the latest RC, noticed in the logs:

{"message":"agones.dev/agones/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: secrets is forbidden: User \"system:serviceaccount:agones-system:agones-controller\" cannot list secrets at the cluster scope","severity":"error","time":"2019-05-08T18:59:53.210002337Z"}
{"message":"agones.dev/agones/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: secrets is forbidden: User \"system:serviceaccount:agones-system:agones-controller\" cannot list secrets at the cluster scope","severity":"error","time":"2019-05-08T18:59:54.212493604Z"}
{"message":"agones.dev/agones/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: secrets is forbidden: User \"system:serviceaccount:agones-system:agones-controller\" cannot list secrets at the cluster scope","severity":"error","time":"2019-05-08T18:59:55.214727668Z"}

Looks like we need to add a RBAC permission 😢

Doesn't affect functionality at the moment, but would be good to get a fix in while we're in feature freeze I think.

@pooneh-m
Copy link
Contributor

pooneh-m commented May 8, 2019

Thanks! I am on it.

@pooneh-m
Copy link
Contributor

pooneh-m commented May 22, 2019

Here are the remaining work items for the allocator service:

  • gRPC API.
  • Load testing and moving allocation APIServer extention to allocator service if helps with performance.
  • Convert allocation errors to meaningful http status (currently they are all 500 http status).
  • Add retry logic for transient errors when calling cross cluster allocator endpoints.
  • Cache allocation client.
  • Add E2E tests for multi-cluster scenarios.
  • Remove ClusterName from allocation policies. Allocation endpoint suffice.
  • Add Healthz for the service.
  • Revisit whether we need to expose namespace to the caller of the allocator service.
  • Add samples that send request for allocation to gameserver--allocator
  • Add metrics for multicluster allocation failures
  • Add documentations

@markmandel
Copy link
Member

@pooneh-m just wanted to gently nudge this - see where we up to date on this?

We should probably add "documentation" to the above list as well 😃

This isn't on the 1.0 roadmap, but I was just curious.

@pooneh-m
Copy link
Contributor

pooneh-m commented Aug 2, 2019

Yes, I am going to tackle the list before going to v1.0. I added documentation to the list.

@markmandel
Copy link
Member

Nice! Very cool!

@Davidnovarro
Copy link

@pooneh-m Hi! Any update on this?

@pooneh-m
Copy link
Contributor

Hi @Davidnovarro, I just started working on this again. Hopefully before v1.0 there will be plenty of updates. :) I'm planning to do a refactoring to move allocation handler to its own stand alone library that both allocator service and the API server extension reference to help with the scale. Then I will introduce the gRPC API, add more testing for cross cluster calls and add documentations.

@markmandel
Copy link
Member

@pooneh-m is this closeable now?

@markmandel markmandel added this to the 1.11.0 milestone Dec 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/design Proposal discussing new features / fixes and how they should be implemented
Projects
None yet
Development

No branches or pull requests

6 participants