Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-cluster Gateway scheduling and HA #3754

Closed
jianjuns opened this issue May 9, 2022 · 17 comments
Closed

Multi-cluster Gateway scheduling and HA #3754

jianjuns opened this issue May 9, 2022 · 17 comments
Labels
area/multi-cluster Issues or PRs related to multi cluster. kind/design Categorizes issue or PR as related to design. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@jianjuns
Copy link
Contributor

jianjuns commented May 9, 2022

Describe what you are trying to solve
In the current implementation, users need to specify a single worker Node to be the Gateway, and the Gateway will not move to another Node if the specified Node fails.

Finally, we should allow users to specify a set of Nodes to potentially host Gateway and specify the desired number of active Gateway instances. Antrea should auto select Nodes to run the Gateway instances, and if a Node fails, Antrea should try selecting another Node (if available) for the Gateway instance.

Describe the solution you have in mind

In each member cluster:

  1. Users can annotate multiple Nodes to run MC Gateways
  2. antrea-agents auto-select (leveraging memberlist) the Nodes for active Gateway instances
  3. An antrea-agent notifies MC Controller about the selected Nodes, and MC Controller creates a Gateway CR for each Node
  4. MC Controller replicates the local Gateways to remote clusters via the leader cluster
  5. When a Gateway fails, antrea-agents will select a new Node. Before the new Gateway CR is created and received by all Nodes, antrea-agents will send cross-cluster traffic to existing living Gateways.

Between clusters, we might build a health check protocol among peer Gateways, going through the tunnels, so a Gateway can detect failures of remote peers, and also select living remote Gateways to tunnel cross-cluster traffic. So, local Gateway failover in a member cluster still replies on apiserver, and Gateway replication in the ClusterSet relies on leader cluster, but at least there is a mechanism to select living Gateways from the existing instances without depending on apiserver and leader cluster.

@jianjuns jianjuns added kind/design Categorizes issue or PR as related to design. area/multi-cluster Issues or PRs related to multi cluster. labels May 9, 2022
@jianjuns
Copy link
Contributor Author

jianjuns commented May 9, 2022

@luolanzone @gran-vmv @tnqn : could you review if the design makes sense?

Several questions:

  1. When a Node fails, when agents re-select Gateway Nodes using memberlist, can we keep the previous Gateway Nodes which are still alive, but just select a new Node to replace the failed one? @tnqn
  2. I feel better to let MC Controller (rather than antrea-agent) create Gateway CRs, to have a single source of truth for Gateway CRs. But then we need a mechanism for MC Controller to know which Nodes are selected by antrea-agents. Any idea for that? A local file?

@gran-vmv
Copy link
Contributor

LGTM overall.
I think the major issue is how we design the failover logic.

@jianjuns
Copy link
Contributor Author

jianjuns commented May 15, 2022

After more thinking, I now feel it is a little too complicated to leverage memberlist to select Nodes for active Gateway instances. An alternative approach is to let MC Controller select Gateway Nodes. And, on each Node, antrea-agent can use heartbeat to detect Gateway failure and always send cross-cluster traffic to the living Gateways. The same mechanism can be shared by "antrea-agent to detect local Gateway failure" and "Gateway to detect remote cluster Gateway failure".

memberlist may be leveraged as a secondary mechanism to detect active Gateway failure and trigger selection of new active Gateway Nodes, in addition to detection of Node failure from K8s API by MC Controller. In this case, a single elected (via memberlis) antrea-agent will report the failed Gateway (detected by memberlist) to MC Controller, e.g. via K8s events. Then MC Controller can select another Node as the active Gateway to replace the failed one.

@luolanzone
Copy link
Contributor

luolanzone commented Jun 9, 2022

@jianjuns I am thinking maybe we can run a dummy deployment for all Gateway Nodes, and use leader election provided by client-go to elect active Gateway and provide an API for other member clusters to get active Gateway?
cc @hjiajing

@jianjuns
Copy link
Contributor Author

jianjuns commented Jun 9, 2022

What you mean by "provide an API for other member clusters to get active Gateway"? Please add details.

@luolanzone
Copy link
Contributor

luolanzone commented Jun 10, 2022

I am thinking it maybe like the API endpoint in the Agent, an Endpoint like http://<mc-gateway-ip>:10350/active-gateway, The deployment on all local Gateways will have only one active replica. The active replica will be running on local active Gateway Node, It can get remote Gateway info from ClusterInfoImport, and check all remote Gateway's endpoint, it can return success as long as it get the active-gateway from any one of available endpoint, then update statue of ClusterInfoImport to indicate which one is active Gateway.

@jianjuns
Copy link
Contributor Author

I am still confused.

  1. Do you have one active Gateway or multiple ones per cluster? Finally we need to support multiple active Gateways. The design needs to consider that.
  2. How client-go leader election is used?

Ideally, we want not require another port open across clusters, except the tunnel port. Opening a port requires firewall configuration.

@luolanzone
Copy link
Contributor

Ok, I thought we will support one active Gateway and do HA with failover. Here is a blog about client-go leader election. The core idea is:

It begins with the creation of a lock object, where the leader updates the current timestamp at regular intervals as a way of
informing other replicas regarding its leadership. This lock object which could be a Lease , ConfigMap or an Endpoint, also 
holds the identity of the current leader. If the leader fails to update the timestamp within the given interval, it is assumed to
have been crashed, which is when the inactive replicas race to acquire leadership by updating the lock with their identity.  
The pod which successfully acquires the lock gets to be the new leader.

It's simple to use, a sample code is like below:

// "k8s.io/client-go/tools/leaderelection" package is required.
func runLeaderElection(lock *resourcelock.LeaseLock, ctx context.Context, id string) {
	leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
		Lock:            lock,
		ReleaseOnCancel: true,
		LeaseDuration:   15 * time.Second,
		RenewDeadline:   10 * time.Second,
		RetryPeriod:     2 * time.Second,
		Callbacks: leaderelection.LeaderCallbacks{
			OnStartedLeading: func(c context.Context) {
				doStuff()
			},
			OnStoppedLeading: func() {
				klog.Info("no longer the leader, staying inactive.")
			},
			OnNewLeader: func(current_id string) {
				if current_id == id {
					klog.Info("still the leader!")
					return
				}
				klog.Info("new leader is %s", current_id)
			},
		},
	})
}

@jianjuns
Copy link
Contributor Author

jianjuns commented Jun 10, 2022

We can start from one active GW, but we should finally support active/active. We discussed it multiple times.

I meant in your proposal how leader election is needed? You said you want to run a single replica; then you still need a leader? If we need a leader, we can consider client-go, or just use memberlist. The latter makes more sense if we want to leverage memberlist to detect active GW failure fast (as described in my proposal #3754 (comment)).

@luolanzone
Copy link
Contributor

OK, I mean a Deployment with multiple replicas in all Gateways but only one is active (as a leader) if leader election is used, but you are right, memberlist should be faster considering the K8s API introduces latency during request and response.

@jianjuns
Copy link
Contributor Author

I now feel it might be simpler to let mc-controller select active Nodes (than a separate Deployment or elect). We may consider election as a backup solution for faster failure detection and failover, as a future enhancement. What you think?

In dataplane, we can implement a simple heartbeat (over tunnel to avoid opening another port) to detect an active Gateway is up or not without depending on apiserver or leader, and only send traffic to the up ones. The same heartbeat can be used for both local and remote GW.

@luolanzone
Copy link
Contributor

yeah, I think it's Ok to let mc-controller to select active Nodes, and we need to define what's the policy to do selection. one remain question from me is if we let mc-controller to do selection, and there is also heartbeats, does it mean we only let mc-controller to select active Gateway Nodes at very beginning and let heartbeat to detect which Gateway is preferred in data plane? I am thinking if leader is down, any new change from Gateway Nodes will not be reflected to other members.

@jianjuns
Copy link
Contributor Author

jianjuns commented Jun 10, 2022

we need to define what's the policy to do selection

I feel nothing complex here - just randomly select the desired number of "ready" Nodes from all Nodes with the mc-gateway annotation. As we discussed earlier, even we use Deployment and let K8s decide the placement, it does not really do better in our case. The main question is how to detect a Node is ready or not. I think we can start from just checking the Node resource in K8s API, and if needed we can add a faster mechanism like memberlist later.

one remain question from me is if we let mc-controller to do selection, and there is also heartbeats, does it mean we only let mc-controller to select active Gateway Nodes at very beginning and let heartbeat to detect which Gateway is preferred in data plane?

mc-controller is doing the initial placement, but datapath can detect runtime link failure without relying on mgmt plane. It is the typical model of data plane nodes (e.g. what LB health check does). When mgmt plane detects "long-time" node failure, it can re-place the failed nodes. The datapath failure detection makes more sense for remote Gateways, but if we are anyway implementing the mechanism, we might apply it to "node -> local gateway" too (we can decide based on the complexity and cost).

I am thinking if leader is down, any new change from Gateway Nodes will not be reflected to other members.

As we talked, ideally datapath HA should not rely on mgmt plane. This makes more sense when you have member clusters on multiple zones and even regions - cannot depend on a central mgmt plane for failure, as the link to leader cluster can be down.

@jianjuns
Copy link
Contributor Author

jianjuns commented Jun 10, 2022

I am not saying we must implement all active/active and DP failure detection in one release. I am fine if you like to start from a single active instance. I just meant our design should align with the long term direction, and avoid throw-away work.

@luolanzone
Copy link
Contributor

luolanzone commented Aug 16, 2022

After a few dicussion, we'd like to implement active-standy mode of high availability in the first phase.
From user perspective, users need to annotate one or more Nodes with annotation 'multicluster.antrea.io/gateway=true'.
These Nodes will become Gateway candidates. The first ready Node with annotation 'multicluster.antrea.io/gateway=true' becomes the Gateway. A Gateway CR will be created automatically like below, the name will be the same as Node name:

apiVersion: multicluster.crd.antrea.io/v1alpha1
kind: Gateway
metadata:
  name: k8s-node-1
  namespace: kube-system
gatewayIP: 10.10.27.10
internalIP: 10.10.27.10
  • The Multi-cluster Node Controller will watch Node's event to check Node readiness to take following actions correspondingly:

    1. Controller creates the Gateway CR based on the first ready Node's external and internal IP.
    2. Controller saves the Node's name as Gateway candidates in a list when there is a new Node being annotated with
      'multicluster.antrea.io/gateway=true' but the existing Gateway Node is healthy.
    3. When the Gateway Node becomes not ready, controller will delete existing Gateway CR first and take actions correspondingly.
    • Controller checks Gateway candidates, it will pick one ready Node in alphabatic order and create a new Gateway CR after it deletes old one.
    • Controller will take no action When there is no Gateway candidate
  • The Multi-cluster Gateway Controller will take following actions:

    1. When the Gateway is created, create the ClusterInfo kind of ResourceExport in the leader cluster
    2. When the Gateway is updated (eg: External/Internal IP), update the ClusterInfo kind of ResourceExport in the leader cluster
    3. When the Gateway is deleted, delete the ClusterInfo kind of ResourceExport in the leader cluster

The ClusterInfo kind of ResourceExport will be kept the same as before:

apiVersion: multicluster.crd.antrea.io/v1alpha1
kind: ResourceExport
metadata:
  name: test-cluster-east-clusterinfo
  namespace: antrea-multicluster
spec:
  clusterID: test-cluster-east
  clusterinfo:
    clusterID: test-cluster-east
    gatewayInfos:
    - gatewayIP:  10.10.27.10
    serviceCIDR: 10.19.0.0/18
  kind: ClusterInfo
  name: test-cluster-east
  namespace: kube-system
  • A new Gateway webhook will check existing Gateway list, deny a new Gateway creation if there is already existing Gateway. It guarantees there will be at most one Gateway in a member cluster.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2022
@luolanzone
Copy link
Contributor

Close this issue since Gateway active-passive mode HA is support from v1.9.0.
Will create a new issue if there is any plan to support active-active mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/multi-cluster Issues or PRs related to multi cluster. kind/design Categorizes issue or PR as related to design. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants