-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-cluster Gateway scheduling and HA #3754
Comments
@luolanzone @gran-vmv @tnqn : could you review if the design makes sense? Several questions:
|
LGTM overall. |
After more thinking, I now feel it is a little too complicated to leverage memberlist to select Nodes for active Gateway instances. An alternative approach is to let MC Controller select Gateway Nodes. And, on each Node, antrea-agent can use heartbeat to detect Gateway failure and always send cross-cluster traffic to the living Gateways. The same mechanism can be shared by "antrea-agent to detect local Gateway failure" and "Gateway to detect remote cluster Gateway failure". memberlist may be leveraged as a secondary mechanism to detect active Gateway failure and trigger selection of new active Gateway Nodes, in addition to detection of Node failure from K8s API by MC Controller. In this case, a single elected (via memberlis) antrea-agent will report the failed Gateway (detected by memberlist) to MC Controller, e.g. via K8s events. Then MC Controller can select another Node as the active Gateway to replace the failed one. |
@jianjuns I am thinking maybe we can run a dummy deployment for all Gateway Nodes, and use leader election provided by client-go to elect active Gateway and provide an API for other member clusters to get active Gateway? |
What you mean by "provide an API for other member clusters to get active Gateway"? Please add details. |
I am thinking it maybe like the API endpoint in the Agent, an Endpoint like |
I am still confused.
Ideally, we want not require another port open across clusters, except the tunnel port. Opening a port requires firewall configuration. |
Ok, I thought we will support one active Gateway and do HA with failover. Here is a blog about client-go leader election. The core idea is:
It's simple to use, a sample code is like below: // "k8s.io/client-go/tools/leaderelection" package is required.
func runLeaderElection(lock *resourcelock.LeaseLock, ctx context.Context, id string) {
leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
Lock: lock,
ReleaseOnCancel: true,
LeaseDuration: 15 * time.Second,
RenewDeadline: 10 * time.Second,
RetryPeriod: 2 * time.Second,
Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: func(c context.Context) {
doStuff()
},
OnStoppedLeading: func() {
klog.Info("no longer the leader, staying inactive.")
},
OnNewLeader: func(current_id string) {
if current_id == id {
klog.Info("still the leader!")
return
}
klog.Info("new leader is %s", current_id)
},
},
})
} |
We can start from one active GW, but we should finally support active/active. We discussed it multiple times. I meant in your proposal how leader election is needed? You said you want to run a single replica; then you still need a leader? If we need a leader, we can consider client-go, or just use memberlist. The latter makes more sense if we want to leverage memberlist to detect active GW failure fast (as described in my proposal #3754 (comment)). |
OK, I mean a Deployment with multiple replicas in all Gateways but only one is active (as a leader) if leader election is used, but you are right, memberlist should be faster considering the K8s API introduces latency during request and response. |
I now feel it might be simpler to let mc-controller select active Nodes (than a separate Deployment or elect). We may consider election as a backup solution for faster failure detection and failover, as a future enhancement. What you think? In dataplane, we can implement a simple heartbeat (over tunnel to avoid opening another port) to detect an active Gateway is up or not without depending on apiserver or leader, and only send traffic to the up ones. The same heartbeat can be used for both local and remote GW. |
yeah, I think it's Ok to let mc-controller to select active Nodes, and we need to define what's the policy to do selection. one remain question from me is if we let mc-controller to do selection, and there is also heartbeats, does it mean we only let mc-controller to select active Gateway Nodes at very beginning and let heartbeat to detect which Gateway is preferred in data plane? I am thinking if leader is down, any new change from Gateway Nodes will not be reflected to other members. |
I feel nothing complex here - just randomly select the desired number of "ready" Nodes from all Nodes with the mc-gateway annotation. As we discussed earlier, even we use Deployment and let K8s decide the placement, it does not really do better in our case. The main question is how to detect a Node is ready or not. I think we can start from just checking the Node resource in K8s API, and if needed we can add a faster mechanism like memberlist later.
mc-controller is doing the initial placement, but datapath can detect runtime link failure without relying on mgmt plane. It is the typical model of data plane nodes (e.g. what LB health check does). When mgmt plane detects "long-time" node failure, it can re-place the failed nodes. The datapath failure detection makes more sense for remote Gateways, but if we are anyway implementing the mechanism, we might apply it to "node -> local gateway" too (we can decide based on the complexity and cost).
As we talked, ideally datapath HA should not rely on mgmt plane. This makes more sense when you have member clusters on multiple zones and even regions - cannot depend on a central mgmt plane for failure, as the link to leader cluster can be down. |
I am not saying we must implement all active/active and DP failure detection in one release. I am fine if you like to start from a single active instance. I just meant our design should align with the long term direction, and avoid throw-away work. |
After a few dicussion, we'd like to implement active-standy mode of high availability in the first phase. apiVersion: multicluster.crd.antrea.io/v1alpha1
kind: Gateway
metadata:
name: k8s-node-1
namespace: kube-system
gatewayIP: 10.10.27.10
internalIP: 10.10.27.10
The ClusterInfo kind of ResourceExport will be kept the same as before: apiVersion: multicluster.crd.antrea.io/v1alpha1
kind: ResourceExport
metadata:
name: test-cluster-east-clusterinfo
namespace: antrea-multicluster
spec:
clusterID: test-cluster-east
clusterinfo:
clusterID: test-cluster-east
gatewayInfos:
- gatewayIP: 10.10.27.10
serviceCIDR: 10.19.0.0/18
kind: ClusterInfo
name: test-cluster-east
namespace: kube-system
|
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days |
Close this issue since Gateway active-passive mode HA is support from v1.9.0. |
Describe what you are trying to solve
In the current implementation, users need to specify a single worker Node to be the Gateway, and the Gateway will not move to another Node if the specified Node fails.
Finally, we should allow users to specify a set of Nodes to potentially host Gateway and specify the desired number of active Gateway instances. Antrea should auto select Nodes to run the Gateway instances, and if a Node fails, Antrea should try selecting another Node (if available) for the Gateway instance.
Describe the solution you have in mind
In each member cluster:
Between clusters, we might build a health check protocol among peer Gateways, going through the tunnels, so a Gateway can detect failures of remote peers, and also select living remote Gateways to tunnel cross-cluster traffic. So, local Gateway failover in a member cluster still replies on apiserver, and Gateway replication in the ClusterSet relies on leader cluster, but at least there is a mechanism to select living Gateways from the existing instances without depending on apiserver and leader cluster.
The text was updated successfully, but these errors were encountered: