Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce chance of duelling leaders #86

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Cjen1
Copy link

@Cjen1 Cjen1 commented Jul 20, 2023

Currently if multiple replicas concurrently try to recover from a failed leader, they can collide and must retry.
This is likely to occur in well behaved networks, since replicas will detect the fault at similar times.
It is also exacerbated as the number of replicas increase.

This PR uses randomness to reduce the chance of collisions by spreading out the candidates over $2^{16}$ terms, and effectively electing the candidate which chooses the highest term.

Since this is equivalent to each candidate being partitioned from the cluster until it hits its chosen term, there should be no interactions with the rest of the consensus protocol.

The downside of this approach is that if some candidate with a higher term becomes a candidate after a lower termed candidate is already elected, it will be prolong the outage.
This case is less likely to occur if PreVote is enabled.

The graph below compares time to recovery when the leader fails, which we define as the maximum interval without any committed requests.

ttr etcd

@pav-kv
Copy link
Contributor

pav-kv commented Sep 15, 2023

Nice. Interestingly, this change makes Raft behave more like vanilla Paxos.

Paxos assigns unique terms by construction, all candidates use non-intersecting terms. Correspondingly, if there is contention during phase 1 (equivalent to Raft election), the highest term wins.

Raft can have multiple candidates requesting votes for the same term, and has to retry in case of split votes. This PR reduces the chance of split votes, and makes the candidate with the latest term win election.


Both policies have pros/cons, we can't swap one for another without consideration. I think election should behave like the old code by default, for backwards compatibility. But we can make it possible to opt in for the new policy.

One way to do that: make the number of random bits at the end of Term configurable. By default it's zero, so there are no random bits, and "next term" just returns +1.

@serathius
Copy link
Member

Aside of the improvement, I would be also pretty interested in the methodology you used to measure the maximum recovery time. Not sure if we already have a tool for this, but would be great if we can document the performance difference and methodology to reproducing the measurement.

@erikgrinaker
Copy link
Contributor

How does this interact with CheckQuorum+PreVote? Leader stability can be important for availability under certain kinds of network partitions -- it looks like this will still respect the leader recency condition when granting prevotes and votes, so I think it should be fine.

With this change, would it make sense to drop the random 1x-2x factor of the election timeout, and simply use a fixed election timeout, relying on the term randomness to break ties? This would speed up elections on average. A delayed candidate can also upset an already elected leader by sending out vote requests with higher terms. That scenario would still be possible due to network latency alone, so we could end up electing multiple leaders in rapid succession until it settles on the candidate with the highest term -- I think that should be fine, as far as I can tell, but it's worth thinking through the implications here.

@Cjen1
Copy link
Author

Cjen1 commented Sep 15, 2023

@pavelkalinnikov Yep this is exactly where the idea came from (I was doing some comparisons of leader election performance between multipaxos and raft). Also that sounds like a good idea regarding how to merge this without breaking the internet :)

@serathius The measurements were taken using this tool: https://github.com/cjen1/reckon which is reasonably lightweight and very reproducible for these kind of tests (single machine with an emulated network). Here are some slides from a talk I gave on this work: https://coseners.net/wp-content/uploads/2023/08/Coseners-MP-vs-Raft-.pdf
The tool is currently usable, but if you have any suggestions on UX I'd appreciate it!

@erikgrinaker My understanding was that etcd doesn't do the random exponential backoff when leaders duel? If that exists currently it can still be useful for leader stability if timeouts are set too low by a user. Otherwise I haven't looked into testing it with PreVote, or intermittent network conditions (see the 2020 cloudflare outage).

@erikgrinaker
Copy link
Contributor

My understanding was that etcd doesn't do the random exponential backoff when leaders duel?

It doesn't do exponential backoff, but each node's election timeout is randomized by a factor of [1-2) to reduce the chance of ties:

raft/raft.go

Lines 413 to 416 in ee0fe9d

// randomizedElectionTimeout is a random number between
// [electiontimeout, 2 * electiontimeout - 1]. It gets reset
// when raft changes its state to follower or candidate.
randomizedElectionTimeout int

raft/raft.go

Lines 1989 to 1991 in ee0fe9d

func (r *raft) resetRandomizedElectionTimeout() {
r.randomizedElectionTimeout = r.electionTimeout + globalRand.Intn(r.electionTimeout)
}

@Cjen1
Copy link
Author

Cjen1 commented Sep 15, 2023

It doesn't do exponential backoff, but each node's election timeout is randomized by a factor of [1-2) to reduce the chance of ties:

Ah! Then randomisation of the terms would work instead of the randomisation of the timeout.

@ahrtr
Copy link
Member

ahrtr commented Sep 18, 2023

This PR might lead to subtle availability issues. Under the existing election timeout mechanism, each member only votes once in each term, accordingly at most one leader will be elected in each term. But in this PR, multiple leaders may be elected in a short period, because each member may vote multiple times (due to each randomised term). From application perspective, the leader changes multiple times in a short period in such situation.

Another minor concern is this PR also make the term a little human unreadable. Users have to extract the high 48 bits to get the real term.

The existing election timeout mechanism is a tradeoff between understandability and efficiency. I am not fully convinced we should proceed with this approach. The bottom line is that we should add a flag to enable or disable the new behavior. But again it complicates the user API, for we already have too many such flags.

@boringhello
Copy link

Interestingly, It could be more beneficial if the generation of the next term could vary based on the freshness of the log(log term, log index).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants