Fix abnormal node state when leader transfer fails #247

cserwen · 2022-10-28T10:06:34Z

Question

We have three nodes in dLedger cluster: n0, n1, n2. n0 is preferedLeader

Firstly, n0 is leader. But there is a problem with the machine where n0 is located. Therefore, n2 is elected as the new leader.
When n0 recovers, n2 will transfer the leader to n0.
But n0 did not respond to n2's transfer request in time.

2022-10-26 08:21:20 INFO NettyServerPublicExecutor_3 - [n0] [ChangeRoleToCandidate] from term: 56 and currTerm: 55
2022-10-26 08:22:15 INFO StateMaintainer - n0_[INCREASE_TERM] from 55 to 56

n0 received the transfer request at 08:21:20, but the election was initiated at 08:22:15, causing the transfer request to fail and n2 to become writable. However, at this time, n0 is candidate, and the data cannot be synchronized. As a result, the lagging position of n0 is greater than 1000, and n2 no longer initiates a transfer request.

Because n0 is candidate, the data cannot be synchronized.

Solution

We have two ways

n0 actively rolls back to follower and rolls back term

Term only increases but does not decrease, not in line with the paper

latest term server has seen (initialized to 0 on first boot, increases monotonically)

The paper mentions that when a candidate receives an append request from the leader, if currentTerm <= leader's term, it should become a follower.

While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be leader. If the leader’s term (included in its RPC) is at least as large as the candidate’s current term, then the candidate recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and continues in candidate state

The leader node increases the term and becomes a candidate to initiate an election. n0 participates in the voting process normally and returns to normal.

Reference 5.1 of the paper mentions:

if one server’s current term is smaller than the other’s, then it updates its current term to the larger value.

Therefore, we can fix it according to Method 2.

The text was updated successfully, but these errors were encountered:

Co-authored-by: dengzhiwen1 <[email protected]>

cserwen mentioned this issue Oct 28, 2022

[ISSUE #247]revote when leader's term is not the biggest #248

Merged

RongtongJin closed this as completed in #248 Nov 17, 2022

RongtongJin pushed a commit that referenced this issue Nov 17, 2022

[ISSUE #247]revote when leader's term is not the biggest

4ee007d

Co-authored-by: dengzhiwen1 <[email protected]>

humkum pushed a commit to humkum/dledger that referenced this issue Nov 20, 2023

[ISSUE openmessaging#247]revote when leader's term is not the biggest

d349fea

Co-authored-by: dengzhiwen1 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix abnormal node state when leader transfer fails #247

Fix abnormal node state when leader transfer fails #247

cserwen commented Oct 28, 2022 •

edited

Loading

Fix abnormal node state when leader transfer fails #247

Fix abnormal node state when leader transfer fails #247

Comments

cserwen commented Oct 28, 2022 • edited Loading

Question

Solution

cserwen commented Oct 28, 2022 •

edited

Loading