You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and leader has to send a new snapshot to the slow followers if compacted logs dont contain the entries which the follower is requesting to append the entries.
able to reproduce this bug multiple times.
Bug pattern
in a 4 node cluster, 4th node cannot join the cluster after some time (during which there will be leadership changes and log compactions happened.)
Changes made in the example code to take frequent snapshots
( see the comment for the patch which contains the changes i made for testing #174 (comment))
made defaultSnapshotCount = 1
made snapshotCatchUpEntriesN = 1
2024/03/03 14:21:26 replaying WAL of member 1
2024/03/03 14:21:26 loading WAL at term 0 and index 0
raft2024/03/03 14:21:26 INFO: 1 switched to configuration voters=()
raft2024/03/03 14:21:26 INFO: 1 became follower at term 0
raft2024/03/03 14:21:26 INFO: newRaft 1 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
raft2024/03/03 14:21:26 INFO: 1 became follower at term 1
raft2024/03/03 14:21:26 INFO: 1 switched to configuration voters=(1)
raft2024/03/03 14:21:26 INFO: 1 switched to configuration voters=(1 2)
raft2024/03/03 14:21:26 INFO: 1 switched to configuration voters=(1 2 3)
2024/03/03 14:21:26 ----------------------------------------------------------
2024/03/03 14:21:26 {"ID":1,"term":1,"vote":0,"commit":3,"Lead":0,"RaftState":"StateFollower","Applied":0,"LeadTransferee":0}
2024/03/03 14:21:26 rc.raftStorage.FirstIndex() is 1 the rc.raftStorage.LastIndex() 0
raft2024/03/03 14:21:26 INFO: 1 switched to configuration voters=(1 2 3)
raft2024/03/03 14:21:26 INFO: 1 switched to configuration voters=(1 2 3)
raft2024/03/03 14:21:26 INFO: 1 switched to configuration voters=(1 2 3)
2024/03/03 14:21:26 ----> maybe triggering snapshot..........
2024/03/03 14:21:26 start snapshot [applied index: 3 | last snapshot index: 0]
2024/03/03 14:21:26 compacted log at index 2
raft2024/03/03 14:21:28 INFO: 1 is starting a new election at term 1
raft2024/03/03 14:21:28 INFO: 1 became candidate at term 2
raft2024/03/03 14:21:28 INFO: 1 [logterm: 1, index: 3] sent MsgVote request to 2 at term 2
raft2024/03/03 14:21:28 INFO: 1 [logterm: 1, index: 3] sent MsgVote request to 3 at term 2
raft2024/03/03 14:21:28 INFO: 1 received MsgVoteResp from 1 at term 2
raft2024/03/03 14:21:28 INFO: 1 has received 1 MsgVoteResp votes and 0 vote rejections
2024/03/03 14:21:28 ----------------------------------------------------------
2024/03/03 14:21:28 {"ID":1,"term":2,"vote":1,"commit":3,"Lead":0,"RaftState":"StateCandidate","Applied":3,"LeadTransferee":0}
2024/03/03 14:21:28 rc.raftStorage.FirstIndex() is 3 the rc.raftStorage.LastIndex() 3
raft2024/03/03 14:21:29 INFO: 1 is starting a new election at term 2
raft2024/03/03 14:21:29 INFO: 1 became candidate at term 3
raft2024/03/03 14:21:29 INFO: 1 [logterm: 1, index: 3] sent MsgVote request to 2 at term 3
raft2024/03/03 14:21:29 INFO: 1 [logterm: 1, index: 3] sent MsgVote request to 3 at term 3
raft2024/03/03 14:21:29 INFO: 1 received MsgVoteResp from 1 at term 3
raft2024/03/03 14:21:29 INFO: 1 has received 1 MsgVoteResp votes and 0 vote rejections
2024/03/03 14:21:30 ----------------------------------------------------------
2024/03/03 14:21:30 {"ID":1,"term":3,"vote":1,"commit":3,"Lead":0,"RaftState":"StateCandidate","Applied":3,"LeadTransferee":0}
2024/03/03 14:21:30 rc.raftStorage.FirstIndex() is 3 the rc.raftStorage.LastIndex() 3
raft2024/03/03 14:21:31 INFO: 1 is starting a new election at term 3
raft2024/03/03 14:21:31 INFO: 1 became candidate at term 4
raft2024/03/03 14:21:31 INFO: 1 [logterm: 1, index: 3] sent MsgVote request to 2 at term 4
raft2024/03/03 14:21:31 INFO: 1 [logterm: 1, index: 3] sent MsgVote request to 3 at term 4
raft2024/03/03 14:21:31 INFO: 1 received MsgVoteResp from 1 at term 4
raft2024/03/03 14:21:31 INFO: 1 has received 1 MsgVoteResp votes and 0 vote rejections
raft2024/03/03 14:21:32 INFO: 1 is starting a new election at term 4
raft2024/03/03 14:21:32 INFO: 1 became candidate at term 5
raft2024/03/03 14:21:32 INFO: 1 [logterm: 1, index: 3] sent MsgVote request to 2 at term 5
raft2024/03/03 14:21:32 INFO: 1 [logterm: 1, index: 3] sent MsgVote request to 3 at term 5
raft2024/03/03 14:21:32 INFO: 1 received MsgVoteResp from 1 at term 5
raft2024/03/03 14:21:32 INFO: 1 has received 1 MsgVoteResp votes and 0 vote rejections
2024/03/03 14:21:32 ----------------------------------------------------------
2024/03/03 14:21:32 {"ID":1,"term":5,"vote":1,"commit":3,"Lead":0,"RaftState":"StateCandidate","Applied":3,"LeadTransferee":0}
2024/03/03 14:21:32 rc.raftStorage.FirstIndex() is 3 the rc.raftStorage.LastIndex() 3
raft2024/03/03 14:21:33 INFO: 1 is starting a new election at term 5
raft2024/03/03 14:21:33 INFO: 1 became candidate at term 6
raft2024/03/03 14:21:33 INFO: 1 [logterm: 1, index: 3] sent MsgVote request to 2 at term 6
raft2024/03/03 14:21:33 INFO: 1 [logterm: 1, index: 3] sent MsgVote request to 3 at term 6
raft2024/03/03 14:21:33 INFO: 1 received MsgVoteResp from 1 at term 6
raft2024/03/03 14:21:33 INFO: 1 has received 1 MsgVoteResp votes and 0 vote rejections
raft2024/03/03 14:21:33 INFO: 1 received MsgVoteResp from 2 at term 6
raft2024/03/03 14:21:33 INFO: 1 has received 2 MsgVoteResp votes and 0 vote rejections
raft2024/03/03 14:21:33 INFO: 1 became leader at term 6
raft2024/03/03 14:21:33 INFO: raft.node: 1 elected leader 1 at term 6
2024/03/03 14:21:34 ----------------------------------------------------------
2024/03/03 14:21:34 {"ID":1,"term":6,"vote":1,"commit":4,"Lead":1,"RaftState":"StateLeader","Applied":4,"LeadTransferee":0}
2024/03/03 14:21:34 rc.raftStorage.FirstIndex() is 3 the rc.raftStorage.LastIndex() 4
leader's view of node-id: 1 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 2 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 3 is StateProbe recent snapshot is 0 recent active is false
2024/03/03 14:26:11 ----------------------------------------------------------
2024/03/03 14:26:11 {"ID":1,"term":6,"vote":1,"commit":12,"Lead":1,"RaftState":"StateLeader","Applied":12,"LeadTransferee":0}
2024/03/03 14:26:11 rc.raftStorage.FirstIndex() is 11 the rc.raftStorage.LastIndex() 12
leader's view of node-id: 1 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 2 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 3 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 4 is StateReplicate recent snapshot is 0 recent active is true
2024/03/03 14:26:13 ----------------------------------------------------------
2024/03/03 14:26:13 {"ID":1,"term":6,"vote":1,"commit":12,"Lead":1,"RaftState":"StateLeader","Applied":12,"LeadTransferee":0}
2024/03/03 14:26:13 rc.raftStorage.FirstIndex() is 11 the rc.raftStorage.LastIndex() 12
leader's view of node-id: 4 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 1 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 2 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 3 is StateReplicate recent snapshot is 0 recent active is true
all the nodes applies the commits till index 12.
stop the 4th server
<killed the 4th server>
2024/03/03 14:27:37 ----------------------------------------------------------
2024/03/03 14:27:37 {"ID":4,"term":6,"vote":0,"commit":12,"Lead":1,"RaftState":"StateFollower","Applied":12,"LeadTransferee":0}
2024/03/03 14:27:37 rc.raftStorage.FirstIndex() is 11 the rc.raftStorage.LastIndex() 12
^C
harinath@harinaths-MacBook-Pro raftexample %
trigger multiple leader failures, which leads to leadership changes
current leader: server 1
2024/03/03 14:28:57 ----------------------------------------------------------
2024/03/03 14:28:57 {"ID":1,"term":6,"vote":1,"commit":12,"Lead":1,"RaftState":"StateLeader","Applied":12,"LeadTransferee":0}
2024/03/03 14:28:57 rc.raftStorage.FirstIndex() is 11 the rc.raftStorage.LastIndex() 12
leader's view of node-id: 1 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 2 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 3 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 4 is StateProbe recent snapshot is 0 recent active is false
^C
harinath@harinaths-MacBook-Pro raftexample %
2024/03/03 14:29:24 replaying WAL of member 1
2024/03/03 14:29:24 loading WAL at term 6 and index 11
raft2024/03/03 14:29:24 INFO: 1 switched to configuration voters=(1 2 3 4)
raft2024/03/03 14:29:24 INFO: 1 became follower at term 6
raft2024/03/03 14:29:24 INFO: newRaft 1 [peers: [1,2,3,4], term: 6, commit: 11, applied: 11, lastindex: 12, lastterm: 6]
2024/03/03 14:29:24 loading snapshot at term 6 and index 11
2024/03/03 14:29:24 ----------------------------------------------------------
2024/03/03 14:29:24 {"ID":1,"term":6,"vote":1,"commit":11,"Lead":0,"RaftState":"StateFollower","Applied":11,"LeadTransferee":0}
2024/03/03 14:29:24 rc.raftStorage.FirstIndex() is 12 the rc.raftStorage.LastIndex() 12
raft2024/03/03 14:29:24 INFO: 1 [term: 6] received a MsgVote message with higher term from 2 [term: 27]
raft2024/03/03 14:29:24 INFO: 1 became follower at term 27
raft2024/03/03 14:29:24 INFO: 1 [logterm: 6, index: 12, vote: 0] cast MsgVote for 2 [logterm: 6, index: 12] at term 27
raft2024/03/03 14:29:24 INFO: raft.node: 1 elected leader 2 at term 27
2024/03/03 14:29:24 ----> maybe triggering snapshot..........
2024/03/03 14:29:24 start snapshot [applied index: 13 | last snapshot index: 11]
2024/03/03 14:29:24 compacted log at index 12
2024/03/03 14:29:26 ----------------------------------------------------------
2024/03/03 14:29:26 {"ID":1,"term":27,"vote":2,"commit":13,"Lead":2,"RaftState":"StateFollower","Applied":13,"LeadTransferee":0}
next leader : server 2
logs in server 2
2024/03/03 14:31:33 ----------------------------------------------------------
2024/03/03 14:31:33 {"ID":2,"term":27,"vote":2,"commit":13,"Lead":2,"RaftState":"StateLeader","Applied":13,"LeadTransferee":0}
2024/03/03 14:31:33 rc.raftStorage.FirstIndex() is 13 the rc.raftStorage.LastIndex() 13
leader's view of node-id: 1 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 2 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 3 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 4 is StateProbe recent snapshot is 0 recent active is false
2024/03/03 14:31:35 ----------------------------------------------------------
2024/03/03 14:33:57 ----------------------------------------------------------
2024/03/03 14:33:57 {"ID":2,"term":27,"vote":2,"commit":13,"Lead":2,"RaftState":"StateLeader","Applied":13,"LeadTransferee":0}
2024/03/03 14:33:57 rc.raftStorage.FirstIndex() is 13 the rc.raftStorage.LastIndex() 13
leader's view of node-id: 1 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 2 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 3 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 4 is StateProbe recent snapshot is 0 recent active is false
^C
harinath@harinaths-MacBook-Pro raftexample %
start server 2 back
2024/03/03 14:34:25 replaying WAL of member 2
2024/03/03 14:34:25 loading WAL at term 27 and index 13
raft2024/03/03 14:34:25 INFO: 2 switched to configuration voters=(1 2 3 4)
raft2024/03/03 14:34:25 INFO: 2 became follower at term 27
raft2024/03/03 14:34:25 INFO: newRaft 2 [peers: [1,2,3,4], term: 27, commit: 13, applied: 13, lastindex: 13, lastterm: 27]
2024/03/03 14:34:25 loading snapshot at term 27 and index 13
2024/03/03 14:34:25 ----------------------------------------------------------
2024/03/03 14:34:25 {"ID":2,"term":27,"vote":2,"commit":13,"Lead":0,"RaftState":"StateFollower","Applied":13,"LeadTransferee":0}
2024/03/03 14:34:25 rc.raftStorage.FirstIndex() is 14 the rc.raftStorage.LastIndex() 13
raft2024/03/03 14:34:26 INFO: 2 [term: 27] received a MsgVote message with higher term from 1 [term: 49]
raft2024/03/03 14:34:26 INFO: 2 became follower at term 49
raft2024/03/03 14:34:26 INFO: 2 [logterm: 27, index: 13, vote: 0] cast MsgVote for 1 [logterm: 27, index: 13] at term 49
raft2024/03/03 14:34:26 INFO: raft.node: 2 elected leader 1 at term 49
2024/03/03 14:34:27 ----------------------------------------------------------
2024/03/03 14:34:27 {"ID":2,"term":49,"vote":1,"commit":14,"Lead":1,"RaftState":"StateFollower","Applied":14,"LeadTransferee":0}
2024/03/03 14:34:27 rc.raftStorage.FirstIndex() is 14 the rc.raftStorage.LastIndex() 14
2024/03/03 14:34:29 ----------------------------------------------------------
2024/03/03 14:34:29 {"ID":2,"term":49,"vote":1,"commit":14,"Lead":1,"RaftState":"StateFollower","Applied":14,"LeadTransferee":0}
2024/03/03 14:34:29 rc.raftStorage.FirstIndex() is 14 the rc.raftStorage.LastIndex() 14
next leader: server 1
and restart server 1
2024/03/03 14:35:40 replaying WAL of member 1
2024/03/03 14:35:40 loading WAL at term 27 and index 13
raft2024/03/03 14:35:40 INFO: 1 switched to configuration voters=(1 2 3 4)
raft2024/03/03 14:35:40 INFO: 1 became follower at term 49
raft2024/03/03 14:35:40 INFO: newRaft 1 [peers: [1,2,3,4], term: 49, commit: 13, applied: 13, lastindex: 14, lastterm: 49]
2024/03/03 14:35:40 loading snapshot at term 27 and index 13
2024/03/03 14:35:40 ----------------------------------------------------------
2024/03/03 14:35:40 {"ID":1,"term":49,"vote":1,"commit":13,"Lead":0,"RaftState":"StateFollower","Applied":13,"LeadTransferee":0}
2024/03/03 14:35:40 rc.raftStorage.FirstIndex() is 14 the rc.raftStorage.LastIndex() 14
raft2024/03/03 14:35:41 INFO: 1 [term: 49] received a MsgVote message with higher term from 3 [term: 51]
raft2024/03/03 14:35:41 INFO: 1 became follower at term 51
raft2024/03/03 14:35:41 INFO: 1 [logterm: 49, index: 14, vote: 0] cast MsgVote for 3 [logterm: 49, index: 14] at term 51
raft2024/03/03 14:35:41 INFO: raft.node: 1 elected leader 3 at term 51
2024/03/03 14:35:41 ----> maybe triggering snapshot..........
2024/03/03 14:35:41 start snapshot [applied index: 15 | last snapshot index: 13]
2024/03/03 14:35:41 compacted log at index 14
2024/03/03 14:35:42 ----------------------------------------------------------
2024/03/03 14:35:42 {"ID":1,"term":51,"vote":3,"commit":15,"Lead":3,"RaftState":"StateFollower","Applied":15,"LeadTransferee":0}
next leader: server 3
2024/03/03 14:37:09 ----------------------------------------------------------
2024/03/03 14:37:09 {"ID":3,"term":51,"vote":3,"commit":15,"Lead":3,"RaftState":"StateLeader","Applied":15,"LeadTransferee":0}
2024/03/03 14:37:09 rc.raftStorage.FirstIndex() is 15 the rc.raftStorage.LastIndex() 15
leader's view of node-id: 1 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 2 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 3 is StateReplicate recent snapshot is 0 recent active is true
leader's view of node-id: 4 is StateProbe recent snapshot is 0 recent active is false
2024/03/03 14:37:11 ----------------------------------------------------------
start the 4th server back
leader sends the snapshot now
2024/03/03 14:37:38 replaying WAL of member 4
2024/03/03 14:37:38 loading WAL at term 6 and index 11
raft2024/03/03 14:37:38 INFO: 4 switched to configuration voters=(1 2 3 4)
2024/03/03 14:37:38 loading snapshot at term 6 and index 11
raft2024/03/03 14:37:38 INFO: 4 became follower at term 6
raft2024/03/03 14:37:38 INFO: newRaft 4 [peers: [1,2,3,4], term: 6, commit: 11, applied: 11, lastindex: 12, lastterm: 6]
2024/03/03 14:37:38 ----------------------------------------------------------
2024/03/03 14:37:38 {"ID":4,"term":6,"vote":0,"commit":11,"Lead":0,"RaftState":"StateFollower","Applied":11,"LeadTransferee":0}
2024/03/03 14:37:38 rc.raftStorage.FirstIndex() is 12 the rc.raftStorage.LastIndex() 12
raft2024/03/03 14:37:38 INFO: 4 [term: 6] received a MsgHeartbeat message with higher term from 3 [term: 51]
raft2024/03/03 14:37:38 INFO: 4 became follower at term 51
raft2024/03/03 14:37:38 INFO: raft.node: 4 elected leader 3 at term 51
raft2024/03/03 14:37:38 INFO: log [committed=11, applied=11, applying=11, unstable.offset=13, unstable.offsetInProgress=13, len(unstable.Entries)=0] starts to restore snapshot [index: 15, term: 51]
raft2024/03/03 14:37:38 INFO: 4 switched to configuration voters=(1 2 3 4)
raft2024/03/03 14:37:38 INFO: 4 [commit: 15, lastindex: 15, lastterm: 51] restored snapshot [index: 15, term: 51]
raft2024/03/03 14:37:38 INFO: 4 [commit: 15] restored snapshot [index: 15, term: 51]
SSSSSSSSSSSSSSSSSSSSSSSS> received SNAPSHOT term is 51 index 15
2024/03/03 14:37:38 publishing snapshot at index 11
2024/03/03 14:37:38 finished publishing snapshot at index 11
2024/03/03 14:37:38 loading snapshot at term 51 and index 15
2024/03/03 14:37:40 ----------------------------------------------------------
2024/03/03 14:37:40 {"ID":4,"term":51,"vote":0,"commit":15,"Lead":3,"RaftState":"StateFollower","Applied":15,"LeadTransferee":0}
2024/03/03 14:37:40 rc.raftStorage.FirstIndex() is 16 the rc.raftStorage.LastIndex() 15
2024/03/03 14:37:42 ----------------------------------------------------------
2024/03/03 14:37:42 {"ID":4,"term":51,"vote":0,"commit":15,"Lead":3,"RaftState":"StateFollower","Applied":15,"LeadTransferee":0}
2024/03/03 14:37:42 rc.raftStorage.FirstIndex() is 16 the rc.raftStorage.LastIndex() 15
kill server 4 again
now do multiple transactions to increment the applied index on the active nodes.
kill leader (server 3)
and restart server 3
2024/03/03 14:39:57 replaying WAL of member 3
2024/03/03 14:39:57 loading WAL at term 51 and index 43
raft2024/03/03 14:39:57 INFO: 3 switched to configuration voters=(1 2 3 4)
2024/03/03 14:39:57 loading snapshot at term 51 and index 43
raft2024/03/03 14:39:57 INFO: 3 became follower at term 51
raft2024/03/03 14:39:57 INFO: newRaft 3 [peers: [1,2,3,4], term: 51, commit: 43, applied: 43, lastindex: 43, lastterm: 51]
2024/03/03 14:39:57 ----------------------------------------------------------
2024/03/03 14:39:57 {"ID":3,"term":51,"vote":3,"commit":43,"Lead":0,"RaftState":"StateFollower","Applied":43,"LeadTransferee":0}
2024/03/03 14:39:57 rc.raftStorage.FirstIndex() is 44 the rc.raftStorage.LastIndex() 43
raft2024/03/03 14:39:58 INFO: 3 [term: 51] received a MsgVote message with higher term from 1 [term: 72]
raft2024/03/03 14:39:58 INFO: 3 became follower at term 72
raft2024/03/03 14:39:58 INFO: 3 [logterm: 51, index: 43, vote: 0] cast MsgVote for 1 [logterm: 51, index: 43] at term 72
raft2024/03/03 14:39:58 INFO: raft.node: 3 elected leader 1 at term 72
2024/03/03 14:39:59 ----------------------------------------------------------
2024/03/03 14:39:59 {"ID":3,"term":72,"vote":1,"commit":44,"Lead":1,"RaftState":"StateFollower","Applied":44,"LeadTransferee":0}
2024/03/03 14:39:59 rc.raftStorage.FirstIndex() is 44 the rc.raftStorage.LastIndex() 44
now leader is server-1
now restart 4 expecting a snapshot from server-1
2024/03/03 14:41:26 replaying WAL of member 4
2024/03/03 14:41:26 loading WAL at term 6 and index 11
raft2024/03/03 14:41:26 INFO: 4 switched to configuration voters=(1 2 3 4)
raft2024/03/03 14:41:26 INFO: 4 became follower at term 51
raft2024/03/03 14:41:26 INFO: newRaft 4 [peers: [1,2,3,4], term: 51, commit: 11, applied: 11, lastindex: 12, lastterm: 6]
2024/03/03 14:41:26 loading snapshot at term 51 and index 15
2024/03/03 14:41:26 ----------------------------------------------------------
2024/03/03 14:41:26 {"ID":4,"term":51,"vote":0,"commit":11,"Lead":0,"RaftState":"StateFollower","Applied":11,"LeadTransferee":0}
2024/03/03 14:41:26 rc.raftStorage.FirstIndex() is 12 the rc.raftStorage.LastIndex() 12
raft2024/03/03 14:41:27 INFO: 4 is starting a new election at term 51
raft2024/03/03 14:41:27 INFO: 4 became candidate at term 52
raft2024/03/03 14:41:27 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 1 at term 52
raft2024/03/03 14:41:27 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 2 at term 52
raft2024/03/03 14:41:27 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 3 at term 52
raft2024/03/03 14:41:27 INFO: 4 received MsgVoteResp from 4 at term 52
raft2024/03/03 14:41:27 INFO: 4 has received 1 MsgVoteResp votes and 0 vote rejections
2024/03/03 14:41:28 ----------------------------------------------------------
2024/03/03 14:41:28 {"ID":4,"term":52,"vote":4,"commit":11,"Lead":0,"RaftState":"StateCandidate","Applied":11,"LeadTransferee":0}
2024/03/03 14:41:28 rc.raftStorage.FirstIndex() is 12 the rc.raftStorage.LastIndex() 12
raft2024/03/03 14:41:28 INFO: 4 is starting a new election at term 52
raft2024/03/03 14:41:28 INFO: 4 became candidate at term 53
raft2024/03/03 14:41:28 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 1 at term 53
raft2024/03/03 14:41:28 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 2 at term 53
raft2024/03/03 14:41:28 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 3 at term 53
raft2024/03/03 14:41:28 INFO: 4 received MsgVoteResp from 4 at term 53
raft2024/03/03 14:41:28 INFO: 4 has received 1 MsgVoteResp votes and 0 vote rejections
raft2024/03/03 14:41:29 INFO: 4 is starting a new election at term 53
raft2024/03/03 14:41:29 INFO: 4 became candidate at term 54
raft2024/03/03 14:41:29 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 1 at term 54
raft2024/03/03 14:41:29 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 2 at term 54
raft2024/03/03 14:41:29 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 3 at term 54
raft2024/03/03 14:41:29 INFO: 4 received MsgVoteResp from 4 at term 54
raft2024/03/03 14:41:29 INFO: 4 has received 1 MsgVoteResp votes and 0 vote rejections
at this moment the server 4 dont receive any snapshot from the leader , so it tends to start the election.
even restarting server 4 doesn't help (see below logs.)
kill server 4 and restart it again
2024/03/03 14:42:56 replaying WAL of member 4
2024/03/03 14:42:56 loading WAL at term 6 and index 11
raft2024/03/03 14:42:56 INFO: 4 switched to configuration voters=(1 2 3 4)
raft2024/03/03 14:42:56 INFO: 4 became follower at term 103
raft2024/03/03 14:42:56 INFO: newRaft 4 [peers: [1,2,3,4], term: 103, commit: 11, applied: 11, lastindex: 12, lastterm: 6]
2024/03/03 14:42:56 loading snapshot at term 51 and index 15
2024/03/03 14:42:56 ----------------------------------------------------------
2024/03/03 14:42:56 {"ID":4,"term":103,"vote":4,"commit":11,"Lead":0,"RaftState":"StateFollower","Applied":11,"LeadTransferee":0}
2024/03/03 14:42:56 rc.raftStorage.FirstIndex() is 12 the rc.raftStorage.LastIndex() 12
raft2024/03/03 14:42:57 INFO: 4 is starting a new election at term 103
raft2024/03/03 14:42:57 INFO: 4 became candidate at term 104
raft2024/03/03 14:42:57 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 1 at term 104
raft2024/03/03 14:42:57 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 2 at term 104
raft2024/03/03 14:42:57 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 3 at term 104
raft2024/03/03 14:42:58 INFO: 4 received MsgVoteResp from 4 at term 104
raft2024/03/03 14:42:58 INFO: 4 has received 1 MsgVoteResp votes and 0 vote rejections
2024/03/03 14:42:58 ----------------------------------------------------------
2024/03/03 14:42:58 {"ID":4,"term":104,"vote":4,"commit":11,"Lead":0,"RaftState":"StateCandidate","Applied":11,"LeadTransferee":0}
2024/03/03 14:42:58 rc.raftStorage.FirstIndex() is 12 the rc.raftStorage.LastIndex() 12
raft2024/03/03 14:42:58 INFO: 4 is starting a new election at term 104
raft2024/03/03 14:42:58 INFO: 4 became candidate at term 105
raft2024/03/03 14:42:58 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 1 at term 105
raft2024/03/03 14:42:58 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 2 at term 105
raft2024/03/03 14:42:58 INFO: 4 [logterm: 6, index: 12] sent MsgVote request to 3 at term 105
raft2024/03/03 14:42:58 INFO: 4 received MsgVoteResp from 4 at term 105
raft2024/03/03 14:42:58 INFO: 4 has received 1 MsgVoteResp votes and 0 vote rejections
2024/03/03 14:43:00 ----------------------------------------------------------
Observations
now server 4 cannot join the cluster back due to this.
removing the server 4 from the cluster and adding back immediately makes the leader to send another snapshot.
The text was updated successfully, but these errors were encountered:
harinath001
changed the title
Potential serious-bug in raft or in the etcd contrib example.
Potential serious-bug in raft implementation or in the etcd contrib example.
Mar 3, 2024
harinath001
changed the title
Potential serious-bug in raft implementation or in the etcd contrib example.
Potential serious-bug in raft implementation (or) in the etcd contrib example.
Mar 3, 2024
Enabling PreVote should can resolve the issue. Note that raftexample is out of maintenance. We are planning to implement a new example under this repo: #2
Please raise a new issue in etcd if you see any issue on etcd. Thanks.
Context
Bug pattern
in a 4 node cluster, 4th node cannot join the cluster after some time (during which there will be leadership changes and log compactions happened.)
Changes made in the example code to take frequent snapshots
( see the comment for the patch which contains the changes i made for testing #174 (comment))
made defaultSnapshotCount = 1
made snapshotCatchUpEntriesN = 1
Detailed steps for reproducing the bug.
Compile steps
go build -o raftexample
Create a cluster of 3 raft servers
./raftexample --id 1 --cluster http://127.0.0.1:8001,http://127.0.0.1:8002,http://127.0.0.1:8003 --port 9001
./raftexample --id 2 --cluster http://127.0.0.1:8001,http://127.0.0.1:8002,http://127.0.0.1:8003 --port 9002
./raftexample --id 3 --cluster http://127.0.0.1:8001,http://127.0.0.1:8002,http://127.0.0.1:8003 --port 9003
Add 4th server
curl -L http://127.0.0.1:9003/4 -XPOST -d http://127.0.0.1:8004
Starting 4th server
./raftexample --id 4 --cluster http://127.0.0.1:8001,http://127.0.0.1:8002,http://127.0.0.1:8003,http://127.0.0.1:8004 --port 9004 --join
make some transactions which will increase the applied index
curl -L http://127.0.0.1:9001/my-key -XPUT -d bar1
curl -L http://127.0.0.1:9001/my-key -XPUT -d bar1
curl -L http://127.0.0.1:9001/my-key -XPUT -d bar1
curl -L http://127.0.0.1:9001/my-key -XPUT -d bar1
logs from server-1, which is the leader now.
all the nodes applies the commits till index 12.
stop the 4th server
<killed the 4th server>
trigger multiple leader failures, which leads to leadership changes
current leader: server 1
and restart server 1
./raftexample --id 1 --cluster http://127.0.0.1:8001,http://127.0.0.1:8002,http://127.0.0.1:8003 --port 9001
next leader : server 2
logs in server 2
start server 2 back
next leader: server 1
and restart server 1
next leader: server 3
start the 4th server back
leader sends the snapshot now
kill server 4 again
now do multiple transactions to increment the applied index on the active nodes.
kill leader (server 3)
and restart server 3
now leader is server-1
now restart 4 expecting a snapshot from server-1
kill server 4 and restart it again
Observations
The text was updated successfully, but these errors were encountered: