Skip to content
This repository has been archived by the owner on Dec 1, 2022. It is now read-only.

Raft heartbeat process in event base #438

Merged
merged 8 commits into from
Jun 24, 2021

Conversation

critical27
Copy link
Contributor

@critical27 critical27 commented Apr 14, 2021

Our workflow looks like below, as for an write request

leader bgWorker leader ioThreadPool leader threadMananger follower ioThreadPool follower threadManager
1. requestReceived
2. processInThread
(heartbeat) 3. replicate
4. receive appendLog
5. process
6. collectN
7. commit

From the view of a storaged process, there would be many partition, we could be blocked in any phase if it takes longer time to process. Phase 5 and 7 are the main cause of blocking as before. When pressure is big enough, it is possible that all worker thread is busy. This would raise a lot problems, one of the most notorious among them is leader change.

Block reason:

  1. lock
  2. rocksdb write stall (while holding lock)

IMO, there should be only one phase could be blocked, which is leader commit (phase 7). To achieve that, many works need to do:

  1. The second commit don not hold raft lock when commit will relax the restriction of raft lock. (we have another lock replicatingLogs to prevent concurrent replicate)
  2. The third commit follower delay commit if write stall will make follower commit logs not to block (phase 5, follower could commit in async)
  3. The first commit heartbeat refactor make heartbeat process in event base, even if all worker threads are blocked, there won't be unexpected election.

So, the heartbeat logic is:

  1. If a log is on the fly, only send a heartbeat which will only be processed in eventbase.
  2. If no log is on the fly, both a dummy log(previous behavior) and a heartbeat will be sent.

depends on vesoft-inc/nebula-common#497

@critical27 critical27 changed the title Raft heartbeat in process in event base Raft heartbeat process in event base Apr 14, 2021
@critical27 critical27 added depend on common PR: this PR depends on PRs in the common repo ready-for-testing PR: ready for the CI test labels May 31, 2021
@CLAassistant
Copy link

CLAassistant commented Jun 17, 2021

CLA assistant check
All committers have signed the CLA.

@critical27 critical27 force-pushed the raft_hb branch 2 times, most recently from 7719e45 to b89e773 Compare June 22, 2021 02:41
LogID lastLogIdCanCommit = std::min(lastLogId_, req.get_committed_log_id());
CHECK_LE(committedLogId_ + 1, lastLogIdCanCommit);
if (commitLogs(wal_->iterator(committedLogId_ + 1, lastLogIdCanCommit))) {
auto code = commitLogs(wal_->iterator(committedLogId_ + 1, lastLogIdCanCommit), false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, when wait == false , we will ignore whether the log commits successfully or not.but why do you need to set the committedLogId_ at line 1671?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L1671 only set committedLogId_ in resp, but we didn't update the committedLogId_, when we commit successfully, we update the committedLogId_, and response the new one (Line 1661 1662).

pro.setValue(std::move(t.value()));
}
});
return promise.getFuture();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be moved(not sure about this)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
depend on common PR: this PR depends on PRs in the common repo ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants