About refine the exception handling logic in method commitTo in RaftLogManager #3833

chengjianyun · 2021-08-25T11:28:21Z

chengjianyun
Aug 25, 2021
Collaborator

Hi, all

The discussion started from the issue #3784 and https://issues.apache.org/jira/browse/IOTDB-1583. Here we don't talk about how to fix the concrete BufferOverFlowException(a pr has been proposed, see #3832) but how such exception should be handled. We have had some discission offline and let me summary here to let more people join the discussion.

Background

According to the issue we mentioned above, right now, if some exception is thrown between the code in RaftLogManager.java, the raft group would run into a invalid state that can't accept write request anymore. Root cause is that committedEntries, uncommittedEntries and commitIndex are not updated together if some exception happens in the red rectangle which cause the inconsistency between the variables.

Discussion

I think here we have two different opinions:

Tolerant the error in some way and let the data group continue serve
Let the node stop serve the data group

1. Tolerant the error in some way

Here we may set some NO_OP like action for error entries which could let raft log running forward and keep monotonically increase. And at same time tell the user that these operations are fail.

The drawback is obvious, as the entry has been append to all other nodes in the data group, we can't guarantee all nodes in the group would suffer such runtime exception. So, we can't guarantee the consistency of the nodes in the data group. Of course, for the exceptions like issue-1583, all nodes should suffer the same exception and it should finally keep the consistent.

For me, I won' t vote for the solution right now as it can't make sure the consistency. We can talk more about the solution here.

2. Let node stop serve the data group

The idea here is to let the node is suffering the exception quit the data group properly so that other group members may continue work. The solution could tolerant the exception just happen on some of nodes and keep the server's availability. But for the issue-1583, finally all nodes will quit the group cause the service down.

This is a conservative policy, which is much safe to apply. And in etcd, if the storage have error, it will stop serve write operation as show in comments in file storage.go.

// Storage is an interface that may be implemented by the application
// to retrieve log entries from storage.
//
// If any Storage method returns an error, the raft instance will
// become inoperable and refuse to participate in elections; the
// application is responsible for cleanup and recovery in this case.
type Storage interface {
...

Here, I like the solution right now because it won't cause very big change and doesn't cause side effect.

Your ideas and comments are welcome.

Thanks!

jixuan1989 · 2021-08-27T05:53:09Z

jixuan1989
Aug 27, 2021
Collaborator

I moved this discussion to ISSUE #3856
discussion is only for user-help, not for IoTDB developers.

(and This discussion will be deleted one day later.) Let's discuss the idea in the issue (jira or github issue).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About refine the exception handling logic in method commitTo in RaftLogManager #3833

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

About refine the exception handling logic in method commitTo in RaftLogManager #3833

chengjianyun Aug 25, 2021 Collaborator

Background

Discussion

1. Tolerant the error in some way

2. Let node stop serve the data group

Replies: 1 comment

jixuan1989 Aug 27, 2021 Collaborator

chengjianyun
Aug 25, 2021
Collaborator

jixuan1989
Aug 27, 2021
Collaborator