-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dBFT 2.1 (solving 2.0 liveness lock) #792
Comments
It is a nice study. We will study the cases. From a first look on the first steps, from 1 to 17. On step 17 you mention that state will not change.Why?It is not fully clear to me. |
Amazing work guys 👏 👏
@AnnaShaleva just to clarify, when you said backup is a non primary or a primary backup? |
Since this is not 3.0, i assumed it is non-primary. |
@vncoelho, thank you for the quick response. Speaking of the first case (
That's true, we've end up in a state where replica 2 has committed at view 0, replica 3 has committed at view 1, replicas 0 and 1 have sent their
Replica 2 was stuck forever at the view 0 at the
or even after receiving a single Replicas 0 and 1 will never change their view as far, because it's not enough |
The issue is related strictly to dBFT 2.0, we do not consider double-speakers model here. Thus, @Liaojinghui,
Exactly. |
Hi @AnnaShaleva , Thanks for the reply and attention. Take a look at: public bool NotAcceptingPayloadsDueToViewChanging => ViewChanging && !MoreThanFNodesCommittedOrLost;
// A possible attack can happen if the last node to commit is malicious and either sends change view after his
// commit to stall nodes in a higher view, or if he refuses to send recovery messages. In addition, if a node
// asking change views loses network or crashes and comes back when nodes are committed in more than one higher
// numbered view, it is possible for the node accepting recovery to commit in any of the higher views, thus
// potentially splitting nodes among views and stalling the network.
public bool MoreThanFNodesCommittedOrLost => (CountCommitted + CountFailed) > F;
#endregion In that case, Replicas may still commit at view 1. |
Based on the neo-project/neo-modules#792 (comment) the definition of MoreThanFNodesCommitted should be adjusted to match the core algorithm as it's the key factor of four-nodes deadlock scenario. There are two changes: 1. We consider the node to be "Committed" if it has the Commit message sent at _any_ view. 2. We should count lost nodes as far. We consider the node to be "Lost" if it hasn't sent any messages in the current round. Based on this adjustment, the first liveness lock scenario mentioned in neo-project/neo-modules#792 (comment) ("Liveness lock with four non-faulty nodes") is unreachable. However, there's stil a liveness lock when one of the nodes is in the RMDead list, i.e. can "die" at any moment. Consider running the base model specification with the following configuration: ``` RM RMFault RMDead MaxView {0, 1, 2, 3} {} {0} 2 ```
Based on the neo-project/neo-modules#792 (comment) the definition of MoreThanFNodesCommitted should be adjusted to match the core algorithm as it's the key factor of four-nodes deadlock scenario. There are two changes: 1. We consider the node to be "Committed" if it has the Commit message sent at _any_ view. 2. We should count lost nodes as far. We consider the node to be "Lost" if it hasn't sent any messages in the current round. Based on this adjustment, the first liveness lock scenario mentioned in neo-project/neo-modules#792 (comment) ("Liveness lock with four non-faulty nodes") is unreachable. However, there's stil a liveness lock when one of the nodes is in the RMDead list, i.e. can "die" at any moment. Consider running the base model specification with the following configuration: ``` RM RMFault RMDead MaxView {0, 1, 2, 3} {} {0} 2 ```
Based on the neo-project/neo-modules#792 (comment), the definition of MoreThanFNodesCommitted should be adjusted to match the core algorithm as it's the key factor of the four-good-nodes liveness lock scenario. There are two changes: 1. We consider the node to be "Committed" if it has the Commit message sent at _any_ view. 2. We should count lost nodes as far. We consider the node to be "Lost" if it hasn't sent any messages in the current round. Based on this adjustment, the first liveness lock scenario mentioned in neo-project/neo-modules#792 (comment) ("Liveness lock with four non-faulty nodes") is unreachable. However, there's still a liveness lock when one of the nodes is in the RMDead list, i.e. can "die" at any moment. Consider running the base model specification with the following configuration: ``` RM RMFault RMDead MaxView {0, 1, 2, 3} {} {0} 2 ```
Based on the code-level definition of However, consider the following cases: A. Replica 0 permanently dies in the end of the liveness lock 1 scenario (after step 16). Then replica 1 is still able to send its B. Run the adjusted basic specification with the following configuration (one node is able to die):
The TLC Model Checker finds the following liveness lock in this case: Steps to reproduce the liveness lock
After the described sequence of steps we end up in the following situation:
We end up in a situation when replica 2 is stuck in the |
We will soon read and review the other cases,@AnnaShaleva. |
Based on the neo-project/neo-modules#792 (comment), the definition of MoreThanFNodesCommitted should be adjusted to match the core algorithm as it's the key factor of the four-good-nodes liveness lock scenario. There following change is made: * We consider the node to be "Committed" if it has the Commit message sent at _any_ view. If the good node has committed, then we know for sure that it won't go further to the next consensus round and can rely on this information. The thing that remains the same is that we do not conu "lost" nodes, because this information can't be reliably trusted. See the comment inside the commit. Based on this adjustment, the first liveness lock scenario mentioned in neo-project/neo-modules#792 (comment) ("Liveness lock with four non-faulty nodes") needs to include one more step to enter a deadlock: one of the replicas that in the "cv" state must die. Moreover, there's another liveness lock scenario when one of the nodes is in the RMDead list, i.e. can "die" at any moment. Consider running the base model specification with the following configuration: ``` RM RMFault RMDead MaxView {0, 1, 2, 3} {} {0} 2 ```
Based on the neo-project/neo-modules#792 (comment), the definition of MoreThanFNodesCommitted should be adjusted to match the core algorithm as it's the key factor of the four-good-nodes liveness lock scenario. There following change is made: * We consider the node to be "Committed" if it has the Commit message sent at _any_ view. If the good node has committed, then we know for sure that it won't go further to the next consensus round and can rely on this information. The thing that remains the same is that we do not count the "lost" nodes, because this information can't be reliably trusted. See the comment inside the commit. Based on this adjustment, the first liveness lock scenario mentioned in neo-project/neo-modules#792 (comment) ("Liveness lock with four non-faulty nodes") needs to include one more step to enter a deadlock: one of the replicas that in the "cv" state must die. Moreover, there's another liveness lock scenario when one of the nodes is in the RMDead list, i.e. can "die" at any moment. Consider running the base model specification with the following configuration: ``` RM RMFault RMDead MaxView {0, 1, 2, 3} {} {0} 2 ```
Based on the neo-project/neo-modules#792 (comment), the definition of MoreThanFNodesCommitted should be adjusted to match the core algorithm as it's the key factor of the four-good-nodes liveness lock scenario. There following change is made: * We consider the node to be "Committed" if it has the Commit message sent at _any_ view. If the good node has committed, then we know for sure that it won't go further to the next consensus round and can rely on this information. The thing that remains the same is that we do not count the "lost" nodes, because this information can't be reliably trusted. See the comment inside the commit. Based on this adjustment, the first liveness lock scenario mentioned in neo-project/neo-modules#792 (comment) ("Liveness lock with four non-faulty nodes") needs to include one more step to enter a deadlock: one of the replicas that in the "cv" state must die in the end of the scenario. Moreover, there's another liveness lock scenario when one of the nodes is in the RMDead list, i.e. can "die" at any moment. Consider running the base model specification with the following configuration: ``` RM RMFault RMDead MaxView {0, 1, 2, 3} {} {0} 2 ```
Based on the neo-project/neo-modules#792 (comment), the definition of MoreThanFNodesCommitted should be adjusted to match the core algorithm as it's the key factor of the four-good-nodes liveness lock scenario. There following change is made: * We consider the node to be "Committed" if it has the Commit message sent at _any_ view. If the good node has committed, then we know for sure that it won't go further to the next consensus round and can rely on this information. The thing that remains the same is that we do not count the "lost" nodes, because this information can't be reliably trusted. See the comment inside the commit. Based on this adjustment, the first liveness lock scenario mentioned in neo-project/neo-modules#792 (comment) ("Liveness lock with four non-faulty nodes") needs to include one more step to enter a deadlock: one of the replicas that in the "cv" state must die in the end of the scenario. Moreover, there's another liveness lock scenario when one of the nodes is in the RMDead list, i.e. can "die" at any moment. Consider running the base model specification with the following configuration: ``` RM RMFault RMDead MaxView {0, 1, 2, 3} {} {0} 2 ```
After discussion with Roman I need to adjust the previous comment:
It also should be mentioned that we do not count "lost" nodes in the TLA+ basic specification, the reason is explained in roman-khimov/dbft#2 (comment) and the commit message roman-khimov/dbft@0ff31fd. However, this doesn't affect the liveness locks found. |
that's amazing work. 👍 keep improving the formal model and it's even possible to provide a formal proof based on TLA+. to make to TLA+ model clear and verifiable, I suggest to follow the best practice as below and avoid
also it would be better if
|
Summary or problem description
This issue is triggered by the dBFT 2.0 liveness lock problem mentioned in neo-project/neo#2029 (comment) and other discussions in issues and PRs. We've used TLA+ formal modeling tool to analyze dBFT 2.0 behaviour. In this issue we'll present the formal algorithm specification with a set of identified problems and propose ways to fix them in so-called dBFT 2.1.
dBFT 2.0 formal models
We've created two models of dBFT 2.0 algorithm in TLA+. Please, consider reading the brief introduction to our modelling approach at the README and take a look at the base model. Below we present several error traces that were found by TLC Model Checker in the four-nodes network scenario.
Model checking note
Please, consider reading the model checking note before exploring the error traces below.
1. Liveness lock with four non-faulty nodes
The TLA+ specification configuration assumes participating of the four non-faulty replicas precisely following the dBFT 2.0 algorithm and maximum reachable view set to be 2. Here's the model values configuration used for TLC Model Checker in the described scenario:
The following liveness lock scenario was found by the TLC Model Checker:
Steps to reproduce the liveness lock
PrepareRequest
message.ChangeView
message.PrepareRequest
of view 0 and broadcasts itsPrepareResponse
.ChangeView
message.PrepareRequest
of view 0 and broadcasts itsPrepareResponse
.M
prepare messages (from itself and replicas 0, 1) and broadcasts theCommit
message for view 0.ChangeView
message.M
ChangeView
messages (from itself and replicas 1, 3) and changes its view to 1.M
ChangeView
messages (from itself and replicas 0, 3) and changes its view to 1.PrepareRequest
message.PrepareResuest
of view 1 and sends thePrepareResponse
.ChangeView
message.ChangeView
message.M
ChangeView
messages (from itself and replicas 0, 1) and changes its view to 1.PrepareRequest
of view 1 and broadcasts itsPrepareResponse
.M
prepare message and broadcasts theCommit
message for view 1.Here's the TLC error trace attached: base_deadlock1_dl.txt
After the described sequence of steps we end up in the following situation:
ChangeView
sent, in the process of changing view from 1 to 2ChangeView
sent, in the process of changing view from 1 to 2Commit
sent, waiting for the rest nodes to commit at view 0Commit
sent, waiting for the rest nodes to commit at view 1So we have the replica 2 stuck at the view 0 without possibility to exit from the commit stage and without possibility to collect more
Commit
messages from other replicas. We also have replica 3 stuck at the view 1 with the same problem. And finally, we havereplicas 0 and 1 entered the view changing stage and not being able either to commit (as there's only
F
nodes that have been committed at the view 1) or to change view (as the replica 2 can't send itsChangeView
from the commit stage).This liveness lock happens because the outcome of the subsequent consensus round (either commit or do change view) completely depends on the message receiving order. Moreover, we've faced with exactly this kind of deadlock in real functioning network, this incident was fixed by the consensus nodes restarting.
2. Liveness lock with one "dead" node and three non-faulty nodes
The TLA+ specification configuration assumes participating of the three non-faulty nodes precisely following the dBFT 2.0 algorithm and one node which can "die" and stop sending consensus messages or changing its state at any point of the behaviour. The liveness lock can be reproduced both when the first primary node is able to "die" and when the non-primary node is "dying" in the middle of the consensus process. The maximum reachable view set to be 2. Here are two model values configurations used for TLC Model Checker in the described scenario:
The following liveness lock scenario was found by the TLC Model Checker:
Steps to reproduce the liveness lock (first configuration with primary node "dying" is taken as an example)
PrepareRequest
message.PrepareRequest
of view 0 and broadcasts itsPrepareResponse
.ChangeView
message.PrepareRequest
of view 0 and broadcasts itsPrepareResponse
.M
prepare messages (from itself and replicas 0, 1) and broadcasts theCommit
message for view 0.ChangeView
message.Here's the TLC error traces attached:
After the described sequence of steps we end up in the following situation:
PrepareRequest
sent for view 0).ChangeView
sent, in the process of changing view from 0 to 1Commit
sent, waiting for the rest nodes to commit at view 0ChangeView
sent, in the process of changing view from 0 to 1So we have the replica 0 permanently dead at the view 0 without possibility to affect the consensus process. Replica 2 has its
Commit
sent and unsuccessfully waiting for other replicas to enter the commit stage as far. Finally, replicas 1 and 3 have entered the view changing stage and not being able either to commit (as there's onlyF
nodes that have been committed at the view 1) or to change view (as the replica 2 can't send itsChangeView
from the commit stage).It should be noted that dBFT 2.0 algorithm is expected to guarantee the block acceptance with at least
M
non-faulty ("alive" in this case) nodes, which isn't true in the described situation.3. Running the TLC model checker with "faulty" nodes
Both models allow to specify the set of malicious nodes indexes via
RMFault
model constant. "Faulty" nodes are allowed to send any valid message at any step of the behaviour. At the same time, weak fairness is required from the next-state action predicate for both models. It means if it's possible for the model to take any non-stuttering step, this step must eventually be taken. Thus, running the basic dBFT 2.0 model with single faulty and three non-faulty nodes doesn't reveal any model deadlock: the malicious node keeps sending messages to escape from the liveness lock. It should also be noticed that the presence of faulty nodes slightly increases the states graph size, so that it takes more time to evaluate the whole set of possible model behaviours.Nethertheless, it's a thought-provoking experiment to check the model specification behaviour with non-empty faulty nodes set. We've checked the basic model with the following configurations and didn't find any liveness lock:
dBFT 2.1 proposed models
Based on the liveness issues found by the TLC Model Checker we've developed a couple of ways to improve dBFT 2.0 algorithm and completely avoid mentioned liveness and safety problems. The improved models will further be referred to as dBFT 2.1 models. The model descriptions and specifications are too large to attach them to this issue, thus, they are kept in a separate repo. Please, consider reading the dBFT 2.1 models description at the README and check the models TLA+ specifications.
Running the TLC Model Checker on dBFT 2.1 models
We've run the TLC Model Checker to check both proposed dBFT 2.1 models with the same set of configurations as described above for the basic model. Here's the table of configurations and model checking results (they are almost the same for both dBFT 2.1 models):
PASSED
, no liveness lock foundNOT FINISHED
, TLC failed with OOM (too many states, see the model checking note), no liveness lock foundPASSED
, no liveness lock foundNOT FINISHED
, TLC failed with OOM (too many states, see the model checking note), no liveness lock foundPASSED
, no liveness lock foundNOT FINISHED
, TLC failed with OOM (too many states, see the model checking note), no liveness lock foundFAILED
fordbftCV3.tla
, see the note below the table.NOT FINISHED
fordbftCentralizedCV.tla
in reasonable time; the "faulty" node description probably needs some more care.Note for the
FAILED
case: assuming that malicious nodes are allowed to send literally any message at any step of the behaviour, non-emptyRMFault
set is able to ruin the whole consensus process. However, current dBFT 2.0 algorithm faces the liveness lock even with the "dead" nodes (the proposed dBFT 2.1 models successfully handle this case). Moreover, we believe that in real functioning network it's more likely to face with the first type of attack ("dead" nodes) rather than with the malicious clients intentionally sending any type of messages at any time. Thus, we believe that the dBFT 2.1 models perform much better than the dBFT 2.0 and solve the dBFT 2.0 liveness lock problems.Other than this, the TLC Model Checker didn't find any liveness properties violations for dBFT 2.1 models both with
MaxView
constraint set to1
andMaxView
constraint set to2
(see the model checking note for clarification).Conclusion
We believe that proposed dBFT 2.1 models allow to solve the dBFT 2.0 liveness lock problems. Anyone who has thoughts, ideas, questions, suggestions or doubts is welcomed to join the discussion. The proposed specifications may have bugs or inaccuracies thus we accept all kinds of reasoned comments and related feedback. If you have troubles with the models understanding/editing/checking, please, don't hesitate to write a comment to this issue.
Further work
The whole idea of TLA+ dBFT modelling was to reveal and avoid the liveness lock problems so that we've got a normally operating consensus algorithm in our network. There are several directions for the further related work in our mind:
The text was updated successfully, but these errors were encountered: