You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It looks like Plenum has timing-related bug in view change protocol.
Potential steps to reproduce
create a test pool with 4 nodes
pause 2 nodes, none of which are primary. If using docker enviroment:
use docker pause command, so nodes are frozen, and no explicit disconnection events happen
pause Node3 and Node4 - they are guaranteed not to be primaries initially
wait for 30 minutes, during that time
master primary will send freshness batch (probably couple of times)
working nodes will get and store these batches, but won't be able to order it because of lack of consensus
after about 10 minutes working nodes (including primary) should realize, that consensus is lost, and start sending votes for view change (INSTANCE_CHANGE messages), but because of lack of consensus view change won't start
after 30 minutes unpause paused nodes
they will realize that consensus was lost for too long, and also vote for view change
view change will start, NEW_VIEW message with previously unordered freshness batches will be created, but ordering will fail, complaining about incorrect batch time
so next view change will happen, with same results
so pool will enter perpetual view change cycle even though all nodes are up and healthy
restarting all nodes at once should break cycle and put pool back into healthy state
Actual steps when I caught this were longer, but based on my preliminary analysis these should also suffice.
Cause and potential fix
there is indeed a safeguard on batch time during normal ordering, so that malicious primary won't be able to create batches far in future or in past
however this safeguard also applies to batches that are reordered during view change, and if for whatever reason view change took longer than that safeguard window batches won't be able to be reordered, since their timestamps cannot be altered, and so view change will never be able to finish
potential fix should include either different time safeguard logic for reordering phase, or disabling that safeguard during reordering (however before doing that thorough analysis should be performed on safety of such action)
The text was updated successfully, but these errors were encountered:
It looks like Plenum has timing-related bug in view change protocol.
Potential steps to reproduce
docker pause
command, so nodes are frozen, and no explicit disconnection events happenActual steps when I caught this were longer, but based on my preliminary analysis these should also suffice.
Cause and potential fix
The text was updated successfully, but these errors were encountered: