This week, we continued our focus on a fully replicated model of the file system. We're still focusing on directories only, driving our work with an integration test that randomly mutates multiple in-memory replicas of a file system tree and tests for convergence.
Mid-week, we hit a pretty major snag that we hadn't anticipated, but seems obvious in retrospect. Say you have two replicas of a tree that contains two subdirectories, a
and b
. At one replica, a
is moved into b
. Concurrently, on the other replica, b
is moved into a
. When we exchange operations, we end up with both directories in an orphaned cycle, with a
referring to b
as its parent and b
referring to a
as its parent, a state which we can't mirror to the underlying file system of either replica.
Time | Replica 1 State | Replica 2 State |
---|---|---|
0 | a/ b/ |
a/ b/ |
1 | a/b/ |
b/a/ |
2 | ??? | ??? |
For any set of concurrent moves, it's possible to create a cycle, and you could potentially create multiple different cycles that share directories in certain diabolical cases. Left untreated, these cycles end up disconnecting both directories from the root of the tree. We still have the data in the CRDT, but it can't be accessed via the file system. We need to break them.
We spent the second half of this week thinking about every possible approach to breaking the cycles while also preserving convergence, and we ended up arriving at two major alternatives.
The first approach is to preserve the operations that create the cycle, but find a way to break the cycle when we interpret the operations. The trouble is that cycles are always created by concurrent operations, but because this is a CRDT, it's possible for concurrent operations to arrive in different orders at different replicas. This means a decision to break a cycle is order-dependent, and may need to be reevaluated upon the arrival of a new operation. Our best idea is to create an arbitrary ordering of all operations based on Lamport timestamps and replica ids. When a new operation is inserted anywhere other than the end of this sequence, we integrate it and then reinterpret all subsequent operations based on a state of the tree that accounts for the new operation. It's definitely doable and preserves the purity of the CRDT, but it also seems complex and potentially slow. It also means that we could end up synthetically breaking a cycle only to determine later that we don't need to break the cycle due to the arrival of a concurrent operation. This could cause seemingly unrelated directories to appear out of nowhere upon the arrival of a concurrent operation, which could be pretty confusing depending on the integration delay. We'd like Eon to generalize to async use cases in addition to real-time, and these "phantom directories" seemed like a real negative for usability.
The second approach, which we've decided to go with, is sort of a principled hack. Whenever we interpret a move at a given replica that introduces a cycle, we look at every move operation that contributed to the cycle and synthesize a new operation that reverts the operation with the highest Lamport timestamp. We then broadcast this new operation to other participants. Depending on the order that various concurrent operations arrive at different replicas, we may end up reverting the same move redundantly or reverting multiple moves that participate in different variations of the same cycle. We considered this approach within the first hour of our discovery of the issue, but initially discarded it because it seemed to violate the spirit of CRDTs. It seems weird that integrating an operation should require us to generate a new operation in order to put the tree in a valid state. But after fully envisioning the complexity of the pure alternative, synthesizing operations seemed a lot more appealing. Breaking cycles via operations means that once a replica observes the effects of a given cycle being broken, they'll never see it "unbroken" due to the arrival of a concurrent operation. It also completely avoids the issue of totally ordering operations and reevaluating subsequent operations every time an operation arrives.
One consequence of either approach is that there could be certain combinations of operations that lead to a cycle that we never detect and break. That means that certain version vectors might yield tree states containing cycles and constrains the set of version vectors we should consider valid. This isn't a huge deal, because even without cycles, the constraints of causality already limit us to a subset of all possible version vectors if we want a valid interpretation of the tree. For example: If replica 0 creates a directory at sequence number 50 and replica 1 adds a subdirectory to it at sequence number 10, the state vector {0: 20, 1: 10}
would contain a directory whose parent doesn't exist. If we limit ourselves to version vectors corresponding to actual states observed on a replica, we will have no problems.
As I discussed in the previous update, we currently represent the state of the file tree inside a B-tree with heterogenous elements. Each tree item is either metadata, a child reference, or a parent reference. Now I'm realizing this is probably wrong. If we separated metadata, parent references, and child references into their own homogenous trees, we could probably simplify our code, reduce memory usage, and perform way pattern matching on the various enumeration variants. We plan to try separating the trees this week.
For whoever is reading these updates, thanks for your interest. We're always interested in thoughts and feedback. Feel free to comment on this update's PR if there's anything you'd like to communicate.