This post is about two bug fixes to riak-core's claim code.
The post will cover the purpose of claim, the inputs, and the desired properties of the output. Then a look at the bugs themselves, and how we've fixed them.
First though, a quick look at the ring. Those of you who are confident about the ring, what it is, and how it does it, skip down to Recent Work
As Martin Sumner recently wrote in an in depth look at claim in Riak, the ring is something often illustrated, a useful metaphor, and typically a vaguely grasped abstraction. But the ring is not just conceptual, it's a concrete thing in Riak: a datastructure that describes a mapping of ownership as a routing table.
Riak uses consistent hashing to decide where to store replicas of data. An integer (2^160) represents the entire key space. The key space is divided into a number of ranges of equal size. The number of ranges is configured as the ring size for a cluster, and the ring size is always a power of two. If the ring size is 64, then there will be 64 equal sized partitions of the key space (and so on for ring sizes of 8, 16, 32, 128, 512 etc.)
A hash function is used to turn any key into an integer in the key
space (i.e. hash({"artist", "REM"}) -> 666603136322215606811169435479313437784925203632
). The result of the
hash function is an integer that falls inside one of the partitions,
or ranges, of the divided key space. The hash function is such that we
get an even distribution of values throughout the key space. So far,
so abstract.
Each of these ranges/divisions/partitions of the ring represents a portion of the keys Riak stores. In order to spread the work evenly throughout a Riak cluster each physical node should own an equal number of partitions.
Concretely the ring is a mapping of partitions of the key space to physical nodes, denoting ownership. For each partition in the key space there is an entry in the ring that points to a riak node name. It says:
"[email protected] owns partition
1415829711164312202009819681693899175291684651008"
For each partition that it owns, a node will start a process. These processes are called vnodes for virtual node.
Since it maps out ownership, the ring is also a routing table. Any node in a Riak cluster can run the hash function on a key and look up in the ring which node owns the partition that key falls into and route requests to that node. For this to work, every node needs a copy of the ring, and ideally, all copies would agree as to which nodes own which keys. Riak has a gossip protocol for spreading the ring throughout a cluster and ensuring all nodes share a consistent view of the ring. Hopefully a future post can describe gossip in detail.
Riak is a replicated database. By default it stores three replicas of
every key/value, the number of replicas is called the n-val
meaning
there are n
replicas, and by default n=3
. As described above,
hashing the key gets a partition, and a partition is owned by a node,
and the node runs a vnode process for that partition. The vnode stores
the key/value, but what of the other replicas? In order to place the
n-1
remaining replicas Riak uses the ring. Ordering the partitions
low-to-high Riak picks the nodes that own the next two partitions
following the one the key hashes to. This gives us a list of pairs
[{partition(), node()}]
called the Preference List, or preflist for
short, which says exactly which processes on which physical nodes
should have copies of the data.
More specifically the preflist above is called a Primary Preflist. Riak is a fault tolerant database, and if some node in the Primary Preflist is not reachable to write or read a replica, Riak uses the ring again. For any node in the primary preflist that is down Riak will select the node for the next partition on the ring, and so on, until it comes to a node that is up. Any node that stores a replica but is not in the primary preflist is called a Fallback. A Fallback will start a vnode process for the primary partition it temporarily replaces and store a replica until such a time as the primary is contactable.
Thinking about a key that hashes into one of the last two partitions should clarify why the ring is called a ring. A key that hashes into the last partition would have a preflist made up of that last partition, followed by the first, and then the second partitions, wrapping around the tail of the ring and back to the start.
We want at least n
nodes in the ring if we want each replica to be
on it's own node. If we want to tolerate failures and still have each
replica on it's own node we need n + f
nodes where f
is the number
of node failures we need to tolerate. Basho's recommendation was that
you want at least 5 nodes in a Riak cluster.
Knowing how Riak constructs preflists from the ring, and how it selects primaries and fallbacks is useful when we consider what properties we might want from an algorithm that assigns partitions to nodes. This algorithm is called Claim.
Claim's job is to make the ring. As inputs it has the ring size (total number of partitions), the list of nodes that make up the cluster, and the current ring.
Thinking about preflists, we want to distribute the partitions in such
a way that any preflist selected from the ring will lead to replicas
being stored on three distinct nodes. Ideally it should be able to
tolerate the failure of any node and still have every preflist contain
three distinct nodes. This spacing is called target-n-val
and by
default it is set to four.
Since the vnode processes are the actors that do the work of reading and writing data, it would be best if the same number of processes ran on each node, meaning that all nodes would perform pretty equally. No node would be much faster than the others, and crucially no node would be much slower. Clearly if Ring Size is not divisible by Node Count we can't get exactly the same number of vnodes per-node, but the difference should be minimal.
One issue with Claim as it stands today is that it does not address
"tail violations." For example, given a ring size of 16 with 5 nodes
added at once claim generates a ring where there are primary preflists
that do not have three distinct nodes. Claim assigns to each node in
order, partition one (p1
) to node one (n1
), p2
to n2
, p3
to
n3
, p4
to n4
, p5
to n5
, p6
to n1
and so on, finishing
with p16
on n1
. Both the preflists
[{p15, n5}, {p16, n1}, {p1, n1}]
and
[{p16, n1}, {p1, n1}, {p2, n2}]
violate the spacing requirements of
the ring. This can lead to dataloss if you expect there to be 3
distinct physical replicas of your data but in fact there may only be
2, and as such is an important bug to fix.
This bug was found by modifying the quickcheck tests to add
multiple nodes to a cluster at once, and fixed by
Ramen Sen of NHS Digital. The fix
firstly decides if there are tail violations (if the ring size rem node count
is greater than zero but less than target-n-val
) and if
they can be solved given the number of nodes, target-n-val, and ring
size.
Think of the ring as a set of sequences, where a sequence is:
[n1, n2, n3, n4, n5]
Then a ring for 16 vnodes with 5 nodes has 3 full sequences, and a tail of n1
[n1, n2, n3, n4, n5]
[n1, n2, n3, n4, n5]
[n1, n2, n3, n4, n5]
[n1]
In the cases that tail violations can be solved they are done so by
increasing the tail of the ring so that there is no wrap around tail
violation and the tail ensures target-n-val
:
[n1, n2, n3, n4, n5]
[n1, n2, n3, n4, n5]
[n1, n2, n3, n4, n5]
[n1, n2, n3, n4]
Each node added to the tail of the ring is then removed from previous complete sequences. Lowest first from the previous sequence, and so on back.
[n1, n2, n3, n5]
[n1, n2, n4, n5]
[n1, n3, n4, n5]
[n1, n2, n3, n4]
Leaving a ring where all preflists, even with a single node failure, have three distinct nodes. The fix is slightly more complex, taking larger chunks of sequences when possible, but the above should serve as illustration.
If we have 4 nodes and a ring of size 32 and a nicely balanced ring
where each node has 8 vnodes, and no tail violations but then add a
5th node, what happens? Claim takes the existing ring and adds the
n5
to it. It calculates how many vnodes each node wants (6, since we
can't have a fraction of a vnode) and how many each existing node can
give up (2 each). It then decides on what the delta for each node
is. Nodes 1 through 4 have a delta of 2 which means they can give up
at most 2 vnodes. Node 5 has a delta of -6, meaning it wants 6
vnodes. Claim then takes vnodes from the existing nodes and assigns
them to the new node in such a way that target-n-val
is not
violated. The bug is that claim takes all deltas from each node until
wants are satisfied. It takes 2 from node1, 2 from node2, 2 from node3
and then stops. Now node5 has all 6 wants met, and nodes 1 through 3
have 6 vnodes each, all good. Except that node4 has 8 vnodes, 33.3%
more than any other vnode. Node4 is going to be busy, and therefore
slow, and thusly trouble.
The fix for this issue is trivial. Rather than taking all the deltas at once from a node, instead take a delta from each node in round robin until the wants of the new node are met. This results in a ring where two nodes have 7 vnodes and the rest have 6.
As well as fixing the two issues above and adding extra quickcheck
property tests to verify the fixes, there is work to be done. The
existing tests don't call the same code paths as the operator commands
riak-admin cluster join | plan | remove | etc
which means there may
well be edges undetected and code untested. There is no test for
removing a node. To address this we've started work on a more thorough
quickcheck statem test that models the full life cycle of the ring
over many transitions of node adds and removes. However the ring,
claim, and gossip code seems to have grown organically and is spread
over a large surface in riak_core. This work will take some more time.
These changes, though small, make claim safer and lead to better behaved clusters. Though the tail violation issue could be detected if a diligent operator followed best practices and checked the output of the ring before moving to production, there was no way to mitigate the issues.
These fixes will be released with the next version of Riak we create for NHS-Digital, along with the new features from previous posts. If you want to look at the code, or try it for yourself you can find it here.