Skip to content

Latest commit

 

History

History
745 lines (508 loc) · 46.6 KB

2017-02-21.md

File metadata and controls

745 lines (508 loc) · 46.6 KB

Modular data organization

This post is about organizing data in differential dataflow. It is also about organizing organizing data, in that I got tired of writing so many different ways of organizing data and put together a modular framework for building different types of data layout. I think this is pretty cool, and you might think it is cool too. Or (for the more cynical) you could point out the obvious blunders I've made. Something good must come of it.

I'm optimistic!

Although you probably should read this in order for it to make sense, the structure is

  1. The problem definition: In which we describe the interface that our datastructure needs to support.

  2. A bespoke implementation: A description of differential dataflow current design, which has some issues and which we will want to improve.

  3. Growing pains: A list of issues with the current implementation. There are some performance problems, some expressivity problems, and many maintainability problems.

  4. A modular approach: In which we deconstruct the bespoke approach into generic components, which can be re-assembled as appropriate.

  5. Evaluation: Testing our new implementations in various settings, trying to draw out their up and down sides.

0. Some motivation

Hey is differential dataflow fast or slow? Let's profile it.

Strongly connected components is one computation I run a lot. Because it is cool and complicated, but not because anyone ever asked for continually updated strongly connected components. Not once.

But you fire up a 10M node, 20M edge computation, you get a profile that looks a bit like this

scc

What it is telling, because seriously what the heck, is that we are spending a lot of time (20%) futzing around with memory management (memmove) and lots of time in methods called things like extend and consolidate and done. These are all methods related to building up collections, which makes sense because the scc computation produces a bunch of intermediate state (it is a really cool dataflow) but doesn't do much with it once it does (one iteration is often enough, for random graphs at least).

So, building up collections is part of what is holding scc back.

What about something else? Let's look at a continually updated breadth first search compuation. This is another reference computation; I'm going to do 1M nodes and 10M edges, and continually apply 1,000 edge updates to the computation.

bfs

This time around we see we spend lots of time in advance, which we might not have met yet but will meet soon. There are also some seek_key and tidy_vals and tidy_keys, all related to navigating around traces. This makes sense for bfs because there are relatively few changes to the computation as the edges change, but we do need to run around in our indexed data and check that things really have stayed the same.

So, seeking around in collections looking for keyed data is part of what is holding bfs back.

Bold Claim In order to improve differential dataflow, we need to improve how we manage data. Like, how we efficently bundle it up and stash it, and how we navigate around in it.

Not shown: a screenshot of Instruments.App beachballing, because apparently it locked up OSX's screen shot taking mechanism too. It's good to know you aren't the only one who's software isn't fully sorted yet.

1. The problem

Recall that differential dataflow tracks lots of four-tuples,

(key, val, time, diff)

where key and val are probably ordered, time is kind of a mess, and diff is some sort of integer (I've put on my grown-up pants and switched it from an i32 to an isize, because big data).

All of this stuff also applies more broadly if your scenario has more interesting relations with more dimensions (or fewer, as we will get to).

Without worrying the details, I'm going to claim that we want to navigate the data in this order too: we want to sniff around by key, within each key we want to move around val, and finally we want to crank through a bunch of (time, diff) pairs for each value. This may change as the underlying differential dataflow algorithms change, but let's take it as a given for now.

Our tuples arrive in batches of varying sizes. At the beginning of the computation we might get a billion edges (hi twitter_rv!), and as the computation proceeds these batches might wobble around and be as small as single records. In any case, we want to append these tuples to our collection, so that when we sniff around in the future, we see them.

There is some magic about compacting differences for equivalent times which you can read about, but for now let's ignore this.

So, we repeatedly accept batches of tuples, and need to support interactive exploration by keys and vals and .. not really by time but we need access to all the (time, diff) pairs.

2. A bespoke implementation

We'll start with the custom implementation, before we get fancy and generalize it. This should call out the intended structure of the implementation, and then we'll talk about swapping in clever bits more modularly. It is the same implementation as mentioned in the recent status report post, but let's review it anyhow.

Step 1: An immutable collection

Our first building block is going to be an immutable collection. If we are presented with a batch of data and are told "just this", how might we represent it?

There are a few choices, but I happened to choose the following:

struct Batch<K: Ord, V: Ord, T> {
	keys: Vec<(K, usize)>,
	vals: Vec<(V, usize)>,
	times: Vec<(T, isize)>,
}

The idea here is that keys is ordered by the K field, and has as its second field an offset into vals where the values associated with that key end. The values start, of course, at the offset where the previous key's values end (or zero, if this is the first key). Similarly, each interval of vals is sorted by V and has an offset into times bounding the associated range range of (T, isize) entries.

This works. Because things are sorted, we could use binary search or something to zip around. It has some performance issues that we will get to, but it is not so bad to navigate through.

You could walk around in these fields on your own, but I thought it would be good to put together a trait that allowed us to hide the details behind a simple Cursor interface:

pub trait Cursor<K, V, T> {

	// move the cursor around..
	fn step_key(&mut self);
	fn seek_key(&mut self, key: K);
	fn step_val(&mut self);
	fn seek_val(&mut self, val: V);

	// see what the cursor points at..
	fn valid_key(&self) -> bool;
	fn valid_val(&self) -> bool;
	fn key(&self) -> &K;
	fn val(&self) -> &V;

	// do stuff with the associated (time, diff) data.
	fn map_times<F: FnMut(&T, isize)>(&self, func: F);
}

Actually the real trait is a bit more complicated (rewinding, peeking, etc).

Nonetheless, the trait above gives us some structure about what we'll need to be able to do with our data organization. Walk around, jump around, look at things.

Our ordered lists allow us to use binary search for keys and values, but there is something a bit more clever we can do too. Exponential or "galloping" search starts from an existing cursor and moves forward in exponentially increasing steps until it would pass the value, then takes exponentially decreasing steps until it gets there. This is the version I use:

/// Reports the number of elements to pass until `function` becomes false.
pub fn advance<T, F: Fn(&T)->bool>(slice: &[T], function: F) -> usize {

	// start with no advance
	let mut index = 0;
	if index < slice.len() && function(&slice[index]) {

		// advance in exponentially growing steps.
		let mut step = 1;
		while index + step < slice.len() && function(&slice[index + step]) {
			index += step;
			step = step << 1;
		}

		// advance in exponentially shrinking steps.
		step = step >> 1;
		while step > 0 {
			if index + step < slice.len() && function(&slice[index + step]) {
				index += step;
			}
			step = step >> 1;
		}

		index += 1;
	}	

	index
}

The advantage here is that if you aren't going too far away, it can be pretty quick. The time taken is logarithmic in the distance to the target, and if that is small, zoom. If it isn't small, well shucks.

All this jumping around is fun, but it isn't cheap. Stomping around in large amounts of memory can be pretty painful. One thing that improves life is breaking the keys vector into two vectors:

	keys: Vec<K>,
	offs: Vec<usize>,

so that we can jump around in keys and then look things up in offs. Saving the 8 bytes of the usize entry helps a lot for small keys, as we jump around in a relatively smaller amount of memory while looking for the key.

This is most of what I wanted to say about the immutable collection. We are going to use it, but there isn't all that much to this implementation so far.

Step 2: An append-only collection

We have above an approach for immutable collections, but for some reason we don't get all of our immutable data at once. We keep getting more and more immutable collections handed to us, which probably doesn't seem like the immutability we were promised.

But, there are standard ways of dealing with this. The one I thought was most natural was just to keep the immutable collections in a list as they come in,

struct Spine<K: Ord, V: Ord, T> {
	batches: Vec<Rc<Batch<K, V, T>>>,
}

Side notes: yes, we will merge the immutable collections and create new immutable collections. Also, the Rc there is a reference-counting smart pointer, which we use so we can share the batches without taking ownership from others.


We also need a newer and smarter cursor implementation which would keep its fingers in each of immutable collections, perhaps by using their cursors. This new and smart cursor will need to present the appearance of one order on keys and values, probably by merging the information from each of its constituent cursors.

What is a good way of merging several cursors? I had a pretty horrible way before, but I came up with something that I find pretty tasteful. It has probably been invented before, so if you know the name, holler!

The rough idea is that we'll keep all of our cursors in a list, which is a smart place to keep things when you have many, but we keep them sorted first by each's current key, and then among cursors with the smallest key we keep them sorted by each's current value. Let's imagine we also write down how many cursors currently have the same key and within them the same value, so we don't have to constantly re-evaluate this.

struct CursorMerge<C: Cursor> {
	cursors: Vec<C>,
	same_keys: usize,
	same_vals: usize,
}

The current key and value can be read out of the first element in the list. To navigate around, we need to manipulate some of the cursors, but we only need to work with the prefix of the cursors whose keys or values, appropriately, are smaller than the target. Often this is a relatively small prefix (Note: this only works as we seek forward, which is going to be how we look for keys).

Step 3: Merging and Compaction

I said I wasn't going to say much about these, and I'll try not to. When we need to, we merge immutable collections into new merged collections. The results are (i) one collection and one cursor, and (ii) less space used. As we do this, we can apply advance_by logic to each of the time fields, which gives us the potential to consolidate diff fields and occasionally cancel tuples out of the collection.

It's still append-only, kinda. Immutable is arguable at this point.

3. Growing pains

The bespoke approach above is fine. It has decent worst-case performance, but isn't going to win any awards (unless there are prizes for being impressed with oneself). There are a variety of ways we might want to tweak this implementation to accommodate structural properties of the data, or to take aim at better performance. Let's go through a few examples:

Collections without values

What if your collection isn't (key, val) pairs, but is just keys?

This is not uncommon in, for example, Datalog computations where the collections are all set-valued. The pevasive distinct operator doesn't know anything about keys and values, nor does half of the semijoin operator (which passes those records whose keys exist in a second set).

In this setting, while we could just take our V type to be (), the zero space unit type, we feel pretty silly keeping around

	vals: Vec<((), usize)>,

when we could have just put all of its offsets into times up in the keys vector.

Instead, it makes sense to have a different datastructure:

struct Batch<K, T> {
	keys: Vec<(K, usize)>,
	times: Vec<(T, isize),
}

Of course, we'll need a different cursor now too. Copy/paste?

Static collections

Some collections don't change. Or, maybe they just have few enough times at which they change that we could tolerate an immutable collection for each time.

If we want to do this, it neither makes sense to record the times in the collection, nor to allow vals to describe ranges (as there will only be one diff). Instead, we'd like a struct like

struct Batch<K, V, T> {
	time: T,
	keys: Vec<(K, usize)>,
	vals: Vec<(V, isize)>,
}

This saves us a bunch on the memory footprint (those times aren't small, nor is the usize). It also simplifies some of the cursor logic, which we'll have to rewrite of course. No copy/paste this time.

Hashing vs Searching

Ordered lists are pretty handy for a few reasons.

  1. We can search for elements in logarithmic time.
  2. We can merge lists quite efficiently.
  3. Flat data layout is appealing for copying, serialization.

One potential improvement is to use an order-preserving hash table. My personal favorite is Robin Hood Hashing, which keeps all elements in sorted order of their hash codes, but puts elements no earlier than the index their hash code would suggest. It leaves element in sorted order, but with some gaps that in essence absorb some of the pain of bucket collisions. It still meets most of the properties up above, at the cost of bloating intervals up to powers of two, but only applies when your type implements Hash in addition to Ord.

The changes are not wildly complicated:

struct Batch<K, V, T> {
	keys: Vec<Option<(K, usize)>>,
	vals: Vec<Option<(V, usize)>>,
	times: Vec<(T, isize),
}

The only change here is that we've wrapped each of our entries in an Option type, which allows for the absense of data with its None variant. Of course, there is also a substantially different cursor to write now.

Hybridizing

These two examples are a good start, but maybe what we actually want is hashing for the keys, and ordered lists for the values, because we jump around in keys mostly and then scan values. Or maybe we want hashing for that neat key-free version that distinct uses. Awesome!

Do I really have to write out the full cross-product of these variants? What when we invent a new variant that doesn't use isize but instead has a binary polarity for the whole batch and uses the number of occurrences to indicate the count?

You could write all these things out, and that would be one way to spend several months.

4. A modular approach

The bespoke approach quickly becomes painful to maintain. There are obviously many improvements, but the cost of implementing them (and maintaining them, if there are bugs, lol) makes it not that exciting for me at least.

Let's do something smarter, and make things faster in the process.

Layers on layers

Our various representations above varied mostly in which fields (keys, vals, times) we kept, and how we navigated through each of these layers. In fact, there is a pretty simple contract between the layers: each layer is segmented into distinct regions by usized boundaries known one layer up. We can think about writing generic "layers" that we then plug on top of other layers.

Let me give you a taste of what this looks like:

struct OrdLayer<K: Ord, L: Layer> {
	keys: Vec<(K, usize)>,
	vals: L,
}

Ok, this generic struct has keys like before, but then it just has vals which is some type L: Layer. What does that mean?

The generic struct means that we can plug in our favorite type that implements some Layer trait, and we get the ability to navigate around by key K, and perhaps then dive down into vals using the usize offsets associated with the keys. It's not yet specified what we are diving down into, but that's cool. Something polymorphic, I bet.

Let's do another one, with less cleverness:

struct BottomLayer<T: Ord> {
	vals: Vec<(T, isize)>,
}

I've suggestively named the layer and its generic parameter to connect this with the last level of our first Batch<K, V, T> implementation. Speaking of which,

type Batch<K, V, T> = OrdLayer<K, OrdLayer<V, BottomLayer<T>>>;

Ooo. That was pretty easy, wasn't it? Let's do that value-free (ha) version that something like distinct would use:

type Batch<K, V, T> = OrdLayer<K, BottomLayer<T>>;

Oh neat. What about that hashing stuff? We could certainly write out something like

struct HashLayer<K: Ord+Hash, L: Layer> {
	keys: Vec<(K, usize)>,
	vals: L,
}

which lets us choose to use our Robin Hood hashing on a layer-by-layer basis. We could do

type Batch<K, V, T> = HashLayer<K, OrdLayer<V, BottomLayer<T>>>;

or

type Batch<K, V, T> = OrdLayer<K, HashLayer<V, BottomLayer<T>>>;

or

type Batch<K, V, T> = HashLayer<K, BottomLayer<T>>;

or whatever we want, as long as what we want is to type in lots of different batch definitions.

But there's more isn't there? Just writing the structs is the easy part; there is code to write too.

Cursors

Probably the scariest thing to worry about are the cursors. That is where all the logic exploiting our cleverness lives. But, cursors really aren't all that hard.

Here is the trait that I'm currently using:

pub trait Cursor {
	type Key: Ord;
	fn step(&mut self);
	fn seek(&mut self, key: &Self::Key);
	fn key(&self) -> &Self::Key;
	fn valid(&self) -> bool;
	fn reposition(&mut self, lower: usize, upper: usize);
}

The trait has an associated type Key which is just the cursor's way of saying "you need to use this type to talk with me". It has step and seek, which do the things you would expect (jump around). It has key and valid methods to know what you are looking at. Finally, there is a reposition method that, let's be honest, you aren't supposed to call. That is for parent cursors to call when they jump around. Hands off.

This moves us around one level, but there is nothing in the trait definition about diving down to lower values and such. How do we do that?

I've just kept it out of the trait definition, though perhaps it could live there too. It seemed easier just to do something like

pub struct OrdCursor<K, L: Cursor> {
	layer: OrdLayer<K>,
	pos: usize,
	bounds: (usize, usize),
	pub vals: L, 
}

This cursor has a link to the layer, a current position, and bounds on the interval it is currently exploring (so you don't want into the K for some other G, or whatever lives above you).

In addition, the cursor has a public field, vals, which is just the cursor for the next level down. You can use that as you see fit. Importantly, none of the vals navigation informs the keys cursor; once the vals cursor runs out of values to look at (or whatever), it just sits in an invalid state. The only protocol between the two is that as we advance in the parent OrdCursor, we need to reposition the child vals to the corresponding range of values (hence reposition. don't touch!).

The reality is that I'm going to have to implement the Cursor trait from way back above (with step_key and seek_val) for each of these types. No magic there, yet. But that code should be relatively simple, and mostly it just makes associations between the trait methods and those of the weird type that I've built. It exports an understanding of what are keys and what are values.

As long as all of our batches are the same time, their cursors will be the same type and we can just throw them in that CursorMerger struct, which was written to be generic over cursor implementation.

Merging

This part is pretty cool. I think.

We are going to have to merge some of our Batch collections. No sweat, but let's at least try and do a good job here. Batches are sorted by something (either key or hash), and we can walk through them doing a merge.

We all know how to write merge sort, maybe. We start with two cursors and repeatedly look at the next element in each cursor, and take the smaller or, if there is a tie, take them both. Actually, it's a bit more complicated with our batches: if there are tied keys, we need to merge the associated values. If there are tied values we need to merge the associated times. Ok, it's like merge sort but a bit weirder.

Actually, it is like merge sort but awesomer, in some few important ways.

Let me describe a variant of merge sort that works on arrays, and uses our advance method from up above. Remember that one? It skips forward in ordered sequences to find the first element greater than some target, using logarithmic steps.

Here is a merge sort fragment that compares two elements, and based on which one wins uses advance to zip through the one with lesser values, looking for a number step of elements to copy over. It then uses extend_from_slice, which Rust folks will know is one of the few ways to trick LLVM into using memcpy. Depending on the phase of the moon, of course.

// merge sort, kinda.
fn merge<T: Ord>(mut source1: &[T], mut source2: &[T], target: &mut Vec<T>) {
	target.reserve(source1.len() + source2.len());
	while source1.len() > 0 && source2.len() > 0 {
		match source1[0].cmp(&source2[0]) {
			Ordering::Less => {
				let step = 1 + advance(&source1[1..], |k| k < &source2[0]);
				target.extend_from_slice(&source1[..step]);
				source1 = &source1[step..];
			}
			Ordering::Equal => {
				target.push(source1[0]);
				target.push(source2[0]);
				source1 = &source1[1..];
				source2 = &source2[1..];
			}
			Ordering::Greater => {
				let step = 1 + advance(&source2[1..], |k| k < &source1[0]);
				target.extend_from_slice(&source2[..step]);
				source2 = &source2[step..];
			}
		}
	}

	target.extend_from_slice(&source1);
	target.extend_from_slice(&source2);
}

This works rather well if there are runs of similarly sized values in either input. Maybe these runs aren't very common in sorting benchmarks, but they show up quite a lot in our batches. Wait, really?

In our batches, the analogue of extend_from_slice doesn't just copy over some keys, it needs to move over all of the associated values, and for each of those values the associated times. Even if we only get a few keys in a run, we are likely to want to bulk copy a bunch of values and times.

Fortunately, we can make our Layer implementations expose a copy_range interface, which efficiently pulls over large blocks of data from existing layers. This makes merging a lot less painful, and we only have to write it once for each Layer implementation, rather than for each of our weird cross-product implementations. This should make merging hurt a lot less. I hope.

5. Evaluation

Let's see how this goes. We are initially going to compare against differential dataflow's default implementation, which was the "not very impressive" implementation whose performance limitations we suggested at back at the beginning. We expect to see improvements here.

Specifically, we'd like to get speedy random access, high-throughput merging, and more efficient memory layout where appropriate. The things we will measure are

  1. Time to build collections from unsorted tuples.
  2. Time to merge collections for varying sizes.
  3. Time to look up query sets in collections, each of varying sizes.

We'll actually do a few flavors of queries, taking random query sets of varying sizes and sorting them to see how performance improves as we get larger batches of queries. All of our implementations are designed to go faster as we get larger batches of queries to look up.

As data, I'm not going to work too hard here and just introduce a bunch of (u32, (usize, isize)) tuples, where the key ranges from 0 up to some limit, and there is just one usize value for each key. Our goal is to stress out the key layer, not the layers below it. I think we could probably do some more interesting measurements, but we would start to conflate a bunch of performance issues and likely not understand the behavior of these layers as easily.

Let's see!

Baseline performance

To start, let's consider using Rust's HashMap to drive a layer. I'm not really sure how this would work modularly, but we can use it for the top layer (keys), and get a sense for how "random access" could look, and what building and merging such a structure looks like (in fairness, it is optimized for neither).

	HashMap<u32, (usize, isize)>

For increasing numbers of keys, we are going to stash (key, (0u64, 0i64)) in the hash map. This is less work than the other layers are going to do, as they have the potential of variable numbers of values to maintain, but since HashMap doesn't do that we won't measure it. Instead, we will build up the hash map, merge the hash map with itself, and seek around in the hash map. In each case we report elapsed times divided by the number of records involved, to tell us about the rate.

keys build merge seek
1,000 137ns 82ns 19ns
10,000 102ns 49ns 24ns
100,000 153ns 108ns 69ns
1,000,000 256ns 154ns 77ns
10,000,000 232ns 177ns 83ns
100,000,000 283ns 228ns 177ns

Hash maps are meant to have good seek latencies at the expense of building and merging overhead; hash map construction stabs around in memory, and stutters a bit once we get more and more keys. The performance drop off a bit once we get to 100 million keys, a result of stabbing around randomly in quite a bit of memory.

OrdLayer performance

Our first candidate, and roughly equivalent to what we are currently using in differential dataflow, is the ordered layer. It is meant to have better sequential access behavior, meaning good building and merging but maybe worse random lookups (yup). We are going to be using:

	OrdLayer<u32, WeightedLayer<usize>>

Let's start by looking at the time it takes to build up an instance of the layer, starting from randomly ordered data (explicitly shuffled). We'll go through each key size, where we just have a single value associated with each (as we are trying to test this layer, not others). There are two costs to building the layer: sorting the input tuples, and then feeding them into the layer. We will measure them separately, as well as the time to then merge the collection with itself.

keys sort build merge
1,000 82ns 9ns 10ns
10,000 84ns 11ns 16ns
100,000 96ns 24ns 17ns
1,000,000 136ns 31ns 23ns
10,000,000 166ns 40ns 20ns
100,000,000 180ns 43ns 24ns

The measurements indicate that sorting dominates, and if we could improve this we would see some gains. Or perhaps viewed differently, the building process is relatively efficient, which is good news because merging looks more like building than it looks like sorting. This merge also happens to be a pessimal merge, in that we merge identical traces with each other, meaning we never get a chance to use our fancy memcpy. The performance is still pretty good, and is notably much better than the HashMap implementation. Especially if we sort out sorting.

Up next, let's evaluate the time it takes to issue a bunch of queries. We'll randomly shuffle all the keys, and then issue queries in batches of varying sizes. The implementation can (and will) sort the queries in each batch to use one forward scan through the layer, which is important because our advance method only goes forward. Let's plot the average number of nanoseconds for a seek_key() call across all the keys, batched at different sizes:

keys \ batch 1 10 100 1,000 ALL
1,000 61ns 51ns 56ns 44ns 44ns
10,000 57ns 51ns 52ns 50ns 49ns
100,000 207ns 174ns 112ns 86ns 67ns
1,000,000 513ns 484ns 384ns 261ns 78ns
10,000,000 1,168ns 1,290ns 1,004ns 707ns 86ns
100,000,000 1,812ns 2,014ns 1,646ns 1,185ns 96ns

We see a general trend toward lower latencies with higher batch sizes, though this only really shows up at larger batch sizes. As we get to processing ALL keys, we are essenitally performing a sequential scan through the stored keys, which is pretty efficient. The latencies at large batch sizes are quite high, and notably much worse than the HashMap implementation (by about 10x).

For informational purposes, let's re-do this measurement where we don't charge the cost of sorting to the query. This makes some sense in cases where the keys are pre-sorted, as when they arrive in a pre-formed layer (which they do, in differential dataflow).

keys \ batch 1 10 100 1,000 ALL
1,000 56ns 40ns 22ns 9ns 4ns
10,000 56ns 42ns 28ns 18ns 4ns
100,000 209ns 119ns 87ns 34ns 5ns
1,000,000 494ns 501ns 364ns 215ns 5ns
10,000,000 1,135ns 1,257ns 975ns 651ns 4ns
100,000,000 1,661ns 1,931ns 1,538ns 1,183ns 4ns

Here the improvement as batch sizes increase becomes much clearer, even for small numbers of keys. On the other hand, the performance for large numbers of keys stays pretty unbearable.

HashLayer performance

Our next candidate is the Robin Hood hashing layer,

	HashLayer<u32, WeightedLayer<usize>>

This evaluation taught me something about how the FNV hash function works, or rather doesn't work on consecutive integers. It's fast on small inputs, though, so let's use it for now. I'm using a trait Hashed with two methods that reveal a hash value, and how many of the bits are legit. The latter is most commonly 64, but can be less for 32 bit hash values (bear with me on this). Here is the default implementation for things that can be hashed.

impl<T: ::std::hash::Hash> Hashed for T {
	fn hashed(&self) -> u64 {
        let mut h: fnv::FnvHasher = Default::default();
        self.hash(&mut h);
        h.finish()
    }
    fn bits(&self) -> usize { 64 }
}

If we spin up the computation as with OrdLayer, assembling different numbers of distinct keys, we get numbers like:

keys sort build merge
1,000 421ns 165ns 28ns
10,000 129ns 69ns 27ns
100,000 122ns 104ns 43ns
1,000,000 95ns 68ns 45ns
10,000,000 91ns 64ns 32ns
100,000,000 98ns 74ns 37ns

The sorting has improved by a factor of 2x over OrdLayer, for non-small key counts, but the build times have increased (which makes sense, as we are trying to be more clever). There are several potential improvements to the hash builder, mainly that it currently first stages keys in a second array to figure out where to place them. These numbers also call out the overhead my radix sorter has for small key sizes (there is a fixed cost proportional to 256). Merging doesn't seem too bad!

Up next, let's evaluate the time it takes to issue a bunch of queries. Again, we'll randomly shuffle all the keys, and then issue queries in batches of varying sizes. The implementation sorts the queries to use one forward scan through the layer. Let's plot the average number of nanoseconds for a seek_key() call across all the keys:

keys \ batch 1 10 100 1,000 ALL
1,000 63ns 69ns 115ns 118ns 143ns
10,000 43ns 46ns 84ns 120ns 151ns
100,000 112ns 89ns 113ns 145ns 183ns
1,000,000 145ns 90ns 120ns 147ns 218ns
10,000,000 174ns 158ns 194ns 216ns 248ns
100,000,000 279ns 279ns 309ns 346ns 324ns

The enormous latencies for large key counts are much reduced. But, we see a different trend here than in OrdLayer. As the batch size gets bigger things get slower. This seems mostly due to the fact that sorting has a cost, and the random look-ups we are doing are already pretty fast (for large key sizes and small batches, almost 10x faster than OrdLayer). The cost of sorting can be decreased with radix sorting, but my implementation isn't especially brilliant on small batches (i.e. those other than ALL). Rust's comparison-based sorting is also probably annoyed because there is are two fnv() evaluations in the comparison function. I'm ashamed.

Let's check again on the case where the cost of sorting is not counted against the cost of doing a query, simulating performance when the data arrive pre-arranged.

keys \ batch 1 10 100 1,000 ALL
1,000 51ns 32ns 35ns 17ns 19ns
10,000 40ns 21ns 15ns 16ns 12ns
100,000 82ns 88ns 91ns 40ns 17ns
1,000,000 140ns 65ns 54ns 55ns 17ns
10,000,000 163ns 145ns 133ns 133ns 28ns
100,000,000 268ns 240ns 235ns 227ns 68ns

These numbers look a lot better. If the queries come in sorted, we will be almost as happy as with OrdLayer, and much happier for the small batch sizes (by almost a factor of 10x for large key spaces).

But, these numbers still aren't as good at we would like. The ALL measurement is noticeably larger than the single-digit nanoseconds of the OrdLayer measurements, and the large key counts take more than a hundred nanoseconds on average; more than we'd expect.


It turns out that the janky performance can be attributed (I think) to poor distribution of hash values. If we instrument the implementation, some keys are displaced 40 positions from their intended location, and the variance of displacement is something like 24. These are much larger than they are supposed to be for Robin Hood Hashing (the variance is meant to be 1-ish).

If you dig into how the FNV hash function works, it treats its arguments as a sequence of bytes, and repeatedly x-ors in a byte and then multiplies by some large prime. Unfortunately, that prime is basically a large power of two plus some small junk; specifically, for 64 bit hashes it is

2^40 + 2^8 + 0xb3

The problem (for us) is that something as simple as consecutive integers differ only in their final byte, and by not very much. The result is that the high-order bits of the hash will differ only by some small multiple of 2^40, which doesn't quite reach the highest-order bits of the hash. If we look primarily at the high-order bits to determine intended locations, we'll have a bunch of collisions when our keys are consecutive integers.

For example, let's take the numbers 0 .. 128, compute the FNV hash, and look at the top byte. There are 256 possible values, think hash buckets, so there should be plenty of space for 128 values. Unfortunately, they collide a lot. There are only 24 distinct top bytes that FNV produces on these numbers, so buckets will get five-ish entries on average. And worse, the occupied buckets form runs; here are the first six occupied buckets, and their counts:

bucket[0x0B]: 4
bucket[0x0C]: 6
bucket[0x0D]: 6
bucket[0x2B]: 4
bucket[0x2C]: 6
bucket[0x2D]: 6
...

So, all of those entries are just going to pile up, aren't they. At least, at this small scale. Apparently similar things happen at the larger scales too.


One fix that seems crazy, but works pretty great, is to ask the nice graph processing people, who thoughtfully mapped all their graph identifiers down to small contiguous integers, to turn the identifiers into distinct random numbers. We can then just use this number as the hash, which is pretty handy (we skip the FNV evaluation, and radix sorting is cheaper because we have half the bytes in the key).

#[derive(PartialEq, Eq, PartialOrd, Ord, Clone, Default)]
pub struct Uniform {
	val: u32,
}

impl Hashed for Uniform {
	fn hashed(&self) -> u64 { self.val as u64 }
	fn bits(&self) -> usize { 32 }
}

Our layer is now going to be

	HashLayer<Uniform, WeightedLayer<usize>>

which is roughly the same as our previous HashLayer, just with a different hashed() implementation. If we re-run these measurements with random u32 identifiers rather than 0 .. nodes, we see a maximum displacement of 10 (rather than 40), and a variance of 1, (rather than 24). That feels better.

Aside If you have dense identifiers, you would usually try something like an array rather than a hash map. That makes sense if you have lots of the identifiers, but we need something that works even for smaller sets of keys. If each of our immutable collections carries around an index sized to the number of nodes, we will be blowing a lot of memory.

Let's look at how this version fares. Building and merging shouldn't be much different, as these are largely agnostic to collisions. However, we save on the sorting (because radix sort is smart enough to notice that u32 has fewer bytes than u64), and we might save a bit on the FNV evaluation, which we aren't doing now.

keys sort build merge
1,000 310ns 213ns 30ns
10,000 68ns 57ns 26ns
100,000 53ns 92ns 37ns
1,000,000 44ns 63ns 36ns
10,000,000 49ns 65ns 35ns
100,000,000 52ns 84ns 31ns

These sort times are about half what they are for the data when we use u64 hash values to sort. The radix sort is smart and understands how many bytes are in the key, based on its type, and only does that many passes. Building and merging both seem to be about the same amount of time as before, which makes sense (perhaps fewer FNV invocations, but not enough to matter).

When we start to do look-ups, we notice generally improved times.

keys \ batch 1 10 100 1,000 ALL
1,000 41ns 43ns 65ns 69ns 76ns
10,000 39ns 24ns 41ns 67ns 87ns
100,000 68ns 97ns 60ns 69ns 105ns
1,000,000 125ns 61ns 70ns 83ns 122ns
10,000,000 138ns 54ns 61ns 78ns 132ns
100,000,000 191ns 89ns 96ns 127ns 158ns

As before, a lot of overhead here is due to sorting, but we've also cut into the actual query times. If we check out the times without counting sorting against the queries, we see just how much is sorting:

keys \ batch 1 10 100 1,000 ALL
1,000 26ns 19ns 10ns 11ns 11ns
10,000 34ns 17ns 14ns 12ns 8ns
100,000 82ns 23ns 22ns 37ns 10ns
1,000,000 65ns 42ns 27ns 25ns 6ns
10,000,000 134ns 42ns 34ns 33ns 8ns
100,000,000 189ns 76ns 65ns 64ns 9ns

These times are now much closer to OrdLayer for large query batches, especially ALL, and are up to an order of magnitude faster than OrdLayer for queries against large key sizes. In fact, the apparently rough single-query batches seem to be an artifact of re-initializing the cursor more than anything; if we re-use the same cursor (the HashCursor supports forward and backward navigation) the times become roughly the same as the batch size 10 numbers; this makes some sense because 10 keys out of 100 million don't really have any more locality than one key out of 100 million. What doesn't make sense is why construcing a cursor breaks performance (it does very little).

Our times are also similar to, and occasionally better than the HashMap times. My understanding of HashMap is that it cynically optimizes for "misses", by storing values separately from keys. This means that when we do successful look-ups and want the value, there is another memory dereference to do. We stash the values with the key, so once we've found the key we have the data at hand. That's my best interpretation of the difference between the two. We also don't use siphash, which could make a difference. This isn't meant to be a critique of HashMap, so don't get too excited about the difference.

HashLayer<OrdLayer<WeightedLayer>>>

Let's try out what happens when we put a spurious bogus layer in place. We are going to use a collection of type

	HashLayer<Uniform, OrdLayer<(), WeightedLayer<usize>>>

which just adds a bogus () value. This is meant to represent the world in which we just have keys, no values, but have to suffer due to an implementation that assumes we must have values.

keys sort build merge
1,000 320ns 204ns 28ns
10,000 68ns 58ns 27ns
100,000 53ns 82ns 55ns
1,000,000 44ns 73ns 43ns
10,000,000 49ns 75ns 41ns
100,000,000 52ns 60ns 67ns

These numbers aren't actually that bad. I was hoping they would be worse, actually. Oh well. The build and merge seem to creep upwards slighly, except for some heroics building the 100 million element collection. But it isn't as bad as we might have worried.

Looking things up will not behave any differently until we actually try to track down some values, which we haven't done just yet (we've just been calling seek and confirming we found the key). But, it's a pretty easy experiment to do, so lets compare the two. I've done the version where we don't charge sorting to the query, to try and make the distinction the clearest.

First, the previous implementation without the bogus OrdLayer<(),_> gumming up the works:

keys \ batch 1 10 100 1,000 ALL
1,000 41ns 22ns 16ns 12ns 13ns
10,000 42ns 19ns 16ns 14ns 9ns
100,000 142ns 31ns 35ns 53ns 10ns
1,000,000 205ns 63ns 46ns 43ns 9ns
10,000,000 233ns 68ns 74ns 54ns 9ns
100,000,000 351ns 146ns 126ns 113ns 11ns

Notice that these numbers are for sure worse than when we didn't look at any of the associated data. This is due to the additional dereference we need to do to find the data. It becomes less of a problem with ALL keys, because our sequential scan of the keys also turns in to a sequential scan of the values.

Now with the implementation with the bogus layer containing () values:

keys \ batch 1 10 100 1,000 ALL
1,000 50ns 27ns 16ns 14ns 24ns
10,000 54ns 25ns 23ns 18ns 13ns
100,000 206ns 129ns 93ns 62ns 14ns
1,000,000 287ns 103ns 87ns 78ns 11ns
10,000,000 440ns 128ns 103ns 119ns 11ns
100,000,000 515ns 232ns 191ns 197ns 13ns

These numbers are definitely worse. Not catastrophic, but definitely worse. Add to this the fact that we hold on to an extra usize for each key, 800 MB for the 100 million key case, and I'm comfortable with the effort spent making it possible to use the leaner representation.

Fancy merging

What about our fancy memcpy merging? Let's evaluate that by doing some merging on two different pairs of collections. The first pair of collections will have all the even keys in one collection, and all the odd keys in the other collection. The second pair of collections will have the first nodes keys in one collection, and the second nodes keys in the other collection. The former collections alternate keys and will be annoying to merge, whereas the second collections can largely just copy their ranges over.

keys alternating contiguous
1,000 73ns 36ns
10,000 57ns 22ns
100,000 59ns 23ns
1,000,000 53ns 26ns
10,000,000 55ns 25ns
100,000,000 67ns 31ns

This shows off what sorts of gains we might pick up as we copy over larger regions. We have picked the very best possibly case to show this off, so it remains to be seen how cool it is in practice.

6. Conclusions, next steps.

It seems like we have a decent toolkit to start building up interesting index structures. My first plan is to try and slot in a

	HashLayer<Key, OrdLayer<Val, WeightedLayer<Time>>>

into differential dataflow and see how that looks. Of course, there is a bunch of glue to write, and perhaps some re-factoring of interfaces if pain points get discovered.

It might also make sense to push this code as a separate crate, then everyone could .. I'm not sure what you actually do with this, really.