Differential dataflow internals; a work in progress

This post is really meant to be a series of posts about how to represent data managed in differential dataflow. As they are topically related and meant to build upon one another, I thought putting them in one place would help record and explain what I was thinking when implementing them. They may also be helpful for folks thinking about building interesting things in Rust!

In the spirit of "append only updates", this post will largely be append-only (modulo errors, edits). I'm also going to try and make the development process largely append-only as well: new implementations will be added to older ones, rather than replacing them, allowing us to keep track of where we were and how far we've come. This will probably break at various points, but that is the intent!

This series is going to start out intentionally pedantic and simple, to frame the problem in the simplest language and implementations. We will try and become smarter as we move along, but let's start with simple, and evaluate things as we go. Here is the planned roadmap:

Part 0/x: Trait definitions; defining what we require.
Part 1/x: An obvious implementation based on HashMap.
Part 2/x: A very naïve implementation based on sorted lists.
Part 3/x: Organizing the data into tries.
Part 4/x: Adding an index.
Part 5/x: Navigation rather than enumeration
Part 6/x: ...

I'd like to think I know exactly where this is going to end up (I have a plan), but for now let's just pretend I know the general direction, and will tell you each time we take a meaningful step.

Part 0: Trait definitions; defining what we require.

One of the things I like doing most when writing Rust code is to write trait definitions. These are the "interfaces" of Rust, which indicate the methods and types and such required in order to actually implement "a thing". For example, we are going to spend some time defining and implementing collections, and we need to specify what methods and such these collections need to provide to be generally useful.

One nice aspect of Rust is that just writing trait definitions gives you some insight into how your implementations are going to need to work. For example, all methods in Rust clearly indicate the ownership of input arguments and returned output. Ownership distinguishes between variable bindings that own the data, and can do pretty much whatever with it (including de-allocating it), and variable bindings that only reference the data (the sigil & indicates references at work). Just writing down the method signatures gives some clarity about which types own which data, who gets to refer to them, and for how long they will be around. Each of these decisions constrains your implementation, and even at this point it helps you think out which details implementations should commit to, and which they should be able to hide.

An example trait definition

Let's try out a trait definition for the "collection trace" functionality, something that differential dataflow uses. In order to not make you angry with details I'm not going to start with the final version. I hope this ends up clearer than otherwise, and I suspect once you see the result you may agree.

Our collection trace tracks tuples (key, val, time, delta) of type (Key, Val, Time, isize) where the types Key, Val, and Time get to be chosen by the person who is instantiating our type. These generic parameters are indicated in the name of the trait, written as part of its definition as:

pub trait Trace<Key, Val, Time> {

	// methods and stuff goes here

}

In fact our trace is not going to work for any arbitrary types; there are constraints that the types need to satisfy. For example, if we want to find keys we had best be able to test keys for equality, right? Equality testing is defined by a trait, Eq. We can add the constraint that whichever type Key gets chosen implements equality testing by adding a constraint:

pub trait Trace<Key: Eq, Val, Time> {

	// methods and stuff goes here

}

See how we added : Eq after Key? That means that the choice of Key needs to also implement the trait Eq, which is great to know because from this point forward we can rely on it being true, and write things like if key1 == key2 without knowing yet which types of key we will need to use.

Our trace actually has different constraints; the keys and values are going to be ordered, and the times must form a lattice, which is like a not-exactly-ordered set (it is a partially ordered set with some additional structure).

pub trait Trace<Key: Ord, Val: Ord, Time: Lattice> {

	// methods and stuff goes here

}

Even just writing this much is helpful, because we have sorted out (or started too, at least) what needs to be true about the types we will use.

How about adding some methods?

Yes, let's add some methods. There are two things most collections do: add data and get data. I'm going to write some over-simple versions of these that would be great, but lack some flexibility. Our data happens to be tuples (key,val,time,delta), so we are going to add and retrieve vectors of those.

pub trait Trace<Key: Ord, Val: Ord, Time: Lattice> {

	/// incorporates a new bunch of differences into the trace.
	fn add(&mut self, diffs: Vec<(Key, Val, Time, isize))>);

	/// retrieves differences associated with a specified key.
	fn get(&mut self, key: &Key) -> Vec<(Key, Val, Time, isize)>;

}

You may notice the & sigil hanging around in a few places. We will get to that, but what it is saying is that when we call add or get, we are only temporarily working with self, the trace. Once the method returns, control of self returns too. Similarly, the use of key is only temporary access to the key.

This contrasts with diffs in add: there is no & in front of the type, which means that diffs is actually an owned vector; we get to do whatever we want with it (add things to it, de-allocate it, etc). This ownership is pretty handy, because it is transitive: diffs also owns all of the contents of the vector, meaning all those keys and values and times and stuff. We will want to put those in our collection!

Looks pretty good! What's not to like? Well, there are a few things.

Type parameters: letting the user choose

Adding and returning vectors is not unreasonable, but putting that detail in the trait definition limits our implementations. Does add really need the results to be materialized in a vector, and does get really need to allocate and return ownership of memory?

Not really. I mean, they could and that might be useful, but we could be more general by indicating that all add really needs is a way to enumerate the items you would like to add, and all get really needs to return is a way for you to enumerate the results. These could be backed by vectors, but they don't have to be. There will be better choices, it will turn out.

We have already seen the techniques that we will use to improve the add method: generic type parameters. The method can introduce a new generic type parameter, Iter, whose only requirement is that it implements the Iterator trait with appropriately typed items, and then accepts diffs as an Iter rather than a vector:

pub trait Trace<Key: Ord, Val: Ord, Time: Lattice> {

	/// incorporates a new bunch of differences into the trace.
	fn add<Iter: Iterator<Item=(Key, Val, Time, isize)>(&mut self, diffs: Iter);

	/// retrieves differences associated with a specified key.
	fn get(&mut self, key: &Key) -> Vec<(Key, Val, Time, isize)>;

}

All that we have done here is said that when one implements add one needs to be able to respond to arbitrary iterator types, without relying at all on the concrete type (no Vec-specific methods, for example). This has the advantage that when we use the method, we can use whatever iterators we like, including ones that don't stage the results in one vector before inserting them (e.g., timely dataflow delivers messages as small batches, and we may want to insert the results of concatenating all of these, without copying them to another location in memory first).

Before you get all "lol iterators", these type constraints mean that methods are available statically, and can be compiled down to roughly the sort of for-loops you would write by hand. Some times they are even better (some times they are worse). But, there are real performance differences between an iterator that reads a list of block of data, and an implementation that first copies them to a second contiguous allocation.

Associated types: letting the implementation decide

What if the get method wants to return an iterator, but be coy about which specific iterator type it returns? It doesn't really make sense for the user to ask for an iterator type, because it is the implementation that has to determine what gets returned. Rather, it makes sense for the implementation to define the type of the result.

In Rust traits may define "associated types", which are a bit dual to the generic type parameters that users get to specify. These are types that the implementor of the trait determines, and other than constraints on the types nothing else is exposed through the interface of the trait. We can indicate that an implementor of Trace must name a type for the output of get and that this type must implement the Iterator trait with appropriate items, just like so:

pub trait Trace<Key: Ord, Val: Ord, Time: Lattice> {

	/// incorporates a new bunch of differences into the trace.
	fn add<Iter: Iterator<Item=(Key, Val, Time, isize)>(&mut self, diffs: Iter);

	/// An iterator defined by the implementation to produce results.
	type OutputIterator: Iterator<Item=(Key, Val, Time, isize)>;

	/// retrieves differences associated with a specified key.
	fn get(&mut self, key: &Key) -> Self::OutputIterator;

}

Notice how get now returns Self::OutputIterator, a type totally unknown to us at this point, with only the promise that any implementor of Trace will need to be more specific. That's fine, we can be more specific when we implement things, and Rust can stitch everything together then. The flexibility the implementors get to choose different iterator implementations is great, though.

Returning references: learning about lifetimes

Ok, brace yourselves.

The signatures of the methods have so-far involved owned data. When get returns an iterator whose items are tuples (key, val, time, isize) it yields tuples whose contents are owned by the recipient. That is great for the recipient, who can now do whatever they like with the results, but it can be expensive and restrictive for the implementation.

To make this more concrete, imagine that the key type is u64 and the value type is String. A String is dynamically sized, depending on how much stuff you stuck in there. That means that a String involves some dynamically allocated memory, which is usually a bit annoying to go and grab more of when you don't need to. But, through our signatures we have promised to return actual owned String types, meaning the responsibility for the memory involved is passed from the trace implementation to the recipient. Unless the trace is getting rid of its copy of the string (no) it probably needs to make a copy, because you might decide to call .all_caps() on it.

If the recipient (you) just needed to look at the String, you might feel a bit silly for forcing all these allocations just to subsequently de-allocate them.

This is where Rust's references come in. References are not owned data, but rather "references" to data owned by someone else. When you get a reference, you get to look at it, maybe even mutate it, but once you are done with it (and you do have to be done with it, eventually) the owner gets control back. Rust has a bunch of rules in place to make sure that when you borrow a reference no one else mutates the referenced data, and part of the joy of Rust is learning to interpret the various reasons Rust won't let you do the things you wanted to do with references.

Let's investigate this 'joy' (50-50 sarcastic-serious, so single quotes) by trying to return references to key, value, and time data, rather than owned instances of these types.

pub trait Trace<Key: Ord, Val: Ord, Time: Lattice> {

	/// incorporates a new bunch of differences into the trace.
	fn add<Iter: Iterator<Item=(Key, Val, Time, isize)>>(&mut self, diffs: Iter);

	/// An iterator defined by the implementation to produce results.
	type OutputIterator: Iterator<Item=(&Key, &Val, &Time, isize)>;

	/// retrieves differences associated with a specified key.
	fn get(&mut self, key: &Key) -> Self::OutputIterator;

}

We just put & in front of the results in the OutputIterator constraint. Aside: I didn't put a reference in front of the isize because we know we can always just copy that data.

This doesn't work, and Rust tries to help us out by saying:

src/lib.rs:44:45: 44:49 error: missing lifetime specifier [E0106]
src/lib.rs:44     type OutputIterator: Iterator<Item=(&Key, &Val, &Time, isize)>;
                                                      ^~~~
src/lib.rs:44:45: 44:49 help: run `rustc --explain E0106` to see a detailed explanation

What is a lifetime specifier? You could totally run rustc --explain E0106 to find out, and I was going to copy/paste it here but it is really quite long. Instead, I will try and explain.

Rust needs to be able to determine that references to the same data are not active at the same time, because that is what guarantees that no one is messing with your data while you are looking at it (or looking at your data while you mess with it). This is done through the concept of "lifetimes", an indication of "for how long" the reference is in play. Here "how long" is measured in "parts of your code" rather than something like real time.

For each reference you take, Rust infers the span of code for which the reference is live, and bakes that into the type. The problem here is that we haven't told Rust enough for it to figure out the lifetime for these references. Let's write something slightly less clever, but which shows where the lifetimes would come from; we will temporarily return to the more innocent time where we returned a Vec output rather than an iterator:

	/// retrieves differences associated with a specified key.
	fn get(&mut self, key: &Key) -> Vec<(&Key, &Val, &Time, isize)>;

This almost works. If it weren't for that key argument, Rust would be able to figure out the lifetimes of our outputs.

Why? Because the only things the results could possibly refer to are the references supplied as inputs (or references available through them, transitively). What else could these reference possibly refer to that will still be valid after the method returns?

What is going on here, or would be going on without the key argument, is what Rust calls "lifetime elision", some handy rules Rust uses to remove the need to explicitly write lifetimes everywhere. But you can write them explicitly, and it is helpful to see to understand what Rust is actually doing. Let's fix that method up above with the right generic lifetimes:

	/// retrieves differences associated with a specified key.
	fn get<'a, 'b>(&'a mut self, key: &'b Key) -> Vec<(&'a Key, &'a Val, &'a Time, isize)>;

Whoooooaaaaa! All of the & things have some more noise around them!

The most important thing to notice above is that get has two generic parameters, 'a and 'b, which are lifetime parameters. Their roles are to capture the lifetime information about the references supplied by the caller, and to say that in the vector of results, all those references have the same lifetime as the self reference rather than the key reference.

This makes a lot of sense, right? The references are totally into the trace, rather than at the query key. It's cool, all we need to do is tell Rust that. In exchange, Rust will make sure no one accidentally puts key into the result vector, as its referent might get mutated or invalidated as we spin around a loop.

You wouldn't do that, of course. I trust you, but Rust doesn't.

So, we need to put some 'a things in front of our references to make Rust happy. For "reasons", we need to add it as a generic parameter for the trait, rather than the method. What reasons? Because the OutputIterator type constraint depends on 'a, not just the get method, and Rust doesn't yet allow associated types to have generic parameters (part of "higher kinded types" or something).

pub trait Trace<'a, Key: Ord, Val: Ord, Time: Lattice> {

	/// incorporates a new bunch of differences into the trace.
	fn add<Iter: Iterator<Item=(Key, Val, Time, isize)>>(&mut self, diffs: Iter);

	/// An iterator defined by the implementation to produce results.
	type OutputIterator: Iterator<Item=(&'a Key, &'a Val, &'a Time, isize)>;

	/// retrieves differences associated with a specified key.
	fn get(&'a mut self, key: &Key) -> Self::OutputIterator;

}

You might be a bit stressed that "how can we known which 'a we will need" to which the answer is just "no worries, we will implement the trait for all 'a", which is a great thing that computers can do.

What is the last remaining issue? The types parameters Key, Val, and Time need some more constraints on them now.

src/lib.rs:33:1: 66:2 error: the parameter type `Key` may not live long enough [E0309]
src/lib.rs:33 pub trait Trace<'a, Key: Ord, Val:Ord, Time: Lattice> {
              ^
src/lib.rs:33:1: 66:2 help: run `rustc --explain E0309` to see a detailed explanation
src/lib.rs:33:1: 66:2 help: consider adding an explicit lifetime bound `Key: 'a`...
src/lib.rs:33:1: 66:2 note: ...so that the type `Key` will meet its required lifetime bounds

What Rust is telling us now is that because we are thinking about returning types like &'a Key we have to promise that Key itself is actually valid for all of the lifetime 'a. We could make Key something horrible like &'b String for some other 'b, at which point the reference &'a Key wouldn't actually be to valid data. You probably weren't thinking you would do this, but Rust was. Rust is protecting you from weird creative you, and from your weird creative co-workers, and the weird creative unintended interactions of your code with theirs.

The fix is pretty simple:

pub trait Trace<'a, Key: Ord+'a, Val: Ord+'a, Time: Lattice+'a> {

	/// incorporates a new bunch of differences into the trace.
	fn add<Iter: Iterator<Item=(Key, Val, Time, isize)>>(&mut self, diffs: Iter);

	/// An iterator defined by the implementation to produce results.
	type OutputIterator: Iterator<Item=(&'a Key, &'a Val, &'a Time, isize)>;

	/// retrieves differences associated with a specified key.
	fn get(&'a mut self, key: &Key) -> Self::OutputIterator;

}

The constraint Key: Ord+'a means that Key needs to implement Ord and live at least as long as 'a, which mostly just means it can't contain references that might last less long. Typically Key is going to be some owned data like u64 or String or even Vec<Vec<Option<String>>>, all of which satisfy this constraint.

An observation

This whole section may give some insight into the Rust debugging process. We have done a lot of up-front work defining our trait, and scoping out several potential issues that Rust has already forced us to fix by making our method prototypes more specific. I would say that the majority of my time debugging Rust code is spent like this. I've literally spent zero time in an actual debugger, and once written while not necessarily correct, your code pretty much does what it says.

Part 1: An obvious implementation based on `HashMap`

Just to start things out, and before trying to be clever at all, let's write an implementation of our trait using Rust's built in HashMap. This implementation will be pretty simple, but wasteful with respect to memory usage, and misses out on some important properties we will want for later on. But, it is a simple example of perhaps the most obvious implementation.

This also gives us a chance to see how to implement traits in Rust, before we start doing harder things.

To implement a trait for a type, here a hashmap from keys of type K to vectors of (V, T, isize) elements, you write something like so:

impl<'a, K, V, T> Trace<'a, K, V, T> for HashMap<K, Vec<(V, T, isize)>> {

	// implementation of methods and stuff goes here

}

We just need to add constraints on the type parameters and implement the methods, and we should be good to go.

Implementing the methods

We have two methods to implement, add and get. We know what they are going to look like because of the trait's signature, so let's write down some empty bodies.

impl<'a, K, V, T> Trace<'a, K, V, T> for HashMap<K, Vec<(V, T, isize)>> {

	fn add<Iter: Iterator<Item=(K, V, T, isize)>>(&mut self, diffs: Iter) {
		unimplemented!()
	}

	fn get(&'a mut self, key: &K) -> Self::OutputIterator {
		unimplemented!()
	}

}

These unimplemented!() things are macros indicating the code isn't written yet. They are totally optional, but allow your code to compile without actually returning the right thing yet (Self::OutputIterator, which is also not written yet).

The `add` method

Here is what I would do for the add method:

	fn add<Iter: Iterator<Item=(K, V, T, isize)>>(&mut self, diffs: Iter) {
		for (key, val, time, delta) in diffs {
			self.entry(key)
				.or_insert(Vec::new())
				.push((val, time, delta));
		}
	}

There is some syntactic sugar in the form of the for .. in .. construct for iterators; because Iter implements Iterator, we can just use the for loop.

We've also used HashMap's functionality to pretty ergonomically try to find the entry associated with a key, create one if it doesn't exist, and the push the value, time, and delta triple to the end of the list.

This was pretty painless, but we got lucky in a few ways. The entry method requires an owned instance of K because if the entry doesn't exist we need to add it, and the hash map will need to hold on to a copy of the key in that case. However there is only one owned instance of the key for each element in diffs, so we were pretty smart to only stash the (val, time, delta) triple.

If we tried to use the key twice we would get an error from Rust indicating that we are trying to re-use a variable which has been moved somewhere else. We could consider copying or cloning the key, but that functionality lies behind traits we haven't required of the type K yet.

The `get` method

The get method is going to turn into a pain in the butt, mostly because of the restrictive way Rust's HashMap is implemented. We will fight against it a little bit, explain how it could be done, and then walk away because we don't really want to be using HashMap anyhow.

Here is what we would like to write:

	fn get(&'a mut self, key: &K) -> Self::OutputIterator {
		Self::OutputIterator::new(self.get(key))
	}

This looks up key in self and returns an Option<&'a Vec<(V, T, isize)>>. We can use that reference in our output iterator! In particular, a &'a Vec<(V, T, isize)> gives us access to all the &'a V and &'a T references contained therein.

The problem is we don't have a &'a K anywhere. That key argument is a reference, but not with the right lifetime (remember the cautionary tale from the previous section?). The key obviously exists in the HashMap, but we can't get a reference to it because HashMap's interface is a bit restrictive. There is a way to get a reference to a key once you've used the entry method up above, but that requires an owned key to call in the first place. We can't even stash a second copy of the key in the list with the (V, T, isize) tuples because we can't copy or clone it.

Sucks.

We could consider changing the iterator in the trait definition to be

	Iter: Iterator<Item=(&'a V, &'a T, isize)>

because we know that users must have a &K on hand to call the get method. That isn't a horrible idea, but it is limiting; the user has to make sure to match up their keys with the iterators' values, even though we have a reference to the right key in the hash map.

Since we aren't actually going to use this version of the code, let's imagine that the HashMap's get method return a reference to both the key and the value, rather than just the value. We still get to write:

	fn get(&'a mut self, key: &K) -> Self::OutputIterator {
		Self::OutputIterator::new(self.get(key))
	}

because we are just passing the buck to the OutputIterator type's new method, which we will see next.

Implementing the types

The OutputIterator type isn't going to be wildly complicated. In fact, we could possibly borrow some existing iterator implementations, but we will just roll our own to see how it works.

The first thing to do is to define the struct that holds the data we will need for iteration.

/// iterates over tuples of references
pub struct OutputIter<'a, K:'a, V:'a, T:'a> {
	data: Option<(&'a K, &'a Vec<(V, T, isize)>, usize)>
}

The data field here is, optionally, a triple of: key reference, vector reference, and a usize for remembering where we are in the vector. As we advance through the iterator we will increment this usize, and stop when it reaches the length of the vector.

The field is an Option type, which can either be Some(x) for some data x, or None indicating that there is nothing there. The option is important here because get does need to return a valid instance of OutputIter even when the key is absent and no &'a K even exists. We could change our get trait to return an Option<Self::OutputIterator>, or we can just make our iterator return nothing.

Let's first write the new method we know we need:

impl<'a, K:'a, V:'a, T:'a> OutputIter<'a, K, V, T> {
	pub fn new(input: Option<(&'a K, &'a Vec<(V, T, isize)>)>) -> Self {
		OutputIter {
			data: input.map(|(key, vec)| (key, vec, 0))
		}
	}
}

Here we see for the first time struct field initialization; the data field of an OutputIter. We take in the type produced by HashMap's hypothetical get method, and use map which applies some logic to the internals of an Option type, where in this case we've just put a zero in for the current position. In case input is None, we just end up with a None output and no logic applied.

All we need to do to make OutputIter into an iterator is implement the Iterator trait. It's really not that hard; I've written the skeleton here to give you a sense for all that is needed:

impl<'a, K:'a, V:'a, T:'a> Iterator for OutputIter<'a, K, V, T> {

	type Item = (&'a K, &'a V, &'a T, isize);

	fn next(&mut self) -> Option<Self::Item> {

	}

}

An implementation needs to indicate the type of the Item it will enumerate, and a function that produces the next item or None if the iterator has finished.

impl<'a, K:'a, V:'a, T:'a> Iterator for OutputIter<'a, K, V, T> {

	type Item = (&'a K, &'a V, &'a T, isize);

	fn next(&mut self) -> Option<Self::Item> {
		if let Some((ref key, ref vec, ref mut pos)) = self.data {
			if *pos < vec.len() {
				*pos += 1;
				let vtd = &vec[*pos-1];
				Some((key, &vtd.0, &vtd.1, vtd.2))
			}
			else {
				None
			}
		}
		else {
			None
		}
	}
}

The if let construction is a neat thing Rust borrowed from Swift (perhaps "copied" would be more accurate, or "cloned" depending on your views on whether ideas have owners). The construct fires if the pattern matches, in this case if self.data is in fact something. At the same time, it binds key, vec, and pos to references to the fields of the something. In the case of pos it is even a mutable reference. The ref keyword might be a bit confusing, but I guess the Rust folks thought it was better than *, which is logically what you want there (bind key, val, pos to things that dereference to the fields).

This trait implementation is all it takes to be an iterator.

If you use this type in your code, perhaps by calling get on our HashMap implementation of Trace, Rust will go track down this code and use it directly. For example, imagine your value and time types are stack-allocated, like maybe u64s or something, and you write the following code:

let mut dataz = Vec::new();
for (k,v,t,d) in trace.get(&my_key) {
	dataz.push(v.clone(), t.clone(), d);
}

In principle Rust (via LLVM) is able to realize that (i) if the iterator's Option is Some it stays Some and that test can be hoisted out of the loop, (ii) the "increment and test against length" pattern is just walking through an array, and (iii) your key, time, and delta accesses have the same layout as where they are stored. All of these combine to allow LLVM to optimize your loop down to a conditional memcpy, just checking if your key is present in the HashMap and then memcpying the results if so.

That's pretty neat. It's the sort of thing you might implement by hand in other languages in order to have awesome performance. Instead, you can spend your time writing blog posts about how you don't necessarily need to do that any more.

The implementation

In addition to the definition and implementation of OutputIter, here is our implementation of Trace.

impl<'a, K:Ord+'a, V:Ord+'a, T:Lattice+'a> Trace<'a, K, V, T> for HashMap<K, Vec<(V, T, isize)>> {

	fn add<Iter: Iterator<Item=(K, V, T, isize)>>(&mut self, diffs: Iter) {
		for (key, val, time, delta) in diffs {
			self.entry(key)
				.or_insert(Vec::new())
				.push((val, time, delta));
		}
	}

	type OutputIterator = OutputIter<'a, K, V, T>;

	fn get(&'a mut self, key: &K) -> Self::OutputIterator {
		Self::OutputIterator::new(self.get(key))
	}

}

One last error: to use K as a key in a HashMap it must implement ::std::hash::Hash, the trait for hashing. So, we just + that into our constraints on K and we are good to go!

Except that HashMap doesn't actually have the get method we wanted. Oh well.

Wrap-up

I hope this was somewhat informative about how to implement things in Rust. We saw some horrible sticky details in the interface mismatch between HashMap and Trace, where obviously I like my version more but the Rust folks probably have opinions too. My recollection is that they felt the ergonomics of get were important as HashMap would be used a lot, and so while expressivity is good too, overwhelming people in such a commonly used class might suck.

In addition to the limited interface, there are other things not to like about a HashMap implementation. A HashMap uses a lot more memory than we want to use; if a key has just one value, well you can do the math on how much this requires (hint: lots). Also, HashMap maintains its keys in something of a weird order; it makes sense when you reverse it out, but it makes it painful to pre-arrange keys to provide sequential access. The Vec values also result in lots of random access as locality in the hash map means little for the memory they manage.

Part 2: A very naïve version based on sorted lists.

Let's write a relatively naïve implementation of our Trace trait using sorted lists.

How is this going to work? Sorted lists aren't a standard data structure for maintaining indexed data sets. If we had a sorted list of all the data we could look things up, using binary search or something similar, but how do we update a list? If we need to keep it in order that can be very expensive, right?

One standard approach, I'm not sure who gets credit for it, to maintain not one but multiple sorted lists of varying sizes. When we add new records we merge them in to the shortest of the lists, and as the short lists get bigger we merge them with the large lists. This has the downside that we need to logically merge the lists when we look at them, and the more concrete downside that the values for a key are not simply a &[(K,V,T,size)] contiguous slice. At the same time, with updates that probably wasn't going to happen anyhow.

The strategy we are going to use is to merge any lists whose lengths are within a factor of two. This ensures that we have at most a logarithmic number of such lists and that the total work for each element is also logarithmic.

Here is our noble data structure, a list of lists:

pub struct Lists<K, V, T> {
	lists: Vec<Vec<(K, V, T, isize)>>,
}

We will set this up so that the largest lists come first, and smaller lists later on. This makes it easy to push and pop things at the end of the list, which is a bit easier for Rust (whose Vec type supports vacancies at the end of its allocation, but not at the beginning).

A quite naïve implementation

We now have some functions to implement, naïvely of course, for the Trace trait.

Adding elements

We can be a bit lazy for now and make adding elements relatively simple. Let's do that, and we will improve the behavior once we start looking at performance (just a bit further down this section).

fn add<Iter: Iterator<Item=(K,V,T,isize)>>(&mut self, diff: Iter) {

	// collect iterator contents into a temporary buffer.
	let mut buffer: Vec<(K,V,T,isize)> = diff.collect();

	// while the next smallest list is less than twice as large as the buffer, merge it in.
	while self.lists.len() > 0 && self.lists[self.lists.len()-1].len() < 2 * buffer.len() {
		buffer.extend(self.lists.pop().unwrap().into_iter());
	}

	// sort the buffer and push it on the end of our lists.
	buffer.sort();
	self.lists.push(buffer);
}

The main bit of laziness here is that rather than merge sorted lists, which we should be able to do in time linear in the length of the lists, we are just concatenating and sorting the lists. It's worse by a logarithmic factor, asymptotically, but for the moment it is a bit easier to write. Plus if we don't do it now, we can see the difference when we fix the implementation!

Another bit of laziness that we will need to address, eventually, is that of consolidation: if we have two records (k,v,t,1) and (k,v,t,-1) the two should be accumulated and canceled out. More generally, whatever their delta values are, they should be added up and put into a single record. This will be pretty easy with sorted lists, and was set to be a big pain in the bottom with the HashMap implementation from before. As it turns out, consolidation can be a nice performance win too: despite doing more work, it would seem, we can often reduce the amount of data we are working with.

Getting elements

The retrieval side of things is a bit more complicated, in part because we have a more than one list and in part because we have to define and implement an iterator.

In principle, we don't have to do all that much to define an iterator; for a given key, we should look it up in each list and walk over the elements there. That is fine and dandy, and we will start with that. However, what we actually would like is to iterate over the records in sorted order; we would like like to merge the sorted sub-lists, to present the appearance of sorted data. Why? Because we are heading towards a more sophisticated interface where one can navigate rather than iterate through the resulting tuples. More in a bit.

Let's take the simple approach for now. We look up the key in each list, and if it is present we add its range of tuples to a list of such slices. Importantly, we don't actually iterate through all of the values beforehand; if the user doesn't want to use the iterator, the cost to building it shouldn't be large.

fn get(&'a mut self, key: &K) -> Self::OutputIterator {
	// storage for slices
	let mut slices = Vec::new();
	for list in &self.lists {
		let lower = binary_search(list, |x| &x.0 < key);
		let upper = binary_search(list, |x| &x.0 <= key);
		if lower < upper {
			slices.push(&list[lower .. upper]);
		}
	}

	OutputIter {lists: slices }
}

This is what you might imagine, except that if you are familiar with Rust you might be surprised that binary_search is used as a free method, rather than using list.binary_search(key). Unfortunately, Rust's binary search method doesn't find the first matching element, just a matching element. To find the actual range we would need to search backwards and forwards, which goes against our "don't scan the elements unless asked".

Fortunately it is easier to write your own binary search than it is to work around Rust's quirky behavior.

/// Returns the count of the number of elements satisfying `test`.
///
/// Correct behavior relies on true values for `test` coming before
/// false values in `slice`.
fn binary_search<T, F: Fn(&T)->bool>(slice: &[T], test: F) -> usize {

	// bounds on result.
	let mut lower = 0;
	let mut upper = slice.len();

	while lower < upper {
		let mid = (lower + upper) / 2;
		if test(&slice[mid]) {
			lower = mid + 1;
		}
		else {
			upper = mid;
		}
	}

	assert_eq!(lower, upper);
	lower
}

All we need to do now is sort out the OutputIter type and we should be set!

Iterating outputs

Our OutputIter will have a list of sorted slices, and walk through each of them in no particular order. We will improve the order later on.

pub struct OutputIter<'a, K:'a, V:'a, T:'a> {
	lists: Vec<&'a [(K,V,T,isize)]>
}

Now we need to write an iterator implementation. Since we have no particular order required, we should just look at lists and see if it is empty, and if not go and read some element from one of the slices. We then need to advance the slice that we read and clean up any empty slices.

impl<'a, K:'a, V:'a, T:'a> Iterator for OutputIter<'a, K, V, T> {
	type Item = (&'a K, &'a V, &'a T, isize);
	fn next(&mut self) -> Option<Self::Item> {
		if let Some(mut last) = self.lists.pop() {
			let result = (&last[0].0, &last[0].1, &last[0].2, last[0].3);
			last = &last[1..];
			if last.len() > 0 {
				self.lists.push(last);
			}
			Some(result)
		}
		else {
			None
		}
	}
}

This implementation repeatedly pops and pushes things from self.lists, which feels a bit silly and I'm hoping that LLVM will optimize that away. Otherwise it is a bit of a pain to get a mutable reference to the last list (so it can be advanced), and then mutate the list holding on to it. You can, but you have to flatten out the control flow a bit and pass some bools around.

Looking at performance

Despite having a very simple implementation, we might still be interested in how this implementation behaves. In particular, it would be great to know which of the highlighted issues are the main pain points, and maybe which pain points haven't yet been highlighted.

Insertion throughput

Let's see how quickly we can put elements into our trace. Insertion throughput is one of the reasons we are avoiding a HashMap implementation, so we would like this to be reasonably good. We'll start by just trying to understand it.

I've written a profile-lists.rs program that takes a number of keys, number of values, and batch size, and repeatedly introduces batches of records whose keys and values are picked by an increasing counter mod the number of valid possibilities. It prints out after every so-many batches.

A hash map baseline

Before we get started, I ran the computation with a HashMap<K, Vec<(V,T,isize)> as written up earlier. It doesn't implement the Trace trait, but we can still run it to get a feel. Here is how it looks:

Echidnatron% cargo run --release --bin profile-lists -- 1000 1000 1000 1000
     Running `target/release/profile-lists 1000 1000 1000 1000`
throughput: 16517799.18 elts/sec @ 1000000 elts
throughput: 16582305.02 elts/sec @ 2000000 elts
throughput: 15579539.55 elts/sec @ 3000000 elts
throughput: 16189958.99 elts/sec @ 4000000 elts
throughput: 14807005.80 elts/sec @ 5000000 elts
...
throughput: 8571400.89 elts/sec @ 99000000 elts
throughput: 8587396.41 elts/sec @ 100000000 elts
throughput: 8603516.77 elts/sec @ 101000000 elts
...
throughput: 8138150.61 elts/sec @ 522000000 elts
throughput: 8137500.97 elts/sec @ 523000000 elts
throughput: 8136678.71 elts/sec @ 524000000 elts

At this point the process crossed the 16GB threshold and I stopped the experiment. But, we have a starting baseline for insertion throughput: between 8 and 16 million elements per second. You could probably do better, for example here is a fun read where they are worried about a lot more keys than I used (them: 16M, 256M, 1B; here: 1K).

Just to set the scale, in case you would like to compare to their Figures 2 and 3, which report 20 million inserts per second, here are some HashMap numbers with 16M keys:

Echidnatron% cargo run --release --bin profile-lists -- 16000000 1000 1000 1000
     Running `target/release/profile-lists 16000000 1000 1000 1000`
throughput: 2578612.44 elts/sec @ 1000000 elts
throughput: 2446594.26 elts/sec @ 2000000 elts
throughput: 2782877.84 elts/sec @ 3000000 elts
throughput: 2339393.02 elts/sec @ 4000000 elts
throughput: 2522213.50 elts/sec @ 5000000 elts
...
throughput: 3296886.79 elts/sec @ 99000000 elts
throughput: 3304827.24 elts/sec @ 100000000 elts
throughput: 3312735.90 elts/sec @ 101000000 elts

These numbers start a bit shaky and improve, probably because for the first 16 million elements the hash map is still getting properly sized. But substantially less than the paper above. If we turn things up to 256 million keys, the hash map does this:

Echidnatron% cargo run --release --bin profile-lists -- 256000000 1000 1000 1000
     Running `target/release/profile-lists 256000000 1000 1000 1000`
throughput: 2529228.46 elts/sec @ 1000000 elts
throughput: 2496031.20 elts/sec @ 2000000 elts
throughput: 2821726.46 elts/sec @ 3000000 elts
throughput: 2392483.92 elts/sec @ 4000000 elts
throughput: 2577791.26 elts/sec @ 5000000 elts
...
throughput: 1867181.91 elts/sec @ 120000000 elts
throughput: 1855464.96 elts/sec @ 121000000 elts
throughput: 1839465.73 elts/sec @ 122000000 elts

This falls over quite a bit earlier, because the hash map has a fair amount of per-key state. We could try one billion keys, but you'll notice that we only managed 122 million insertions above, so we didn't even get the opportunity to repeat a key. The same thing should happen with one billion keys.

So, depending on the numbers of keys we see a range of throughputs with the very simple HashMap implementation: from just shy of two million inserts per second, up to 16 million inserts per second. The linked paper shows a pretty consistent 20 million single-threaded inserts per second, even with large numbers of keys, through engineering and possibly some different assumptions.

Our most naïve implementation

Let's slot in our sorted list implementation. Get ready for some worse numbers!

For 1000 keys and 1000 values, batch size 1000, printing every 1000, the numbers look like:

Echidnatron% cargo run --release --bin profile-insert -- 1000 1000 1000 1000 lists
     Running `target/release/profile-insert 1000 1000 1000 1000 lists`
throughput: 2388709.90 elts/sec @ 1000000 elts
throughput: 2111806.18 elts/sec @ 2000000 elts
throughput: 1891017.15 elts/sec @ 3000000 elts
throughput: 1947063.24 elts/sec @ 4000000 elts
throughput: 1706687.56 elts/sec @ 5000000 elts
...
throughput: 1193976.71 elts/sec @ 260000000 elts
throughput: 1196107.05 elts/sec @ 261000000 elts
throughput: 1198720.72 elts/sec @ 262000000 elts

You can see where this is going. Actually, nowhere because it ran out of memory at this point, but the throughput goes down as we run: we are only appending data and never removing anything. Given that our per-record work is "logarithmic", if we keep running forever eventually that is going to increase. The good news is we will run out of memory before it becomes too serious.

These numbers aren't so great. They are about one-third the number for the many-keys HashMap, and one-tenth the numbers for the thousand-keys HashMap. On the positive side, these numbers don't vary at all as we change the numbers of keys, though being consistently slow isn't much of a technical achievement.

One of the reasons the code is slow is that we are re-sorting appended sorted data, rather than merging it. Let's say we try to do merging correctly. I've written the following merging code:

/// Merges two sorted vectors into a new result vector
fn merge<T: Ord>(source1: Vec<T>, source2: Vec<T>) -> Vec<T> {

	// pre-allocate the required amount of space.
	let mut result = Vec::with_capacity(source1.len() + source2.len());

	// convert vectors into peekable iterators.
	let mut iter1 = source1.into_iter().peekable();
	let mut iter2 = source2.into_iter().peekable();

	// while comparisons remain, do them and move elements.
	while iter1.peek().is_some() && iter2.peek().is_some() {
		match iter1.peek().unwrap().cmp(&iter2.peek().unwrap()) {
			::std::cmp::Ordering::Less => {
				result.push(iter1.next().unwrap());
			},
			::std::cmp::Ordering::Equal => {
				result.push(iter1.next().unwrap());
				result.push(iter2.next().unwrap());
			}
			::std::cmp::Ordering::Greater => {
				result.push(iter2.next().unwrap());
			}		
		}
	}

	// drain whatever remains.
	result.extend(iter1);
	result.extend(iter2);

	result
}

If we replace the buffer.extend(..) line with buffer = merge(buffer, ...) and move the buffer.sort() from the end to just after it is collected, we get some better numbers:

Echidnatron% cargo run --release --bin profile-insert -- 1000 1000 1000 1000 lists
     Running `target/release/profile-insert 1000 1000 1000 1000 lists`
throughput: 6407104.61 elts/sec @ 1000000 elts
throughput: 5731598.78 elts/sec @ 2000000 elts
throughput: 5360619.80 elts/sec @ 3000000 elts
throughput: 5560609.58 elts/sec @ 4000000 elts
throughput: 5005889.36 elts/sec @ 5000000 elts
...
throughput: 4083296.32 elts/sec @ 522000000 elts
throughput: 4087287.71 elts/sec @ 523000000 elts
throughput: 4091650.69 elts/sec @ 524000000 elts

It falls over a later on because merge allocates half as much memory as buffer.sort().

We are now doing better than the many-keys HashMap numbers, though not as good as the thousand-keys numbers. This feels pretty good, though if we were to consider lookup times the feeling would pass.

So that we can feel even better about ourselves before we close out our discussion of the very naïve version, let's test a feature that we will eventually want: consolidation of weights for elements with the same (key,val,time) fields. As part of merge, we can compare just the (K, V, T) part of each record, and add weights if they are the same rather than returning both elements.

Let's advance the time every round (one million records) rather than for every record and see what happens:

Echidnatron% cargo run --release --bin profile-insert -- 1000 1000 1000 1000 lists
     Running `target/release/profile-insert 1000 1000 1000 1000 lists`
throughput: 24726955.68 elts/sec @ 1000000 elts
throughput: 24264013.63 elts/sec @ 2000000 elts
throughput: 23780381.90 elts/sec @ 3000000 elts
throughput: 23490919.41 elts/sec @ 4000000 elts
throughput: 23126086.72 elts/sec @ 5000000 elts
...
throughput: 22463200.83 elts/sec @ 998000000 elts
throughput: 22462284.77 elts/sec @ 999000000 elts
throughput: 22462328.48 elts/sec @ 1000000000 elts
...
throughput: 22394475.86 elts/sec @ 9999000000 elts
throughput: 22394548.97 elts/sec @ 10000000000 elts
throughput: 22394751.84 elts/sec @ 10001000000 elts
...

Ok, now we are talking. Twenty million inserts per second. This should actually still degrade, just much more slowly; the process had reach about 304MB at this point, ten billion elements in.

What happened? The data are a bit degenerate, with exactly as many keys as values, there are only one thousand distinct values in that million elements, all with the same time. They get compacted quite well! Each round of one million inserts is compacted down to one thousand, before being merged with any older elements.

This is perhaps an interesting representation of what a hot set of updates might look like, with periodic transitions to new hot elements. Perhaps not.

We can do a different experiment where we have 16 million keys, one value, and one timestamp, something like the 16 million key example above where we are able to update in place.

Echidnatron% cargo run --release --bin profile-insert -- 16000000 1 1000 1000 lists
   Compiling trace_report v0.1.0 (file:///Users/mcsherry/Projects/trace_report)
     Running `target/release/profile-insert 16000000 1 1000 1000 lists`
throughput: 6919814.42 elts/sec @ 1000000 elts
throughput: 6474400.62 elts/sec @ 2000000 elts
throughput: 6002611.86 elts/sec @ 3000000 elts
throughput: 6110876.73 elts/sec @ 4000000 elts
throughput: 5454596.96 elts/sec @ 5000000 elts
...
throughput: 5055585.25 elts/sec @ 32000000 elts
throughput: 4649927.35 elts/sec @ 33000000 elts
throughput: 4697103.11 elts/sec @ 34000000 elts
...
throughput: 4465313.52 elts/sec @ 999000000 elts
throughput: 4454331.37 elts/sec @ 1000000000 elts
throughput: 4456430.71 elts/sec @ 1001000000 elts
...

In principle this should stabilize once we've seen roughly twice the number of keys (32 million, which is why that line is up there), so that all of the logarithmically many arrays can fill up. It is sitting steady around between 600 megabytes and 1GB (up and down a bit, as lists come and go). For 16 million 32 byte records (multiplied: 512MB) this lines up with our expectations perfectly. This is better than our hash table implementation up above, and with a much more compact memory footprint. It is yet not better than the informative linked paper, and may not become so, but I think it is a good start; we have more plans in store!

Look-up throughput

Looking things up is going to be slow. We are going to improve it later on (hint: "index"), but for now it is slow. It is dinner time, so I'm not going to measure things for y'all, but binary search is much slower than looking something up in a hash map.

Part 3: Organizing the data into tries.

Our next stop is organizing our data even further. Sorted lists are pretty organized already, but there is still room to improve the representation and make our lives easier. The main thing that sorted lists don't reveal to use is "where do we find those tuples with a indicated key, or key and value?"

Our new representation of a sorted sequence of (K, V, T, isize) maintains three lists:

A list of (T, isize) pairs, with one entry for each entry in our original list.

This list is just the list of the T and isize values from our sorted sequence, in the same order.
A list of (V, usize) pairs, with one entry for each distinct (key, val) in our original list.

This list records the val for each distinct (key, val) pair, and where its associated (T, isize) pairs end. That is, the usize field is the offset in the list above just after the last element starting with (key, val). "Just after" rather than "just at" so that we can use it as an upper bound while iterating.
A list of (K, usize) pairs, with one entry for each distinct key in our original list.

This list records the distinct keys we see, and where each's associated (V, usize) pairs end. The usize field is the offset in the (V, usize) list, not the (T, isize) list, just after the last element with the corresponding key.

Normally we think of this as a tree, and write it the other way around, with the keys first, then values, and finally the data bits, because that is the direction we will approach it when we come looking for information.

/// Equivalent to a `Vec<(K, V, T, isize)>`
pub struct Trie<K, V, T> {
	keys: Vec<(K, usize)>,
	vals: Vec<(V, usize)>,
	history: Vec<(T, isize),
}

When we approach this data, we come with a key in mind, and looking only at key-related data (the (K, usize) list) can determine (i) whether the key is present, (ii) how many values there are, (iii) where to find them. If we want to dive deeper, we can start navigating the values, using only value-related data (a slice of the (V, usize) list), and determine the same things for values and times as we learned for keys and values.

This isn't exactly a trie the way they are normally presented; each layer is flattened into a hunk of contiguous memory, for example. But the spirit is similar, and if anyone knows a better name, it seems like a quite natural representation.

Each Trie<K, V, T> is equivalent to one of our ordered lists Vec<(K, V, T, isize)>, and we can use multiple tries the same way our Lists<K, V, T> structure did with lists, to accommodate writes by collecting and merging different sized lists.

/// Equivalent to `Lists<K, V, T>`.
pub struct Tries<K, V, T> {
	layers: Vec<Trie<K, V, T>>,
}

We now get to re-write a few things, most notably the merge logic for tries.

Merging tries is going to be more painful than merging sorted lists, but some things are simpler (for the computer, not for us). When we see a key in one trie but not another we can just copy its associated contents over. We don't have to pay a per-tuple comparison cost, as we did when we merged sorted lists. If we end up with two of the same keys, we can do the same thing for values: if a value is in one trie but not the other, just copy its (T, isize) history over. A lot of merge logic can be replaced with something that looks more like a memcpy.

Merging tries

Let's write the merge logic for our tries. It is going to be a bit explicit and gross as we go, but at the end we will think a bit about how to make it more general. The good news: there is already a crate for it!

So, we want to write a method whose signature looks like

fn merge<K: Ord, V: Ord, T: Ord>(trie1: Trie<K, V, T>, trie2: Trie<K, V, T>) -> Trie<K, V, T> {

}

Part of what will make this fun and challenging is that we are still using owned data everywhere, and haven't permitted K, V, or T to be easily copied or cloned. That means we have to be careful to make sure we get these fields out of our input tries to place in the output trie.

The first thing we are going to do is take each trie, which contains three vectors, and turn each into an iterator. We are also going to hit keys and vals with some iterator logic that has them emit counts rather than offsets. Here is one example, which repeats identically for the others:

let mut keys1 = trie1.keys.into_iter()
						  .scan(0, |sum, (key, off)| {
						      let cnt = off - *sum;
						      *sum = off;
						      Some((key, cnt))
						  })
						  .peekable();

This snippet takes ownership of trie1.keys and turns it into an iterator of owned K elements. It then wraps this iterator is an adaptor that keeps a running view of offsets, and subtracts them from each to get the count of the number of elements. This will be easier to use, and writing it this way is easier for our logic later on (rather than having some globally named offsets that we have to subtract from everything).

With these iterators in place, let's write the core loop that moves through the keys:

let mut result = Trie::new();

while keys1.peek().is_some() && keys2.peek().is_some() {
	match keys1.peek().unwrap().0.cmp(&keys2.peek().unwrap().0) {
		Ordering::Less => { unimplemented!() },
		Ordering::Equal => { unimplemented!() },
		Ordering::Greater => { unimplemented!() },		
	}
}

// drain whatever remains in either keys1 or keys2
for (key, key_cnt) in keys1 { unimplemented!() }
for (key, key_cnt) in keys2 { unimplemented!() }

result

I've not put the logic in yet, because the complete thing is pretty large. Instead, let's look at each fragment one piece at a time.

The logic for the cases where the keys are inequal is not horrible. We want to copy the key, values, and histories into result. Here is the code for the Ordering::Less case, where we want to read out of the iterators derived from trie1.

// append in key, vals, and their histories.
let (key, key_cnt) = keys1.next().unwrap();
for (val, val_cnt) in vals1.by_ref().take(key_cnt) {
	result.hist.extend(hist1.by_ref().take(val_cnt));
	result.vals.push((val, result.hist.len()));
}
result.keys.push((key, result.vals.len()));

This isn't so bad, is it? We pop open the key and its count, grab that many entries from the vals1 iterator, and for each of its entries (a value and count) we grab the right number of entries from hist1. The main non-obvious thing, from my point of view, is the use of by_ref(). This creates a new iterator from a mutable reference to an iterator; this means we can use the iterator without consuming it, which something like take would otherwise do.

This same logic, excluding the first line, is what we use to drain keys1 at the end of the loop, too. And it is the same logic, swapping 1 for 2 we use in the Ordering::Greater case and to drain keys2. The main appealing property about these fragments is that they do no conditional logic, so they should be implementable nearly as copies. They aren't, because Rust isn't quite there yet, but I've worked around that in another implementation we will measure.

The only thing left is to investigate the Ordering::Equal case. The case where the keys are equal means we need to do a merge, and that isn't all that different from what we've just written, with a few important changes and exceptions. Obviously we will be merging values rather than keys, but perhaps less obviously at the end of the merge we may not have added any values, if things cancelled.

Here is what I wrote for the equals case, with comments calling out the less-obvious changes:

// important: track to know if we added any vals.
let vals_len = result.vals.len();

let (key, key1_cnt) = keys1.next().unwrap();
let (___, key2_cnt) = keys2.next().unwrap();

let mut vals1 = vals1.by_ref().take(key1_cnt).peekable();
let mut vals2 = vals2.by_ref().take(key2_cnt).peekable();

while vals1.peek().is_some() && vals2.peek().is_some() {
	match vals1.peek().unwrap().0.cmp(&vals2.peek().unwrap().0) {
		Ordering::Less => { unimplemented!() },
		Ordering::Equal => { unimplemented!() },
		Ordering::Greater => { unimplemented!() },
	}
}

for (val, cnt) in vals1 { unimplemented!() }
for (val, cnt) in vals2 { unimplemented!() }

// important: conditionally write `key`.
if result.vals.len() > vals_len {
	result.keys.push((key, result.vals.len()));
}

The main important detail is that we capture result.vals.len() before we start work, so that once we are done we can see if any values actually landed in result. If not, we don't push the key because there is no corresponding data.

As before, the cases where we have a value from just one trace are relatively simple. Here is my code for the Ordering::Less case:

let (val, cnt) = vals1.next().unwrap();
result.hist.extend(hist1.by_ref().take(cnt));
result.vals.push((val, result.hist.len()));

This is really quite short now, and is essentially the same code from the Ordering::Less case for keys, without the loop over values.

What about the Ordering::Equals case? Are we near the bottom of the rabbit hole? Very nearly; we just need to merge the histories of equal values. The main thing to look for here is in the Ordering::Equal case, where we only conditionally write a time if the sum of deltas is non-zero. This is the root of the problem that makes us conditionally write values and keys. By "problem" I mean "optimization", of course.

let (val, val1_cnt) = vals1.next().unwrap();
let (___, val2_cnt) = vals2.next().unwrap();

let hist_len = result.hist.len();

let mut hist1 = hist1.by_ref().take(val1_cnt).peekable();
let mut hist2 = hist2.by_ref().take(val2_cnt).peekable();

while hist1.peek().is_some() && hist2.peek().is_some() {
	match hist1.peek().unwrap().0.cmp(&hist2.peek().unwrap().0) {
		Ordering::Less => {
			result.hist.push(hist1.next().unwrap());
		}
		Ordering::Equal => {
			let (time, delta1) = hist1.next().unwrap();
			let (____, delta2) = hist2.next().unwrap();
			if delta1 + delta2 != 0 {
				result.hist.push((time, delta1 + delta2));
			}
		}
		Ordering::Greater => {
			result.hist.push(hist2.next().unwrap());
		}
	}
}

result.hist.extend(hist1);
result.hist.extend(hist2);

if result.hist.len() > hist_len {
	result.vals.push((val, result.hist.len()));
}

Ok! I think we are done with that. Well done sticking through with us.

You might have notice that we wrote the same sort of code a lot, over and over. It's true. Could we possibly re-factor this a bit and make something more general? Yes! In fact, there is already a crate that does this and some other cool things, over at frankmcsherry/trie.

Apparently the code is just faster that what we've written, too. More on that later.

Iterating over data

Recall that our iterator interface for the Trace trait looks like so:

/// retrieves differences associated with a specified key.
fn get(&mut self, key: &Key) -> Self::OutputIterator;

/// An iterator defined by the implementation to produce results.
type OutputIterator: Iterator<Item=(&'a Key, &'a Val, &'a Time, isize)>;

This is part of our interface, so we should probably implement that. Recall that the enumerated items are not required to be in any particular order. With that in mind, we can largely steal the implementation for Lists as long as we write an iterator for each Trie.

A Trie output iterator is just going to have a few cursors into its arrays (three, specifically), and move them forward as appropriate.

pub struct TrieIter<'a, K:'a, V:'a, T:'a> {
	keys: (usize, &'a [(K, usize)]),
	vals: (usize, &'a [(V, usize)]),
	hist: (usize, &'a [(T, isize)]),
}

For each field, the first usize component indicates at which point the slice starts, so that we can compare positions with the absolute offsets recorded in the trie.

The logic is pretty simple, I think: we always advance hist, and once its usize offset hits the offset pointed at by vals we advance vals, and once it hits the value pointed at by keys we advance keys, which probably means we are done.

impl<'a, K:'a, V:'a, T:'a> Iterator for TrieIter<'a, K, V, T> {
	type Item = (&'a K, &'a V, &'a T, isize);
	fn next(&mut self) -> Option<Self::Item> {
		if self.hists.1.len() > 0 {
			let result = (
				&self.keys.1[0].0,
				&self.vals.1[0].0,
				&self.hist.1[0].0,
				 self.hist.1[0].1,
			);

			// advance self.hist.
			self.hist.0 += 1;
			self.hist.1 = &self.hist.1[1..];

			// maybe advance self.vals.
			if self.hist.0 == self.vals.1[0].1 {
				self.vals.0 += 1;
				self.vals.1 = &self.vals.1[1..];

				// maybe advance self.keys.
				if self.vals.0 == self.keys.1[0].1 {
					self.keys.0 += 1;
					self.keys.1 = &self.keys.1[1..];
				}
			}

			Some(result)
		}
		else { None }
	}
}

This iterator implementation outputs elements in order, but unlike the Lists iterator it re-uses a lot of the &'a K and &'a V references. This is one of the main reasons our Trace trait has the constraint

type OutputIter: Iterator<Item=(&'a K, &'a V, &'a T, isize)>;

as opposed to a similar but less verbose constraint like

type OutputIter: Iterator<Item=&'a (K, V, T, isize)>;

The latter of these two requires the existence of an actual (K, V, T, isize) tuple somewhere, and we don't have one of those. Phew, close call on that one.

Performance measurements

I assert that look-up times are only going to get better with Tries as opposed to Lists, so lets hold off evaluating look-ups for either of these until we get to the next installment. The short, qualitative version is that we just need to do binary search in the Vec<(K, usize)> list, which can be much, much smaller that the full Vec<(K, V, T, isize)> list of all tuples, and is never larger (because usize is the same size as isize).

Instead, for now let's try out insertion throughput.

We have made a few important changes, mostly in streamlining the merge process, but also in potentially moving around less data, because each trie compresses the data a little bit. If we have lots of repeated keys and values, we end up copying each only once, and reduce our data movement from (K, V, T, isize) tuples down to just (T, isize) pairs. This is great news for relatively large-ish keys and values. At the same time, if every key and value are distinct, we've only added a few new usize fields for each record, and a bunch of crazy logic.

We should see better performance in some cases, and worse performance in others!

Let's start with the test we've used so far: 1000 keys and 1000 values. Each million elements produces the same set of keys and values, so we do lots of merging and unfortunately no copying, which kinda sucks for us.

Echidnatron% cargo run --release --bin profile-insert -- 1000 1000 1000 1000 tries
     Running `target/release/profile-insert 1000 1000 1000 1000 tries`
throughput: 6064802.57 elts/sec @ 1000000 elts
throughput: 5650132.30 elts/sec @ 2000000 elts
throughput: 5297179.32 elts/sec @ 3000000 elts
throughput: 5313736.12 elts/sec @ 4000000 elts
throughput: 4959473.91 elts/sec @ 5000000 elts
...
throughput: 4031347.05 elts/sec @ 522000000 elts
throughput: 4033318.14 elts/sec @ 523000000 elts
throughput: 4035769.52 elts/sec @ 524000000 elts
...
throughput: 3819354.47 elts/sec @ 1046000000 elts
throughput: 3820566.10 elts/sec @ 1047000000 elts
throughput: 3821941.38 elts/sec @ 1048000000 elts

These computations always hit the inner merge logic, but fortunately in this case that inner merge logic moves less data around, and we hope to win out a little bit. For reference, here are both the first and final moments of the merging list-based approach with the same arguments, copied from up above:

Echidnatron% cargo run --release --bin profile-insert -- 1000 1000 1000 1000 lists
     Running `target/release/profile-insert 1000 1000 1000 1000 lists`
throughput: 6407104.61 elts/sec @ 1000000 elts
throughput: 5731598.78 elts/sec @ 2000000 elts
throughput: 5360619.80 elts/sec @ 3000000 elts
throughput: 5560609.58 elts/sec @ 4000000 elts
throughput: 5005889.36 elts/sec @ 5000000 elts
...
throughput: 4083296.32 elts/sec @ 522000000 elts
throughput: 4087287.71 elts/sec @ 523000000 elts
throughput: 4091650.69 elts/sec @ 524000000 elts

These arguments, 1000 keys and 1000 values, mean that there really is only one value for each key, equal to itself. Each trie that we merge has the same properties, so we still spend our time doing lots of merges, just on (T, isize) elements rather than full (K, V, T, isize) tuples. I was expecting to see some performance improvement.

As it turns out, we do see improved performance when using the trie crate.

Echidnatron% cargo run  --release --bin profile-insert -- 1000 1000 1000 1000 trie-crate
     Running `target/release/profile-insert 1000 1000 1000 1000 trie-crate`
throughput: 7948465.33 elts/sec @ 1000000 elts
throughput: 7465856.38 elts/sec @ 2000000 elts
throughput: 7140011.66 elts/sec @ 3000000 elts
throughput: 7272882.94 elts/sec @ 4000000 elts
throughput: 6849094.57 elts/sec @ 5000000 elts
...
throughput: 6215628.98 elts/sec @ 522000000 elts
throughput: 6219111.63 elts/sec @ 523000000 elts
throughput: 6223198.90 elts/sec @ 524000000 elts
...
throughput: 6028748.73 elts/sec @ 1046000000 elts
throughput: 6030676.76 elts/sec @ 1047000000 elts
throughput: 6032800.13 elts/sec @ 1048000000 elts

I don't have a great explanation for this, other than that the trie crate is largely based on arrays rather than iterators, so Rust's "zero-cost abstraction" mantra might need its annual check-up. I'd like look into this more, because (i) it shouldn't be so much of a problem, and (ii) it would be nice to understand better.

The trie-based approach gets us two things in this case, slightly better performance and (more importantly, I think) twice as much capacity before we hit the self-imposed 16GB limit. Navigation should be much faster too, because we only need to poke around in the 1000 keys rather than the 1048000000 or whatever tuples (32GB-ish using lists).

Let's change this up a bit and use 1001 keys, so that we get one million distinct key-value pairs in each batch. This should make for more value-merging and less time-merging, which isn't great news for tries. It should have no effect on the list based approach, so we will skip re-measuring that.

Here is the Tries implementation,

Echidnatron% cargo run --release --bin profile-insert -- 1000 1001 1000 1000 tries
     Running `target/release/profile-insert 1000 1001 1000 1000 tries`
throughput: 5013128.28 elts/sec @ 1000000 elts
throughput: 4685649.30 elts/sec @ 2000000 elts
throughput: 4216060.65 elts/sec @ 3000000 elts
throughput: 4272969.87 elts/sec @ 4000000 elts
throughput: 3783114.63 elts/sec @ 5000000 elts
...
throughput: 3438601.60 elts/sec @ 522000000 elts
throughput: 3440632.98 elts/sec @ 523000000 elts
throughput: 3443512.43 elts/sec @ 524000000 elts
...
throughput: 3402962.63 elts/sec @ 1046000000 elts
throughput: 3403922.25 elts/sec @ 1047000000 elts
throughput: 3405376.64 elts/sec @ 1048000000 elts

and the trie crate implementation:

Echidnatron% cargo run  --release --bin profile-insert -- 1000 1001 1000 1000 trie-crate
     Running `target/release/profile-insert 1000 1001 1000 1000 trie-crate`
throughput: 5420873.47 elts/sec @ 1000000 elts
throughput: 5137352.90 elts/sec @ 2000000 elts
throughput: 4694924.92 elts/sec @ 3000000 elts
throughput: 4791913.72 elts/sec @ 4000000 elts
throughput: 4376992.36 elts/sec @ 5000000 elts
...
throughput: 4268305.67 elts/sec @ 522000000 elts
throughput: 4270360.60 elts/sec @ 523000000 elts
throughput: 4273183.70 elts/sec @ 524000000 elts
...
throughput: 4017279.31 elts/sec @ 1046000000 elts
throughput: 4018399.22 elts/sec @ 1047000000 elts
throughput: 4019631.23 elts/sec @ 1048000000 elts

These times are less great because we spend relatively little time doing merges of (T, isize) lists; each round of one million records has only a single time for each key-value pair, and most of the merging has relatively short lists of times. This is where the trie approaches would have gained ground, thinking relatively less about keys and values than they think about times. Instead, the trie crate implementation ends up pretty close to the sorted list implementation.

On the other end of the spectrum, we could think consider just one key and one value, completely unreasonable but still interesting to check out. The times are all distinct, so no records get coalesced here.

The Tries implementation yields:

Echidnatron% cargo run --release --bin profile-insert -- 1 1 1000 1000 tries
     Running `target/release/profile-insert 1 1 1000 1000 tries`
throughput: 8349560.50 elts/sec @ 1000000 elts
throughput: 7402898.58 elts/sec @ 2000000 elts
throughput: 6880129.36 elts/sec @ 3000000 elts
throughput: 6982623.11 elts/sec @ 4000000 elts
throughput: 6286018.40 elts/sec @ 5000000 elts
...
throughput: 5265066.94 elts/sec @ 522000000 elts
throughput: 5267632.95 elts/sec @ 523000000 elts
throughput: 5271201.16 elts/sec @ 524000000 elts
...
throughput: 4738916.24 elts/sec @ 1046000000 elts
throughput: 4740889.90 elts/sec @ 1047000000 elts
throughput: 4743167.73 elts/sec @ 1048000000 elts

The trie crate implementation gives

Echidnatron% cargo run  --release --bin profile-insert -- 1 1 1000 1000 trie-crate
     Running `target/release/profile-insert 1 1 1000 1000 trie-crate`
throughput: 13197679.49 elts/sec @ 1000000 elts
throughput: 11779801.77 elts/sec @ 2000000 elts
throughput: 11013754.34 elts/sec @ 3000000 elts
throughput: 11152347.94 elts/sec @ 4000000 elts
throughput: 9907595.05 elts/sec @ 5000000 elts
...
throughput: 8295162.96 elts/sec @ 522000000 elts
throughput: 8302350.43 elts/sec @ 523000000 elts
throughput: 8309421.18 elts/sec @ 524000000 elts
...
throughput: 7550401.54 elts/sec @ 1046000000 elts
throughput: 7553511.81 elts/sec @ 1047000000 elts
throughput: 7556793.09 elts/sec @ 1048000000 elts

It looks like Rust's abstractions may have some non-zero cost.

Let's do a slightly different experiment, where instead of having 1000 keys that we rotate through, each round of one million records will have one key, so that we see about 1000 (specifically: 1048) over the course of execution. This should give the trie implementations a chance to shine, as they can avoid doing merges by just copying data.

Let's start with the Tries implementation:

Echidnatron% cargo run --release --bin profile-insert -- 1 1000 1000 1000 tries
     Running `target/release/profile-insert 1 1000 1000 1000 tries`
throughput: 6753392.52 elts/sec @ 1000000 elts
throughput: 6575431.13 elts/sec @ 2000000 elts
throughput: 6127113.48 elts/sec @ 3000000 elts
throughput: 6216579.15 elts/sec @ 4000000 elts
throughput: 5812184.59 elts/sec @ 5000000 elts
...
throughput: 4694255.86 elts/sec @ 1046000000 elts
throughput: 4695949.84 elts/sec @ 1047000000 elts
throughput: 4697697.12 elts/sec @ 1048000000 elts

And then the trie crate implementation:

Echidnatron% cargo run  --release --bin profile-insert -- 1 1000 1000 1000 trie-crate
     Running `target/release/profile-insert 1 1000 1000 1000 trie-crate`
throughput: 9390847.91 elts/sec @ 1000000 elts
throughput: 9049046.31 elts/sec @ 2000000 elts
throughput: 8574939.67 elts/sec @ 3000000 elts
throughput: 8741667.15 elts/sec @ 4000000 elts
throughput: 8081182.77 elts/sec @ 5000000 elts
...
throughput: 6522300.43 elts/sec @ 1046000000 elts
throughput: 6524586.42 elts/sec @ 1047000000 elts
throughput: 6527395.99 elts/sec @ 1048000000 elts

There is some difference here, but it turns out that there is just a fair amount of cost in assembling each round of one million records, which have lots of merging, even if then merging the results are relatively cheap. We can tweak the experiment further, moving through each round's 1000 values one per batch, rather seeing each one over the 1000 batches. Just looking at the trie crate version:

Echidnatron% cargo run  --release --bin profile-inspect -- 1 1000 1000 1000 trie-crate
     Running `target/release/profile-inspect 1 1000 1000 1000 trie-crate`
throughput: 12882371.99 elts/sec @ 1000000 elts
throughput: 12088110.92 elts/sec @ 2000000 elts
throughput: 10501636.28 elts/sec @ 3000000 elts
throughput: 10784657.67 elts/sec @ 4000000 elts
throughput: 9953763.38 elts/sec @ 5000000 elts
...
throughput: 7729524.16 elts/sec @ 1046000000 elts
throughput: 7732909.29 elts/sec @ 1047000000 elts
throughput: 7736427.85 elts/sec @ 1048000000 elts

This is much better, though we are pretty much just inserting sorted data at this point. It doesn't have to be, it was just an easy way to make the values distinct, but there is a silly amount of structure in the input designed to make the numbers as good as they can be. They should probably be better, as we really only need to be copying memory around in this case; I haven't profiled it yet to see what the hold-up is, but even just sorting the 1000 records each tops out at about 27 million elements per second (not radix sort).

I should re-iterate that the trie-structuring isn't so much about increasing insertion throughput, though that would be great, as much as increasing capacity (check!) and simplifying indexing and navigation, something we will get to next!

Errata (added later)

I made fun of Rust for some slowness that I attributed to crappy iterator abstraction. That may have been wrong.

Specifically, the trie crate gets better performance than iterator-based merging, but it also has an algorithmic improvement I failed to mention. Rather, I remembered it, but I didn't realize that it would be relevant, so I cut the text talking about how clever it was. Apparently it was clever, so points awarded to past me, deducted from present me.

The merge implementation in the trie crate does something potentially clever in the not-equals cases. Let's pick the shortest (and most relevant) example from the code we wrote above, the inner loop where we merge histories.

while hist1.peek().is_some() && hist2.peek().is_some() {
	match hist1.peek().unwrap().0.cmp(&hist2.peek().unwrap().0) {
		Ordering::Less => {
			result.hist.push(hist1.next().unwrap());
		}
		Ordering::Equal => {
			let (time, delta1) = hist1.next().unwrap();
			let (____, delta2) = hist2.next().unwrap();
			if delta1 + delta2 != 0 {
				result.hist.push((time, delta1 + delta2));
			}
		}
		Ordering::Greater => {
			result.hist.push(hist2.next().unwrap());
		}
	}
}

This walks through each element of the iterator, each time performing a comparison. We end up doing some linear number of comparisons, which sounds like the right number, right?

You can do lots fewer comparisons, sometimes.

In each of the not-equals cases, we learn that the head of one list is less than the head of some other list. We could, hypothetically, take this opportunity to read off all elements in the first list that are less than the head of the second list. If you just have an iterator, you kind of need to compare each element in turn. Instead, if you know your iterator is backed by a sorted array, you can do an exponential search to find the first element in the first list that is greater-or-equal to the head of the second list. This performs a number of comparisons logarithmic in the number of elements you can read off. Then you just memcpy them into place.

This is relevant because in all of those examples up above, with many keys and with few keys, the list of (T, isize) elements associated with each key value pair have T timestamps, which naturally increase as we run. When we merge two tries, one will be older than the other, and have smaller timestamps for all key-value pairs. We will merge keys one by one, and then values one-by-one, but when we get to timestamps, the trie crate implementation will do a logarithmic number of comparisons and two memcpys.

Even the sorted list implementation can benefit from this optimization: the same older vs newer property holds for each of the lists we merge, and each will have non-trivial runs to memcpy. The exponential search after each merge step is pretty easy to write (or copy), and it just relies on breaking the iterator abstraction boundary which is pretty easy for the sorted lists. I will try this out tomorrow!

So, there are definitely some places where Rust's abstractions hurt (iterators not actually being turned in to memcpys, despite claims (mine) that they could), but it isn't clear that the performance discrepancies seen above derive from any of these places.

Addendum

I implemented the version of merge on sorted lists that does exponential search to find how many elements to copy, which can improve performance in the case there are runs in either input that should be copied contiguously. This is the case when we use keys and vals of 1000, for example, because in each round of one million elements there are only 1000 distinct values, each of which has 1000 increasing timestamps. Merging between rounds of one million elements should just copy the 1000 elements of each's histories.

Here are the numbers, first with the old merge (copy/pasted from above):

Echidnatron% cargo run --release --bin profile-insert -- 1000 1000 1000 1000 lists
     Running `target/release/profile-insert 1000 1000 1000 1000 lists`
throughput: 6407104.61 elts/sec @ 1000000 elts
throughput: 5731598.78 elts/sec @ 2000000 elts
throughput: 5360619.80 elts/sec @ 3000000 elts
throughput: 5560609.58 elts/sec @ 4000000 elts
throughput: 5005889.36 elts/sec @ 5000000 elts
...
throughput: 4083296.32 elts/sec @ 522000000 elts
throughput: 4087287.71 elts/sec @ 523000000 elts
throughput: 4091650.69 elts/sec @ 524000000 elts

And next with the new merge:

Echidnatron% cargo run --release --bin profile-insert -- 1000 1000 1000 1000 lists
     Running `target/release/profile-insert 1000 1000 1000 1000 lists`
throughput: 6921776.01 elts/sec @ 1000000 elts
throughput: 6586201.96 elts/sec @ 2000000 elts
throughput: 6091053.70 elts/sec @ 3000000 elts
throughput: 6217983.53 elts/sec @ 4000000 elts
throughput: 5583829.84 elts/sec @ 5000000 elts
...
throughput: 4563902.73 elts/sec @ 522000000 elts
throughput: 4568167.76 elts/sec @ 523000000 elts
throughput: 4572768.42 elts/sec @ 524000000 elts

Finally, the trie crate implementation (copy/pasted from above):

Echidnatron% cargo run  --release --bin profile-insert -- 1000 1000 1000 1000 trie-crate
     Running `target/release/profile-insert 1000 1000 1000 1000 trie-crate`
throughput: 7948465.33 elts/sec @ 1000000 elts
throughput: 7465856.38 elts/sec @ 2000000 elts
throughput: 7140011.66 elts/sec @ 3000000 elts
throughput: 7272882.94 elts/sec @ 4000000 elts
throughput: 6849094.57 elts/sec @ 5000000 elts
...
throughput: 6215628.98 elts/sec @ 522000000 elts
throughput: 6219111.63 elts/sec @ 523000000 elts
throughput: 6223198.90 elts/sec @ 524000000 elts
...
throughput: 6028748.73 elts/sec @ 1046000000 elts
throughput: 6030676.76 elts/sec @ 1047000000 elts
throughput: 6032800.13 elts/sec @ 1048000000 elts

The first two are definitely different, with the newer version faster, but neither catches up to the trie crate implementation. This might be because there are other optimizations I have forgotten about, or it might be because of those great motivating explanations, like that we only end up merging (and copying) (T, isize) pairs rather than full (K, V, T, isize) tuples.

For reference, here are some of the other parameter settings we considered. First, the 1000 key and 1001 value setting, which should have shorter runs of times and less effective copying:

Echidnatron% cargo run --release --bin profile-insert -- 1000 1001 1000 1000 lists
     Running `target/release/profile-insert 1000 1001 1000 1000 lists`
throughput: 6629849.74 elts/sec @ 1000000 elts
throughput: 6379930.74 elts/sec @ 2000000 elts
throughput: 5798108.45 elts/sec @ 3000000 elts
throughput: 6028078.50 elts/sec @ 4000000 elts
throughput: 5335330.10 elts/sec @ 5000000 elts
...
throughput: 3927461.80 elts/sec @ 522000000 elts
throughput: 3931341.55 elts/sec @ 523000000 elts
throughput: 3935778.65 elts/sec @ 524000000 elts

And next the single key, single value case, which should have highly effective copying behavior.

Echidnatron% cargo run --release --bin profile-insert -- 1 1 1000 1000 lists
     Running `target/release/profile-insert 1 1 1000 1000 lists`
throughput: 7618399.66 elts/sec @ 1000000 elts
throughput: 7456742.90 elts/sec @ 2000000 elts
throughput: 6737874.21 elts/sec @ 3000000 elts
throughput: 7021508.62 elts/sec @ 4000000 elts
throughput: 6288380.26 elts/sec @ 5000000 elts
...
throughput: 4951442.39 elts/sec @ 522000000 elts
throughput: 4956449.23 elts/sec @ 523000000 elts
throughput: 4962011.37 elts/sec @ 524000000 elts

So, good news and improvements for our target application, where we have these histories with increasing times at the heart of everything. Still it seems that the trie crate's performance is better than for sorted lists, but surely there are still smarter things to do also. We will look harder once we've worked through the other things we need Trace to do.

Part 4: Adding an index.

We now move on to thinking about getting data back out of a Trace. For the moment, we are just going to think about getting our fingers on the data, not about actually using it. Our look-up experiments will be to fill the collection as we have done up above, rounds of waves of batches of inserts, with varying numbers of keys and values, and after each round do some random look-ups (usually 1000) and report the throughput.

The fastest way to do look-ups is probably a HashMap or something like it. Let's measure this, then.

Echidnatron% cargo run --release --bin profile-lookup -- 1000 1000 1000 1000 hash
     Running `target/release/profile-lookup 1000 1000 1000 1000 hash`
throughput: 40115532.73 elts/sec @ 1000000 elts
throughput: 40160642.57 elts/sec @ 2000000 elts
throughput: 39953653.76 elts/sec @ 3000000 elts
throughput: 40444893.83 elts/sec @ 4000000 elts
throughput: 40180006.43 elts/sec @ 5000000 elts
...
throughput: 34115720.52 elts/sec @ 522000000 elts
throughput: 35321959.66 elts/sec @ 523000000 elts
throughput: 35181536.73 elts/sec @ 524000000 elts

These are great numbers! 35-40 million look-ups per seconds is wonderful. This is possible because our use of the data is currently just to report the number of elements, which as part of the Vec struct actually lives in the HashMap allocation rather than on the other end of some pointers. With only one thousand keys, the working set for look-ups is really quite small, because we aren't actually looking at most of the data yet.

Here are some numbers with 16 million keys.

Echidnatron% cargo run --release --bin profile-lookup -- 16000000 1000 1000 1000 hash
     Running `target/release/profile-lookup 16000000 1000 1000 1000 hash`
throughput: 9111285.24 elts/sec @ 1000000 elts
throughput: 7974799.63 elts/sec @ 2000000 elts
throughput: 7672948.25 elts/sec @ 3000000 elts
throughput: 6695950.29 elts/sec @ 4000000 elts
throughput: 6978415.76 elts/sec @ 5000000 elts
...
throughput: 3833473.89 elts/sec @ 522000000 elts
throughput: 3611529.45 elts/sec @ 523000000 elts
throughput: 2874240.48 elts/sec @ 524000000 elts

These have dropped by an order of magnitude, as the look-ups now have to hit in a much larger set. They still don't touch the real volume of (V, T, isize) data, staying in a several hundred megabytes working set.

How about our sorted list implementation? We were going to take each of our lists and do a binary search (or two) to find the range of tuples associated with a key, right? That sounds like a good way to start.

Echidnatron% cargo run --release --bin profile-lookup -- 1000 1000 1000 1000 lists
     Running `target/release/profile-lookup 1000 1000 1000 1000 lists`
throughput: 313602.61 elts/sec @ 1000000 elts
throughput: 274131.52 elts/sec @ 2000000 elts
throughput: 258997.11 elts/sec @ 3000000 elts
throughput: 251680.47 elts/sec @ 4000000 elts
throughput: 281871.57 elts/sec @ 5000000 elts
...
throughput: 67968.46 elts/sec @ 522000000 elts
throughput: 64092.42 elts/sec @ 523000000 elts
throughput: 62579.01 elts/sec @ 524000000 elts

And now with 16 million keys:

Echidnatron% cargo run --release --bin profile-lookup -- 16000000 1000 1000 1000 lists
     Running `target/release/profile-lookup 16000000 1000 1000 1000 lists`
throughput: 3654503.26 elts/sec @ 1000000 elts
throughput: 2756537.13 elts/sec @ 2000000 elts
throughput: 2186418.41 elts/sec @ 3000000 elts
throughput: 1983977.40 elts/sec @ 4000000 elts
throughput: 1806071.29 elts/sec @ 5000000 elts
...
throughput: 85581.92 elts/sec @ 522000000 elts
throughput: 74162.77 elts/sec @ 523000000 elts
throughput: 95654.47 elts/sec @ 524000000 elts

Well, these numbers aren't great. They make sense, though. When we start out, we have not so many lists, each with not so many elements. As we move along, we have a whole bunch more data to look through for each key. Also, you can see that it actually ends up a bit faster with 16 million keys, which might seem counter-intuitive, but really when we look up a key we need to find the start and end of the range for a key, and with 16 million keys these are closer together for each key.

How about our trie implementation, which has keys separate from the bulk of the data, like the HashMap but without its random access?

Echidnatron% cargo run --release --bin profile-lookup -- 1000 1000 1000 1000 tries
     Running `target/release/profile-lookup 1000 1000 1000 1000 tries`
throughput: 2640013.09 elts/sec @ 1000000 elts
throughput: 2624299.97 elts/sec @ 2000000 elts
throughput: 2389366.36 elts/sec @ 3000000 elts
throughput: 2645138.76 elts/sec @ 4000000 elts
throughput: 3063387.62 elts/sec @ 5000000 elts
...
throughput: 1504786.73 elts/sec @ 523000000 elts
throughput: 1547709.70 elts/sec @ 524000000 elts
throughput: 3046653.40 elts/sec @ 525000000 elts
...
throughput: 1489857.79 elts/sec @ 1046000000 elts
throughput: 1453179.99 elts/sec @ 1047000000 elts
throughput: 1554332.47 elts/sec @ 1048000000 elts

We are up in the millions of lookups per second again, though nowhere near the 35-40 million per second that HashMap does for the same number of keys. One thing to note, perhaps, is that at 524 million we have a big consolidation of lists; we move from one list at each power-of-two length, to one larger one. This is why the other techniques bail at this point. Here we notice a marked improvement in query latency, as the number of lists to search drops substantially.

Echidnatron% cargo run --release --bin profile-lookup -- 16000000 1000 1000 1000 tries
     Running `target/release/profile-lookup 16000000 1000 1000 1000 tries`
throughput: 6099829.81 elts/sec @ 1000000 elts
throughput: 4348922.77 elts/sec @ 2000000 elts
throughput: 3354511.48 elts/sec @ 3000000 elts
throughput: 2985698.50 elts/sec @ 4000000 elts
throughput: 2724728.82 elts/sec @ 5000000 elts
...
throughput: 191675.05 elts/sec @ 524000000 elts
throughput: 428941.18 elts/sec @ 525000000 elts
throughput: 922347.56 elts/sec @ 526000000 elts
...
throughput: 175878.04 elts/sec @ 1046000000 elts
throughput: 171276.22 elts/sec @ 1047000000 elts
throughput: 169637.21 elts/sec @ 1048000000 elts

This is clearly ends up less impressive than when we had fewer keys, despite starting well (due to missing on some keys, I bet), though it is better than the Lists implementation. These numbers are more than 10x slower than the HashMap implementation, which has to poke around in about the same amount of data. The difference is that each binary search hits the data several times, and we need to do several of them.

Batching look-ups

One thing that does improve with Lists and Tries are batches of queries, where we have the opportunity to sort the queries within a batch. Let's try that; there are 1000 queries for random keys in each batch, so this should work "great" with 1000 keys:

Echidnatron% cargo run --release --bin profile-lookup -- 1000 1000 1000 1000 lists sort
     Running `target/release/profile-lookup 1000 1000 1000 1000 lists sort`
throughput: 593154.64 elts/sec @ 1000000 elts
throughput: 525547.11 elts/sec @ 2000000 elts
throughput: 513238.21 elts/sec @ 3000000 elts
throughput: 459265.87 elts/sec @ 4000000 elts
throughput: 530989.62 elts/sec @ 5000000 elts
...
throughput: 124565.39 elts/sec @ 522000000 elts
throughput: 132066.36 elts/sec @ 523000000 elts
throughput: 114911.01 elts/sec @ 524000000 elts

About a 2x improvement. And for tries:

Echidnatron% cargo run --release --bin profile-lookup -- 1000 1000 1000 1000 tries sort
     Running `target/release/profile-lookup 1000 1000 1000 1000 tries sort`
throughput: 4169307.23 elts/sec @ 1000000 elts
throughput: 4109003.65 elts/sec @ 2000000 elts
throughput: 3865003.17 elts/sec @ 3000000 elts
throughput: 4189710.07 elts/sec @ 4000000 elts
throughput: 4736911.91 elts/sec @ 5000000 elts
...
throughput: 2220440.94 elts/sec @ 523000000 elts
throughput: 2326403.75 elts/sec @ 524000000 elts
throughput: 4669515.07 elts/sec @ 525000000 elts
...
throughput: 2008790.47 elts/sec @ 1046000000 elts
throughput: 2232048.19 elts/sec @ 1047000000 elts
throughput: 2278376.61 elts/sec @ 1048000000 elts

Also about a 2x improvement by processing things in order.

We don't expect quite as great a result for 16 million keys, because each key is sufficiently different that processing them in order doesn't help quite as much.

Echidnatron% cargo run --release --bin profile-lookup -- 16000000 1000 1000 1000 lists sort
     Running `target/release/profile-lookup 16000000 1000 1000 1000 lists sort`
throughput: 3347291.54 elts/sec @ 1000000 elts
throughput: 2645019.82 elts/sec @ 2000000 elts
throughput: 2103053.84 elts/sec @ 3000000 elts
throughput: 1953128.81 elts/sec @ 4000000 elts
throughput: 1989570.67 elts/sec @ 5000000 elts
...
throughput: 149253.91 elts/sec @ 522000000 elts
throughput: 92232.88 elts/sec @ 523000000 elts
throughput: 120980.44 elts/sec @ 524000000 elts

This is better, though perhaps not 2x better. And for tries:

Echidnatron% cargo run --release --bin profile-lookup -- 16000000 1000 1000 1000 tries sort
     Running `target/release/profile-lookup 16000000 1000 1000 1000 tries sort`
throughput: 5969864.13 elts/sec @ 1000000 elts
throughput: 4385387.89 elts/sec @ 2000000 elts
throughput: 3422735.18 elts/sec @ 3000000 elts
throughput: 2712452.87 elts/sec @ 4000000 elts
throughput: 2923292.80 elts/sec @ 5000000 elts
...
throughput: 259774.67 elts/sec @ 524000000 elts
throughput: 64566.67 elts/sec @ 525000000 elts
throughput: 1080639.48 elts/sec @ 526000000 elts
...
throughput: 236002.35 elts/sec @ 1046000000 elts
throughput: 233229.90 elts/sec @ 1047000000 elts
throughput: 233800.82 elts/sec @ 1048000000 elts

These are about a 4/3x improvement. This goes up as the batch size gets larger. For example, cranking the query batch size up to 10,000 elements, we see numbers for Tries like:

Echidnatron% cargo run --release --bin profile-lookup -- 16000000 1000 10000 100 tries sort
     Running `target/release/profile-lookup 16000000 1000 10000 100 tries sort`
throughput: 5230218.53 elts/sec @ 1000000 elts
throughput: 4803007.07 elts/sec @ 2000000 elts
throughput: 3579488.81 elts/sec @ 3000000 elts
throughput: 3313328.56 elts/sec @ 4000000 elts
throughput: 3288689.38 elts/sec @ 5000000 elts
...
throughput: 316652.69 elts/sec @ 523000000 elts
throughput: 319958.04 elts/sec @ 524000000 elts
throughput: 745702.26 elts/sec @ 525000000 elts
throughput: 1317004.58 elts/sec @ 526000000 elts
...
throughput: 269572.39 elts/sec @ 1046000000 elts
throughput: 291515.14 elts/sec @ 1047000000 elts
throughput: 288647.49 elts/sec @ 1048000000 elts

This looks like it might be settling on a 3/2x throughput improvement by batching 10x more queries together.

There are even further optimizations one can do. The trie crate doesn't support get but instead returns a cursor over keys, leaving a finger in each of the geometrically sized tries to remember where it was. For the shorter arrays, the finger is often quite near (or past) the next query, and exponentially increasing search can work much better than binary search. This is an interesting approach to consider, and we will talk about it more when we get to navigating around the per-key state, because we will do exactly this approach.

Sorting doesn't much help a HashMap, in that it intentionally maps nearby keys to totally different locations. We could cheat a bit, and will mention this in just a few sections.

Building an index

Can we combine the best of both worlds: the look-up speed of a HashMap and the density of Tries?

Yes, both in a fairly simple and direct way, and in a more sophisticated and only barely highlighted here way.

The simple and direct way

The HashMap is useful for quickly locating information about a key, but not such a great story for how to store all of our data. Why not use the Tries to store the data, and a Hashmap to help us recall exactly where we can find each key.

More specifically, we could maintain a HashMap<K, Vec<(usize, usize)>> which for each key indicates a list of (trie_id, offset) pairs indicating the tries and offsets at which the key can be found. Looking up locations of data is now quite simple, as the HashMap has the data for us. Maintaining the index is also not very hard: as we naturally keep information about the youngest tries at the ends of these lists, when merging a key we just pop off the end and then push on the offset in the merged trie.

Perhaps the easiest way to do this is to perform all of the merges we need to perform, record all observed keys (even the ones that were discarded due to cancellation), and then for each of these keys pop all entries from their list to tries that were merged, and if appropriate push one entry for the offset in the newly merged trie.

Here is the logic for Tries::add, before the modification

/// Incorporates a new difference into the trace.
fn add<Iter: Iterator<Item=(K,V,T,isize)>>(&mut self, diff: Iter) {

    // collect iterator contents into a temporary buffer.
    let mut buffer = Trie::from(diff);

    // while the next smallest list is less than twice as large as the buffer, merge it in.
    while self.tries.len() > 0 && self.tries[self.tries.len()-1].len() < 2 * buffer.len() {
        buffer = merge(buffer, self.tries.pop().unwrap());
    }

    // <- new stuff about to go here!

    self.tries.push(buffer);
}

And here is a version afterwards, where merge also updates a supplied list of "discarded" keys.

/// Incorporates a new difference into the trace.
fn add<Iter: Iterator<Item=(K,V,T,isize)>>(&mut self, diff: Iter) {

    // collect iterator contents into a temporary buffer.
    let mut buffer = Trie::from(diff);
    let mut discard = Vec::new();

    // while the next smallest list is less than twice as large as the buffer, merge it in.
    while self.tries.len() > 0 && self.tries[self.tries.len()-1].len() < 2 * buffer.len() {
        buffer = merge(buffer, self.tries.pop().unwrap(), &mut discard);
    }

    // trie ids this or greater should be discarded.
    let valid = self.tries.len();

    // drop entries for discarded keys.
    for key in discard.drain(..) {
    	if let Occupied(entry) = self.index.entry(key) {
    		while entry.last().map(|x| x.1 >= valid) == Some(true) {
    			entry.get_mut().pop();
    		}
    		if entry.get().len() == 0 {
    			entry.remove();
    		}
    	}
    }

    // update entries for present keys
    for (key, pos) in buffer.keys(..) {
    	let entry = self.index.entry(key.clone()).or_insert(Vec::new());
		while list.len() > 0 && list[list.len()-1].1 >= valid {
			list.pop();
		}
		list.push((valid, pos));
    }

    self.tries.push(buffer);
}

This approach gives us the look-up performance of the HashMap at the beginning of the section. The Vec as the value in the two HashMaps have the same size, and the same thing happens, again until we actually start looking at the data.

In principle, we could remove the Vec<(K, usize)> vectors entirely, and store the usize information directly in the HashMap itself. Before we get too excited about this, let's think about how different a HashMap and a Vec<K,_> really are. It turns out they can be pretty similar.

The more sophisticated way

So far we have used the fact that keys are ordered to provide some order on keys, which is an important part of being able to merge the keys. However, nothing actually binds us to use that order on keys. We could choose a different order, if it would be helpful. Are there any other orders that are interesting?

There are several hash map implementations, I'm thinking specifically of Robin Hood Hashing, that maintain their set of (key, val) pairs in a flat array ordered by (hash(key), key). The pairs are ordered by hash value, dropped in at locations proportional to each's hash, which is how given just its hash value we can zip to where a key should be. If there are collisions, we just scan forward until we either find the key or a value with a larger hash or key.

We can choose to order keys by (hash(key), key) throughout, and replace our use of dense Vec<(K, usize)> arrays with slightly sparser Robin Hood hash maps. Each binary search is now replaced by a hash map look up.

Alternately, we can follow the suggestion above and replace all Vec<(K, usize)> entries with a single hash map containing their associated data. Using something like a Robin Hood hash map, we still maintain a relatively well packed vector with a clear order on keys,

I think this sounds like a great idea, but my Robin Hood hash map implementation is a bit dusty. I'm traveling a bit in the next few days, so I'm going to take some time to clean up code and see what things look like. No performance numbers combining things just yet.

Part 5: Navigation rather than enumeration.

After we talk about indices, surely.

Files

2016-08-03.md

Latest commit

History