Add optional features that can potentially improve performance #7

novacrazy · 2019-01-16T04:47:38Z

This started out as an experiment to see if I could make evmap work better with the libraries I'm already using in my projects, but there really isn't a reason not to provide them as features.

The big ones:

hashbrown: Fast and space efficient hash table replacement.
~~parking_lot: Fast and space efficient synchronization primitives.~~

These two are obvious, as they provide generally better and potentially safer functionality (parking_lot doesn't poison threads).

Additionally:

smallvec: Mixed stack/inline/heap allocated vector.

Although the multi-value map is useful for multiple insertions, if a map never uses that and only ever replaces a single value with update, we can cheat by effectively using [T; 1] instead, which saves us an extra level of indirection. smallvec does grow and allocate if necessary.

Combined, this seems to net about a 50% performance increase on my system, but I need to do a clean run of benchmarks before this is merged. It's certainly better. Plus hashbrown is supposed to be more space efficient.

I switched to log10 because the others were getting lost. Also, I have a 16C/32T CPU, so I bumped up the reader counts to compensate. Although for new benchmarks I may try to change the CPU affinity.

Write speeds are about the same.

What are your thoughts?

jonhoo · 2019-01-16T15:29:58Z

This is really cool work! Let me work my way through the diff and comment as I go, and then I'll leave some concluding remarks at the end!

benchmark/src/main.rs

src/inner.rs

Cargo.toml

src/lib.rs

src/read.rs

jonhoo · 2019-01-16T15:44:04Z

Okay, so overall I'd say:

parking_lot is probably not a worthwhile change based on my understanding of the current performance bottlenecks of evmap. A micro-benchmark would be good though. Specifically, the one Mutex in evmap is pretty much entirely uncontended, although it is taken on every refresh. I don't know that parking_lot::Mutex improves on that case, but maybe?
smallvec is a great change.
hashbrown is tricky to add as an optional feature because it changes types that are part of evmap's public type signatures, which in turn means that the features isn't additive. I think we'd probably just want to make hashbrown non-optional if we believe it to be a pretty much universal win.
updates to the benchmark aren't particularly important, since they don't change the performance users will experience when using evmap themselves. we may want evmap to expose a MultiWriterHandle for convenience though (which is essentially Arc<Mutex<WriteHandle>>).

novacrazy · 2019-01-16T19:15:33Z

Running some new benchmarks now. I've limited the affinity to 8 real CPU cores (no SMT), and set it to high priority to hopefully reduce some noise.

jonhoo · 2019-01-16T19:20:30Z

I think what you want to do is something similar to the script in benchmarks/README.md and run with #readers + #writers <= #cores. That's where the interesting scalability results lie. Once you go beyond the number of cores, you end up context switching, and that's not particularly interesting from a performance standpoint.

novacrazy · 2019-01-16T19:37:31Z

In that case, I'll bump it to 12 real cores and go overboard on the granularity of the benchmarks. This may take a while, but we should get some pretty graphs!

I've realized by now that these benchmarks are really just for people glancing at the library. Now that I understand how it works, the reader benchmarks are basically just benchmarking the hashmap implementation, plus a few atomic loads.

Speaking of... I just noticed an issue that I will fix after benchmarking.

In ReadHandle::with_handle, there is this snippet:

let rs = unsafe { Box::from_raw(r_handle) };
res = Some(f(&*rs));
mem::forget(rs); // don't free the Box!

However, if f panics (which can be in user code, via get_and and for_each, among others), rs will be dropped. Therefore, I will wrap it in ManuallyDrop ASAP, and look for other occurrences of this pattern.

jonhoo · 2019-01-16T19:45:30Z

Yup, the benchmarker is essentially trying to generate as much contention as it can, because that's where we want to understand the behavior characteristics. If there's no contention, the performance is mostly just the performance of the underlying HashMap + some overhead. That's not all that interesting. Seeing how performance scales as you have more concurrent accesses however, that's where the real dragons lie! It's not about the sheer number of readers or writers, but rather about the access contention those readers and writers generate.

As for the snippet you gave, I think the real solution is to make that code simply:

res = Some(f(unsafe { &*r_handle }));

That way there's no issue with panics. I believe that should work just fine.

novacrazy · 2019-01-20T11:13:14Z

I've been experiencing some odd results that don't reflect what I saw when first adding these features, to the point I'm beginning to think I'm running into memory bandwidth limitations on my system.

When I first benchmarked these, they ran at normal priority (or dynamically lower), and I had netflix on in the background. Typical stuff. It was a not-insignificant load, but it was consistent. I tried to not do anything drastic, but I didn't want to just stare at a blank screen for 20 minutes. That's when I observed the speedups.

However, when I increased the process priority and basically dedicated my system to the benchmarks, the results across all new features were equal, no change.

We already agreed that the simple reader tests are basically just benchmarking the hashmap implementations, and I'm not even seeing the boosts that should be expected of hashbrown, given hashbrown's benchmarks destroy std's.

So, what I think is going on is specific to me. Bad luck or bad planning, I suppose. As stated previously, I have a Threadripper 1950X, and I also have 128GB of RAM set up in a quad channel config. Unfortunately for 1st gen Threadripper, quad channel configs effectively halve the memory speed under certain workloads.

Randomly accessing super-optimized hashmaps at very high OS priorities appears to hit its limit.

What must have happened for the casual-use benchmarks is that it really is faster, and the OS was distributing the work in such a way that the presumed lower operational latency of operations with the new features gave it an edge. It was apparently still hitting that limit. However, the slower original config was having to be shared with the background tasks, and did not get a chance to hit that memory bandwidth limit.

This has certainly been an interesting adventure for so few lines of code changed.

What are your thoughts?

jonhoo · 2019-01-20T23:10:43Z

Yeah, I've observed similar behavior in the past too. Its likely in part because the benchmark is so simple. One good strategy might be to run evmap through the YCSB or something else that's more realistic, and then measure the difference. There's also this page that measures hash tables directly, which we may be able to adapt to a concurrent workload. I also have a larger integration benchmark I could run with noria (cc @ms705), and that might help shine some light on the improvements.

Overall, I'd be inclined to merge this regardless; the additions are both sensible and optional. I do think, as discussed in one of the comments, that both features should be on by default though.

I'd also appreciate a fix to the r_handle drop issue you observed in a separate PR :)

novacrazy · 2019-01-20T23:36:54Z

Perhaps we could use a slow CPU with fast memory for a fair benchmark? Or I may be able to underclock my CPU to compensate. 32 hardware threads at around 1 GHz (vs. 3.71 GHz now) would probably give enough contention.

Anyway, I'll go ahead and make those optional features default and then it should be ready to merge.

novacrazy · 2019-01-23T02:37:44Z

Everything should be ready.

benchmark/Cargo.toml

benchmark/src/main.rs

src/read.rs

src/write.rs

novacrazy · 2019-01-24T02:41:21Z

I don't see whatever last pending review there is, but I think I covered everything.

jonhoo · 2019-01-24T14:44:56Z

Awesome, thank you so much! And thanks for your patience in landing the PR too :D I'm excited to try this out in noria and see if we notice any speedup!

novacrazy · 2019-01-25T02:09:34Z

Oh... you already released it? I still need to fix those Drop soundness issues. You requested I make a separate PR for that after this one; was hoping to squeeze them into the same release.

I'll do those tonight anyway and they can be added to a patch version.

jonhoo · 2019-01-25T14:29:18Z

Yeah, the Drop soundness issue occurs with or without this PR. Arguably you're right that I should have waited, but this doesn't really make anything worse. I'll make a patch release with the Drop fix :)

Add optional features that can potentially improve performance

7e4481f