Improve siphash performance for longer data #27280

bluss · 2015-07-25T10:25:59Z

Improve siphash performance for longer data

Use ptr::copy_nonoverlapping (aka memcpy) to load an u64 from the
byte stream. This is correct for any alignment, and the compiler will
use the appropriate instruction to load the data.

Also contains small tweaks that should benefit hashing short data too,
both the commit that removes a variable and the autovectorization of
the hash state initialization (in SipHash::reset).

Benchmarks show that hashing longer data benefits for the improved word loading.

Before (using benchmarks from the first commit in the PR):

The before benchmark is a bit noisy.

test hash::sip::bench_bytes_4                              ... bench:          41 ns/iter (+/- 0) = 97 MB/s
test hash::sip::bench_bytes_7                              ... bench:          49 ns/iter (+/- 2) = 142 MB/s
test hash::sip::bench_bytes_8                              ... bench:          42 ns/iter (+/- 4) = 190 MB/s
test hash::sip::bench_bytes_a_16                           ... bench:          57 ns/iter (+/- 14) = 280 MB/s
test hash::sip::bench_bytes_b_32                           ... bench:          85 ns/iter (+/- 74) = 376 MB/s
test hash::sip::bench_bytes_c_128                          ... bench:         278 ns/iter (+/- 33) = 460 MB/s
test hash::sip::bench_long_str                             ... bench:         825 ns/iter (+/- 103)
test hash::sip::bench_str_of_8_bytes                       ... bench:         151 ns/iter (+/- 66)
test hash::sip::bench_str_over_8_bytes                     ... bench:          59 ns/iter (+/- 3)
test hash::sip::bench_str_under_8_bytes                    ... bench:          47 ns/iter (+/- 56)
test hash::sip::bench_u32                                  ... bench:          39 ns/iter (+/- 93) = 205 MB/s
test hash::sip::bench_u32_keyed                            ... bench:          40 ns/iter (+/- 88) = 200 MB/s
test hash::sip::bench_u64                                  ... bench:          54 ns/iter (+/- 96) = 148 MB/s

After:

test hash::sip::bench_bytes_4                              ... bench:          41 ns/iter (+/- 3) = 97 MB/s
test hash::sip::bench_bytes_7                              ... bench:          48 ns/iter (+/- 0) = 145 MB/s
test hash::sip::bench_bytes_8                              ... bench:          35 ns/iter (+/- 1) = 228 MB/s
test hash::sip::bench_bytes_a_16                           ... bench:          45 ns/iter (+/- 1) = 355 MB/s
test hash::sip::bench_bytes_b_32                           ... bench:          60 ns/iter (+/- 0) = 533 MB/s
test hash::sip::bench_bytes_c_128                          ... bench:         161 ns/iter (+/- 5) = 795 MB/s
test hash::sip::bench_long_str                             ... bench:         514 ns/iter (+/- 5)
test hash::sip::bench_str_of_8_bytes                       ... bench:          44 ns/iter (+/- 0)
test hash::sip::bench_str_over_8_bytes                     ... bench:          51 ns/iter (+/- 0)
test hash::sip::bench_str_under_8_bytes                    ... bench:          52 ns/iter (+/- 6)
test hash::sip::bench_u32                                  ... bench:          40 ns/iter (+/- 2) = 200 MB/s
test hash::sip::bench_u32_keyed                            ... bench:          39 ns/iter (+/- 1) = 205 MB/s
test hash::sip::bench_u64                                  ... bench:          36 ns/iter (+/- 1) = 222 MB/s

rust-highfive · 2015-07-25T10:26:05Z

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

Use `ptr::copy_nonoverlapping` (aka memcpy) to load an u64 from the byte stream. This is correct for any alignment, and the compiler will use the appropriate instruction to load the data. Use unchecked indexing. This results in a large improvement of throughput (hashed bytes / second) for long data. Maximum improvement benches at a 70% increase in throughput for large values (> 256 bytes) but already values of 16 bytes or larger improve. Introducing unchecked indexing is motivated to reach as good throughput as possible. Using ptr::copy_nonoverlapping without unchecked indexing would land the improvement some 20-30 pct units lower. We use a debug assertion so that the test suite checks our use of unchecked indexing.

Without this temporary variable, codegen improves slightly and less registers are spilled to the stack in SipHash::write.

If they are ordered v0, v2, v1, v3, the compiler can find just a few simd optimizations itself. The new optimization I could observe on x86-64 was using 128 bit registers for the v = key ^ constant operations in new / reset.

alexcrichton · 2015-07-27T17:22:13Z

@bors: r+ 27c44ce

bluss · 2015-07-27T20:23:54Z

Thank you!

bors · 2015-07-28T05:38:54Z

⌛ Testing commit 27c44ce with merge ff6c6ce...

Improve siphash performance for longer data Use `ptr::copy_nonoverlapping` (aka memcpy) to load an u64 from the byte stream. This is correct for any alignment, and the compiler will use the appropriate instruction to load the data. Also contains small tweaks that should benefit hashing short data too, both the commit that removes a variable and the autovectorization of the hash state initialization (in SipHash::reset). Benchmarks show that hashing longer data benefits for the improved word loading. Before (using benchmarks from the first commit in the PR): The before benchmark is a bit noisy. ``` test hash::sip::bench_bytes_4 ... bench: 41 ns/iter (+/- 0) = 97 MB/s test hash::sip::bench_bytes_7 ... bench: 49 ns/iter (+/- 2) = 142 MB/s test hash::sip::bench_bytes_8 ... bench: 42 ns/iter (+/- 4) = 190 MB/s test hash::sip::bench_bytes_a_16 ... bench: 57 ns/iter (+/- 14) = 280 MB/s test hash::sip::bench_bytes_b_32 ... bench: 85 ns/iter (+/- 74) = 376 MB/s test hash::sip::bench_bytes_c_128 ... bench: 278 ns/iter (+/- 33) = 460 MB/s test hash::sip::bench_long_str ... bench: 825 ns/iter (+/- 103) test hash::sip::bench_str_of_8_bytes ... bench: 151 ns/iter (+/- 66) test hash::sip::bench_str_over_8_bytes ... bench: 59 ns/iter (+/- 3) test hash::sip::bench_str_under_8_bytes ... bench: 47 ns/iter (+/- 56) test hash::sip::bench_u32 ... bench: 39 ns/iter (+/- 93) = 205 MB/s test hash::sip::bench_u32_keyed ... bench: 40 ns/iter (+/- 88) = 200 MB/s test hash::sip::bench_u64 ... bench: 54 ns/iter (+/- 96) = 148 MB/s ``` After: ``` test hash::sip::bench_bytes_4 ... bench: 41 ns/iter (+/- 3) = 97 MB/s test hash::sip::bench_bytes_7 ... bench: 48 ns/iter (+/- 0) = 145 MB/s test hash::sip::bench_bytes_8 ... bench: 35 ns/iter (+/- 1) = 228 MB/s test hash::sip::bench_bytes_a_16 ... bench: 45 ns/iter (+/- 1) = 355 MB/s test hash::sip::bench_bytes_b_32 ... bench: 60 ns/iter (+/- 0) = 533 MB/s test hash::sip::bench_bytes_c_128 ... bench: 161 ns/iter (+/- 5) = 795 MB/s test hash::sip::bench_long_str ... bench: 514 ns/iter (+/- 5) test hash::sip::bench_str_of_8_bytes ... bench: 44 ns/iter (+/- 0) test hash::sip::bench_str_over_8_bytes ... bench: 51 ns/iter (+/- 0) test hash::sip::bench_str_under_8_bytes ... bench: 52 ns/iter (+/- 6) test hash::sip::bench_u32 ... bench: 40 ns/iter (+/- 2) = 200 MB/s test hash::sip::bench_u32_keyed ... bench: 39 ns/iter (+/- 1) = 205 MB/s test hash::sip::bench_u64 ... bench: 36 ns/iter (+/- 1) = 222 MB/s ```

bors · 2015-07-28T07:14:53Z

☀️ Test successful - auto-linux-32-nopt-t, auto-linux-32-opt, auto-linux-64-nopt-t, auto-linux-64-opt, auto-linux-64-x-android-t, auto-mac-32-opt, auto-mac-64-nopt-t, auto-mac-64-opt, auto-win-gnu-32-nopt-t, auto-win-gnu-32-opt, auto-win-gnu-64-nopt-t, auto-win-gnu-64-opt, auto-win-msvc-32-opt, auto-win-msvc-64-opt

brson · 2015-08-03T21:42:06Z

Nice wins.

arthurprs · 2015-08-04T01:51:47Z

Awesome! Thanks!

rust-highfive assigned aturon Jul 25, 2015

bluss added 4 commits July 25, 2015 12:26

siphash: Add more benchmarks

381d2ed

siphash: Remove one variable

5f6a61e

Without this temporary variable, codegen improves slightly and less registers are spilled to the stack in SipHash::write.

siphash: Reorder hash state in the struct

27c44ce

If they are ordered v0, v2, v1, v3, the compiler can find just a few simd optimizations itself. The new optimization I could observe on x86-64 was using 128 bit registers for the v = key ^ constant operations in new / reset.

bluss force-pushed the siphash-perf branch from 20422f5 to 27c44ce Compare July 25, 2015 10:26

bors merged commit 27c44ce into rust-lang:master Jul 28, 2015

bluss deleted the siphash-perf branch July 28, 2015 15:06

brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Aug 3, 2015

bluss mentioned this pull request Aug 13, 2015

Mind alignment shepmaster/twox-hash#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve siphash performance for longer data #27280

Improve siphash performance for longer data #27280

bluss commented Jul 25, 2015

rust-highfive commented Jul 25, 2015

alexcrichton commented Jul 27, 2015

bluss commented Jul 27, 2015

bors commented Jul 28, 2015

bors commented Jul 28, 2015

brson commented Aug 3, 2015

arthurprs commented Aug 4, 2015

Improve siphash performance for longer data #27280

Improve siphash performance for longer data #27280

Conversation

bluss commented Jul 25, 2015

rust-highfive commented Jul 25, 2015

alexcrichton commented Jul 27, 2015

bluss commented Jul 27, 2015

bors commented Jul 28, 2015

bors commented Jul 28, 2015

brson commented Aug 3, 2015

arthurprs commented Aug 4, 2015