Vectorized xoshiro256++ #3

vks · 2019-11-06T21:09:28Z

There is some discussion on how to vectorize xoshiro256++ at JuliaLang/julia#27614. The method relies on interleaving 4 xoshiro256++ generators. I implemented it and the results are impressive (see gen_bytes_fill):

test gen_bytes_chacha12           ... bench:     324,494 ns/iter (+/- 4,236) = 3155 MB/s
test gen_bytes_chacha20           ... bench:     490,442 ns/iter (+/- 16,214) = 2087 MB/s
test gen_bytes_chacha8            ... bench:     243,010 ns/iter (+/- 19,972) = 4213 MB/s
test gen_bytes_fill               ... bench:     105,350 ns/iter (+/- 1,456) = 9719 MB/s
test gen_bytes_pcg64mcg           ... bench:     321,665 ns/iter (+/- 7,854) = 3183 MB/s
test gen_bytes_splitmix64         ... bench:     233,973 ns/iter (+/- 1,859) = 4376 MB/s
test gen_bytes_xoshiro256plusplus ... bench:     343,911 ns/iter (+/- 6,580) = 2977 MB/s

The implementation is 3.3 time faster than the non-vectorized xoshiro256++ generator and more than 2.2 times faster than splitmix64 or chacha8. It is also faster than dSFMT. However, the size of the state is blown up to 128 bytes, which is almost as large as chacha's state (136 bytes).

The text was updated successfully, but these errors were encountered:

dhardy · 2019-11-07T09:11:00Z

Impressive performance taking advantage of how "wide" modern CPUs are — but whether this is useful in practice is another question. For ciphers it does sometimes make sense to focus on "byte fill" performance, but is this useful for general-purpose random number generation, or some specific application?

vks · 2019-11-07T12:28:48Z

I think in practice it replaces dSFMT, which used to be faster. All applications requiring several random numbers at a time benefit. ~~For general-purpose number generation, using this with rand_core::block::BlockRng64 is probably faster than the non-SIMD version as well.~~ It's actually slower by a factor two.

Personally, I think ChaCha8 is a better choice. The performance is similar and the cryptographic strength makes it usable for more general purposes.

dhardy · 2019-11-07T13:05:14Z

It's actually slower by a factor two.

I imagine it's going to depend on the actual application significantly. You're using micro-benchmarks here? I actually think for most applications (even ones heavily using RNGs) the speed of the RNG is not very relevant.

vks · 2019-11-07T13:18:03Z

I was referring to the gen_u64_* benchmarks in this repository. I agree that it is not a very relevant benchmark for most applications.

vigna · 2019-11-08T10:24:22Z

Wow, really nice! Note that (as I mentioned in the Julia discussion) the dSFMT returns only 52 significant digits, thus restricting the values in [0..1) to half of the possible values (IEEE has 53 digits precision as one bit is implied).

I also think that the Julia guys found that an 8-fold unroll is even faster. But of course you're using a lot of space for copies (i.e., you are not getting a longer period, just faster generation).

And yes, in 90% applications the speed of the PRNG is irrelevant.

I'm not particularly akin to block generation, but this is what is done in high-performance scientific computing. The Intel Math Kernel library, for example, provides only block generation. If you try to generate one value at a time, the performance is abysmal, but it is unbeatable (high vectorization) for large amounts.

dhardy mentioned this issue Nov 8, 2019

PCG 128-bit generators are easy to predict rust-random/rand#905

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized xoshiro256++ #3

Vectorized xoshiro256++ #3

vks commented Nov 6, 2019

dhardy commented Nov 7, 2019

vks commented Nov 7, 2019 •

edited

Loading

dhardy commented Nov 7, 2019 •

edited

Loading

vks commented Nov 7, 2019 •

edited

Loading

vigna commented Nov 8, 2019

Vectorized xoshiro256++ #3

Vectorized xoshiro256++ #3

Comments

vks commented Nov 6, 2019

dhardy commented Nov 7, 2019

vks commented Nov 7, 2019 • edited Loading

dhardy commented Nov 7, 2019 • edited Loading

vks commented Nov 7, 2019 • edited Loading

vigna commented Nov 8, 2019

vks commented Nov 7, 2019 •

edited

Loading

dhardy commented Nov 7, 2019 •

edited

Loading

vks commented Nov 7, 2019 •

edited

Loading