Add next_bool method to RngCore and counter levels to BlockRng #1031

newpavlov · 2020-08-29T10:23:46Z

This approach significantly simplifies the code, adds effective bit generation (see #1014) and should have minor impact on performance (although I haven't measured it yet).

rand_core/src/block.rs

newpavlov · 2020-08-30T00:13:46Z

Hm, unfortunately I see a measurable performance regression...

PR branch:

test gen_bytes_chacha12      ... bench:     499,128 ns/iter (+/- 79,350) = 2051 MB/s
test gen_bytes_chacha20      ... bench:     719,247 ns/iter (+/- 18,680) = 1423 MB/s
test gen_bytes_chacha8       ... bench:     383,025 ns/iter (+/- 5,268) = 2673 MB/s
test gen_u32_chacha12        ... bench:       4,832 ns/iter (+/- 614) = 827 MB/s
test gen_u32_chacha20        ... bench:       6,142 ns/iter (+/- 47) = 651 MB/s
test gen_u32_chacha8         ... bench:       4,825 ns/iter (+/- 70) = 829 MB/s
test gen_u64_chacha12        ... bench:       6,701 ns/iter (+/- 56) = 1193 MB/s
test gen_u64_chacha20        ... bench:       8,879 ns/iter (+/- 39) = 901 MB/s
test gen_u64_chacha8         ... bench:       6,243 ns/iter (+/- 41) = 1281 MB/s

Master branch:

test gen_bytes_chacha12      ... bench:     448,153 ns/iter (+/- 9,435) = 2284 MB/s
test gen_bytes_chacha20      ... bench:     676,809 ns/iter (+/- 15,727) = 1512 MB/s
test gen_bytes_chacha8       ... bench:     342,120 ns/iter (+/- 4,284) = 2993 MB/s
test gen_u32_chacha12        ... bench:       2,039 ns/iter (+/- 46) = 1961 MB/s
test gen_u32_chacha20        ... bench:       2,930 ns/iter (+/- 43) = 1365 MB/s
test gen_u32_chacha8         ... bench:       1,707 ns/iter (+/- 34) = 2343 MB/s
test gen_u64_chacha12        ... bench:       3,465 ns/iter (+/- 47) = 2308 MB/s
test gen_u64_chacha20        ... bench:       6,291 ns/iter (+/- 101) = 1271 MB/s
test gen_u64_chacha8         ... bench:       3,631 ns/iter (+/- 36) = 2203 MB/s

I will try to look into this, but preliminary results are not great.

vks · 2020-08-31T13:22:15Z

rand_chacha/src/chacha.rs

@@ -258,8 +254,7 @@ macro_rules! chacha_impl {

        impl PartialEq<$ChaChaXRng> for $ChaChaXRng {
            fn eq(&self, rhs: &$ChaChaXRng) -> bool {
-                self.rng.core.state.stream64_eq(&rhs.rng.core.state)
-                    && self.get_word_pos() == rhs.get_word_pos()
+                self.rng.eq(&rhs.rng)


I'm not sure this is correct.

Good catch @vks.

@newpavlov There is a reason BlockRng doesn't derive PartialEq. (This probably deserves a comment in the source, because it's subtle.) Two BlockRngs can be logically equivalent without being bitwise equivalent if: their buffers correspond to different positions in the stream (equivalently: their underlying RNGs have different block counters), and the indexes into their buffers are at different positions, and these differences offset each other. This situation happens with Seekable RNGs like ChaCha.

vks · 2020-09-03T15:45:15Z

Another option might be to have a bit counter for gen_bool only.

dhardy · 2020-09-03T15:59:52Z

Another option might be to have a bit counter for gen_bool only.

Then these bits may be consumed out-of-order with other random values — perhaps acceptable, but more difficult to document the rules around reproducibility.

There is some potential advantage here, but I'm not convinced it's a good trade with complexity.

newpavlov · 2020-09-03T20:37:37Z

Another option might be to have a bit counter for gen_bool only.

I guess we could preemptively advance the index if at least one bit was used. We will trade-off additional work in gen_bool for more efficiency on other methods. Another option is to use enum to indicate level of a counter, i.e.: bit, byte, u32, or u64. With this approach we would have to recompute counter if it's on a different level, but since I think users usually don't mix methods, branch prediction should be quite efficient here.

I'm not convinced it's a good trade with complexity.

I think this approach significantly simplifies code, especially the fill_bytes part. Also we will be able to remove fill_from_chunks functions and merge implementations for u32 and u64 into a single one.

rand_chacha/src/chacha.rs

newpavlov · 2020-10-19T21:55:07Z

The index level approach demonstrates a better performance, but degradation for next_u32 is still quite big. Though fill_bytes performance has improved a bit compared to master.

HC128:

// master
test gen_bytes_hc128         ... bench:     444,387 ns/iter (+/- 7,614) = 2304 MB/s
test gen_u32_hc128           ... bench:       1,751 ns/iter (+/- 23) = 2284 MB/s
test gen_u64_hc128           ... bench:       3,668 ns/iter (+/- 633) = 2181 MB/s
// bit_counter
test gen_bytes_hc128         ... bench:     430,221 ns/iter (+/- 61,055) = 2380 MB/s
test gen_u32_hc128           ... bench:       2,913 ns/iter (+/- 67) = 1373 MB/s
test gen_u64_hc128           ... bench:       3,782 ns/iter (+/- 538) = 2115 MB/s

ChaCha8:

// master
test gen_bytes_chacha8       ... bench:     336,167 ns/iter (+/- 74,806) = 3046 MB/s
test gen_u32_chacha8         ... bench:       1,599 ns/iter (+/- 172) = 2501 MB/s
test gen_u64_chacha8         ... bench:       3,619 ns/iter (+/- 62) = 2210 MB/s
//bit_counter
test gen_bytes_chacha8       ... bench:     322,722 ns/iter (+/- 7,060) = 3173 MB/s
test gen_u32_chacha8         ... bench:       3,053 ns/iter (+/- 361) = 1310 MB/s
test gen_u64_chacha8         ... bench:       4,135 ns/iter (+/- 79) = 1934 MB/s

ChaCha20:

// master
test gen_bytes_chacha20      ... bench:     675,914 ns/iter (+/- 12,027) = 1514 MB/s
test gen_u32_chacha20        ... bench:       2,862 ns/iter (+/- 69) = 1397 MB/s
test gen_u64_chacha20        ... bench:       6,196 ns/iter (+/- 104) = 1291 MB/s
// bit_counter
test gen_bytes_chacha20      ... bench:     651,339 ns/iter (+/- 13,116) = 1572 MB/s
test gen_u32_chacha20        ... bench:       4,294 ns/iter (+/- 76) = 931 MB/s
test gen_u64_chacha20        ... bench:       6,769 ns/iter (+/- 211) = 1181 MB/s

Performance can be improved a bit further by using the unstable core::intrinsics::likely intrinsic.

I am not sure why we get such degradation for next_u32. In the happy path (i.e. when the level is equal to U32) self.level.convert takes only several instructions and one table-based jump, which should not result in 1100-1400 ns delay.

UPD: One bench iteration includes 1000 calls to next_u32, so overhead of this approach is 1.1-1.4 ns or 4-5 cycles. It does not look like it will be possible to significantly reduce it, so we should make a decision if such overhead is acceptable or not. This cost can be amortized by introduction methods like fill_u32(&mut self, buf: &mut [u32]), but it's unclear whether they will be useful in practice or not.

dhardy · 2020-10-20T08:18:53Z

This cost can be amortized by introduction methods like fill_u32(&mut self, buf: &mut [u32])

We already have a Fill trait to fill int arrays via try_fill_bytes, so I don't think this adds anything.

I'm not sure what to say about this approach. IIUC it doesn't affect non-block PRNGs much at all, so for example someone using u32 output of Xoshiro256++ is not going to be much affected, and judging by the above benchmarks f64 generation is also not going to be much affected.

newpavlov · 2020-10-20T08:41:27Z

We already have a Fill trait to fill int arrays via try_fill_bytes, so I don't think this adds anything.

fill_u32 could be a tiny bit more efficient due to the fact that it will enforce aligned (to 4 bytes) loads, while the implementation on top of fill_bytes has to work with byte granularity.

IIUC it doesn't affect non-block PRNGs much at all

Yes, it only changes behavior of types built on top of BlockRng.

BTW why do we define a newtype wrapper for ChaCha and HC-128 instead of using a simple type alias?

dhardy · 2020-10-20T08:56:28Z

BTW why do we define a newtype wrapper for ChaCha and HC-128 instead of using a simple type alias?

IIRC it was simply to avoid exposing the implementation and thus avoiding breaking changes if we changed it.

Any idea why u32 perf is hit so much more than for u64? I tried removing the U1 level; didn't make much difference. Also removing U8 however regains a lot of the performance (85-95% of master branch for u32, and 115-130% for u64).

newpavlov · 2020-10-20T09:38:06Z

IIRC it was simply to avoid exposing the implementation and thus avoiding breaking changes if we changed it.

But BlockRng is part of our public API, so a breaking change in rand_core would result in breaking releases of RNG crates as well. I guess the newtype wrapper type is useful for additional methods (e.g. seek methods in ChaCha), but I think seeking functionality should be exposed via a trait, not via inherent methods.

Any idea why u32 perf is hit so much more than for u64?

No idea at the moment. After re-running benchmarks I get ~1.4 ns overhead for next_u32 and ~0.5 ns for next_u64. It's a strange result because happy path for both convert functions is effectively the same. The fact that generate is called twice more often for next_u64 should be irrelevant, since in theory overhead is a constant value per each call, so the measured mean time should rise to this exact constant.

Emulating next_u32 and next_u64 does not change the resulting assembly in any significant way.

Could be quirks of branch prediction?

Also removing U8 however regains a lot of the performance (85-95% of master branch for u32, and 115-130% for u64).

I can't reproduce this result. Are you sure you have removed it correctly?

dhardy · 2020-10-21T07:40:10Z

but I think seeking functionality should be exposed via a trait, not via inherent methods.

If we have a uniform API for it — but we don't (beyond the three ChaCha generators).

I can't reproduce this result. Are you sure you have removed it correctly?

It was a hack, leaving todo!() in place of try_fill_bytes.

newpavlov · 2020-10-21T12:55:25Z

If we have a uniform API for it — but we don't (beyond the three ChaCha generators).

Every stream cipher (which are often seekable) can be used as an RNG and we have the SyncStreamCipherSeek to describe such capabilities. But it's an off topic discussion, so I may open a separate issue for it later.

It was a hack, leaving todo!() in place of try_fill_bytes.

Hm, I still can't reproduce it. Maybe you forgot to remove some U8 branches, so U8 got interpreted by compiler as a variable, thus making it a "catch all" branch? Also it may depend on CPU.

One hypothesis why we see such difference between next_u32 and next_u64 is that CPU for some reason usually mispredicts the Some(v)/None branch, so for next_u64 which calls generate twice more often this mistake results in a smaller overhead. But sprinkling the likely intrinsic does not significantly improve the situation.

If the current overhead is acceptable to you, I will fix the remaining issues (PartialEq and commented ChaCha code) and will mark this PR as "ready for review".

dhardy · 2020-10-21T14:26:23Z

The overhead may be acceptable but I'm still undecided on this PR. (In a way it just looks like nicer code.)

@vks @kazcw opinions?

dhardy · 2020-12-02T11:24:36Z

// master
test gen_bytes_chacha8       ... bench:     336,167 ns/iter (+/- 74,806) = 3046 MB/s
test gen_u32_chacha8         ... bench:       1,599 ns/iter (+/- 172) = 2501 MB/s
test gen_u64_chacha8         ... bench:       3,619 ns/iter (+/- 62) = 2210 MB/s
//bit_counter
test gen_bytes_chacha8       ... bench:     322,722 ns/iter (+/- 7,060) = 3173 MB/s
test gen_u32_chacha8         ... bench:       3,053 ns/iter (+/- 361) = 1310 MB/s
test gen_u64_chacha8         ... bench:       4,135 ns/iter (+/- 79) = 1934 MB/s

This is IMO quite a big perf hit (u32, but even u64 is affected). If we can't mitigate it, then it's probably already enough to decide against this PR. But before we do that, are there alternatives or other parts of the PR worth keeping?

vks · 2020-12-02T13:29:55Z

An alternative would be to use the bit counter for next_bool only. This puts a higher burden in the user not to mix it with the other methods, because it may result in reused bits.

Another alternative is to provide a second BlockRng, so users can opt into the performance trade off.

dhardy · 2020-12-02T15:02:21Z

An alternative would be to use the bit counter for next_bool only. This puts a higher burden in the user not to mix it with the other methods, because it may result in reused bits.

To keep generated bits in-order (rather than a separate "bit buffer") requires checking the bit counter anyway, and I don't like the idea of losing this constraint.

Another alternative is to provide a second BlockRng, so users can opt into the performance trade off.

Too much complexity IMO. (I mean if users really want their own optimal block RNG for their use-case, there's no reason they can't do that either way. But that doesn't belong in rand_core or even rand IMO.)

dhardy · 2021-09-13T08:48:30Z

Wow, this PR is now a year old! It also conflicts with several other recent changes, and we never did solve that perf. regression. @newpavlov do you still think there is significant merit in pursuing this or shall we abandon it?

newpavlov · 2021-09-13T13:45:36Z

I think we can close this PR. I still think this approach is worth exploring, but I guess it will be better to start fresh than to update this PR.

newpavlov commented Aug 29, 2020

View reviewed changes

rand_core/src/block.rs Outdated Show resolved Hide resolved

rand_core/src/block.rs Outdated Show resolved Hide resolved

vks reviewed Aug 31, 2020

View reviewed changes

dhardy mentioned this pull request Sep 1, 2020

Tracker: 0.8 release #889

Closed

19 tasks

kazcw reviewed Sep 18, 2020

View reviewed changes

rand_chacha/src/chacha.rs Outdated Show resolved Hide resolved

Bit-level counter for BlockRng

601038d

newpavlov force-pushed the bit_counter branch from 20afad9 to 601038d Compare October 20, 2020 02:21

Merge branch 'master' into bit_counter

b677c59

newpavlov changed the title ~~Bit-level counter for BlockRng~~ Add next_bool method to RngCore and counter levels to BlockRng Oct 20, 2020

fix bugs

1411a1c

newpavlov force-pushed the bit_counter branch from 9750954 to 1411a1c Compare October 20, 2020 04:16

vks mentioned this pull request Oct 27, 2020

Add geometric and hypergeometric distributions #1062

Merged

dhardy mentioned this pull request Nov 2, 2020

Tracker: rand_core 0.6 #1029

Closed

3 tasks

This was referenced Jul 16, 2021

Meta: add bit / bit-range sampling to RngCore #1147

Closed

Consider Fast Loaded Dice Roller for WeightedIndex #1014

Closed

newpavlov closed this Sep 13, 2021

newpavlov deleted the bit_counter branch September 13, 2021 13:45

dhardy mentioned this pull request Nov 23, 2022

Tracker: proposed RngCore changes #1261

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add next_bool method to RngCore and counter levels to BlockRng #1031

Add next_bool method to RngCore and counter levels to BlockRng #1031

newpavlov commented Aug 29, 2020

newpavlov commented Aug 30, 2020 •

edited

Loading

vks Aug 31, 2020

kazcw Sep 18, 2020

vks commented Sep 3, 2020

dhardy commented Sep 3, 2020

newpavlov commented Sep 3, 2020

newpavlov commented Oct 19, 2020 •

edited

Loading

dhardy commented Oct 20, 2020

newpavlov commented Oct 20, 2020

dhardy commented Oct 20, 2020

newpavlov commented Oct 20, 2020 •

edited

Loading

dhardy commented Oct 21, 2020

newpavlov commented Oct 21, 2020

dhardy commented Oct 21, 2020

dhardy commented Dec 2, 2020

vks commented Dec 2, 2020

dhardy commented Dec 2, 2020

dhardy commented Sep 13, 2021

newpavlov commented Sep 13, 2021

Add next_bool method to RngCore and counter levels to BlockRng #1031

Add next_bool method to RngCore and counter levels to BlockRng #1031

Conversation

newpavlov commented Aug 29, 2020

newpavlov commented Aug 30, 2020 • edited Loading

vks Aug 31, 2020

Choose a reason for hiding this comment

kazcw Sep 18, 2020

Choose a reason for hiding this comment

vks commented Sep 3, 2020

dhardy commented Sep 3, 2020

newpavlov commented Sep 3, 2020

newpavlov commented Oct 19, 2020 • edited Loading

dhardy commented Oct 20, 2020

newpavlov commented Oct 20, 2020

dhardy commented Oct 20, 2020

newpavlov commented Oct 20, 2020 • edited Loading

dhardy commented Oct 21, 2020

newpavlov commented Oct 21, 2020

dhardy commented Oct 21, 2020

dhardy commented Dec 2, 2020

vks commented Dec 2, 2020

dhardy commented Dec 2, 2020

dhardy commented Sep 13, 2021

newpavlov commented Sep 13, 2021

newpavlov commented Aug 30, 2020 •

edited

Loading

newpavlov commented Oct 19, 2020 •

edited

Loading

newpavlov commented Oct 20, 2020 •

edited

Loading