Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add next_bool method to RngCore and counter levels to BlockRng #1031

Closed
wants to merge 3 commits into from

Conversation

newpavlov
Copy link
Member

This approach significantly simplifies the code, adds effective bit generation (see #1014) and should have minor impact on performance (although I haven't measured it yet).

rand_core/src/block.rs Outdated Show resolved Hide resolved
rand_core/src/block.rs Outdated Show resolved Hide resolved
@newpavlov
Copy link
Member Author

newpavlov commented Aug 30, 2020

Hm, unfortunately I see a measurable performance regression...

PR branch:

test gen_bytes_chacha12      ... bench:     499,128 ns/iter (+/- 79,350) = 2051 MB/s
test gen_bytes_chacha20      ... bench:     719,247 ns/iter (+/- 18,680) = 1423 MB/s
test gen_bytes_chacha8       ... bench:     383,025 ns/iter (+/- 5,268) = 2673 MB/s
test gen_u32_chacha12        ... bench:       4,832 ns/iter (+/- 614) = 827 MB/s
test gen_u32_chacha20        ... bench:       6,142 ns/iter (+/- 47) = 651 MB/s
test gen_u32_chacha8         ... bench:       4,825 ns/iter (+/- 70) = 829 MB/s
test gen_u64_chacha12        ... bench:       6,701 ns/iter (+/- 56) = 1193 MB/s
test gen_u64_chacha20        ... bench:       8,879 ns/iter (+/- 39) = 901 MB/s
test gen_u64_chacha8         ... bench:       6,243 ns/iter (+/- 41) = 1281 MB/s

Master branch:

test gen_bytes_chacha12      ... bench:     448,153 ns/iter (+/- 9,435) = 2284 MB/s
test gen_bytes_chacha20      ... bench:     676,809 ns/iter (+/- 15,727) = 1512 MB/s
test gen_bytes_chacha8       ... bench:     342,120 ns/iter (+/- 4,284) = 2993 MB/s
test gen_u32_chacha12        ... bench:       2,039 ns/iter (+/- 46) = 1961 MB/s
test gen_u32_chacha20        ... bench:       2,930 ns/iter (+/- 43) = 1365 MB/s
test gen_u32_chacha8         ... bench:       1,707 ns/iter (+/- 34) = 2343 MB/s
test gen_u64_chacha12        ... bench:       3,465 ns/iter (+/- 47) = 2308 MB/s
test gen_u64_chacha20        ... bench:       6,291 ns/iter (+/- 101) = 1271 MB/s
test gen_u64_chacha8         ... bench:       3,631 ns/iter (+/- 36) = 2203 MB/s

I will try to look into this, but preliminary results are not great.

@@ -258,8 +254,7 @@ macro_rules! chacha_impl {

impl PartialEq<$ChaChaXRng> for $ChaChaXRng {
fn eq(&self, rhs: &$ChaChaXRng) -> bool {
self.rng.core.state.stream64_eq(&rhs.rng.core.state)
&& self.get_word_pos() == rhs.get_word_pos()
self.rng.eq(&rhs.rng)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @vks.

@newpavlov There is a reason BlockRng doesn't derive PartialEq. (This probably deserves a comment in the source, because it's subtle.) Two BlockRngs can be logically equivalent without being bitwise equivalent if: their buffers correspond to different positions in the stream (equivalently: their underlying RNGs have different block counters), and the indexes into their buffers are at different positions, and these differences offset each other. This situation happens with Seekable RNGs like ChaCha.

@dhardy dhardy mentioned this pull request Sep 1, 2020
19 tasks
@vks
Copy link
Collaborator

vks commented Sep 3, 2020

Another option might be to have a bit counter for gen_bool only.

@dhardy
Copy link
Member

dhardy commented Sep 3, 2020

Another option might be to have a bit counter for gen_bool only.

Then these bits may be consumed out-of-order with other random values — perhaps acceptable, but more difficult to document the rules around reproducibility.

There is some potential advantage here, but I'm not convinced it's a good trade with complexity.

@newpavlov
Copy link
Member Author

Another option might be to have a bit counter for gen_bool only.

I guess we could preemptively advance the index if at least one bit was used. We will trade-off additional work in gen_bool for more efficiency on other methods. Another option is to use enum to indicate level of a counter, i.e.: bit, byte, u32, or u64. With this approach we would have to recompute counter if it's on a different level, but since I think users usually don't mix methods, branch prediction should be quite efficient here.

I'm not convinced it's a good trade with complexity.

I think this approach significantly simplifies code, especially the fill_bytes part. Also we will be able to remove fill_from_chunks functions and merge implementations for u32 and u64 into a single one.

rand_chacha/src/chacha.rs Outdated Show resolved Hide resolved
@newpavlov
Copy link
Member Author

newpavlov commented Oct 19, 2020

The index level approach demonstrates a better performance, but degradation for next_u32 is still quite big. Though fill_bytes performance has improved a bit compared to master.

HC128:

// master
test gen_bytes_hc128         ... bench:     444,387 ns/iter (+/- 7,614) = 2304 MB/s
test gen_u32_hc128           ... bench:       1,751 ns/iter (+/- 23) = 2284 MB/s
test gen_u64_hc128           ... bench:       3,668 ns/iter (+/- 633) = 2181 MB/s
// bit_counter
test gen_bytes_hc128         ... bench:     430,221 ns/iter (+/- 61,055) = 2380 MB/s
test gen_u32_hc128           ... bench:       2,913 ns/iter (+/- 67) = 1373 MB/s
test gen_u64_hc128           ... bench:       3,782 ns/iter (+/- 538) = 2115 MB/s

ChaCha8:

// master
test gen_bytes_chacha8       ... bench:     336,167 ns/iter (+/- 74,806) = 3046 MB/s
test gen_u32_chacha8         ... bench:       1,599 ns/iter (+/- 172) = 2501 MB/s
test gen_u64_chacha8         ... bench:       3,619 ns/iter (+/- 62) = 2210 MB/s
//bit_counter
test gen_bytes_chacha8       ... bench:     322,722 ns/iter (+/- 7,060) = 3173 MB/s
test gen_u32_chacha8         ... bench:       3,053 ns/iter (+/- 361) = 1310 MB/s
test gen_u64_chacha8         ... bench:       4,135 ns/iter (+/- 79) = 1934 MB/s

ChaCha20:

// master
test gen_bytes_chacha20      ... bench:     675,914 ns/iter (+/- 12,027) = 1514 MB/s
test gen_u32_chacha20        ... bench:       2,862 ns/iter (+/- 69) = 1397 MB/s
test gen_u64_chacha20        ... bench:       6,196 ns/iter (+/- 104) = 1291 MB/s
// bit_counter
test gen_bytes_chacha20      ... bench:     651,339 ns/iter (+/- 13,116) = 1572 MB/s
test gen_u32_chacha20        ... bench:       4,294 ns/iter (+/- 76) = 931 MB/s
test gen_u64_chacha20        ... bench:       6,769 ns/iter (+/- 211) = 1181 MB/s

Performance can be improved a bit further by using the unstable core::intrinsics::likely intrinsic.

I am not sure why we get such degradation for next_u32. In the happy path (i.e. when the level is equal to U32) self.level.convert takes only several instructions and one table-based jump, which should not result in 1100-1400 ns delay.

UPD: One bench iteration includes 1000 calls to next_u32, so overhead of this approach is 1.1-1.4 ns or 4-5 cycles. It does not look like it will be possible to significantly reduce it, so we should make a decision if such overhead is acceptable or not. This cost can be amortized by introduction methods like fill_u32(&mut self, buf: &mut [u32]), but it's unclear whether they will be useful in practice or not.

@newpavlov newpavlov changed the title Bit-level counter for BlockRng Add next_bool method to RngCore and counter levels to BlockRng Oct 20, 2020
@dhardy
Copy link
Member

dhardy commented Oct 20, 2020

This cost can be amortized by introduction methods like fill_u32(&mut self, buf: &mut [u32])

We already have a Fill trait to fill int arrays via try_fill_bytes, so I don't think this adds anything.

I'm not sure what to say about this approach. IIUC it doesn't affect non-block PRNGs much at all, so for example someone using u32 output of Xoshiro256++ is not going to be much affected, and judging by the above benchmarks f64 generation is also not going to be much affected.

@newpavlov
Copy link
Member Author

We already have a Fill trait to fill int arrays via try_fill_bytes, so I don't think this adds anything.

fill_u32 could be a tiny bit more efficient due to the fact that it will enforce aligned (to 4 bytes) loads, while the implementation on top of fill_bytes has to work with byte granularity.

IIUC it doesn't affect non-block PRNGs much at all

Yes, it only changes behavior of types built on top of BlockRng.

BTW why do we define a newtype wrapper for ChaCha and HC-128 instead of using a simple type alias?

@dhardy
Copy link
Member

dhardy commented Oct 20, 2020

BTW why do we define a newtype wrapper for ChaCha and HC-128 instead of using a simple type alias?

IIRC it was simply to avoid exposing the implementation and thus avoiding breaking changes if we changed it.

Any idea why u32 perf is hit so much more than for u64? I tried removing the U1 level; didn't make much difference. Also removing U8 however regains a lot of the performance (85-95% of master branch for u32, and 115-130% for u64).

@newpavlov
Copy link
Member Author

newpavlov commented Oct 20, 2020

IIRC it was simply to avoid exposing the implementation and thus avoiding breaking changes if we changed it.

But BlockRng is part of our public API, so a breaking change in rand_core would result in breaking releases of RNG crates as well. I guess the newtype wrapper type is useful for additional methods (e.g. seek methods in ChaCha), but I think seeking functionality should be exposed via a trait, not via inherent methods.

Any idea why u32 perf is hit so much more than for u64?

No idea at the moment. After re-running benchmarks I get ~1.4 ns overhead for next_u32 and ~0.5 ns for next_u64. It's a strange result because happy path for both convert functions is effectively the same. The fact that generate is called twice more often for next_u64 should be irrelevant, since in theory overhead is a constant value per each call, so the measured mean time should rise to this exact constant.

Emulating next_u32 and next_u64 does not change the resulting assembly in any significant way.

Could be quirks of branch prediction?

Also removing U8 however regains a lot of the performance (85-95% of master branch for u32, and 115-130% for u64).

I can't reproduce this result. Are you sure you have removed it correctly?

@dhardy
Copy link
Member

dhardy commented Oct 21, 2020

but I think seeking functionality should be exposed via a trait, not via inherent methods.

If we have a uniform API for it — but we don't (beyond the three ChaCha generators).

I can't reproduce this result. Are you sure you have removed it correctly?

It was a hack, leaving todo!() in place of try_fill_bytes.

@newpavlov
Copy link
Member Author

If we have a uniform API for it — but we don't (beyond the three ChaCha generators).

Every stream cipher (which are often seekable) can be used as an RNG and we have the SyncStreamCipherSeek to describe such capabilities. But it's an off topic discussion, so I may open a separate issue for it later.

It was a hack, leaving todo!() in place of try_fill_bytes.

Hm, I still can't reproduce it. Maybe you forgot to remove some U8 branches, so U8 got interpreted by compiler as a variable, thus making it a "catch all" branch? Also it may depend on CPU.

One hypothesis why we see such difference between next_u32 and next_u64 is that CPU for some reason usually mispredicts the Some(v)/None branch, so for next_u64 which calls generate twice more often this mistake results in a smaller overhead. But sprinkling the likely intrinsic does not significantly improve the situation.

If the current overhead is acceptable to you, I will fix the remaining issues (PartialEq and commented ChaCha code) and will mark this PR as "ready for review".

@dhardy
Copy link
Member

dhardy commented Oct 21, 2020

The overhead may be acceptable but I'm still undecided on this PR. (In a way it just looks like nicer code.)

@vks @kazcw opinions?

@dhardy
Copy link
Member

dhardy commented Dec 2, 2020

// master
test gen_bytes_chacha8       ... bench:     336,167 ns/iter (+/- 74,806) = 3046 MB/s
test gen_u32_chacha8         ... bench:       1,599 ns/iter (+/- 172) = 2501 MB/s
test gen_u64_chacha8         ... bench:       3,619 ns/iter (+/- 62) = 2210 MB/s
//bit_counter
test gen_bytes_chacha8       ... bench:     322,722 ns/iter (+/- 7,060) = 3173 MB/s
test gen_u32_chacha8         ... bench:       3,053 ns/iter (+/- 361) = 1310 MB/s
test gen_u64_chacha8         ... bench:       4,135 ns/iter (+/- 79) = 1934 MB/s

This is IMO quite a big perf hit (u32, but even u64 is affected). If we can't mitigate it, then it's probably already enough to decide against this PR. But before we do that, are there alternatives or other parts of the PR worth keeping?

@vks
Copy link
Collaborator

vks commented Dec 2, 2020

An alternative would be to use the bit counter for next_bool only. This puts a higher burden in the user not to mix it with the other methods, because it may result in reused bits.

Another alternative is to provide a second BlockRng, so users can opt into the performance trade off.

@dhardy
Copy link
Member

dhardy commented Dec 2, 2020

An alternative would be to use the bit counter for next_bool only. This puts a higher burden in the user not to mix it with the other methods, because it may result in reused bits.

To keep generated bits in-order (rather than a separate "bit buffer") requires checking the bit counter anyway, and I don't like the idea of losing this constraint.

Another alternative is to provide a second BlockRng, so users can opt into the performance trade off.

Too much complexity IMO. (I mean if users really want their own optimal block RNG for their use-case, there's no reason they can't do that either way. But that doesn't belong in rand_core or even rand IMO.)

@dhardy
Copy link
Member

dhardy commented Sep 13, 2021

Wow, this PR is now a year old! It also conflicts with several other recent changes, and we never did solve that perf. regression. @newpavlov do you still think there is significant merit in pursuing this or shall we abandon it?

@newpavlov
Copy link
Member Author

I think we can close this PR. I still think this approach is worth exploring, but I guess it will be better to start fresh than to update this PR.

@newpavlov newpavlov closed this Sep 13, 2021
@newpavlov newpavlov deleted the bit_counter branch September 13, 2021 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants