Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rand_chacha and reducing dependencies #872

Closed
dhardy opened this issue Aug 28, 2019 · 26 comments · Fixed by #931
Closed

rand_chacha and reducing dependencies #872

dhardy opened this issue Aug 28, 2019 · 26 comments · Fixed by #931
Labels
E-question Participation: opinions wanted
Milestone

Comments

@dhardy
Copy link
Member

dhardy commented Aug 28, 2019

This is basically forked from #850: a request to reduce the number of dependencies. This issue is opened for discussion since previously it did not have much. Also see: #667, #789.

As can be seen, this crate has several dependencies of its own, and is also the only required dependency of rand (besides getrandom, rand_core and on some platforms libc):

    ├── rand_chacha v0.2.1
    │   ├── c2-chacha v0.2.2
    │   │   ├── lazy_static v1.4.0
    │   │   └── ppv-lite86 v0.2.5
    │   └── rand_core v0.5.0
    │       └── getrandom v0.1.11 (*)

Probable resolution: no change

We don't need to change anything.

Possible resolution: subsume dependencies

We could copy some/all dependencies up into rand_chacha. Note that currently c2-chacha has no other dependent crates, while ppv-lite86 has two (both only used by the powhasher miner, as far as I can tell).

CC @kazcw

Other resolutions

We could revert back to HC-128 which has a fast implementation without use of SIMD intrinsics.

We could revert to the old ChaCha implementation or use the chacha20 crate, however this would not be good for performance.

@newpavlov
Copy link
Member

newpavlov commented Aug 29, 2019

I think @tarcieri plans to implement SIMD acceleration for chacha20 crate. After it will be done we could make dependency on stream-cipher optional to reduce number of dependencies when used from rand_chacha. We even could implement BlockRngCore directly by adding an optional dependency on rand_core.

BTW in RustCrypto we were thinking about introducing a generic wrapper à la BlockRng. Maybe it's worth to revisit the suggestion of moving the block stuff outside of rand_core? This wrapper could implement both stream-cipher and rand_core traits behind feature gates and we will not have any code duplication.

This way dependency tree will look like:

├── rand_chacha
│   └── chacha20
│       └── block-stuff-crate
│           ├── rand_core
│           │   └── getrandom
│           └── stream-cipher (optional, disabled by default)

UPD: We also could deprecate rand_chacha and use chacha20 or c2-chacha directly, assuming they will implement rand_core traits.

@tarcieri
Copy link

SIMD support in the chacha20 crate would take me ~1 day. The algorithm is trivial.

I really like the idea of having a "raw" API for the ChaCha20 core and making the rest of the crate dependencies optional (but default-enabled).

@dhardy
Copy link
Member Author

dhardy commented Aug 29, 2019

Having another fast implementation of ChaCha20 does give us options.

Does it not still suffer the problem of language SIMD support being experimental?

@newpavlov
Copy link
Member

newpavlov commented Aug 29, 2019

Well, x86(-64) intrinsics have been stable for a relatively long time (since 1.27), so I wouldn't call them "experimental". One noticeable issue is the absence of is_..._feature_detected! macros in core (see rust-lang/rfcs#2725). But since initially we will support only x86(-64) (since intrinsics for other arches are still only available in Nightly), it shouldn't be a problem. We can use an approach similar to one we use for RDRAND in getrandom, i.e. manual call to CPUID and cashing target feature check result in an atomic variable.

@tarcieri
Copy link

Does it not still suffer the problem of language SIMD support being experimental?

Just to +1 @newpavlov here, while there are a lot of unstable parts of core::arch::x86/x86_64, SIMD ChaCha20 can be implemented entirely in terms of the stable APIs.

This was referenced Sep 13, 2019
@dhardy
Copy link
Member Author

dhardy commented Oct 4, 2019

BTW in RustCrypto we were thinking about introducing a generic wrapper à la BlockRng. Maybe it's worth to revisit the suggestion of moving the block stuff outside of rand_core? This wrapper could implement both stream-cipher and rand_core traits behind feature gates and we will not have any code duplication.

BlockRng is the basis for minimal overhead in ReseedingRng and fork protection. If you propose removing that from rand_core, please also propose how reseeding should be implemented.

@bjorn3
Copy link
Contributor

bjorn3 commented Oct 10, 2019

The c2-chacha crate hasn't been updated for 2 months and has two PRs for 2 cq 5 months. It also doesnt work without at least SSE2 when the simd feature is enabled. (enabled by default) cryptocorrosion/cryptocorrosion#25

@tarcieri
Copy link

I'd be happy to look into adding SIMD support which "just works" and isn't gated on a feature to the chacha20 crate, or if anyone else would like to contribute it, I can review / merge PRs and release it.

It's definitely on my list of things to do (as part of starting to use the chacha20poly1305 crate in earnest and I hope to get around to it soon if no one else does.

@dhardy
Copy link
Member Author

dhardy commented Oct 10, 2019

That would be brilliant @tarcieri.

In the mean-time it does appear that @kazcw and the whole cryptocorrosion project has gone silent, but I am not aware of any significant issues other than the build issue with Cranelift mentioned by @bjorn3 (hopefully something you can work around for now?), thus I suggest we wait for the SIMD work on chacha20. If necessary we can switch the implementation, though in complaince with our portability rules we should stick with ChaCha20 for the Rand 0.7.x series.

@bjorn3
Copy link
Contributor

bjorn3 commented Oct 10, 2019

but I am not aware of any significant issues other than the build issue with Cranelift mentioned by @bjorn3 (hopefully something you can work around for now?)

Patching c2-chacha works.

@tarcieri
Copy link

We just landed SSE support in the chacha20 crate, resulting in a ~25% performance improvement (I measure it as going from 6.7 cycles/byte to 5.3) and I plan on cutting another release in the next day or so:

RustCrypto/stream-ciphers#61

We'll hopefully be adding an AVX2 backend soon as well.

@dhardy
Copy link
Member Author

dhardy commented Oct 20, 2019

I did a little hacking, for testing purposes.

This lets me run comparative benchmarks (std and thread are using the chacha20 crate; chacha benches are using c2-chacha via rand_chacha):

test gen_bytes_std           ... bench:   1,961,243 ns/iter (+/- 55,531) = 522 MB/s
test gen_u32_std             ... bench:       8,207 ns/iter (+/- 256) = 487 MB/s
test gen_u64_std             ... bench:      15,804 ns/iter (+/- 386) = 506 MB/s

test thread_rng_u32          ... bench:       8,667 ns/iter (+/- 752) = 461 MB/s
test thread_rng_u64          ... bench:      16,137 ns/iter (+/- 1,489) = 495 MB/s

test reseeding_chacha20_16k  ... bench:   9,884,385 ns/iter (+/- 183,632) = 1697 MB/s
test reseeding_chacha20_1M   ... bench:   9,223,385 ns/iter (+/- 1,112,501) = 1818 MB/s
test reseeding_chacha20_256k ... bench:   9,207,397 ns/iter (+/- 234,138) = 1822 MB/s
test reseeding_chacha20_32k  ... bench:   9,577,018 ns/iter (+/- 312,612) = 1751 MB/s
test reseeding_chacha20_4k   ... bench:  12,100,123 ns/iter (+/- 365,937) = 1386 MB/s
test reseeding_chacha20_64k  ... bench:   9,437,745 ns/iter (+/- 223,226) = 1777 MB/s

test gen_bytes_chacha12      ... bench:     404,481 ns/iter (+/- 23,979) = 2531 MB/s
test gen_bytes_chacha20      ... bench:     610,738 ns/iter (+/- 42,359) = 1676 MB/s
test gen_bytes_chacha8       ... bench:     302,190 ns/iter (+/- 64,398) = 3388 MB/s
test gen_u32_chacha12        ... bench:       1,949 ns/iter (+/- 66) = 2052 MB/s
test gen_u32_chacha20        ... bench:       2,506 ns/iter (+/- 96) = 1596 MB/s
test gen_u32_chacha8         ... bench:       1,535 ns/iter (+/- 132) = 2605 MB/s
test gen_u64_chacha12        ... bench:       4,129 ns/iter (+/- 131) = 1937 MB/s
test gen_u64_chacha20        ... bench:       5,701 ns/iter (+/- 252) = 1403 MB/s
test gen_u64_chacha8         ... bench:       3,367 ns/iter (+/- 84) = 2376 MB/s

As you can see, there's a long way to go to catch up. One possible part of this is that chacha20 uses [u32; 16] blocks returned via copy while the rand_chacha crate uses a [u32; 64] output block filled via reference.

These benches include 8- and 12-round versions of ChaCha, which in my opinion make for excellent RNGs where security is less important. The code in the chacha20 crate would need some refactoring in order to support these alternatives.

Another thing to think about is the title of this thread: and reducing dependencies (which, as we all know, is not a very useful metric but never-the-less one that many users bring up). This is a cut from the cargo tree output for rand with this hack:

rand v0.7.2 (/home/dhardy/projects/rand/rand)
├── chacha20 v0.2.1 (/home/dhardy/projects/rand/stream-ciphers/chacha20)
│   ├── byteorder v1.3.2
│   ├── rand_core v0.5.1 (/home/dhardy/projects/rand/rand/rand_core)
│   │   └── getrandom v0.1.12
│   │       ├── cfg-if v0.1.9
│   │       └── libc v0.2.62
│   ├── salsa20-core v0.2.2 (/home/dhardy/projects/rand/stream-ciphers/salsa20-core)
│   │   └── stream-cipher v0.3.2
│   │       └── generic-array v0.12.3
│   │           └── typenum v1.11.2
│   │   [dev-dependencies]
│   │   └── stream-cipher v0.3.2 (*)
│   └── stream-cipher v0.3.2 (*)

Thanks for the optimisation work @tarcieri, but I am skeptical about switching to chacha20 over c2-chacha.

@tarcieri
Copy link

Thanks for looking into it. We can probably reverse the relationship between the chacha20 crate and the salsa20-core crate, so the latter isn't needed when the construction isn't using the traits from the stream-cipher crate.

We've also talked about operating on several blocks in parallel as part of adding an AVX2 backend.

@dhardy
Copy link
Member Author

dhardy commented Oct 21, 2019

I noticed several constants being pulled from the salsa20-core crate; if those are copied then the remaining dependency could probably be optional (i.e. opt-in to the RNG, opt-in to the cipher). That may be the simplest solution there?

@tarcieri
Copy link

Yup, we can potentially have both salsa20-core (and vicariously stream-cipher) as well as rand_core as optional dependencies

@tarcieri
Copy link

tarcieri commented Oct 21, 2019

Also re: the reduced round variants i.e. ChaCha8/ChaCha12, yes, we can potentially add those, probably gated under an off-by-default Cargo feature.

@kazcw
Copy link
Collaborator

kazcw commented Oct 21, 2019

I eliminated the lazy_static dependency:

rand_chacha v0.2.1
├── c2-chacha v0.2.3
│   └── ppv-lite86 v0.2.6
└── rand_core v0.5.1

@kazcw
Copy link
Collaborator

kazcw commented Oct 21, 2019

The API surface coupling c2-chacha to ppv-lite86 dwarfs that coupling rand_chacha to c2-chacha (a SIMD library vs a hash function), so moving it between workspaces would lead to maintenance headaches. If package count is more important than logic-abiding modularity, I see a couple things we could do: I could move the core implementation from c2-chacha into ppv-lite86 ("have some chacha with your SIMD"), and we could move the contents of rand_chacha into rand so that rand would depend only on ppv-lite86--as minimal a dependency tree as could sanely be obtained without turning to a crate-bundling script. rand_chacha would depend on rand and pub-export the chacha rngs (which would technically be pub in rand, but could be in a namespace called private_dont_look). If that's worth doing then the cargo system is broken, but I don't see any lower-hanging fruit in the dependency-count-reduction field.

@kazcw
Copy link
Collaborator

kazcw commented Oct 21, 2019

IMO the more reasonable of those two dependency eliminations (if we were to stop at one) would be splaying the rand/rand_chacha dependency--it would be a quirk for rand maintainers that the rand_chacha code actually lives in rand, but forcing rand on rand_chacha users seems like it subverts the crate concept less than forcing a chacha backend on ppv-lite86 users.

@vks
Copy link
Collaborator

vks commented Oct 22, 2019

I eliminated the lazy_static dependency:

rand_chacha v0.2.1
├── c2-chacha v0.2.3
│   └── ppv-lite86 v0.2.6
└── rand_core v0.5.1

I think this is good enough! If anything, we could get rid of rand_chacha and add rand_core support to c2-chacha behind a feature flag.

@dhardy
Copy link
Member Author

dhardy commented Oct 22, 2019

Thanks @kazcw for taking another look at this and removing the lazy_static dependency.

We have in the past duplicated the OsRng code to avoid a dependency, and have at least discussed reversing dependencies (rand_chacha depending on rand), but neither is desirable.

If that's worth doing then the cargo system is broken, but I don't see any lower-hanging fruit in the dependency-count-reduction field.

I agree; actually this is less about "Cargo is broken" than "people's metrics of complexity are broken".

At this point my preference is actually no further changes, except possibly the one @vks mentions: drop rand_chacha and add optional rand_core integration within c2-chacha behind a feature flag (approx 500 extra lines to a 900 line crate; in contrast ppv-lite has over 3500 lines and rand over 9000).

@dhardy
Copy link
Member Author

dhardy commented Jan 13, 2020

drop rand_chacha and add optional rand_core integration within c2-chacha behind a feature flag

In fact I believe this is the wrong approach; rather we should copy the c2-chacha code we need into rand_chacha and depend directly on ppv-lite86. Reasons are two:

  1. rand_chacha depends on only about 300 lines (1/3) of the code in c2-chacha
  2. rand_chacha already has 44 reverse-dependencies; c2-chacha has 2

@tarcieri
Copy link

tarcieri commented Jan 13, 2020

@dhardy while you’re looking into that, you might consider using ChaCha12 over ChaCha20:

https://eprint.iacr.org/2019/1492

ChaCha12 still provides 5 rounds of safety margin over the best known attack, whereas ChaCha20 provides 13 which is excessive when the best attack is 7 (and even then that attack doesn’t result in a catastrophic breakage).

The cited paper suggests ChaCha8 which only provides a single round of safety margin. I think ChaCha12 is at a happy medium of enough rounds to feel confident it’s highly unlikely to be broken by future attacks, and improving the performance through a lower number of rounds.

@dhardy
Copy link
Member Author

dhardy commented Jan 13, 2020

@tarcieri sounds great! But considering that is a security issue, I think it deserves its own thread.
So maybe copy+paste that to a new issue?

Also, it's not "while you're looking into that". rand_chacha already exports all three common variants; this change would be to Rand's main lib.

@tarcieri
Copy link

tarcieri commented Jan 13, 2020

Sure, I can open a separate issue (edit: opened #932)

@tarcieri
Copy link

tarcieri commented Jan 16, 2020

I'm back, with something actually relevant to this thread.

I just released v0.3 of the chacha20 crate. It includes the following changes:

  • All of its dependencies (stream-cipher, rand_core, and zeroize) are now optional, and the rng feature now depends exclusively on the rand_core crate. With chacha20 = { version = "0.3", default-features = false, features = ["rng"] }, the only dependency is the rand_core crate.
  • The buffering logic has been dramatically improved. The RNG no longer performs any buffering whatsoever.
  • ChaCha8 / ChaCha8Rng / ChaCha8RngCore and ChaCha12 / ChaCha12Rng / ChaCha12RngCore are now available.
  • It now supports an AVX2 backend (implemented in terms of core::arch intrinsics) which brings its performance to ~1.4 cycles/byte when using that backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-question Participation: opinions wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants