Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sha1 and Sha2 cleanups & optimizations #8097

Closed
wants to merge 9 commits into from

Conversation

DaGenix
Copy link

@DaGenix DaGenix commented Jul 28, 2013

UPDATED (8/03): I'm leaving in my cycle count measurements in the description below, but just want to warn that I'm not sure how accurate they are. I would suggest not using them for anything unless someone else verifies them.

UPDATED (7/29): Rebased out the MD5 implementation and updated the Sha1 implementation instead.

I reworked the Sha2 code to be more readable and faster. Part of that work entailed extracting out common code having to do with buffer handling into a new struct, FixedBuffer, in the new cryptoutil.rs file. I then updated the Sha1 implementation to use the new functionality.

The first 3 commits in this request just move stuff around in the sha2.rs file. None of the changes here should impact functionality, but hopefully make the code a bit cleaner and easier to work with.

In the 4th commit:

  • I create the FixedBuffer struct. This struct contains the logic to gather up input bytes until a full block is available. Once that full block is available, its fed into a passed in closure, which in the case of a hash function is the compression function and the buffer is then marked as empty. The existing Digest implementations tend to do a lot processing a byte at a time and tend to do more copies than necessary - the FixedBuffer class tries to do just the minimum amount of copying possible. My hope is that this struct can be used by any future Digest implementation (at least those that are block based).
  • I added in a few unsafe methods to optimize some common cases - writing an integer to a [u8] and reading in an array of integers from a [u8]. These methods verify their preconditions explicitly and will cause an assert! if they are used in a manner that would corrupt memory.
  • I updated the Sha2 code to make use of all this new functionality. This resulted in a nice speedup in my benchmarks.
  • I greatly reduced the size of the Sha2 objects - previously they contained their entire working buffer - 80 u64s for Sha512 and 64 u32s for Sha256, even though they are only needed for calculation by the compression function; with this change I allocate those larger buffers on the stack in the compression function.

In the 5th commit, I rewrite the compression functions used by Sha2. Not a major change - I inlined the message schedule calculation into the same loop as the rest of the compression function. This results in a nice speedup as I believe it lets LLVM optimize the method better.

In the 6th commit, I update the Digest trait to use default methods and get rid of the the DigestUtil trait now that default methods seem to be working better.

The 7th commit adds a new function to cryptoutil.rs that tests a Digest by feeding it 1,000,000 'a's with a varying input size and then checks the result. This is basically the same test that already exists in Sha1, so, I update that test. I also implement new tests using this function for Sha2.

The 8th commit adds new functions to cryptoutil.rs to perform integer additions with overflow checking. These functions are useful for keeping track of the size of the message to be processed by Sha1 and Sha2 since these Digests are only defined for messages under a certain size. I update the Sha2 code to use these functions.

The 9th commit updates the existing Sha1 code to use the new functionality in cryptoutil.rs. Hopefully this is a good example of how the new functionality is generally applicable to more than just Sha2 and can be used by future block based Digest functions.

Outside of the #tests, I also tested the Sha2 and Sha1 implementations by generating a few thousand random inputs of up to 10MB in size, calculated the digests using these implementations and Java's implementations checking to make sure that they agree.

Performance Measurements:

Benchmarks:

Before making any updates, the Sha2 performance was:

test sha2::bench::sha256_10 ... bench: 104 ns/iter (+/- 3) = 96 MB/s
test sha2::bench::sha256_1k ... bench: 10476 ns/iter (+/- 918) = 97 MB/s
test sha2::bench::sha256_64k ... bench: 671398 ns/iter (+/- 9934) = 97 MB/s
test sha2::bench::sha512_10 ... bench: 86 ns/iter (+/- 6) = 116 MB/s
test sha2::bench::sha512_1k ... bench: 6114 ns/iter (+/- 50) = 167 MB/s
test sha2::bench::sha512_64k ... bench: 393582 ns/iter (+/- 22549) = 166 MB/s

After the changes in change 4, the performance improved to:

test sha2::bench::sha256_10 ... bench: 101 ns/iter (+/- 7) = 99 MB/s
test sha2::bench::sha256_1k ... bench: 7736 ns/iter (+/- 177) = 132 MB/s
test sha2::bench::sha256_64k ... bench: 492327 ns/iter (+/- 12650) = 133 MB/s
test sha2::bench::sha512_10 ... bench: 72 ns/iter (+/- 2) = 138 MB/s
test sha2::bench::sha512_1k ... bench: 5004 ns/iter (+/- 935) = 204 MB/s
test sha2::bench::sha512_64k ... bench: 316444 ns/iter (+/- 8554) = 207 MB/s

After the rest of the changes (changes after 5 didn't impact performance much, so this is basically the same as the performance after change 5), the performance further improved to:

test sha2::bench::sha256_10 ... bench: 91 ns/iter (+/- 1) = 109 MB/s
test sha2::bench::sha256_1k ... bench: 7018 ns/iter (+/- 3036) = 145 MB/s
test sha2::bench::sha256_64k ... bench: 449890 ns/iter (+/- 9512) = 145 MB/s
test sha2::bench::sha512_10 ... bench: 63 ns/iter (+/- 0) = 158 MB/s
test sha2::bench::sha512_1k ... bench: 4052 ns/iter (+/- 1087) = 252 MB/s
test sha2::bench::sha512_64k ... bench: 256686 ns/iter (+/- 1721) = 255 MB/s

Before making any updates, the Sha1 performance was:

test sha1::bench::sha1_10 ... bench: 85 ns/iter (+/- 6) = 117 MB/s
test sha1::bench::sha1_1k ... bench: 8090 ns/iter (+/- 230) = 126 MB/s
test sha1::bench::sha1_64k ... bench: 523690 ns/iter (+/- 160571) = 125 MB/s

After the updates to the Sha1 code, the performance improved to:

test sha1::bench::sha1_10 ... bench: 87 ns/iter (+/- 4) = 114 MB/s
test sha1::bench::sha1_1k ... bench: 6221 ns/iter (+/- 61) = 164 MB/s
test sha1::bench::sha1_64k ... bench: 395805 ns/iter (+/- 4292) = 165 MB/s

Cycle counts:

Sha512 performance on my machine is: 10.95 cycles / byte. Intel [1], has a Sha512 implementation that achieves 9.4 cycles / byte using the SSE instructions (the AVX implementation is slightly faster, but, only runs on Haswell processors, I think). Assuming that that is the best possible performance, there really isn't all that much more room for optimization without resorting to assembly. I'm not sure what optimized C implementations generally achieve. Prior to the rebase, I was getting about 11.5 cycles / byte. I'm not sure why the performance improved a bit, possibly just variation on my machine or maybe the new code to keep track of the message length is a bit more efficient.

Sha256 performance on my machine is: 18.55 cycles / byte. Intel [2] achieved roughly 13.8 cycles / byte in their single threaded SSE implementation. Like Sha512, I think this puts the Sha256 implementation pretty close to the limit on how efficient it can be without resorting to assembly, though not as close to Sha512. After the rebase, the performance seems to have decreased by about 1 cycle / byte. The only new change was to how the message size is being calculated. I haven't done any testing specifically focusing on that at this time, however.

I used the RDTSC instruction to measure the cycle counts. In my measurements, I took into account the entire digest operation all the way through calculating the output digest. Its a little unclear to me if Intel's measurements were just of the compression function or if they included the output digest calculation steps as well.

I didn't / can't measure performance on non-Intel architectures. However, I don't think that any of the changes I made would result in decreased performance on other architectures. I redid my cycle calculations after the rebase.

[1] - http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/fast-sha512-implementations-ia-processors-paper.html

[2] - http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/sha-256-implementations-paper.html

#[test]
fn test_md5() {
// Examples from wikipedia
let wikipedia_tests = ~[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFCs for MD4 and MD5 provide some test vectors that you could use too. In my implementation, I found a bug when testing a couple of longer input strings. It might also make sense to use the a_million_letter_a test from SHA1's suite.

https://github.com/alco/rust-digest/blob/master/md5.rs#L289

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. I updated the pull request with new tests that do the "1 million a" test.

@brson
Copy link
Contributor

brson commented Jul 30, 2013

Since there are two competing implementations here is somebody willing to compare with @alco's version, discussed on the list here: https://mail.mozilla.org/pipermail/rust-dev/2013-July/004975.html.

Am I right in thinking that this doesn not include MD4 while @alco's does? What's the way forward to merge these two efforts.

We probably shouldn't rubber stamp this just because it won the PR race.

@DaGenix
Copy link
Author

DaGenix commented Jul 30, 2013

@alco, please chime in with your own comparison if I misstate anything. I'm trying to just state the facts and I'm not trying to recommend one implementation over the other here, so if anything I write gives that impression, it is unintentional.

  • @alco has an implementation of MD4, however, its not integrated into his MD5 implementation (at least not in his public repo). So, essentially, its the same as mine in that regard.
  • My implementation makes use of the new functionality I included in this patch series, his does not. So, merging in this request would mean that his implementation should be updated since I wouldn't think it would make sense to merge in new functionality that seeks to be the basis for future digests and then merge a digest that doesn't use that functionality. If the functionality in this pull request looks good and is merged, there isn't any reason it couldn't be integrated into his implementation.
  • My implementation completely unrolls the loops in the compression function, while his does not. I have no idea how the implementations compare performance wise. I have benchmarks for the performance of my implementation, but, I don't know if LLVM will generate better code with the loops unrolled (as in mine) or left alone (as in his).

The goal of my MD5 implementation was to be a demonstration that the new functionality I was implementing was usable for digests outside of Sha2. Anyway, what I'm going to do is update this pull request to rebase out the MD5 implementation and instead integrate the new functionality into the existing Sha1 implementation.

@DaGenix
Copy link
Author

DaGenix commented Jul 30, 2013

I've rebased the pull request to remove the MD5 implementation, to update the existing Sha1 implementation, and to add some new tests. A result of these changes is that Sha2 now properly checks the size of the message it is processing. Sha1 now gives a (slightly) more descriptive error message as part of fail!() if you try to process a message that is too large. This may address #2346.

@alco
Copy link
Contributor

alco commented Jul 30, 2013

It might take a while before I learn how to properly rebuild Rust without it taking a month of time to be able to compare our two implementation of MD5.

@DaGenix If you send MD5 as a separate PR, we can discuss it there. I can still see md5.rs in the current changeset though.

@DaGenix
Copy link
Author

DaGenix commented Jul 30, 2013

Yikes, that was sloppy of me. I deleted the change that introduced it, but I guess the next change brought it back without me realizing. Sorry! I've rebased again to really get rid of it.

@DaGenix
Copy link
Author

DaGenix commented Jul 31, 2013

I'm happy to open another pull request for the MD5 changes. However, my MD5 implementation heavily depends on the rest of the changes in this pull request, so, I don't think I can really do that until its decided what to do with this pull request.

@brson
Copy link
Contributor

brson commented Jul 31, 2013

@DaGenix thanks for rebasing out the md5 impl so we can do this incrementally.

This series seems pretty reasonable to me though DaGenix@4994417 does still introduce md5.rs only to delete it one commit later. Can you rebase once more to fix that snafu?

Anybody else have opinions?

Palmer Cox added 9 commits July 31, 2013 00:50
The code was arranged so that the core Sha2 code came first, and then
all of the various implementation of Digest followed along later. The
problem is that the Sha512 compression function code is far away from
the Sha512 Digest implementation, so, if you are trying to read over
the code, you need to scroll all around the file for no good reason. The
code was rearranged so that all of the Sha512 code is in one place and
all of the Sha256 code is in another and so that all impls for a struct
are near the definition of that struct.
The result_X() methods just calculate an output of a fixed size. They don't
really have much to do with running the actually hash algorithm until the very
last step - the output. It makes much more sense to put all this logic into
the Digest impls for each specific variation on the hash function.
…f it.

There are 2 main pieces of functionality in cryptoutil.rs:
* A set of unsafe function for efficiently reading and writing u32 and u64
  values. All of these functions are fairly easy to audit to confirm that
  they do what they are supposed to.
* A FixedBuffer struct. This struct keeps track of input data until there
  is enough of it to execute the a function on it which expects a fixed
  block of data.

The Sha2 module was rewritten to take advantage of the new functions in
cryptoutil as well as FixedBuffer. The result is that the duplicate code
for maintaining a buffer of input data is removed from the Sha512 and
Sha256 implementation. Additionally, the FixedBuffer code is much more
efficient than the previous code was.
The Sha2 compression functions were re-written to execute the message
scheduling calculations in the same loop as the rest of the compression
function. The compiler is able to generate much better code. Additionally,
innermost part of the compression functions were turned into macros to
reduce code duplicate and to make the functions more concise.
Create a helper function in cryptoutil.rs which feeds 1,000,000 'a's into
a Digest with varying input sizes and then checks the result. This is
essentially the same as one of Sha1's existing tests, so, that test was
re-implemented using this method. New tests were added using this method for
Sha512 and Sha256.
Added functions to cryptoutil.rs that perform an addition after shifting
the 2nd parameter by a specified constant. These function fail!() if integer
overflow will result. Updated the Sha2 implementation to use these functions.
@DaGenix
Copy link
Author

DaGenix commented Jul 31, 2013

Ug. Sorry about that. I've rebased again to get rid of the MD5 stuff. Git is determined to not let me get rid of it.

@DaGenix DaGenix closed this Aug 1, 2013
bors added a commit that referenced this pull request Aug 3, 2013
Same content as #8097, but bors had an issue with that pull request. Opening a new one.
@DaGenix DaGenix mentioned this pull request Aug 3, 2013
bors added a commit that referenced this pull request Aug 17, 2013
An MD5 implementation was originally included in #8097, but, since there are a couple different implementations of that digest algorithm (@alco mentioned his implementation on the mailing list just before I opened that PR), it was suggested that I remove it from that PR and open up a new PR to discuss the different implementations and the best way forward. If anyone wants to discuss a different implementation, feel free to present it here and discuss and compare it to this one. I'll just discuss my implementation and I'll leave it to others to present details of theirs.

This implementation relies on the FixedBuffer struct from cryptoutil.rs for managing the input buffer, just like the Sha1 and Sha2 digest implementations do. I tried manually unrolling the loops in the compression function, but I got slightly worse performance when I did that.

Outside of the #[test]s, I also tested the implementation by generating 1,000 inputs of up to 10MB in size and checking the MD5 digest calculated by this code against the MD5 digest calculated by Java's implementation.

On my computer, I'm getting the following performance:

```
test md5::bench::md5_10 ... bench: 52 ns/iter (+/- 1) = 192 MB/s
test md5::bench::md5_1k ... bench: 2819 ns/iter (+/- 44) = 363 MB/s
test md5::bench::md5_64k ... bench: 178566 ns/iter (+/- 4927) = 367 MB/s
```
flip1995 pushed a commit to flip1995/rust that referenced this pull request Jun 30, 2022
feat(new lint): new lint `manual_retain`

close rust-lang#8097

This PR is  a new  lint implementation.
This lint checks if the `retain` method is available.

Thank you in advance.

changelog: add new ``[`manual_retain`]`` lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants