Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a dedicated length-prefixing method to Hasher #94598

Merged
merged 2 commits into from
May 6, 2022

Conversation

scottmcm
Copy link
Member

@scottmcm scottmcm commented Mar 4, 2022

This accomplishes two main goals:

  • Make it clear who is responsible for prefix-freedom, including how they should do it
  • Make it feasible for a Hasher that doesn't care about Hash-DoS resistance to get better performance by not hashing lengths

This does not change rustc-hash, since that's in an external crate, but that could potentially use it in future.

Fixes #94026

r? rust-lang/libs


The core of this change is the following two new methods on Hasher:

pub trait Hasher {
    /// Writes a length prefix into this hasher, as part of being prefix-free.
    ///
    /// If you're implementing [`Hash`] for a custom collection, call this before
    /// writing its contents to this `Hasher`.  That way
    /// `(collection![1, 2, 3], collection![4, 5])` and
    /// `(collection![1, 2], collection![3, 4, 5])` will provide different
    /// sequences of values to the `Hasher`
    ///
    /// The `impl<T> Hash for [T]` includes a call to this method, so if you're
    /// hashing a slice (or array or vector) via its `Hash::hash` method,
    /// you should **not** call this yourself.
    ///
    /// This method is only for providing domain separation.  If you want to
    /// hash a `usize` that represents part of the *data*, then it's important
    /// that you pass it to [`Hasher::write_usize`] instead of to this method.
    ///
    /// # Examples
    ///
    /// ```
    /// #![feature(hasher_prefixfree_extras)]
    /// # // Stubs to make the `impl` below pass the compiler
    /// # struct MyCollection<T>(Option<T>);
    /// # impl<T> MyCollection<T> {
    /// #     fn len(&self) -> usize { todo!() }
    /// # }
    /// # impl<'a, T> IntoIterator for &'a MyCollection<T> {
    /// #     type Item = T;
    /// #     type IntoIter = std::iter::Empty<T>;
    /// #     fn into_iter(self) -> Self::IntoIter { todo!() }
    /// # }
    ///
    /// use std::hash::{Hash, Hasher};
    /// impl<T: Hash> Hash for MyCollection<T> {
    ///     fn hash<H: Hasher>(&self, state: &mut H) {
    ///         state.write_length_prefix(self.len());
    ///         for elt in self {
    ///             elt.hash(state);
    ///         }
    ///     }
    /// }
    /// ```
    ///
    /// # Note to Implementers
    ///
    /// If you've decided that your `Hasher` is willing to be susceptible to
    /// Hash-DoS attacks, then you might consider skipping hashing some or all
    /// of the `len` provided in the name of increased performance.
    #[inline]
    #[unstable(feature = "hasher_prefixfree_extras", issue = "88888888")]
    fn write_length_prefix(&mut self, len: usize) {
        self.write_usize(len);
    }

    /// Writes a single `str` into this hasher.
    ///
    /// If you're implementing [`Hash`], you generally do not need to call this,
    /// as the `impl Hash for str` does, so you can just use that.
    ///
    /// This includes the domain separator for prefix-freedom, so you should
    /// **not** call `Self::write_length_prefix` before calling this.
    ///
    /// # Note to Implementers
    ///
    /// The default implementation of this method includes a call to
    /// [`Self::write_length_prefix`], so if your implementation of `Hasher`
    /// doesn't care about prefix-freedom and you've thus overridden
    /// that method to do nothing, there's no need to override this one.
    ///
    /// This method is available to be overridden separately from the others
    /// as `str` being UTF-8 means that it never contains `0xFF` bytes, which
    /// can be used to provide prefix-freedom cheaper than hashing a length.
    ///
    /// For example, if your `Hasher` works byte-by-byte (perhaps by accumulating
    /// them into a buffer), then you can hash the bytes of the `str` followed
    /// by a single `0xFF` byte.
    ///
    /// If your `Hasher` works in chunks, you can also do this by being careful
    /// about how you pad partial chunks.  If the chunks are padded with `0x00`
    /// bytes then just hashing an extra `0xFF` byte doesn't necessarily
    /// provide prefix-freedom, as `"ab"` and `"ab\u{0}"` would likely hash
    /// the same sequence of chunks.  But if you pad with `0xFF` bytes instead,
    /// ensuring at least one padding byte, then it can often provide
    /// prefix-freedom cheaper than hashing the length would.
    #[inline]
    #[unstable(feature = "hasher_prefixfree_extras", issue = "88888888")]
    fn write_str(&mut self, s: &str) {
        self.write_length_prefix(s.len());
        self.write(s.as_bytes());
    }
}

With updates to the Hash implementations for slices and containers to call write_length_prefix instead of write_usize.

write_str defaults to using write_length_prefix since, as was pointed out in the issue, the write_u8(0xFF) approach is insufficient for hashers that work in chunks, as those would hash "a\u{0}" and "a" to the same thing. But since SipHash works byte-wise (there's an internal buffer to accumulate bytes until a full chunk is available) it overrides write_str to continue to use the add-non-UTF-8-byte approach.


Compatibility:

Because the default implementation of write_length_prefix calls write_usize, the changed hash implementation for slices will do the same thing the old one did on existing Hashers.

@rustbot rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Mar 4, 2022
@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 4, 2022
@tczajka
Copy link

tczajka commented Mar 4, 2022

  • Make it feasible for a Hasher that doesn't care about Hash-DoS resistance to get better performance by not hashing lengths

This seems like a bad idea. It's getting better performance by skipping over some relevant bits of information. By the same reasoning, you might skip the first element when hashing an (i32, i32) to get better performance. Isn't the whole point of hashing to hash all the data?

It's not just about DoS resistance, data such as ("aaa", "a") and ("a", "aaa") can come from a friendly source.

@scottmcm
Copy link
Member Author

scottmcm commented Mar 4, 2022

Isn't the whole point of hashing to hash all the data?

No, I wouldn't say that. Rather the point is to hash enough of the entropy to get the collision rate acceptably low. It's sometimes fine to skip lengths the same way it's sometimes fine to skip big-but-low-cardinality fields in Hash. I certainly wouldn't suggest it as the default behaviour -- so perhaps I should change the text to encourage still using the length in non-adversarial situations too -- but I can absolutely imagine that in many cases it would be perfectly appropriate. (And I'd much rather that some Hashers skipped it than have Hash implementations start showing up in the ecosystem that skip entirely for all Hashers.)

rustc has found that a lower-quality but faster hash is worth it, and it's entirely possible that the tradeoff in skipping lengths would also be a net win. Something like TerminatorKind::Call definitely doesn't need the length prefix, for example -- there's only one Vec in there, and if it has a different length it will have had a different func too, since rust doesn't have overloading. And just in general, it interning so much stuff makes it that much less likely that the ("ab", "c") vs ("a", "bc") scenarios will come up. So while some parts -- StableHasher, maybe -- likely will keep using the lengths, I wouldn't be surprised if most of the FxHashMaps were net-happier without lengths.

And because of Borrow, there's a bunch of places where a hasher that ignores lengths would be faster with no loss of hash quality. For example, I have a project I made 15% faster by changing from HashMap<[usize; 2], _> to HashMap<(usize, usize), _> -- because the former is forced to pointlessly hash a 2_usize for every key. Even just a HashSet<String> is wasting effort by hashing that extra byte, since it's obviously prefix-free without it. Being able to fix things like that with one Hasher wrapper, instead of needing alternative Hash implementations for all of them, sounds really nice to me.

It can potentially also provide value even if nobody ever omits the length entirely. For example, a byte-oriented hasher might choose to LEB128 the length prefixes to save hash calls in common cases, but not do that for all usizes. (That would make a [u8; 1] hash 2 bytes instead of 9, for example, so could plausibly be worth it, since the extra LEB cost would be on long slices where it's likely negligible compared to the total cost of hashing all those extra elements.)

@tczajka
Copy link

tczajka commented Mar 4, 2022

Some good points, but I want to address a few of them:

rustc has found that a lower-quality but faster hash is worth it

This comparison seems to be about performance of SipHash vs another hash. SipHash is just slow. I agree there are better hashers for many (most) applications, but you can get much faster hashers that don't skip anything.

For example, I have a project I made 15% faster by changing from HashMap<[usize; 2], _> to HashMap<(usize, usize), _> -- because the former is forced to pointlessly hash a 2_usize for every key.

Agreed, but this seems like a fundamental design choice / design issue with hashing arrays and Borrow, and this proposed API works around it in a brittle way which only works for certain hashers, rather than really addressing the fundamental problem with how constant-sized arrays are hashed.

Even just a HashSet<String> is wasting effort by hashing that extra byte, since it's obviously prefix-free without it.

Not really -- Strings are not prefix free. Good hashers can currently assume that the data they are hashing is already prefix free, so they don't need to add a terminator themselves, they can just pad the data with zeroes to a block size without any delimiters. This wouldn't work without the extra byte.

Currently SipHash seems to add a delimeter in finish(), but this is superfluous.

For example, a byte-oriented hasher might choose to LEB128 the length prefixes to save hash calls in common cases, but not do that for all usizes.

Another solution without having to complicate the API would be for Hash for [T] to always do this. I think it makes sense.

@scottmcm
Copy link
Member Author

scottmcm commented Mar 4, 2022

would be for Hash for [T] to always do this. I think it makes sense.

I absolutely want to try it for SipHash as a followup to this PR, if accepted.

But always doing it feels wrong. It's probably a win for anything that works byte-by-byte, but rustc's FxHasher works on usize chunks without a buffer:

https://github.com/rust-lang/rustc-hash/blob/5ff7e6ae50e770381cb7bb393bfd22e016f2b8a6/src/lib.rs#L76-L81

So the extra LEB128 work would be entirely wasteful overhead for it -- even if it's only a 1-byte write, it'll still make a full add_to_hash step:

https://github.com/rust-lang/rustc-hash/blob/5ff7e6ae50e770381cb7bb393bfd22e016f2b8a6/src/lib.rs#L105-L107

Which sounds to me like all the more reason to leave it up to the Hasher, rather than doing it in, say, slice's Hash. (Not to mention that we'd need to copy the logic to VecDeque's Hash and hashbrown's Hash and etc, so even if it were worth doing for every Hasher I think we'd still want a method for it somewhere.)

There are middle-ground possibilities here too. If you're implementing a hasher that uses a 4-byte block, then for platforms where usize is 64-bit you might want to "LEB2147483648" the lengths, to usually only need to do 1 round.

@tczajka
Copy link

tczajka commented Mar 5, 2022

So the extra LEB128 work would be entirely wasteful overhead

The extra work is just one comparison against 128, or a single bitand, with a very predictable branch (the branch can assume it's always small).

Avoiding this comparison (only in some non-standard Hashers) seems like a micro-optimization -- is this really worth complicating the Hasher API and complicating prefix-freedom rules for Hash?

If the length is 128 or more, the overhead for length doesn't matter because hashing 128 things is going to dominate (except for ZSTs).

Alternatively, the following encoding could be even better: one byte for lengths smaller than 255, 9 bytes (255u8, length) for lengths 255 or more.

@scottmcm scottmcm added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Mar 9, 2022
@bors
Copy link
Contributor

bors commented Apr 1, 2022

☔ The latest upstream changes (presumably #95552) made this pull request unmergeable. Please resolve the merge conflicts.

@apiraino
Copy link
Contributor

apiraino commented Apr 7, 2022

I'm removing T-compiler from the teams involved for review since it looks like a T-libs thing. In case feel free to add the label again.

@rustbot label -T-compiler

@rustbot rustbot removed the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 7, 2022
@scottmcm
Copy link
Member Author

scottmcm commented Apr 8, 2022

Rebased.

@joshtriplett friendly 1-month review ping. Let me know if you'd rather I bring it up in a libs-api meeting.

@scottmcm
Copy link
Member Author

cc @Amanieu, who was mentioned in the libs-api meeting today as perhaps interested in commenting on this.

@joshtriplett
Copy link
Member

r? @Amanieu

@Amanieu
Copy link
Member

Amanieu commented May 3, 2022

@bors r+ rollup=never

@bors
Copy link
Contributor

bors commented May 3, 2022

📌 Commit ae6baf0e0ba28a27734dc24786a29083015e9f33 has been approved by Amanieu

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 3, 2022
@bors
Copy link
Contributor

bors commented May 3, 2022

⌛ Testing commit ae6baf0e0ba28a27734dc24786a29083015e9f33 with merge 5036d0075d5da4806f69242f88e0241494512937...

@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Contributor

bors commented May 3, 2022

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels May 3, 2022
@scottmcm scottmcm force-pushed the prefix-free-hasher-methods branch from ae6baf0 to 24e5585 Compare May 3, 2022 21:11
@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 6, 2022
compiler-errors added a commit to compiler-errors/rust that referenced this pull request May 6, 2022
…errors

Put the incompatible_closure_captures lint messages in alphabetical order

Looks like they were in hash order before, which was causing me trouble in rust-lang#94598, so this PR sorts the errors by trait name.
@scottmcm scottmcm force-pushed the prefix-free-hasher-methods branch from 609977f to 6b386f5 Compare May 6, 2022 06:18
@rust-log-analyzer

This comment has been minimized.

This accomplishes two main goals:
- Make it clear who is responsible for prefix-freedom, including how they should do it
- Make it feasible for a `Hasher` that *doesn't* care about Hash-DoS resistance to get better performance by not hashing lengths

This does not change rustc-hash, since that's in an external crate, but that could potentially use it in future.
We might want to change the default before stabilizing (or maybe even after), but for getting in the new unstable methods, leave it as-is for now.  That way it won't break cargo and such.
@scottmcm scottmcm force-pushed the prefix-free-hasher-methods branch from 6b386f5 to ebdcb08 Compare May 6, 2022 07:15
@scottmcm
Copy link
Member Author

scottmcm commented May 6, 2022

Ok, @Amanieu, I've updated this to change the provided Hasher::write_str to do the same thing that str::hash was doing before, and as part of that I've been able to remove the ui test updates.

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels May 6, 2022
@Amanieu
Copy link
Member

Amanieu commented May 6, 2022

@bors r+

@bors
Copy link
Contributor

bors commented May 6, 2022

📌 Commit ebdcb08 has been approved by Amanieu

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 6, 2022
@bors
Copy link
Contributor

bors commented May 6, 2022

⌛ Testing commit ebdcb08 with merge 8c4fc9d...

@bors
Copy link
Contributor

bors commented May 6, 2022

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing 8c4fc9d to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label May 6, 2022
@bors bors merged commit 8c4fc9d into rust-lang:master May 6, 2022
@rustbot rustbot added this to the 1.62.0 milestone May 6, 2022
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (8c4fc9d): comparison url.

Summary:

  • Primary benchmarks: no relevant changes found
  • Secondary benchmarks: 😿 relevant regression found
Regressions 😿
(primary)
Regressions 😿
(secondary)
Improvements 🎉
(primary)
Improvements 🎉
(secondary)
All 😿 🎉
(primary)
count1 0 1 0 0 0
mean2 N/A 1.3% N/A N/A N/A
max N/A 1.3% N/A N/A N/A

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

@rustbot label: -perf-regression

Footnotes

  1. number of relevant changes

  2. the arithmetic mean of the percent change

GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request May 6, 2022
…errors

Put the incompatible_closure_captures lint messages in alphabetical order

Looks like they were in hash order before, which was causing me trouble in rust-lang#94598, so this PR sorts the errors by trait name.
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request May 6, 2022
…errors

Put the incompatible_closure_captures lint messages in alphabetical order

Looks like they were in hash order before, which was causing me trouble in rust-lang#94598, so this PR sorts the errors by trait name.
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request May 6, 2022
…errors

Put the incompatible_closure_captures lint messages in alphabetical order

Looks like they were in hash order before, which was causing me trouble in rust-lang#94598, so this PR sorts the errors by trait name.
@scottmcm scottmcm deleted the prefix-free-hasher-methods branch May 7, 2022 21:57
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request May 8, 2022
…anieu

Further elaborate the lack of guarantees from `Hasher`

I realized that I got too excited in rust-lang#94598 by adding new methods, and forgot to do the documentation to really answer the core question in rust-lang#94026.

This PR just has that doc update.

r? `@Amanieu`
/// writing its contents to this `Hasher`. That way
/// `(collection![1, 2, 3], collection![4, 5])` and
/// `(collection![1, 2], collection![3, 4, 5])` will provide different
/// sequences of values to the `Hasher`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lost a full-stop .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hasher::write should clarify its "whole unit" behaviour