Add a dedicated length-prefixing method to `Hasher` #94598

scottmcm · 2022-03-04T08:28:15Z

This accomplishes two main goals:

Make it clear who is responsible for prefix-freedom, including how they should do it
Make it feasible for a Hasher that doesn't care about Hash-DoS resistance to get better performance by not hashing lengths

This does not change rustc-hash, since that's in an external crate, but that could potentially use it in future.

r? rust-lang/libs

The core of this change is the following two new methods on Hasher:

pub trait Hasher {
    /// Writes a length prefix into this hasher, as part of being prefix-free.
    ///
    /// If you're implementing [`Hash`] for a custom collection, call this before
    /// writing its contents to this `Hasher`.  That way
    /// `(collection![1, 2, 3], collection![4, 5])` and
    /// `(collection![1, 2], collection![3, 4, 5])` will provide different
    /// sequences of values to the `Hasher`
    ///
    /// The `impl<T> Hash for [T]` includes a call to this method, so if you're
    /// hashing a slice (or array or vector) via its `Hash::hash` method,
    /// you should **not** call this yourself.
    ///
    /// This method is only for providing domain separation.  If you want to
    /// hash a `usize` that represents part of the *data*, then it's important
    /// that you pass it to [`Hasher::write_usize`] instead of to this method.
    ///
    /// # Examples
    ///
    /// ```
    /// #![feature(hasher_prefixfree_extras)]
    /// # // Stubs to make the `impl` below pass the compiler
    /// # struct MyCollection<T>(Option<T>);
    /// # impl<T> MyCollection<T> {
    /// #     fn len(&self) -> usize { todo!() }
    /// # }
    /// # impl<'a, T> IntoIterator for &'a MyCollection<T> {
    /// #     type Item = T;
    /// #     type IntoIter = std::iter::Empty<T>;
    /// #     fn into_iter(self) -> Self::IntoIter { todo!() }
    /// # }
    ///
    /// use std::hash::{Hash, Hasher};
    /// impl<T: Hash> Hash for MyCollection<T> {
    ///     fn hash<H: Hasher>(&self, state: &mut H) {
    ///         state.write_length_prefix(self.len());
    ///         for elt in self {
    ///             elt.hash(state);
    ///         }
    ///     }
    /// }
    /// ```
    ///
    /// # Note to Implementers
    ///
    /// If you've decided that your `Hasher` is willing to be susceptible to
    /// Hash-DoS attacks, then you might consider skipping hashing some or all
    /// of the `len` provided in the name of increased performance.
    #[inline]
    #[unstable(feature = "hasher_prefixfree_extras", issue = "88888888")]
    fn write_length_prefix(&mut self, len: usize) {
        self.write_usize(len);
    }

    /// Writes a single `str` into this hasher.
    ///
    /// If you're implementing [`Hash`], you generally do not need to call this,
    /// as the `impl Hash for str` does, so you can just use that.
    ///
    /// This includes the domain separator for prefix-freedom, so you should
    /// **not** call `Self::write_length_prefix` before calling this.
    ///
    /// # Note to Implementers
    ///
    /// The default implementation of this method includes a call to
    /// [`Self::write_length_prefix`], so if your implementation of `Hasher`
    /// doesn't care about prefix-freedom and you've thus overridden
    /// that method to do nothing, there's no need to override this one.
    ///
    /// This method is available to be overridden separately from the others
    /// as `str` being UTF-8 means that it never contains `0xFF` bytes, which
    /// can be used to provide prefix-freedom cheaper than hashing a length.
    ///
    /// For example, if your `Hasher` works byte-by-byte (perhaps by accumulating
    /// them into a buffer), then you can hash the bytes of the `str` followed
    /// by a single `0xFF` byte.
    ///
    /// If your `Hasher` works in chunks, you can also do this by being careful
    /// about how you pad partial chunks.  If the chunks are padded with `0x00`
    /// bytes then just hashing an extra `0xFF` byte doesn't necessarily
    /// provide prefix-freedom, as `"ab"` and `"ab\u{0}"` would likely hash
    /// the same sequence of chunks.  But if you pad with `0xFF` bytes instead,
    /// ensuring at least one padding byte, then it can often provide
    /// prefix-freedom cheaper than hashing the length would.
    #[inline]
    #[unstable(feature = "hasher_prefixfree_extras", issue = "88888888")]
    fn write_str(&mut self, s: &str) {
        self.write_length_prefix(s.len());
        self.write(s.as_bytes());
    }
}

With updates to the Hash implementations for slices and containers to call write_length_prefix instead of write_usize.

write_str defaults to using write_length_prefix since, as was pointed out in the issue, the write_u8(0xFF) approach is insufficient for hashers that work in chunks, as those would hash "a\u{0}" and "a" to the same thing. But since SipHash works byte-wise (there's an internal buffer to accumulate bytes until a full chunk is available) it overrides write_str to continue to use the add-non-UTF-8-byte approach.

Compatibility:

Because the default implementation of write_length_prefix calls write_usize, the changed hash implementation for slices will do the same thing the old one did on existing Hashers.

tczajka · 2022-03-04T09:06:26Z

Make it feasible for a Hasher that doesn't care about Hash-DoS resistance to get better performance by not hashing lengths

This seems like a bad idea. It's getting better performance by skipping over some relevant bits of information. By the same reasoning, you might skip the first element when hashing an (i32, i32) to get better performance. Isn't the whole point of hashing to hash all the data?

It's not just about DoS resistance, data such as ("aaa", "a") and ("a", "aaa") can come from a friendly source.

scottmcm · 2022-03-04T10:14:25Z

Isn't the whole point of hashing to hash all the data?

No, I wouldn't say that. Rather the point is to hash enough of the entropy to get the collision rate acceptably low. It's sometimes fine to skip lengths the same way it's sometimes fine to skip big-but-low-cardinality fields in Hash. I certainly wouldn't suggest it as the default behaviour -- so perhaps I should change the text to encourage still using the length in non-adversarial situations too -- but I can absolutely imagine that in many cases it would be perfectly appropriate. (And I'd much rather that some Hashers skipped it than have Hash implementations start showing up in the ecosystem that skip entirely for all Hashers.)

rustc has found that a lower-quality but faster hash is worth it, and it's entirely possible that the tradeoff in skipping lengths would also be a net win. Something like TerminatorKind::Call definitely doesn't need the length prefix, for example -- there's only one Vec in there, and if it has a different length it will have had a different func too, since rust doesn't have overloading. And just in general, it interning so much stuff makes it that much less likely that the ("ab", "c") vs ("a", "bc") scenarios will come up. So while some parts -- StableHasher, maybe -- likely will keep using the lengths, I wouldn't be surprised if most of the FxHashMaps were net-happier without lengths.

And because of Borrow, there's a bunch of places where a hasher that ignores lengths would be faster with no loss of hash quality. For example, I have a project I made 15% faster by changing from HashMap<[usize; 2], _> to HashMap<(usize, usize), _> -- because the former is forced to pointlessly hash a 2_usize for every key. Even just a HashSet<String> is wasting effort by hashing that extra byte, since it's obviously prefix-free without it. Being able to fix things like that with one Hasher wrapper, instead of needing alternative Hash implementations for all of them, sounds really nice to me.

It can potentially also provide value even if nobody ever omits the length entirely. For example, a byte-oriented hasher might choose to LEB128 the length prefixes to save hash calls in common cases, but not do that for all usizes. (That would make a [u8; 1] hash 2 bytes instead of 9, for example, so could plausibly be worth it, since the extra LEB cost would be on long slices where it's likely negligible compared to the total cost of hashing all those extra elements.)

tczajka · 2022-03-04T10:38:45Z

Some good points, but I want to address a few of them:

rustc has found that a lower-quality but faster hash is worth it

This comparison seems to be about performance of SipHash vs another hash. SipHash is just slow. I agree there are better hashers for many (most) applications, but you can get much faster hashers that don't skip anything.

For example, I have a project I made 15% faster by changing from HashMap<[usize; 2], _> to HashMap<(usize, usize), _> -- because the former is forced to pointlessly hash a 2_usize for every key.

Agreed, but this seems like a fundamental design choice / design issue with hashing arrays and Borrow, and this proposed API works around it in a brittle way which only works for certain hashers, rather than really addressing the fundamental problem with how constant-sized arrays are hashed.

Even just a HashSet<String> is wasting effort by hashing that extra byte, since it's obviously prefix-free without it.

Not really -- Strings are not prefix free. Good hashers can currently assume that the data they are hashing is already prefix free, so they don't need to add a terminator themselves, they can just pad the data with zeroes to a block size without any delimiters. This wouldn't work without the extra byte.

Currently SipHash seems to add a delimeter in finish(), but this is superfluous.

For example, a byte-oriented hasher might choose to LEB128 the length prefixes to save hash calls in common cases, but not do that for all usizes.

Another solution without having to complicate the API would be for Hash for [T] to always do this. I think it makes sense.

scottmcm · 2022-03-04T20:14:38Z

would be for Hash for [T] to always do this. I think it makes sense.

I absolutely want to try it for SipHash as a followup to this PR, if accepted.

But always doing it feels wrong. It's probably a win for anything that works byte-by-byte, but rustc's FxHasher works on usize chunks without a buffer:

https://github.com/rust-lang/rustc-hash/blob/5ff7e6ae50e770381cb7bb393bfd22e016f2b8a6/src/lib.rs#L76-L81

So the extra LEB128 work would be entirely wasteful overhead for it -- even if it's only a 1-byte write, it'll still make a full add_to_hash step:

https://github.com/rust-lang/rustc-hash/blob/5ff7e6ae50e770381cb7bb393bfd22e016f2b8a6/src/lib.rs#L105-L107

Which sounds to me like all the more reason to leave it up to the Hasher, rather than doing it in, say, slice's Hash. (Not to mention that we'd need to copy the logic to VecDeque's Hash and hashbrown's Hash and etc, so even if it were worth doing for every Hasher I think we'd still want a method for it somewhere.)

There are middle-ground possibilities here too. If you're implementing a hasher that uses a 4-byte block, then for platforms where usize is 64-bit you might want to "LEB2147483648" the lengths, to usually only need to do 1 round.

tczajka · 2022-03-05T02:26:13Z

So the extra LEB128 work would be entirely wasteful overhead

The extra work is just one comparison against 128, or a single bitand, with a very predictable branch (the branch can assume it's always small).

Avoiding this comparison (only in some non-standard Hashers) seems like a micro-optimization -- is this really worth complicating the Hasher API and complicating prefix-freedom rules for Hash?

If the length is 128 or more, the overhead for length doesn't matter because hashing 128 things is going to dominate (except for ZSTs).

Alternatively, the following encoding could be even better: one byte for lengths smaller than 255, 9 bytes (255u8, length) for lengths 255 or more.

bors · 2022-04-01T20:42:12Z

☔ The latest upstream changes (presumably #95552) made this pull request unmergeable. Please resolve the merge conflicts.

apiraino · 2022-04-07T09:19:33Z

I'm removing T-compiler from the teams involved for review since it looks like a T-libs thing. In case feel free to add the label again.

@rustbot label -T-compiler

scottmcm · 2022-04-08T04:16:34Z

Rebased.

@joshtriplett friendly 1-month review ping. Let me know if you'd rather I bring it up in a libs-api meeting.

scottmcm · 2022-04-20T20:05:27Z

cc @Amanieu, who was mentioned in the libs-api meeting today as perhaps interested in commenting on this.

joshtriplett · 2022-04-27T19:17:22Z

r? @Amanieu

Amanieu · 2022-05-03T06:19:27Z

@bors r+ rollup=never

bors · 2022-05-03T06:19:29Z

📌 Commit ae6baf0e0ba28a27734dc24786a29083015e9f33 has been approved by Amanieu

bors · 2022-05-03T08:36:56Z

⌛ Testing commit ae6baf0e0ba28a27734dc24786a29083015e9f33 with merge 5036d0075d5da4806f69242f88e0241494512937...

bors · 2022-05-03T09:00:59Z

💔 Test failed - checks-actions

…errors Put the incompatible_closure_captures lint messages in alphabetical order Looks like they were in hash order before, which was causing me trouble in rust-lang#94598, so this PR sorts the errors by trait name.

This accomplishes two main goals: - Make it clear who is responsible for prefix-freedom, including how they should do it - Make it feasible for a `Hasher` that *doesn't* care about Hash-DoS resistance to get better performance by not hashing lengths This does not change rustc-hash, since that's in an external crate, but that could potentially use it in future.

We might want to change the default before stabilizing (or maybe even after), but for getting in the new unstable methods, leave it as-is for now. That way it won't break cargo and such.

scottmcm · 2022-05-06T07:55:19Z

Ok, @Amanieu, I've updated this to change the provided Hasher::write_str to do the same thing that str::hash was doing before, and as part of that I've been able to remove the ui test updates.

@rustbot ready

Amanieu · 2022-05-06T08:22:02Z

@bors r+

bors · 2022-05-06T08:22:03Z

📌 Commit ebdcb08 has been approved by Amanieu

bors · 2022-05-06T09:44:01Z

⌛ Testing commit ebdcb08 with merge 8c4fc9d...

bors · 2022-05-06T12:09:39Z

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing 8c4fc9d to master...

rust-timer · 2022-05-06T13:37:15Z

Finished benchmarking commit (8c4fc9d): comparison url.

Summary:

Primary benchmarks: no relevant changes found
Secondary benchmarks: 😿 relevant regression found

	Regressions 😿 (primary)	Regressions 😿 (secondary)	Improvements 🎉 (primary)	Improvements 🎉 (secondary)	All 😿 🎉 (primary)
count¹	0	1	0	0	0
mean²	N/A	1.3%	N/A	N/A	N/A
max	N/A	1.3%	N/A	N/A	N/A

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

@rustbot label: -perf-regression

number of relevant changes ↩
the arithmetic mean of the percent change ↩

…errors Put the incompatible_closure_captures lint messages in alphabetical order Looks like they were in hash order before, which was causing me trouble in rust-lang#94598, so this PR sorts the errors by trait name.

…anieu Further elaborate the lack of guarantees from `Hasher` I realized that I got too excited in rust-lang#94598 by adding new methods, and forgot to do the documentation to really answer the core question in rust-lang#94026. This PR just has that doc update. r? `@Amanieu`

mzji · 2022-05-13T02:09:56Z

library/core/src/hash/mod.rs

+    /// writing its contents to this `Hasher`.  That way
+    /// `(collection![1, 2, 3], collection![4, 5])` and
+    /// `(collection![1, 2], collection![3, 4, 5])` will provide different
+    /// sequences of values to the `Hasher`


Lost a full-stop .

rust-highfive assigned joshtriplett Mar 4, 2022

rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Mar 4, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 4, 2022

scottmcm mentioned this pull request Mar 4, 2022

Hasher::write should clarify its "whole unit" behaviour #94026

Closed

scottmcm added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Mar 9, 2022

rustbot removed the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 7, 2022

scottmcm force-pushed the prefix-free-hasher-methods branch from 86aa700 to ae6baf0 Compare April 8, 2022 03:12

rust-highfive assigned Amanieu and unassigned joshtriplett Apr 27, 2022

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 3, 2022

This comment has been minimized.

Sign in to view

bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels May 3, 2022

scottmcm force-pushed the prefix-free-hasher-methods branch from ae6baf0 to 24e5585 Compare May 3, 2022 21:11

bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 6, 2022

scottmcm mentioned this pull request May 6, 2022

Put the incompatible_closure_captures lint messages in alphabetical order #96752

Merged

scottmcm force-pushed the prefix-free-hasher-methods branch from 609977f to 6b386f5 Compare May 6, 2022 06:18

This comment has been minimized.

Sign in to view

scottmcm mentioned this pull request May 6, 2022

Tracking Issue for Hasher::{write_str, write_length_prefix} (feature(hasher_prefixfree_extras)) #96762

Open

3 tasks

scottmcm added 2 commits May 6, 2022 00:03

For now, don't change the details of hashing a str

ebdcb08

We might want to change the default before stabilizing (or maybe even after), but for getting in the new unstable methods, leave it as-is for now. That way it won't break cargo and such.

scottmcm force-pushed the prefix-free-hasher-methods branch from 6b386f5 to ebdcb08 Compare May 6, 2022 07:15

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels May 6, 2022

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 6, 2022

bors added the merged-by-bors This PR was explicitly merged by bors. label May 6, 2022

bors merged commit 8c4fc9d into rust-lang:master May 6, 2022

rustbot added this to the 1.62.0 milestone May 6, 2022

scottmcm deleted the prefix-free-hasher-methods branch May 7, 2022 21:57

scottmcm mentioned this pull request May 8, 2022

Further elaborate the lack of guarantees from Hasher #96828

Merged

mzji reviewed May 13, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a dedicated length-prefixing method to `Hasher` #94598

Add a dedicated length-prefixing method to `Hasher` #94598

scottmcm commented Mar 4, 2022

tczajka commented Mar 4, 2022 •

edited

Loading

scottmcm commented Mar 4, 2022

tczajka commented Mar 4, 2022 •

edited

Loading

scottmcm commented Mar 4, 2022

tczajka commented Mar 5, 2022 •

edited

Loading

bors commented Apr 1, 2022

apiraino commented Apr 7, 2022 •

edited

Loading

scottmcm commented Apr 8, 2022

scottmcm commented Apr 20, 2022

joshtriplett commented Apr 27, 2022

Amanieu commented May 3, 2022

bors commented May 3, 2022

bors commented May 3, 2022

This comment has been minimized.

bors commented May 3, 2022

This comment has been minimized.

scottmcm commented May 6, 2022

Amanieu commented May 6, 2022

bors commented May 6, 2022

bors commented May 6, 2022

bors commented May 6, 2022

rust-timer commented May 6, 2022

mzji May 13, 2022

Add a dedicated length-prefixing method to Hasher #94598

Add a dedicated length-prefixing method to Hasher #94598

Conversation

scottmcm commented Mar 4, 2022

tczajka commented Mar 4, 2022 • edited Loading

scottmcm commented Mar 4, 2022

tczajka commented Mar 4, 2022 • edited Loading

scottmcm commented Mar 4, 2022

tczajka commented Mar 5, 2022 • edited Loading

bors commented Apr 1, 2022

apiraino commented Apr 7, 2022 • edited Loading

scottmcm commented Apr 8, 2022

scottmcm commented Apr 20, 2022

joshtriplett commented Apr 27, 2022

Amanieu commented May 3, 2022

bors commented May 3, 2022

bors commented May 3, 2022

This comment has been minimized.

bors commented May 3, 2022

This comment has been minimized.

scottmcm commented May 6, 2022

Amanieu commented May 6, 2022

bors commented May 6, 2022

bors commented May 6, 2022

bors commented May 6, 2022

rust-timer commented May 6, 2022

Footnotes

mzji May 13, 2022

Choose a reason for hiding this comment

Add a dedicated length-prefixing method to `Hasher` #94598

Add a dedicated length-prefixing method to `Hasher` #94598

tczajka commented Mar 4, 2022 •

edited

Loading

tczajka commented Mar 4, 2022 •

edited

Loading

tczajka commented Mar 5, 2022 •

edited

Loading

apiraino commented Apr 7, 2022 •

edited

Loading