Adaptive hashmap implementation #38368

arthurprs · 2016-12-14T19:00:33Z

All credits to @pczarn who wrote rust-lang/rfcs#1796 and contain-rs/hashmap2#5

Background

Rust std lib hashmap puts a strong emphasis on security, we did some improvements in #37470 but in some very specific cases and for non-default hashers it's still vulnerable (see #36481).

This is a simplified version of rust-lang/rfcs#1796 proposal sans switching hashers on the fly and other things that require an RFC process and further decisions. I think this part has great potential by itself.

Proposal
This PR adds code checking for extra long probe and shifts lengths (see code comments and rust-lang/rfcs#1796 for details), when those are encountered the hashmap will grow (even if the capacity limit is not reached yet) greatly attenuating the degenerate performance case.

We need a lower bound on the minimum occupancy that may trigger the early resize, otherwise in extreme cases it's possible to turn the CPU attack into a memory attack. The PR code puts that lower bound at half of the max occupancy (defined by ResizePolicy). This reduces the protection (it could potentially be exploited between 0-50% occupancy) but makes it completely safe.

Drawbacks

May interact badly with poor hashers. Maps using those may not use the desired capacity.
It adds 2-3 branches to the common insert path, luckily those are highly predictable and there's room to shave some in future patches.
May complicate exposure of ResizePolicy in the future as the constants are a function of the fill factor.

Example

Example code that exploit the exposure of iteration order and weak hasher.

const MERGE: usize = 10_000usize;
#[bench]
fn merge_dos(b: &mut Bencher) {
    let first_map: $hashmap<usize, usize, FnvBuilder> = (0..MERGE).map(|i| (i, i)).collect();
    let second_map: $hashmap<usize, usize, FnvBuilder> = (MERGE..MERGE * 2).map(|i| (i, i)).collect();
    b.iter(|| {
        let mut merged = first_map.clone();
        for (&k, &v) in &second_map {
            merged.insert(k, v);
        }
        ::test::black_box(merged);
    });
}

_91 is stdlib and _ad is patched (the end capacity in both cases is the same)

running 2 tests
test _91::merge_dos              ... bench:  47,311,843 ns/iter (+/- 2,040,302)
test _ad::merge_dos              ... bench:     599,099 ns/iter (+/- 83,270)

rust-highfive · 2016-12-14T19:00:46Z

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

arthurprs · 2016-12-14T19:03:21Z

The code is in very rough shape, I wanted to collect feedback on the idea first.

alexcrichton · 2016-12-15T00:08:03Z

r? @bluss

cc @pczarn, @apasel422

bluss · 2016-12-15T12:06:24Z

src/libstd/collections/hash/map.rs

        let mut old_table = replace(&mut self.table, RawTable::new(new_raw_cap));
        let old_size = old_table.size();

-        if old_table.capacity() == 0 || old_table.size() == 0 {
+        if old_table.size() == 0 {


why was the capacity conditional removed here?

This doesn't need to be part of the PR. The capacity check is redundant though, right?

Right, it's the existence of it to start with that is puzzling, if capacity is 0, size is surely already 0.

arthurprs · 2016-12-15T12:09:27Z

I can remove that change. But the capacity check is redundant, right?

arthurprs · 2016-12-15T12:10:28Z

src/libstd/collections/hash/map.rs

-            NoElem(bucket) => bucket.put(self.hash, self.key, value).into_mut_refs().1,
+            NeqElem(bucket, disp) => {
+                let (shift, v_ref) = robin_hood(bucket, disp, self.hash, self.key, value);
+                if disp >= 128 || shift >= 512 {


These of course will be moved into well commented constants.

I wonder if we can get away only checking probe length. It's still possible to abuse long shifts without hitting the probe length limit but that's a lot harder.

bluss · 2016-12-15T12:11:08Z

Looks remarkably simple for what it does. That's good.

Advantages:

Only affects insert (the fast lookup we have is untouched)
Simple

Obviously the constants involved need constant names, tuning and comments. I think we can make an argument that for example a displacement of 128 slots from its best position is always a bad case and should never occur in a healthy hash table, no matter its size?

arthurprs · 2016-12-15T20:45:48Z

The good thing is that the math behind this is independent from the map size, it's only a function of the fill factor and the hasher being good. The first is fine as the constants work for fill factors smaller than the one they were calculated.

Interacting badly with bad hashers could be problematic in practice as hashmap may never reach the maximum fill factor (the check for half filled is useful here so it doesn't blow up).

Veedrac · 2016-12-16T15:01:30Z

We need a lower bound on the minimum occupancy that may trigger the early resize, otherwise in extreme cases it's possible to turn the CPU attack into a memory attack. The PR code puts that lower bound at half of the max occupancy (ResizePolicy).

This sounds like a good idea, but it means it only counters the n=2 case (aka. merging two maps, rather than, say, the first nth of n maps). That's definitely an improvement, though.

arthurprs · 2016-12-16T15:03:47Z

@Veedrac what do you mean by n=2 case?

arthurprs · 2016-12-16T15:10:20Z

Putting into more generic terms, you mean that it can still be abused while between 0% and 50% filled?

Veedrac · 2016-12-17T15:41:23Z

Yes, basically. I'll try to cook up some examples later, to give a more concrete demo.

arthurprs · 2017-01-18T13:18:32Z

Trying to resume the conversation... I think the obvious open question here is the interaction with less than good hashers, hashmaps using those may not use the desired capacity.

alexcrichton · 2017-02-14T22:25:14Z

The libs team discussed this briefly at triage the other day, and we were wondering if we could perhaps land this ahead of the RFC? The changes to probing here are universally better, even if we don't do the hasher changes yet, right?

If so perhaps, the PR title/description could be cleaned up to the current state and we could look to merge?

arthurprs · 2017-02-14T23:25:12Z

I'll update the PR/description to provide clearer picture.

The changes to probing here are universally better, even if we don't do the hasher changes yet, right?

I wouldn't say universally, but mostly.

alexcrichton · 2017-02-15T15:25:19Z

Ah ok, thanks for the clarification. Want to ping me when updated and we can look to merge?

arthurprs · 2017-02-15T16:08:33Z

I should have elaborated that. It's not strictly better because the interaction with poor hashers isn't great, with those it's possible that the hashmap resizes early even on non-rogue input.

I'll ping when I update it.

arthurprs · 2017-02-15T20:41:42Z

PR updated, now there's two constants and lots of comment lines.

alexcrichton · 2017-02-15T22:52:37Z

Thanks @arthurprs! Out of curiosity, would it be at all possible to add a test for this?

arthurprs · 2017-02-15T23:13:51Z

I think so. It's possible to observe the early resizes from the public api and it's somewhat easy to trigger it by merging two maps with the same hash seed (like the example in first post). I'll write something tomorrow.

arthurprs · 2017-02-16T20:30:22Z

I rebased and squashed the commits.

alexcrichton · 2017-02-16T20:31:30Z

@bors: r+

Thanks again and for being patient @arthurprs!

bors · 2017-02-16T20:31:31Z

📌 Commit 57940d0 has been approved by alexcrichton

bors · 2017-02-16T20:32:49Z

⌛ Testing commit 57940d0 with merge 668864d...

@pczarn

Adaptive hashmap implementation All credits to @pczarn who wrote rust-lang/rfcs#1796 and contain-rs/hashmap2#5 **Background** Rust std lib hashmap puts a strong emphasis on security, we did some improvements in #37470 but in some very specific cases and for non-default hashers it's still vulnerable (see #36481). This is a simplified version of rust-lang/rfcs#1796 proposal sans switching hashers on the fly and other things that require an RFC process and further decisions. I think this part has great potential by itself. **Proposal** This PR adds code checking for extra long probe and shifts lengths (see code comments and rust-lang/rfcs#1796 for details), when those are encountered the hashmap will grow (even if the capacity limit is not reached yet) _greatly_ attenuating the degenerate performance case. We need a lower bound on the minimum occupancy that may trigger the early resize, otherwise in extreme cases it's possible to turn the CPU attack into a memory attack. The PR code puts that lower bound at half of the max occupancy (defined by ResizePolicy). This reduces the protection (it could potentially be exploited between 0-50% occupancy) but makes it completely safe. **Drawbacks** * May interact badly with poor hashers. Maps using those may not use the desired capacity. * It adds 2-3 branches to the common insert path, luckily those are highly predictable and there's room to shave some in future patches. * May complicate exposure of ResizePolicy in the future as the constants are a function of the fill factor. **Example** Example code that exploit the exposure of iteration order and weak hasher. ``` const MERGE: usize = 10_000usize; #[bench] fn merge_dos(b: &mut Bencher) { let first_map: $hashmap<usize, usize, FnvBuilder> = (0..MERGE).map(|i| (i, i)).collect(); let second_map: $hashmap<usize, usize, FnvBuilder> = (MERGE..MERGE * 2).map(|i| (i, i)).collect(); b.iter(|| { let mut merged = first_map.clone(); for (&k, &v) in &second_map { merged.insert(k, v); } ::test::black_box(merged); }); } ``` _91 is stdlib and _ad is patched (the end capacity in both cases is the same) ``` running 2 tests test _91::merge_dos ... bench: 47,311,843 ns/iter (+/- 2,040,302) test _ad::merge_dos ... bench: 599,099 ns/iter (+/- 83,270) ```

bors · 2017-02-16T23:22:09Z

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing 668864d to master...

arthurprs · 2017-02-18T08:36:49Z

@istankovic Please make a PR 😃

istankovic · 2017-02-18T09:05:27Z

@arthurprs Nah, it was just something I noticed so I made the comments, but it doesn't bother me enough to make a PR, sorry...

Fix spelling in hashmap comments Fixing my bad english from rust-lang#38368 Note to self: triple check spelling/grammar

arthurprs · 2017-02-19T23:10:28Z

The shift length math is broken. It turns out that checking for the shift length is complicated. Using simulations it's possible to see that a value of 2000 will only get probabilities down to ~1e-7 when the hashmap load factor is 90% (rust goes up to 90.9% as of today). That's probably not good enough to go into the stdlib with pluggable hashers. See rust-lang/rfcs#1796 (comment) and rust-lang/rfcs#1796 (comment)

I suggest taking that part out and keeping only displacement check, which is much safer and very useful by itself.

cc @pczarn @bluss @alexcrichton

pczarn · 2017-02-20T08:34:04Z

I agree. This issue also indicates that the hashmap load factor may be too high.

Thanks for help running these simulations.

Fix spelling in hashmap comments Fixing my bad english from rust-lang#38368 Note to self: triple check spelling/grammar

Simplify/fix adaptive hashmap Please see rust-lang#38368 (comment) for context. The shift length math is broken. It turns out that checking for the shift length is complicated. Using simulations it's possible to see that a value of 2000 will only get probabilities down to ~1e-7 when the hashmap load factor is 90% (rust goes up to 90.9% as of today). That's probably not good enough to go into the stdlib with pluggable hashers. So this PR simplify the adaptive behavior to only consider displacement, which is much safer and very useful by itself. There's two comments because one of them is already being tested to be merged by bors.

SimonSapin · 2017-02-23T09:13:53Z

Because of alignment(?), this one bool field makes the already-large HashMap grow from 40 bytes to 48. Maybe it’s worth the space, but I thought this should be noted.

(We caught this in Servo because we have unit tests that check std::mem::size_of for various types used in DOM nodes, to catch size regressions. Because there can be so many nodes in a document, even a small size increase can be significant.)

SimonSapin · 2017-02-23T09:48:32Z

Could the extra bit be packed in RawTable::size or RawTable::capacity?

pczarn · 2017-02-23T10:39:01Z

Yes, of course. The code is going to be messy, though.

If we're able to restrict adaptive hashing to maps with the default hasher, I'd prefer to have the extra bit in RandomSipHasher.

arthurprs · 2017-02-23T10:49:17Z

It's just a matter of finding how to use the bit with reasonable code.

I'd argue against making it RandomState only, the selling point was supporting all hashmaps. Edit: I also think that making it RandomState only will require even more code.

SimonSapin · 2017-02-23T16:40:33Z

Let’s discuss in #40042.

Simplify/fix adaptive hashmap Please see rust-lang#38368 (comment) for context. The shift length math is broken. It turns out that checking for the shift length is complicated. Using simulations it's possible to see that a value of 2000 will only get probabilities down to ~1e-7 when the hashmap load factor is 90% (rust goes up to 90.9% as of today). That's probably not good enough to go into the stdlib with pluggable hashers. So this PR simplify the adaptive behavior to only consider displacement, which is much safer and very useful by itself. There's two comments because one of them is already being tested to be merged by bors.

Simplify/fix adaptive hashmap Please see rust-lang/rust#38368 (comment) for context. The shift length math is broken. It turns out that checking for the shift length is complicated. Using simulations it's possible to see that a value of 2000 will only get probabilities down to ~1e-7 when the hashmap load factor is 90% (rust goes up to 90.9% as of today). That's probably not good enough to go into the stdlib with pluggable hashers. So this PR simplify the adaptive behavior to only consider displacement, which is much safer and very useful by itself. There's two comments because one of them is already being tested to be merged by bors.

rust-highfive assigned alexcrichton Dec 14, 2016

rust-highfive assigned bluss and unassigned alexcrichton Dec 15, 2016

alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Dec 15, 2016

arthurprs mentioned this pull request Dec 15, 2016

Exposure of HashMap iteration order allows for O(n²) blowup. #36481

Open

bluss reviewed Dec 15, 2016

View reviewed changes

arthurprs commented Dec 15, 2016

View reviewed changes

arthurprs mentioned this pull request Jan 18, 2017

Decrease HashMap Load Factor #38003

Closed

arthurprs mentioned this pull request Jan 19, 2017

HashMap starts the search even if the map is empty #38880

Closed

aturon mentioned this pull request Feb 1, 2017

Adaptive hashing rust-lang/rfcs#1796

Closed

brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Feb 13, 2017

arthurprs changed the title ~~[WIP] Adaptive hashmap implementation~~ Adaptive hashmap implementation Feb 15, 2017

arthurprs force-pushed the hm-adapt branch from 443cc46 to 57940d0 Compare February 16, 2017 20:29

bors merged commit 57940d0 into rust-lang:master Feb 16, 2017

arthurprs mentioned this pull request Feb 18, 2017

Fix spelling in hashmap comments #39937

Merged

GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request Feb 19, 2017

Rollup merge of rust-lang#39937 - arthurprs:hm-adapt, r=frewsxcv

e5e4163

Fix spelling in hashmap comments Fixing my bad english from rust-lang#38368 Note to self: triple check spelling/grammar

frewsxcv added a commit to frewsxcv/rust that referenced this pull request Feb 20, 2017

Rollup merge of rust-lang#39937 - arthurprs:hm-adapt, r=frewsxcv

7ef0192

Fix spelling in hashmap comments Fixing my bad english from rust-lang#38368 Note to self: triple check spelling/grammar

arthurprs mentioned this pull request Feb 20, 2017

Simplify/fix adaptive hashmap #39988

Merged

SimonSapin mentioned this pull request Feb 23, 2017

HashMap (and therefore DOM nodes) grew by 8 bytes servo/servo#15704

Closed

SimonSapin mentioned this pull request Feb 23, 2017

Size of HashMap increased between 2/16 and 2/17 nightly #40042

Closed

Gankra mentioned this pull request Dec 4, 2018

Replace HashMap implementation with SwissTable #56241

Closed

2 tasks

Adaptive hashmap implementation #38368

Adaptive hashmap implementation #38368

Conversation

arthurprs commented Dec 14, 2016 • edited Loading

rust-highfive commented Dec 14, 2016

arthurprs commented Dec 14, 2016

alexcrichton commented Dec 15, 2016

bluss Dec 15, 2016

Choose a reason for hiding this comment

arthurprs Dec 15, 2016

Choose a reason for hiding this comment

bluss Dec 15, 2016 • edited Loading

Choose a reason for hiding this comment

arthurprs commented Dec 15, 2016

arthurprs Dec 15, 2016

Choose a reason for hiding this comment

arthurprs Dec 15, 2016

Choose a reason for hiding this comment

bluss commented Dec 15, 2016

arthurprs commented Dec 15, 2016

Veedrac commented Dec 16, 2016 • edited Loading

arthurprs commented Dec 16, 2016

arthurprs commented Dec 16, 2016

Veedrac commented Dec 17, 2016

arthurprs commented Jan 18, 2017 • edited Loading

alexcrichton commented Feb 14, 2017

arthurprs commented Feb 14, 2017 • edited Loading

alexcrichton commented Feb 15, 2017

arthurprs commented Feb 15, 2017

arthurprs commented Feb 15, 2017

alexcrichton commented Feb 15, 2017

arthurprs commented Feb 15, 2017

arthurprs commented Feb 16, 2017

alexcrichton commented Feb 16, 2017

bors commented Feb 16, 2017

bors commented Feb 16, 2017

bors commented Feb 16, 2017

arthurprs commented Feb 18, 2017

istankovic commented Feb 18, 2017

arthurprs commented Feb 19, 2017 • edited Loading

pczarn commented Feb 20, 2017

SimonSapin commented Feb 23, 2017

SimonSapin commented Feb 23, 2017

pczarn commented Feb 23, 2017 • edited Loading

arthurprs commented Feb 23, 2017 • edited Loading

SimonSapin commented Feb 23, 2017

arthurprs commented Dec 14, 2016 •

edited

Loading

bluss Dec 15, 2016 •

edited

Loading

Veedrac commented Dec 16, 2016 •

edited

Loading

arthurprs commented Jan 18, 2017 •

edited

Loading

arthurprs commented Feb 14, 2017 •

edited

Loading

arthurprs commented Feb 19, 2017 •

edited

Loading

pczarn commented Feb 23, 2017 •

edited

Loading

arthurprs commented Feb 23, 2017 •

edited

Loading