Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized HashMap for size. Added DefaultResizePolicy #14526

Merged
merged 1 commit into from
Jun 5, 2014

Conversation

pczarn
Copy link
Contributor

@pczarn pczarn commented May 29, 2014

An interface that gives a better control over the load factor and the minimum capacity for HashMap.
The size of HashMap<K, V> is now 64 bytes by default on a 64-bit platform (or at least 40 bytes, that is 3 words less, with FNV and without minimum capacity)

Unanswered questions about ResizePolicy

  • should it control the INITIAL_CAPACITY?
  • should it fully control the resizing behavior? Even though the capacity always changes by a factor of 2.
  • is caching grow_at desirable?

special thanks to @eddyb and @pnkfelix

deallocate(self.hashes as *mut u8, size, align);
// Remember how everything was allocated out of one buffer
// during initialization? We only need one call to free here.
if self.hashes.is_not_null() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should zero self.hashes after deallocation etc. just to be safe.

@alexcrichton
Copy link
Member

This change is a fairly large change to a core type in rust, due to the addition of an extra type parameter everywhere, so I would imagine that it likely warrants an RFC. If the resizing policy were an implementation of the detail of the hash map, I would be more comfortable with it.

@brson
Copy link
Contributor

brson commented May 29, 2014

What is ResizePolicy, why is it needed, and what are the downsides if we don't have it?

deallocate(self.hashes as *mut u8, size, align);
// Remember how everything was allocated out of one buffer
// during initialization? We only need one call to free here.
self.hashes = RawPtr::null();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary, dropping will always zero the destination.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddyb: @huonw advised to add this "just to be safe". Dropping should zero it out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddyb drop zeroing is not to be relied upon. it's going away.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmr and when it does, destructors will be run precisely once. (Meaning this is ok.)

@eddyb
Copy link
Member

eddyb commented May 30, 2014

We were wasting space storing constants and caching relatively-easy-to-perform computations, while HashMap wasn't even exposing a way to change the load factor.

@pczarn
Copy link
Contributor Author

pczarn commented May 30, 2014

There are some changes and optimizations unrelated to ResizePolicy, specifically removing the drop flag.

To begin with, ResizePolicy is a trait that allows the user to define conditions under which the allocated table is grown or shrunk. In other words, it controls how much unused capacity a hashmap can have. It's a (possibly non-trivial) combination of the minimum capacity, the load factor, and shrinking behavior.

An extra type parameter isn't added exactly "everywhere" thanks to default type parameters. ResizePolicy is an ordinary abstraction. It reminds me of fmt's GenericRadix.

The proposed implementation has flaws:

  • it's still impossible to instantiate a HashMap with a custom load factor unless we add a method called with_capacity_hasher_and_resize_policy. Rust's methods rarely have excessively long names.
  • the effects of reserve depend on the ResizePolicy and might be different than expected.

We can keep using what the author calls a "hackish fraction type". The disadvantages of the current approach are marginal for librustc since hashmaps are often kept on the stack, I suppose. But a fine-grained control should be desirable for Servo. According to @cgaebel, two load factors are particularly useful: approximately 92% and 84%. The latter would make the probing read a single cache line most of the time, at the expense of a higher memory usage.

Also, we can make it an internal mechanism without the use of type parameters for now.

@cgaebel
Copy link
Contributor

cgaebel commented May 30, 2014

Overall, I like this refactoring and think it cleans up a few subtle sharp corners in HashMap. I think it is very important that it exposes a configurable load factor resize policy, as that will cover 99% of use cases, and also serves as a proof of the utility of the ResizePolicy abstraction.

Oh, and I find it really cute that a hashmap is now exactly one cache line in size. :)

@alexcrichton
Copy link
Member

Do you have any concrete numbers for where this improves existing applications? The reduction of size of a HashMap to just 64 bytes seems appealing, but it would be useful to see how beneficial this is to applications such as rustc.

I'm not personally worried about consumers of the HashMap type, but I am quite worried about users looking at the documentation of HashMap as well as anyone reading this code. Each impl block already has 5 type parameters (a very high number), and this is adding yet another that needs to be comprehended. The defaults are almost never in place on the impl blocks, so there needs to be a tradeoff in how understandable we want the documentation to be or how rustdoc should change to accommodate use cases such as this.

I'm also curious about @brson's questions, "why is it needed, and what are the downsides if we don't have it?". I'm not sure that the answer for why is this needed that 1% of use cases may use this would justify it. Extending this would add a resize policy type parameter to almost all collections: Vec, PriorityQueue, BTree, LruCache, Trie, etc.

@pczarn
Copy link
Contributor Author

pczarn commented May 30, 2014

Force pushed minor changes.

No, it makes no measurable difference in microbenchmarks*. And I believe that rustc has not too many HashMaps and they usually live on the stack (a difference of 2sec in my numbers is most likely a measurement error). I'll have to look somewhere else for numbers.

By the way, it's very difficult to squeeze out further two more pointers without a significant loss of performance. But it might be possible.

@alexcrichton, this is concerning. The first time I had a closer look at HashMap, I had already found its type parameters confusing and nonintuitive. Although the syntax should better be understandable and the docs are getting really unreadable, it doesn't mean it shouldn't be done this way.

Fortunately, Vec doesn't require anything that is currently a part of ResizePolicy; it has no load factor and no shrinking policy. Therefore only HashMap-based collections are affected: HashSet, LruCache etc.

                  with ResizePolicy         original
find_existing     59616 ns/iter (+/- 2784)   59640 ns/iter (+/- 2959)
find_nonexisting  59971 ns/iter (+/- 1874)   60565 ns/iter (+/- 2385)
find_pop_insert     301 ns/iter (+/- 4)        291 ns/iter (+/- 3)
hashmap_as_queue    177 ns/iter (+/- 3)        171 ns/iter (+/- 4)
insert              357 ns/iter (+/- 38)       318 ns/iter (+/- 14)
new_drop            209 ns/iter (+/- 5)        205 ns/iter (+/- 5)
new_insert_drop     311 ns/iter (+/- 17)       322 ns/iter (+/- 12)

@gereeter
Copy link
Contributor

gereeter commented Jun 1, 2014

Would it be possible to split this PR into two parts? The only controversial part of it seems to be the ResizePolicy trait, and since this PR doesn't even provide a way for the user to specify a custom policy, I don't think it would hurt too much to only have DefaultResizePolicy. Additionally, the padding measurements seem to imply that the space optimization part helps quite a lot.

@pczarn
Copy link
Contributor Author

pczarn commented Jun 1, 2014

For now, I'll make DefaultResizePolicy a private type without a trait.

Also, I realized that all collections will get yet another default type parameter once the "Allocator trait" RFC gets accepted. Which one should come first? I'm convinced this warrants an RFC.

Keyword arguments would make changes like these much less painful. Hopefully the use of type parameters will improve as well, for instance by bringing back impls on typedefs (#6087, #9767).

@pczarn pczarn changed the title Optimized HashMap for size. Added ResizePolicy Optimized HashMap for size. Added DefaultResizePolicy Jun 2, 2014
@alexcrichton
Copy link
Member

This looks like it's just an internal improvement to HashMap, so I think it's definitely fine to merge in that respect (just needs a rebase now).

Refactored the load factor and the minimum capacity out of HashMap.
The size of HashMap<K, V> is now 64 bytes by default on a 64-bit platform
(or 48 bytes, that is 2 words less, with FNV)
Added a documentation in a few places to clarify the behavior.
@alexcrichton
Copy link
Member

Feel free to ping the PR whenever you update it, sadly github doesn't send out any notifications about a force-push :(

bors added a commit that referenced this pull request Jun 5, 2014
An interface that gives a better control over the load factor and the minimum capacity for HashMap.
The size of `HashMap<K, V>` is now 64 bytes by default on a 64-bit platform (or at least 40 bytes, that is 3 words less, with FNV and without minimum capacity)

Unanswered questions about `ResizePolicy`

* should it control the `INITIAL_CAPACITY`?
* should it fully control the resizing behavior? Even though the capacity always changes by a factor of 2.
* is caching `grow_at` desirable?

special thanks to @eddyb and @pnkfelix
@bors bors closed this Jun 5, 2014
@bors bors merged commit 2202b10 into rust-lang:master Jun 5, 2014
bors added a commit to rust-lang-ci/rust that referenced this pull request Jun 5, 2023
Revert "Add bounds for fields in derive macro"

Reverts rust-lang/rust-analyzer#14521 as it introduces too many mismatches
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants