Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashMap starts the search even if the map is empty #38880

Closed
pmarcelll opened this issue Jan 6, 2017 · 10 comments
Closed

HashMap starts the search even if the map is empty #38880

pmarcelll opened this issue Jan 6, 2017 · 10 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@pmarcelll
Copy link
Contributor

It was mentioned in the latest Ferris Makes Emulators episode that contains showed up in the profiling results, although a lot of the times the map itself was empty.
I think the extra checking is worth it before starting the search and that way the hashing is also skipped.

@Manishearth
Copy link
Member

Manishearth commented Jan 6, 2017

contains() calls contains_key() which calls search() and finally search_hashed()

That function checks for zero capacity the first thing it does. The only work that is done between the contains() call and search_hashed() is computing the hash.

Now, if the map was empty (but had nonzero capacity) we'd have a problem. search_hashed can't bail early in that case because it still needs to produce the entry, but contains doesn't care about that.

In Ferris' example, were hashmaps with nonzero capacity but zero length used? You can make that happen with HashMap::with_capacity(foo), but by default hashmaps are zero capacity (no allocation till you insert things into them). If not, this probably isn't the problem.

Bailing out on size == 0 in contains is an interesting idea. Small negative impact on users of regular hashmaps, large positive impact on users of preallocated hashmaps.

@leonardo-m
Copy link

Small negative impact on users of regular hashmaps, large positive impact on users of preallocated hashmaps.

How much "small" is the negative impact in the regular case needs to be benchmarked.

@Manishearth
Copy link
Member

Yes. It's probably negligible.

@Stebalien
Copy link
Contributor

@Manishearth That really depends on the use case. If you have a large vector of mostly empty HashMap/HashSets, repeatedly re-hashing the key could take a while. However, this probably isn't a common use-case so the programmer could just check the length in cases like this.

Anyways, wasn't there some talk about falling back on linear search/storage for small maps/sets? That would probably be more useful in practice.

@arthurprs
Copy link
Contributor

arthurprs commented Jan 19, 2017

I wanted to move the capacity check from search_hashed up in the call stack, notably to the get* and contain methods (and there it can become a len = 0 check, solving this), as entry and insert reserve at least 1 space upfront anyway (so it's never really capacity 0). This saves 1 branch offsetting the extra ones required by #38368 . Probably not worth the trouble though :)

@steveklabnik steveklabnik added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed A-libs labels Mar 24, 2017
@Mark-Simulacrum Mark-Simulacrum added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Jun 19, 2017
@Mark-Simulacrum Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jul 26, 2017
@technicalguy
Copy link
Contributor

So I've had a look into this issue.

The only work that is done between the contains() call and search_hashed() is computing the hash.

Computing the hash can be quite costly, even for integer keys on an empty HashMap it seems contains() is an order of magnitude faster with an empty check before hash computation than with one after.

Bailing out on size == 0 in contains is an interesting idea. Small negative impact on users of regular hashmaps, large positive impact on users of preallocated hashmaps.

I think it should be possible to make an early size == 0 check without requiring a second check afterwards (i.e. no negative impact on regular HashMap users with non-empty maps). Some kind of search_hashed_no_empty_check method could be called by a search_no_empty_check method, given that contains_key has already checked that it is empty.

Anyways, wasn't there some talk about falling back on linear search/storage for small maps/sets? That would probably be more useful in practice.

This works well for small collections (e.g. integer keys up to about 16 items), however the problem is that different types require different amounts of time to compute the hash, and and also different amounts of time to compute equality. This is not necessarily consistent between values (e.g. strings or lists).

I think I can implement this for types with simple equality checks, but it cannot be used in general. (The empty check can be used in general, so I propose we implement both.)

technicalguy added a commit to technicalguy/rust that referenced this issue Feb 6, 2018
kennytm added a commit to kennytm/rust that referenced this issue Feb 14, 2018
…ap-38880, r=arthurprs

Early exit for empty HashMap (issue rust-lang#38880)

Addresses issue rust-lang#38880 by checking if the HashMap is empty before computing the value of the hash.

Before (integer keys)
```
running 4 tests
test empty_once ... bench:          13 ns/iter (+/- 0)
test empty_100  ... bench:       1,367 ns/iter (+/- 35)
test exist_once ... bench:          14 ns/iter (+/- 0)
test exist_100  ... bench:       1,518 ns/iter (+/- 40)
```

After
```
running 4 tests
test empty_once ... bench:           2 ns/iter (+/- 0)
test empty_100  ... bench:         221 ns/iter (+/- 0)
test exist_once ... bench:          15 ns/iter (+/- 0)
test exist_100  ... bench:       1,515 ns/iter (+/- 92)
```

When the HashMap is not empty, the performance remains the same, and when it is empty the performance is significantly improved.
@lucab
Copy link
Contributor

lucab commented Jan 29, 2019

Given that #48035 landed, is there anything else pending from this ticket?

@technicalguy
Copy link
Contributor

For this issue

HashMap starts the search even if the map is empty

there is nothing left to do.

For the linear search

Anyways, wasn't there some talk about falling back on linear search/storage for small maps/sets? That would probably be more useful in practice.

I don't know if it is possible: The time to compute equality and the time to compute a hash varies with different types (these two time determine the threshold at which it becomes faster to hash than to linear search). Not only that, but moreover, it could vary with different architectures. This means that determining the threshold is probably unfeasible in any kind of generality.

(You might be able to say that for a particular processor architecture, when the entire map is in the L1 cache, and your keys are 32-bit integers, then it is faster to do a linear search if the size of your map is up to 27 entries (for example), and after that, it is faster to hash the integer key and do a regular lookup. However, if any of these assumptions change then it might not be possible to guarantee a faster lookup for a map with 27 entries. The one feature that you really want from your HashMap is that the lookup time is bounded (with a certain probability) and it's very hard to do a linear search and guarantee that you will be at least as fast as a hash-lookup.)

The only thing that could certainly be implemented is if the hashmap contains only one element, then just do a single comparison rather than a hash lookup and then the comparison. i.e. hashmaps with size 1 don't need to compute the hash.

@pmarcelll
Copy link
Contributor Author

I didn't realize this issue wasn't closed by #48035, and I also think this should be closed since it would be more appropriate to open a new issue for future optimizations.

@technicalguy
Copy link
Contributor

Yes, I agree. It wasn't closed because I wrote "Addresses issue #38880" in the pull request not "Fixes #38880" which would have caused GitHub to auto close it. (At the time I thought linear search was more feasible and might be implemented as an extension to the empty-hashmap case, and so it seemed like it could make sense for the issue to remain open a little while longer. Clearly that was not to be.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants