-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check hashes first during probing the aggr hash table #11718
Merged
+5
−4
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat confused about how this makes things faster -- I thought that the check for equal hash was done as part of
self.map.get_mut
(aka the closure is only called when the hashes are equal, so I would expct this comparison to always be true)However, if the benchmark results show it is an improvement wonderful.
I'll run the numbers as well to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it either.
I thought the closure (row values equal check) is triggered only if the hash is matched 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my thought about why faster.
I read the source code in hashbrown, and the
get_mut
procedure is like this:hash value
to find the first bucket.eq
function passed by us to check if it is the target, for example, the preveq
function.eq
return true, it is the target, otherwise we need to prob next and check.In the high cardinality aggr scenario, the entry often actually not exist in the hash table.
And after the hash table grow too large(many buckets are filled), the prob will perform many times and finally find nothing...
In this sitution, check the hash first can reduce the random memory accesses compared to directy check the group value through group index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jayzhan211 , can see the guess above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rachelint
My understanding is the similar, collision will happen quite often in
get_mut
, equality ofu64
hashes will be faster than retrieving / comparing rows.Hash collisions are usually very low, even for high cardinality, but
RawTable::get_mut
doesn't check for equality itself, just finds a first match without guaranteeing hash values are the same (equality check should be in the provided equality function). In other implementations we also check for hash values to be equal first.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I am trying to find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea to check hash before expensive arrow row comparison makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This talk explains the details of this one-byte abbreviation trick https://www.youtube.com/watch?v=ncHmEUmJZf4 , I vaguely remember they said when this 1 byte check is done, it's very likely to find the correct slot.
Looks like it's not working well when the hash table size grows over some threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems make sense, maybe we should only add this check when found the hash table size is larger than a threshold 🤔 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking this rationale is very non obvious, so I proposed adding some comments on th rationale here #11750