Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Add comment explaining rationale for hash check #11750

Merged
merged 1 commit into from
Aug 1, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 31, 2024

Which issue does this PR close?

Follow on to #11718
Closes #.

Rationale for this change

In #11718 @Rachelint found that comparing hash values before group key values improved performance. The reason this could be faster was not obvious at first

What changes are included in this PR?

Add doc comments explaining the rationale and linking to #11718

Are these changes tested?

CI

Are there any user-facing changes?

No, just comments

@Rachelint
Copy link
Contributor

Seems duckdb has the similar check(but just use the u16 prefix), maybe we can mention?

https://github.com/duckdb/duckdb/blob/f92559f42c118075baaa8daafc437954eb5c85ec/src/execution/aggregate_hashtable.cpp#L379-L386

@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2024

Seems duckdb has the similar check(but just use the u16 prefix), maybe we can mention?

https://github.com/duckdb/duckdb/blob/f92559f42c118075baaa8daafc437954eb5c85ec/src/execution/aggregate_hashtable.cpp#L379-L386

The linked PR is interesting: duckdb/duckdb#9575

@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2024

maybe we can mention?

I am not sure what we would mention 🤔 Maybe can you propose the wording ?

Or maybe we can leave a comment to the duckdb art in one of the comments on #11718 ?

@Rachelint
Copy link
Contributor

Seems duckdb has the similar check(but just use the u16 prefix), maybe we can mention?
https://github.com/duckdb/duckdb/blob/f92559f42c118075baaa8daafc437954eb5c85ec/src/execution/aggregate_hashtable.cpp#L379-L386

The linked PR is interesting: duckdb/duckdb#9575

Interesting, it seems the common point with #11718 is that we should consider the increasing collision when the hash table is filling up.

@Rachelint
Copy link
Contributor

Rachelint commented Jul 31, 2024

maybe we can mention?

I am not sure what we would mention 🤔 Maybe can you propose the wording ?

Or maybe we can leave a comment to the duckdb art in one of the comments on #11718 ?

Maybe better to leave a comment, seems actually hard to make a good conclusion about the discussion in #11718

@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2024

Interesting, it seems the common point with #11718 is that we should consider the increasing collision when the hash table is filling up.

In theory I think this is the kind of optimization that we are relying on hashbrown to do for us. Maybe there are some tuning knobs we could set but maybe we should just leave the defaults -- it seems to work reasonably well

@Rachelint
Copy link
Contributor

Rachelint commented Jul 31, 2024

Interesting, it seems the common point with #11718 is that we should consider the increasing collision when the hash table is filling up.

In theory I think this is the kind of optimization that we are relying on hashbrown to do for us. Maybe there are some tuning knobs we could set but maybe we should just leave the defaults -- it seems to work reasonably well

Yes, as @2010YOUY01 mentioned in #11, hashbrown did a lot of design work about it.

Actually, I am thinking that, maybe exist a threshold (like xx% of the bucket have been filled), and it is just reasonable to check the hash first when exceed it.

@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2024

Actually, I am thinking that, maybe exist a threshold (like xx% of the bucket have been filled), and it is just reasonable to check the hash first when exceed it.

It might be worth checking -- one thing that we have to be careful of is that the overhead of the check itself may end up large

@Rachelint
Copy link
Contributor

Rachelint commented Jul 31, 2024

Actually, I am thinking that, maybe exist a threshold (like xx% of the bucket have been filled), and it is just reasonable to check the hash first when exceed it.

It might be worth checking -- one thing that we have to be careful of is that the overhead of the check itself may end up large

Make sense, will make more experiments about this when having free time.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb for adding the tracker

@alamb alamb merged commit 0d98b99 into apache:main Aug 1, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants