Minor: Add comment explaining rationale for hash check #11750

alamb · 2024-07-31T14:13:01Z

Which issue does this PR close?

Follow on to #11718
Closes #.

Rationale for this change

In #11718 @Rachelint found that comparing hash values before group key values improved performance. The reason this could be faster was not obvious at first

What changes are included in this PR?

Add doc comments explaining the rationale and linking to #11718

Are these changes tested?

CI

Are there any user-facing changes?

No, just comments

Rachelint · 2024-07-31T14:25:32Z

Seems duckdb has the similar check(but just use the u16 prefix), maybe we can mention?

https://github.com/duckdb/duckdb/blob/f92559f42c118075baaa8daafc437954eb5c85ec/src/execution/aggregate_hashtable.cpp#L379-L386

alamb · 2024-07-31T15:37:47Z

Seems duckdb has the similar check(but just use the u16 prefix), maybe we can mention?

https://github.com/duckdb/duckdb/blob/f92559f42c118075baaa8daafc437954eb5c85ec/src/execution/aggregate_hashtable.cpp#L379-L386

The linked PR is interesting: duckdb/duckdb#9575

alamb · 2024-07-31T15:39:17Z

maybe we can mention?

I am not sure what we would mention 🤔 Maybe can you propose the wording ?

Or maybe we can leave a comment to the duckdb art in one of the comments on #11718 ?

Rachelint · 2024-07-31T16:17:00Z

Seems duckdb has the similar check(but just use the u16 prefix), maybe we can mention?
https://github.com/duckdb/duckdb/blob/f92559f42c118075baaa8daafc437954eb5c85ec/src/execution/aggregate_hashtable.cpp#L379-L386

The linked PR is interesting: duckdb/duckdb#9575

Interesting, it seems the common point with #11718 is that we should consider the increasing collision when the hash table is filling up.

Rachelint · 2024-07-31T16:20:50Z

maybe we can mention?

I am not sure what we would mention 🤔 Maybe can you propose the wording ?

Or maybe we can leave a comment to the duckdb art in one of the comments on #11718 ?

Maybe better to leave a comment, seems actually hard to make a good conclusion about the discussion in #11718

alamb · 2024-07-31T16:25:20Z

Interesting, it seems the common point with #11718 is that we should consider the increasing collision when the hash table is filling up.

In theory I think this is the kind of optimization that we are relying on hashbrown to do for us. Maybe there are some tuning knobs we could set but maybe we should just leave the defaults -- it seems to work reasonably well

Rachelint · 2024-07-31T16:33:24Z

Interesting, it seems the common point with #11718 is that we should consider the increasing collision when the hash table is filling up.

In theory I think this is the kind of optimization that we are relying on hashbrown to do for us. Maybe there are some tuning knobs we could set but maybe we should just leave the defaults -- it seems to work reasonably well

Yes, as @2010YOUY01 mentioned in #11, hashbrown did a lot of design work about it.

Actually, I am thinking that, maybe exist a threshold (like xx% of the bucket have been filled), and it is just reasonable to check the hash first when exceed it.

alamb · 2024-07-31T16:39:44Z

Actually, I am thinking that, maybe exist a threshold (like xx% of the bucket have been filled), and it is just reasonable to check the hash first when exceed it.

It might be worth checking -- one thing that we have to be careful of is that the overhead of the check itself may end up large

Rachelint · 2024-07-31T16:47:52Z

Actually, I am thinking that, maybe exist a threshold (like xx% of the bucket have been filled), and it is just reasonable to check the hash first when exceed it.

It might be worth checking -- one thing that we have to be careful of is that the overhead of the check itself may end up large

Make sense, will make more experiments about this when having free time.

comphead

lgtm thanks @alamb for adding the tracker

Minor: Add comment explaining rationale for hash check

30d6d07

alamb mentioned this pull request Jul 31, 2024

Check hashes first during probing the aggr hash table #11718

Merged

comphead approved these changes Aug 1, 2024

View reviewed changes

alamb merged commit 0d98b99 into apache:main Aug 1, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor: Add comment explaining rationale for hash check #11750

Minor: Add comment explaining rationale for hash check #11750

alamb commented Jul 31, 2024

Rachelint commented Jul 31, 2024

alamb commented Jul 31, 2024

alamb commented Jul 31, 2024

Rachelint commented Jul 31, 2024

Rachelint commented Jul 31, 2024 •

edited

Loading

alamb commented Jul 31, 2024

Rachelint commented Jul 31, 2024 •

edited

Loading

alamb commented Jul 31, 2024

Rachelint commented Jul 31, 2024 •

edited

Loading

comphead left a comment

Minor: Add comment explaining rationale for hash check #11750

Minor: Add comment explaining rationale for hash check #11750

Conversation

alamb commented Jul 31, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Rachelint commented Jul 31, 2024

alamb commented Jul 31, 2024

alamb commented Jul 31, 2024

Rachelint commented Jul 31, 2024

Rachelint commented Jul 31, 2024 • edited Loading

alamb commented Jul 31, 2024

Rachelint commented Jul 31, 2024 • edited Loading

alamb commented Jul 31, 2024

Rachelint commented Jul 31, 2024 • edited Loading

comphead left a comment

Choose a reason for hiding this comment

Rachelint commented Jul 31, 2024 •

edited

Loading

Rachelint commented Jul 31, 2024 •

edited

Loading

Rachelint commented Jul 31, 2024 •

edited

Loading