-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39803: [C++][Acero] Fix AsOfJoin with differently ordered schemas than the output #39804
Conversation
|
@@ -16,6 +16,7 @@ | |||
// under the License. | |||
|
|||
#include <gmock/gmock-matchers.h> | |||
#include <iostream> // nocommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required? Also, what is "nocommit" for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops - this was for some cout debugging, will remove
Also, can you please rebase from git main? |
eaffb69
to
628dd0a
Compare
Done and addressed comments. Thanks for the review |
@github-actions crossbow submit -g cpp |
Revision: 11436fd Submitted crossbow builds: ursacomputing/crossbow @ actions-4147b43cbc |
Thanks a lot for this @JerAguilon ! |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 4b74b45. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…than the output (#39804) ### Rationale for this change Issue is described visually in #39803. The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things. The issue is we are passing the `output_schema`, rather than the input's schema. If an input looks like ``` key_string_type,ts_int32_type,val ``` But our expected output schema looks like: ``` ts_int32,key_string_type,... ``` Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board. ### What changes are included in this PR? One line fix and test with string types. ### Are these changes tested? Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc Before the change, notice that inputs 0 and 1 have mismatched hashes: ``` AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039] ... AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` And after, they do match: ``` AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ... AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` ...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]` ### Are there any user-facing changes? * Closes: #39803 Lead-authored-by: Jeremy Aguilon <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…hemas than the output (apache#39804) ### Rationale for this change Issue is described visually in apache#39803. The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things. The issue is we are passing the `output_schema`, rather than the input's schema. If an input looks like ``` key_string_type,ts_int32_type,val ``` But our expected output schema looks like: ``` ts_int32,key_string_type,... ``` Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board. ### What changes are included in this PR? One line fix and test with string types. ### Are these changes tested? Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc Before the change, notice that inputs 0 and 1 have mismatched hashes: ``` AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039] ... AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` And after, they do match: ``` AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ... AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` ...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]` ### Are there any user-facing changes? * Closes: apache#39803 Lead-authored-by: Jeremy Aguilon <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…hemas than the output (apache#39804) ### Rationale for this change Issue is described visually in apache#39803. The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things. The issue is we are passing the `output_schema`, rather than the input's schema. If an input looks like ``` key_string_type,ts_int32_type,val ``` But our expected output schema looks like: ``` ts_int32,key_string_type,... ``` Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board. ### What changes are included in this PR? One line fix and test with string types. ### Are these changes tested? Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc Before the change, notice that inputs 0 and 1 have mismatched hashes: ``` AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039] ... AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` And after, they do match: ``` AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ... AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` ...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]` ### Are there any user-facing changes? * Closes: apache#39803 Lead-authored-by: Jeremy Aguilon <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
Issue is described visually in #39803.
The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the column metadata for the asof-join key fields. This returns whether columns are fixed width, among other things.
The issue is we are passing the
output_schema
, rather than the input's schema.If an input looks like
But our expected output schema looks like:
Then the hasher will think that the
key_string_type
's type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board.What changes are included in this PR?
One line fix and test with string types.
Are these changes tested?
Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc
Before the change, notice that inputs 0 and 1 have mismatched hashes:
And after, they do match:
...which is exactly what you want, since the
key
column for both tables looks like["0", "1", ..."9"]
Are there any user-facing changes?