Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace cudf's concurrent_ordered_map with cuco::static_map in semi/anti joins #9666

Merged
merged 14 commits into from
Jan 5, 2022

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Nov 12, 2021

This PR resolves #9586, replacing the hash table used in semi and anti joins with cuco::static_map. It depends on NVIDIA/cuCollections#118. At present the code is slower than the original version, so we'll probably want to make some optimizations in cuco before merging this.

@vyasr vyasr added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 12, 2021
@vyasr vyasr self-assigned this Nov 12, 2021
@vyasr vyasr requested a review from a team as a code owner November 12, 2021 00:17
@vyasr vyasr requested review from PointKernel and jrhemstad and removed request for robertmaynard and codereport November 12, 2021 00:17
@vyasr
Copy link
Contributor Author

vyasr commented Nov 12, 2021

Benchmarks

Old code:

Join<int32_t, int32_t>/left_semi_join_32bit/100000/100000/manual_time                                    0.156 ms        0.177 ms         4409
Join<int32_t, int32_t>/left_semi_join_32bit/100000/400000/manual_time                                    0.259 ms        0.278 ms         2699
Join<int32_t, int32_t>/left_semi_join_32bit/100000/1000000/manual_time                                   0.474 ms        0.493 ms         1479
Join<int32_t, int32_t>/left_semi_join_32bit/10000000/10000000/manual_time                                 30.8 ms         30.9 ms           23
Join<int32_t, int32_t>/left_semi_join_32bit/10000000/40000000/manual_time                                 61.3 ms         61.3 ms           11
Join<int32_t, int32_t>/left_semi_join_32bit/10000000/100000000/manual_time                                 122 ms          122 ms            6
Join<int32_t, int32_t>/left_semi_join_32bit/100000000/100000000/manual_time                                313 ms          313 ms            2
Join<int32_t, int32_t>/left_semi_join_32bit/80000000/240000000/manual_time                                 414 ms          414 ms            2
Join<int64_t, int64_t>/left_semi_join_64bit/50000000/50000000/manual_time                                  164 ms          164 ms            4
Join<int64_t, int64_t>/left_semi_join_64bit/40000000/120000000/manual_time                                 225 ms          225 ms            3
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/100000/manual_time                              0.151 ms        0.173 ms         4556
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/400000/manual_time                              0.197 ms        0.218 ms         3482
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/1000000/manual_time                             0.265 ms        0.283 ms         2654
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/10000000/10000000/manual_time                           7.08 ms         7.10 ms           99
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/10000000/40000000/manual_time                           14.6 ms         14.6 ms           48
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/10000000/100000000/manual_time                          32.3 ms         32.3 ms           18
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000000/100000000/manual_time                         76.7 ms         76.7 ms            8
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/80000000/240000000/manual_time                           107 ms          107 ms            5
Join<int64_t, int64_t>/left_semi_join_64bit_nulls/50000000/50000000/manual_time                           43.5 ms         43.6 ms           13
Join<int64_t, int64_t>/left_semi_join_64bit_nulls/40000000/120000000/manual_time                          66.0 ms         66.0 ms            8

New code:

Join<int32_t, int32_t>/left_semi_join_32bit/100000/100000/manual_time                                    0.426 ms        0.446 ms         1613
Join<int32_t, int32_t>/left_semi_join_32bit/100000/400000/manual_time                                    0.840 ms        0.860 ms          833
Join<int32_t, int32_t>/left_semi_join_32bit/100000/1000000/manual_time                                    1.44 ms         1.45 ms          490
Join<int32_t, int32_t>/left_semi_join_32bit/10000000/10000000/manual_time                                 32.2 ms         32.2 ms           22
Join<int32_t, int32_t>/left_semi_join_32bit/10000000/40000000/manual_time                                 77.0 ms         77.0 ms            9
Join<int32_t, int32_t>/left_semi_join_32bit/10000000/100000000/manual_time                                 167 ms          167 ms            4
Join<int32_t, int32_t>/left_semi_join_32bit/100000000/100000000/manual_time                                324 ms          324 ms            2
Join<int32_t, int32_t>/left_semi_join_32bit/80000000/240000000/manual_time                                 505 ms          505 ms            2
Join<int64_t, int64_t>/left_semi_join_64bit/50000000/50000000/manual_time                                  194 ms          194 ms            4
Join<int64_t, int64_t>/left_semi_join_64bit/40000000/120000000/manual_time                                 322 ms          322 ms            2
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/100000/manual_time                              0.191 ms        0.212 ms         3720
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/400000/manual_time                              0.229 ms        0.250 ms         3001
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/1000000/manual_time                             0.359 ms        0.378 ms         2037
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/10000000/10000000/manual_time                           8.39 ms         8.41 ms           86
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/10000000/40000000/manual_time                           16.1 ms         16.1 ms           44
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/10000000/100000000/manual_time                          42.7 ms         42.7 ms           12
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000000/100000000/manual_time                         90.2 ms         90.2 ms            6
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/80000000/240000000/manual_time                           121 ms          121 ms            5
Join<int64_t, int64_t>/left_semi_join_64bit_nulls/50000000/50000000/manual_time                           51.8 ms         51.8 ms           11
Join<int64_t, int64_t>/left_semi_join_64bit_nulls/40000000/120000000/manual_time                          75.4 ms         75.4 ms            7

Comment on lines 135 to 133
// Note: This equality comparator violates symmetry of equality and is
// therefore relying on the implementation detail of the order in which its
// operator is invoked. If cuco makes no promises about the order of
// invocation this seems a bit unsafe.
row_equality equality_probe{*right_rows_d, *left_rows_d, compare_nulls == null_equality::EQUAL};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if you're going to try and do the same optimization as is done in the other joins of using the row hash value as the key and the index as the payload, you're going to need to add the equivalent of pair_contains from the multimap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given our discussions I think we can wait on this as well. The same issue exists in other join types and probably isn't worth addressing until NVIDIA/cuCollections#110.

cpp/src/join/semi_join.cu Show resolved Hide resolved
@vyasr vyasr marked this pull request as draft November 12, 2021 01:56
@vyasr
Copy link
Contributor Author

vyasr commented Nov 12, 2021

I intended this to be a draft PR, just marked it as such. I pushed this out so that I could get some feedback and we could iron out the gaps with cuco, but this isn't really ready for the big time yet. We'll probably want to wait on further improvements in cuco.

@vyasr vyasr marked this pull request as ready for review November 19, 2021 20:08
@vyasr vyasr added 0 - Blocked Cannot progress due to external reasons and removed 2 - In Progress Currently a work in progress labels Nov 19, 2021
@vyasr
Copy link
Contributor Author

vyasr commented Nov 19, 2021

This PR requires NVIDIA/cuCollections#118, but once that's merged I think we can move forward with this largely as is. While there are significant improvements that could be made, they are heavily dependent on refactoring cuCollections and I don't think we benefit too much by trying to implement interim stopgap solutions.

cpp/src/join/semi_join.cu Outdated Show resolved Hide resolved
cpp/src/join/semi_join.cu Outdated Show resolved Hide resolved
@PointKernel
Copy link
Member

PointKernel commented Nov 19, 2021

This PR depends on NVIDIA/cuCollections#113, otherwise the default hash allocator won't work here.

@vyasr vyasr requested a review from a team as a code owner December 16, 2021 18:35
@vyasr vyasr added 3 - Ready for Review Ready for review by team and removed 0 - Blocked Cannot progress due to external reasons 5 - Merge After Dependencies 5 - DO NOT MERGE Hold off on merging; see PR for details labels Dec 16, 2021
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@PointKernel
Copy link
Member

@vyasr Back to the benchmark results, any idea why the new implementation is slower?

@vyasr
Copy link
Contributor Author

vyasr commented Dec 21, 2021

@vyasr Back to the benchmark results, any idea why the new implementation is slower?

I'm reasonably confident that the performance regression is entirely due to the switch from cudf's concurrent unordered map to the cuco static map, which hasn't benefited from the optimizations you worked on for the multimap. @jrhemstad was fine eating the perf hit for now and postponing optimization because we were trying to get the mixed joins in #9917 up and running ASAP.

However, the work in #9917 shows that the new mixed join code is going to have to be a new kernel rather than a direct adaptation of the existing hash join code because of how we deal with shared memory. Therefore, IMO this PR is no longer a prerequisite for getting mixed joins done for semi/anti joins and that work can happen in parallel, i.e. we could start using cuco's static multimap for mixed joins without merging this PR. @jrhemstad in light of that, do you want to hold off on merging this PR until we've had a chance to do the cuco refactoring and optimized cuco::static_map? Then we could avoid a performance degradation in hash semi/anti joins.

@vyasr
Copy link
Contributor Author

vyasr commented Dec 22, 2021

rerun tests

@PointKernel
Copy link
Member

Hmm, cudf's concurrent_unordered_map is actually a more naive/unoptimized implementation compared to cuco::static_map. Both of them are linear probing while cuco is even using the CG-based algorithm by default. I think the test case has relatively low occupancy (or few collisions) which may explain why CG-based algorithms are outperformed. We need a follow-up PR dedicated to detailed profiling and performance optimization.

Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@vyasr
Copy link
Contributor Author

vyasr commented Jan 4, 2022

rerun tests

@vyasr
Copy link
Contributor Author

vyasr commented Jan 5, 2022

Discussed offline, we're going to get this merged now and deal with perf later.

@gpucibot merge

@jlowe
Copy link
Member

jlowe commented Jan 5, 2022

deal with perf later.

Is there an issue to track this?

@vyasr
Copy link
Contributor Author

vyasr commented Jan 5, 2022

Just made one in #9973.

@vyasr
Copy link
Contributor Author

vyasr commented Jan 5, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 2112757 into rapidsai:branch-22.02 Jan 5, 2022
@vyasr vyasr deleted the feature/semi_anti_join_cuco branch January 14, 2022 18:02
rapids-bot bot pushed a commit that referenced this pull request Apr 12, 2022
The `concurrent_unordered_multimap` is no longer used in libcudf. It has been replaced by `cuco::static_multimap`. The majority of the refactoring was done in PRs #8934 and #9704. A similar effort is in progress for `concurrent_unordered_map` and `cuco::static_map` in #9666 (and may depend on porting some optimizations from libcudf to cuco -- need to look into this before doing a direct replacement).

This partially resolves issue #10401.

cc: @PointKernel @vyasr

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #10642
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Refactor semi/anti join to use cuCollections
5 participants