Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support duplicate_keep_option in cudf::distinct #11052

Merged
merged 96 commits into from
Jun 22, 2022

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jun 4, 2022

This adds duplicate_keep_option to cudf::distinct, allowing to specify a keep option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieve O(n) performance.

A KEEP_ANY option is also added to duplicate_keep_option, which was an attempt in #9417 but didn't get in eventually.

Partially addresses #11050 and #11053.


Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47

Follow up work:

@ttnghia ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf blocker libcudf Affects libcudf (C++/CUDA) code. 4 - Needs Review Waiting for reviewer to review or respond Performance Performance related issue Spark Functionality that helps Spark RAPIDS 5 - DO NOT MERGE Hold off on merging; see PR for details improvement Improvement / enhancement to an existing function breaking Breaking change labels Jun 4, 2022
@ttnghia ttnghia self-assigned this Jun 4, 2022
@github-actions github-actions bot added the CMake CMake build issue label Jun 4, 2022
@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 4, 2022

Below is benchmarks on a struct column, 8 children (4 int + 4 string), generated randomly then repeated 4 times (that means the column has less than 25% unique rows).

Comparing cudf::distinct without vs with keep_first option:

Benchmark                                          Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------
Distinct/BM/1024/manual_time                    -0.3614         -0.3514             1             0             1             0
Distinct/BM/4096/manual_time                    -0.2241         -0.2178             1             0             1             0
Distinct/BM/32768/manual_time                   -0.0262         -0.0228             1             1             1             1
Distinct/BM/262144/manual_time                  +0.2469         +0.2399             1             1             1             1
Distinct/BM/2097152/manual_time                 +0.4191         +0.4152             6             9             6             9
Distinct/BM/16777216/manual_time                +0.2508         +0.2482            56            71            57            71
Distinct/BM/67108864/manual_time                +0.0985         +0.1000           257           282           258           284

Comparing cudf::distinct with keep_first option vs cudf::sort + cudf::unique:

Benchmark                                          Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------
Distinct/BM/1024/manual_time                    +8.6636         +8.2572             0             4             0             4
Distinct/BM/4096/manual_time                    +9.8727         +9.5028             0             5             0             5
Distinct/BM/32768/manual_time                   +9.8985         +9.5761             1             6             1             6
Distinct/BM/262144/manual_time                  +5.6234         +5.5612             1             9             1             9
Distinct/BM/2097152/manual_time                 +9.1337         +9.1176             9            91             9            91
Distinct/BM/16777216/manual_time               +12.7376        +12.7323            70           967            70           967
Distinct/BM/67108864/manual_time               +16.2584        +16.2026           281          4845           282          4847

From the numbers, we can see that adding the keep option into cudf::distinct may reduce the performance up to 40%, but is still way better than sorting then using cudf::unique (16X faster).

@ttnghia ttnghia marked this pull request as ready for review June 4, 2022 20:46
@ttnghia ttnghia requested a review from a team as a code owner June 4, 2022 20:46
@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 4, 2022

I believe that this brings undeniable benefits. We can apply the same approach to lists::drop_list_duplicates too.

@PointKernel
Copy link
Member

So this work is intended to fill the "hole" when keys are not sorted and keep control is required. Currently, we need to stable sort the key first and then invoke cudf::unique to handle this situation. After this PR, we will have three options to perform distinct operation:

  1. When keys are sorted, always use cudf::unique
  2. When keys are not sorted and keep control is irrelevant: the existing cudf::distinct
  3. When keys are not sorted and keep control is required: new cudf::distinct with keep option

Did I miss something?

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 5, 2022

The first option should not be handled internally by cudf::distinct. Instead, it is the caller choice. So basically we just have options 2 and 3.

@github-actions github-actions bot added the CMake CMake build issue label Jun 20, 2022
@PointKernel
Copy link
Member

PointKernel commented Jun 20, 2022

Based on the offline discussion, @ttnghia is splitting the existing distinct.cu and distinct_reduce.cu files to reduce build time. The current idea is to:

  • Move get_distinct_indices to a new file ( get_distinct_indices.cu maybe? Feel free to suggest a better name)
    - Move reduce_by_key_fn functor to a new file distinct_gpu.cuh (Looks like this won't help)
  • Clean up header includes (include only what's used)

@davidwendt Any suggestions to further reduce compile time are highly appreciated.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake LGTM (did not look at the C++)

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One non-blocking nitpick otherwise looks good to me. 🔥

cpp/src/stream_compaction/distinct_reduce.cuh Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved
@ttnghia ttnghia requested a review from bdice June 21, 2022 19:46
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some significant changes from the last time I reviewed. I'd like to see one more round of review on this to resolve some questions on naming.

cpp/include/cudf/detail/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_reduce.cuh Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_reduce.cuh Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_reduce.cuh Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_reduce.cuh Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_reduce.cuh Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small edits removing mentions of sparsity since we renamed the function (and it seems that it could be performant even if the data is mostly unique). Otherwise LGTM!

cpp/src/stream_compaction/distinct_reduce.cuh Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_reduce.cuh Outdated Show resolved Hide resolved
@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 22, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 31ad35c into rapidsai:branch-22.08 Jun 22, 2022
@ttnghia ttnghia deleted the refactor_stream_compaction branch June 22, 2022 20:30
rapids-bot bot pushed a commit that referenced this pull request Jun 23, 2022
This adds `nan_equality` parameter to `cudf::distinct`, allowing to specify the desired behavior when dealing with floating-point data: `NaN` should be compared equally to other `NaN` or not.

Depends on #11052 (built on top of it).
Closes #11092.
This is a blocker for set-like operations (#11043) and also the last blocker for #11053.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11118
rapids-bot bot pushed a commit that referenced this pull request Jul 26, 2022
This PR adds the following APIs for set operations:
 * `lists::have_overlap`
 * `lists::intersect_distinct`
 * `lists::union_distinct`
 * `lists::difference_distinct`

### Name Convention
Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements.

---

Depends on:
 * #10945
 * #11017
 * NVIDIA/cuCollections#175
 * #11052
 * #11118
 * #11100
 * #11149

Closes #10409.

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11043
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants