Support `duplicate_keep_option` in `cudf::distinct` #11052

ttnghia · 2022-06-04T20:31:19Z

This adds duplicate_keep_option to cudf::distinct, allowing to specify a keep option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieve O(n) performance.

A KEEP_ANY option is also added to duplicate_keep_option, which was an attempt in #9417 but didn't get in eventually.

Partially addresses #11050 and #11053.

Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47

Follow up work:

ttnghia · 2022-06-04T20:45:40Z

Below is benchmarks on a struct column, 8 children (4 int + 4 string), generated randomly then repeated 4 times (that means the column has less than 25% unique rows).

Comparing `cudf::distinct` without vs with `keep_first` option:

Benchmark                                          Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------
Distinct/BM/1024/manual_time                    -0.3614         -0.3514             1             0             1             0
Distinct/BM/4096/manual_time                    -0.2241         -0.2178             1             0             1             0
Distinct/BM/32768/manual_time                   -0.0262         -0.0228             1             1             1             1
Distinct/BM/262144/manual_time                  +0.2469         +0.2399             1             1             1             1
Distinct/BM/2097152/manual_time                 +0.4191         +0.4152             6             9             6             9
Distinct/BM/16777216/manual_time                +0.2508         +0.2482            56            71            57            71
Distinct/BM/67108864/manual_time                +0.0985         +0.1000           257           282           258           284

Comparing `cudf::distinct` with `keep_first` option vs `cudf::sort` + `cudf::unique`:

Benchmark                                          Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------
Distinct/BM/1024/manual_time                    +8.6636         +8.2572             0             4             0             4
Distinct/BM/4096/manual_time                    +9.8727         +9.5028             0             5             0             5
Distinct/BM/32768/manual_time                   +9.8985         +9.5761             1             6             1             6
Distinct/BM/262144/manual_time                  +5.6234         +5.5612             1             9             1             9
Distinct/BM/2097152/manual_time                 +9.1337         +9.1176             9            91             9            91
Distinct/BM/16777216/manual_time               +12.7376        +12.7323            70           967            70           967
Distinct/BM/67108864/manual_time               +16.2584        +16.2026           281          4845           282          4847

From the numbers, we can see that adding the keep option into cudf::distinct may reduce the performance up to 40%, but is still way better than sorting then using cudf::unique (16X faster).

ttnghia · 2022-06-04T20:47:06Z

I believe that this brings undeniable benefits. We can apply the same approach to lists::drop_list_duplicates too.

cpp/src/stream_compaction/distinct.cu

PointKernel · 2022-06-05T15:00:59Z

So this work is intended to fill the "hole" when keys are not sorted and keep control is required. Currently, we need to stable sort the key first and then invoke cudf::unique to handle this situation. After this PR, we will have three options to perform distinct operation:

When keys are sorted, always use cudf::unique
When keys are not sorted and keep control is irrelevant: the existing cudf::distinct
When keys are not sorted and keep control is required: new cudf::distinct with keep option

Did I miss something?

ttnghia · 2022-06-05T20:11:09Z

The first option should not be handled internally by cudf::distinct. Instead, it is the caller choice. So basically we just have options 2 and 3.

cpp/src/stream_compaction/distinct.cu

PointKernel · 2022-06-20T17:11:20Z

Based on the offline discussion, @ttnghia is splitting the existing distinct.cu and distinct_reduce.cu files to reduce build time. The current idea is to:

Move get_distinct_indices to a new file ( get_distinct_indices.cu maybe? Feel free to suggest a better name)
~~- Move reduce_by_key_fn functor to a new file distinct_gpu.cuh~~ (Looks like this won't help)
Clean up header includes (include only what's used)

@davidwendt Any suggestions to further reduce compile time are highly appreciated.

vyasr

CMake LGTM (did not look at the C++)

PointKernel

One non-blocking nitpick otherwise looks good to me. 🔥

cpp/src/stream_compaction/distinct_reduce.cuh

cpp/src/stream_compaction/distinct.cu

bdice

There were some significant changes from the last time I reviewed. I'd like to see one more round of review on this to resolve some questions on naming.

cpp/include/cudf/detail/stream_compaction.hpp

cpp/include/cudf/stream_compaction.hpp

cpp/src/stream_compaction/distinct.cu

cpp/src/stream_compaction/distinct_reduce.cuh

bdice

Two small edits removing mentions of sparsity since we renamed the function (and it seems that it could be performant even if the data is mostly unique). Otherwise LGTM!

cpp/src/stream_compaction/distinct_reduce.cuh

# Conflicts: # cpp/src/stream_compaction/distinct.cu

ttnghia · 2022-06-22T18:50:56Z

@gpucibot merge

This adds `nan_equality` parameter to `cudf::distinct`, allowing to specify the desired behavior when dealing with floating-point data: `NaN` should be compared equally to other `NaN` or not. Depends on #11052 (built on top of it). Closes #11092. This is a blocker for set-like operations (#11043) and also the last blocker for #11053. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #11118

This PR adds the following APIs for set operations: * `lists::have_overlap` * `lists::intersect_distinct` * `lists::union_distinct` * `lists::difference_distinct` ### Name Convention Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements. --- Depends on: * #10945 * #11017 * NVIDIA/cuCollections#175 * #11052 * #11118 * #11100 * #11149 Closes #10409. Authors: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Michael Wang (https://github.com/isVoid) - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #11043

ttnghia added 5 commits June 4, 2022 12:51

Test passed

c22b11c

Cleanup

5a6602c

Add bench

79fa051

Add comments

a7e3463

Add unique bench

3b7124e

ttnghia requested review from PointKernel and jrhemstad June 4, 2022 20:31

ttnghia self-assigned this Jun 4, 2022

github-actions bot added the CMake CMake build issue label Jun 4, 2022

Rewrite benchs

4f933bf

ttnghia marked this pull request as ready for review June 4, 2022 20:46

ttnghia requested a review from a team as a code owner June 4, 2022 20:46

PointKernel reviewed Jun 5, 2022

View reviewed changes

cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved

jrhemstad reviewed Jun 6, 2022

View reviewed changes

cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved

jrhemstad reviewed Jun 6, 2022

View reviewed changes

cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved

github-actions bot added the CMake CMake build issue label Jun 20, 2022

ttnghia added 3 commits June 20, 2022 09:47

Cleanup

120377b

Change functor name

68133d4

Add doxygen

aefdadf

ttnghia added 4 commits June 20, 2022 13:55

Reorganize code

faf6778

Fix headers

e839323

Fix header

f5646b3

Fix mr usage, and rewrite some comments

538ff08

vyasr approved these changes Jun 21, 2022

View reviewed changes

PointKernel approved these changes Jun 21, 2022

View reviewed changes

cpp/src/stream_compaction/distinct_reduce.cuh Outdated Show resolved Hide resolved

cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved

Pass std::shared_ptr by value

a755bea

ttnghia requested a review from bdice June 21, 2022 19:46

bdice requested changes Jun 22, 2022

View reviewed changes

Fix doxygen and change function name

f0ee266

bdice approved these changes Jun 22, 2022

View reviewed changes

cpp/src/stream_compaction/distinct_reduce.cuh Outdated Show resolved Hide resolved

cpp/src/stream_compaction/distinct_reduce.cuh Outdated Show resolved Hide resolved

ttnghia added 2 commits June 22, 2022 06:18

Update doxygen

0b35671

Merge branch 'branch-22.08' into refactor_stream_compaction

1ac6501

# Conflicts: # cpp/src/stream_compaction/distinct.cu

rapids-bot bot merged commit 31ad35c into rapidsai:branch-22.08 Jun 22, 2022

ttnghia deleted the refactor_stream_compaction branch June 22, 2022 20:30

GregoryKimball mentioned this pull request Oct 4, 2022

[FEA] Support duplicate_keep_option in places that use unordered map #11050

Closed

ttnghia mentioned this pull request Apr 6, 2023

[FEA] first and last as hash based aggregates #11141

Open

PointKernel mentioned this pull request Apr 17, 2023

[FEA] Improve cudf::distinct with cuco reduction map #13157

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `duplicate_keep_option` in `cudf::distinct` #11052

Support `duplicate_keep_option` in `cudf::distinct` #11052

ttnghia commented Jun 4, 2022 •

edited

Loading

ttnghia commented Jun 4, 2022

ttnghia commented Jun 4, 2022

PointKernel commented Jun 5, 2022

ttnghia commented Jun 5, 2022

PointKernel commented Jun 20, 2022 •

edited

Loading

vyasr left a comment

PointKernel left a comment

bdice left a comment •

edited

Loading

bdice left a comment

ttnghia commented Jun 22, 2022

Support duplicate_keep_option in cudf::distinct #11052

Support duplicate_keep_option in cudf::distinct #11052

Conversation

ttnghia commented Jun 4, 2022 • edited Loading

ttnghia commented Jun 4, 2022

Comparing cudf::distinct without vs with keep_first option:

Comparing cudf::distinct with keep_first option vs cudf::sort + cudf::unique:

ttnghia commented Jun 4, 2022

PointKernel commented Jun 5, 2022

ttnghia commented Jun 5, 2022

PointKernel commented Jun 20, 2022 • edited Loading

vyasr left a comment

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

bdice left a comment • edited Loading

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

ttnghia commented Jun 22, 2022

Support `duplicate_keep_option` in `cudf::distinct` #11052

Support `duplicate_keep_option` in `cudf::distinct` #11052

ttnghia commented Jun 4, 2022 •

edited

Loading

Comparing `cudf::distinct` without vs with `keep_first` option:

Comparing `cudf::distinct` with `keep_first` option vs `cudf::sort` + `cudf::unique`:

PointKernel commented Jun 20, 2022 •

edited

Loading

bdice left a comment •

edited

Loading