Add cudf::stable_distinct public API, tests, and benchmarks. #13392

bdice · 2023-05-19T17:43:00Z

Description

This PR supersedes part of #11656.

It adds a public API for cudf::stable_distinct, mirroring that of cudf::distinct but preserving the order of the input table. The stable_distinct implementation was refactored to use apply_boolean_mask, which reduces the number of kernels needed. I also added tests/benchmarks for cudf::stable_distinct.

I split out the C++ changes from #11656 because that PR size was getting too large. Also these C++ changes are non-breaking, but the Python changes are breaking (and depend on these C++ changes), so separating into a new PR seemed like a good idea.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

bdice · 2023-05-19T17:44:02Z

cpp/benchmarks/stream_compaction/stable_distinct.cpp

This file is almost an exact copy of the distinct benchmarks.

Here are the results for stable_distinct vs. distinct. I just wanted to show that they're similar in execution time. Stability isn't too expensive, if needed by the caller.

Benchmark Results

stable_distinct

[0] NVIDIA RTX A6000

Type NumRows Samples CPU Time Noise GPU Time Noise

bool 10000 3328x 154.991 us 18.17% 150.732 us 17.58%

bool 100000 3152x 171.956 us 16.30% 167.866 us 16.11%

bool 1000000 2656x 391.125 us 7.81% 386.949 us 7.64%

bool 10000000 1744x 2.370 ms 2.38% 2.366 ms 2.36%

I8 10000 3392x 152.019 us 15.78% 147.786 us 14.82%

I8 100000 3136x 163.654 us 15.81% 159.577 us 15.60%

I8 1000000 1776x 385.042 us 10.74% 380.976 us 10.68%

I8 10000000 2800x 2.409 ms 7.88% 2.405 ms 7.85%

I32 10000 3392x 152.155 us 17.05% 147.984 us 16.48%

I32 100000 3136x 163.918 us 17.83% 159.824 us 17.64%

I32 1000000 1328x 384.702 us 11.33% 379.871 us 7.39%

I32 10000000 1744x 2.371 ms 2.50% 2.367 ms 2.48%

I64 10000 3360x 153.467 us 18.28% 149.399 us 18.07%

I64 100000 3136x 164.309 us 14.35% 160.212 us 14.08%

I64 1000000 1600x 385.795 us 11.04% 381.748 us 10.99%

I64 10000000 2240x 2.407 ms 2.99% 2.403 ms 2.99%

F32 10000 3344x 154.145 us 15.34% 149.986 us 14.03%

F32 100000 3104x 172.753 us 14.15% 168.570 us 13.02%

F32 1000000 1424x 446.890 us 9.98% 442.811 us 9.94%

F32 10000000 1552x 3.466 ms 2.23% 3.461 ms 2.23%

cudf::timestamp_ms 10000 3584x 153.687 us 14.15% 149.512 us 12.79%

cudf::timestamp_ms 100000 3056x 168.547 us 15.12% 164.391 us 14.77%

cudf::timestamp_ms 1000000 2896x 392.451 us 8.76% 388.323 us 8.48%

cudf::timestamp_ms 10000000 1712x 2.440 ms 2.00% 2.436 ms 1.93%

distinct

[0] NVIDIA RTX A6000

Type NumRows Samples CPU Time Noise GPU Time Noise

bool 10000 6064x 86.724 us 21.76% 82.512 us 14.82%

bool 100000 5008x 104.379 us 19.96% 100.135 us 16.15%

bool 1000000 2960x 284.657 us 10.09% 280.587 us 9.98%

bool 10000000 1568x 1.955 ms 2.44% 1.951 ms 2.43%

I8 10000 6080x 86.428 us 17.91% 82.315 us 16.20%

I8 100000 5456x 95.889 us 16.47% 91.739 us 12.97%

I8 1000000 1856x 278.206 us 12.02% 274.187 us 11.93%

I8 10000000 1520x 1.949 ms 2.50% 1.945 ms 2.49%

I32 10000 6064x 86.619 us 22.41% 82.525 us 21.50%

I32 100000 5440x 95.985 us 22.99% 91.946 us 22.57%

I32 1000000 1936x 277.409 us 11.83% 273.181 us 11.49%

I32 10000000 2144x 1.950 ms 2.70% 1.946 ms 2.67%

I64 10000 6080x 86.326 us 19.96% 82.278 us 19.33%

I64 100000 5392x 97.024 us 24.22% 92.818 us 16.11%

I64 1000000 1824x 278.336 us 12.94% 274.197 us 12.76%

I64 10000000 1808x 1.986 ms 2.46% 1.982 ms 2.45%

F32 10000 6080x 86.418 us 19.24% 82.328 us 18.02%

F32 100000 4992x 104.270 us 16.05% 100.173 us 14.24%

F32 1000000 3200x 341.270 us 8.96% 337.142 us 8.57%

F32 10000000 1680x 3.575 ms 1.75% 3.571 ms 1.75%

cudf::timestamp_ms 10000 6080x 86.580 us 22.77% 82.434 us 21.40%

cudf::timestamp_ms 100000 5376x 97.271 us 15.64% 93.221 us 15.01%

cudf::timestamp_ms 1000000 1824x 279.394 us 12.65% 275.040 us 12.13%

cudf::timestamp_ms 10000000 2080x 2.014 ms 2.74% 2.009 ms 2.71%

Next time you can show the time diff for better readability.

@ttnghia Do you know how to do that? I am still learning the nvbench CLI. I also didn't want to make it appear that we were comparing apples-to-apples, since the algorithms are different.

bdice · 2023-05-19T17:44:31Z

cpp/src/stream_compaction/stable_distinct.cu

-    },
-    stream,
-    mr);
+  return cudf::detail::apply_boolean_mask(


Using apply_boolean_mask incurs no noticeable performance penalty, but speeds up compile time. We already have to compile kernels for apply_boolean_mask, so it's good to reuse them rather than add more Thrust kernels.

Under the hood, apply_boolean_mask implements exactly the removed code but that is already compiled separately 😂

This was inspired by the RAFT talk on reducing compile times. I think there are a ton of places like this in cudf that could shrink our kernel footprint.

I vaguely remember that we have a lot of places that have similar compiling burdensome not just with copy_if but also detail::gather. I believe that we can easily convert calls to detail::gather into using apply_boolean_mask too.

BTW, the name apply_boolean_mask is a bit misleading to me. What is the outcome of such "applying"? We are copying out of the input, not applying anything to it (in place). So maybe it should have a better name. Renaming it into another overload of copy_if could be an option.

I think that name change could be sensible (I only just discovered apply_boolean_mask after searching unsuccessfully for copy_if in libcudf) but it is out of scope for this PR.

Filed an issue: #13405

cpp/tests/stream_compaction/distinct_tests.cpp

bdice · 2023-05-19T17:45:04Z

cpp/tests/stream_compaction/stable_distinct_tests.cpp

The tests are not terribly interesting. I copied the cudf::distinct tests and modified them to behave stably.

cpp/tests/stream_compaction/stable_distinct_tests.cpp

Co-authored-by: Nghia Truong <[email protected]>

cpp/src/stream_compaction/stable_distinct.cu

bdice · 2023-05-22T17:13:08Z

/merge

Depends on #13392. Closes #11638 Closes #12449 Closes #11230 Closes #5286 This PR re-implements Python's `DataFrame.drop_duplicates` / `Series.drop_duplicates` to use the `stable_distinct` algorithm. This fixed a large number of issues with correctness (ordering the same way as pandas) and also improves performance by eliminating a sorting step. As a consequence of changing the behavior of `drop_duplicates`, a lot of refactoring was needed. The `drop_duplicates` function was used to implement `unique()`, which cascaded into changes for several groupby functions, one-hot encoding, `np.unique` array function dispatches, and more. Those downstream functions relied on the sorting order of `drop_duplicates` and `unique`, which is _not_ promised by pandas. Authors: - https://github.com/brandon-b-miller - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - Nghia Truong (https://github.com/ttnghia) URL: #11656

Add cudf::stable_distinct public API, tests, and benchmarks.

0a6d81a

bdice requested a review from a team as a code owner May 19, 2023 17:43

bdice requested review from robertmaynard and davidwendt May 19, 2023 17:43

github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels May 19, 2023

bdice mentioned this pull request May 19, 2023

Implement Python drop_duplicates with cudf::stable_distinct. #11656

Merged

bdice commented May 19, 2023

View reviewed changes

cpp/tests/stream_compaction/distinct_tests.cpp Outdated Show resolved Hide resolved

bdice commented May 19, 2023

View reviewed changes

bdice requested a review from ttnghia May 19, 2023 17:45

bdice added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 19, 2023

ttnghia reviewed May 19, 2023

View reviewed changes

cpp/tests/stream_compaction/stable_distinct_tests.cpp Outdated Show resolved Hide resolved

GregoryKimball assigned bdice May 19, 2023

Update copyright.

a379271

Co-authored-by: Nghia Truong <[email protected]>

ttnghia approved these changes May 19, 2023

View reviewed changes

robertmaynard reviewed May 22, 2023

View reviewed changes

cpp/src/stream_compaction/stable_distinct.cu Show resolved Hide resolved

robertmaynard approved these changes May 22, 2023

View reviewed changes

rapids-bot bot merged commit c32cb9a into rapidsai:branch-23.06 May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cudf::stable_distinct public API, tests, and benchmarks. #13392

Add cudf::stable_distinct public API, tests, and benchmarks. #13392

bdice commented May 19, 2023

bdice May 19, 2023

ttnghia May 19, 2023

bdice May 19, 2023

bdice May 19, 2023

ttnghia May 19, 2023

bdice May 19, 2023

ttnghia May 19, 2023

ttnghia May 19, 2023

bdice May 19, 2023 •

edited

Loading

ttnghia May 22, 2023

bdice May 19, 2023

bdice commented May 22, 2023

Type	NumRows	Samples	CPU Time	Noise	GPU Time	Noise
bool	10000	3328x	154.991 us	18.17%	150.732 us	17.58%
bool	100000	3152x	171.956 us	16.30%	167.866 us	16.11%
bool	1000000	2656x	391.125 us	7.81%	386.949 us	7.64%
bool	10000000	1744x	2.370 ms	2.38%	2.366 ms	2.36%
I8	10000	3392x	152.019 us	15.78%	147.786 us	14.82%
I8	100000	3136x	163.654 us	15.81%	159.577 us	15.60%
I8	1000000	1776x	385.042 us	10.74%	380.976 us	10.68%
I8	10000000	2800x	2.409 ms	7.88%	2.405 ms	7.85%
I32	10000	3392x	152.155 us	17.05%	147.984 us	16.48%
I32	100000	3136x	163.918 us	17.83%	159.824 us	17.64%
I32	1000000	1328x	384.702 us	11.33%	379.871 us	7.39%
I32	10000000	1744x	2.371 ms	2.50%	2.367 ms	2.48%
I64	10000	3360x	153.467 us	18.28%	149.399 us	18.07%
I64	100000	3136x	164.309 us	14.35%	160.212 us	14.08%
I64	1000000	1600x	385.795 us	11.04%	381.748 us	10.99%
I64	10000000	2240x	2.407 ms	2.99%	2.403 ms	2.99%
F32	10000	3344x	154.145 us	15.34%	149.986 us	14.03%
F32	100000	3104x	172.753 us	14.15%	168.570 us	13.02%
F32	1000000	1424x	446.890 us	9.98%	442.811 us	9.94%
F32	10000000	1552x	3.466 ms	2.23%	3.461 ms	2.23%
cudf::timestamp_ms	10000	3584x	153.687 us	14.15%	149.512 us	12.79%
cudf::timestamp_ms	100000	3056x	168.547 us	15.12%	164.391 us	14.77%
cudf::timestamp_ms	1000000	2896x	392.451 us	8.76%	388.323 us	8.48%
cudf::timestamp_ms	10000000	1712x	2.440 ms	2.00%	2.436 ms	1.93%

Type	NumRows	Samples	CPU Time	Noise	GPU Time	Noise
bool	10000	6064x	86.724 us	21.76%	82.512 us	14.82%
bool	100000	5008x	104.379 us	19.96%	100.135 us	16.15%
bool	1000000	2960x	284.657 us	10.09%	280.587 us	9.98%
bool	10000000	1568x	1.955 ms	2.44%	1.951 ms	2.43%
I8	10000	6080x	86.428 us	17.91%	82.315 us	16.20%
I8	100000	5456x	95.889 us	16.47%	91.739 us	12.97%
I8	1000000	1856x	278.206 us	12.02%	274.187 us	11.93%
I8	10000000	1520x	1.949 ms	2.50%	1.945 ms	2.49%
I32	10000	6064x	86.619 us	22.41%	82.525 us	21.50%
I32	100000	5440x	95.985 us	22.99%	91.946 us	22.57%
I32	1000000	1936x	277.409 us	11.83%	273.181 us	11.49%
I32	10000000	2144x	1.950 ms	2.70%	1.946 ms	2.67%
I64	10000	6080x	86.326 us	19.96%	82.278 us	19.33%
I64	100000	5392x	97.024 us	24.22%	92.818 us	16.11%
I64	1000000	1824x	278.336 us	12.94%	274.197 us	12.76%
I64	10000000	1808x	1.986 ms	2.46%	1.982 ms	2.45%
F32	10000	6080x	86.418 us	19.24%	82.328 us	18.02%
F32	100000	4992x	104.270 us	16.05%	100.173 us	14.24%
F32	1000000	3200x	341.270 us	8.96%	337.142 us	8.57%
F32	10000000	1680x	3.575 ms	1.75%	3.571 ms	1.75%
cudf::timestamp_ms	10000	6080x	86.580 us	22.77%	82.434 us	21.40%
cudf::timestamp_ms	100000	5376x	97.271 us	15.64%	93.221 us	15.01%
cudf::timestamp_ms	1000000	1824x	279.394 us	12.65%	275.040 us	12.13%
cudf::timestamp_ms	10000000	2080x	2.014 ms	2.74%	2.009 ms	2.71%

Add cudf::stable_distinct public API, tests, and benchmarks. #13392

Add cudf::stable_distinct public API, tests, and benchmarks. #13392

Conversation

bdice commented May 19, 2023

Description

Checklist

Choose a reason for hiding this comment

Benchmark Results

stable_distinct

[0] NVIDIA RTX A6000

distinct

[0] NVIDIA RTX A6000

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice May 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice commented May 22, 2023

bdice May 19, 2023 •

edited

Loading