Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cudf::stable_distinct public API, tests, and benchmarks. #13392

Conversation

bdice
Copy link
Contributor

@bdice bdice commented May 19, 2023

Description

This PR supersedes part of #11656.

It adds a public API for cudf::stable_distinct, mirroring that of cudf::distinct but preserving the order of the input table. The stable_distinct implementation was refactored to use apply_boolean_mask, which reduces the number of kernels needed. I also added tests/benchmarks for cudf::stable_distinct.

I split out the C++ changes from #11656 because that PR size was getting too large. Also these C++ changes are non-breaking, but the Python changes are breaking (and depend on these C++ changes), so separating into a new PR seemed like a good idea.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@bdice bdice requested a review from a team as a code owner May 19, 2023 17:43
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels May 19, 2023
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is almost an exact copy of the distinct benchmarks.

Here are the results for stable_distinct vs. distinct. I just wanted to show that they're similar in execution time. Stability isn't too expensive, if needed by the caller.

Benchmark Results

stable_distinct

[0] NVIDIA RTX A6000

Type NumRows Samples CPU Time Noise GPU Time Noise
bool 10000 3328x 154.991 us 18.17% 150.732 us 17.58%
bool 100000 3152x 171.956 us 16.30% 167.866 us 16.11%
bool 1000000 2656x 391.125 us 7.81% 386.949 us 7.64%
bool 10000000 1744x 2.370 ms 2.38% 2.366 ms 2.36%
I8 10000 3392x 152.019 us 15.78% 147.786 us 14.82%
I8 100000 3136x 163.654 us 15.81% 159.577 us 15.60%
I8 1000000 1776x 385.042 us 10.74% 380.976 us 10.68%
I8 10000000 2800x 2.409 ms 7.88% 2.405 ms 7.85%
I32 10000 3392x 152.155 us 17.05% 147.984 us 16.48%
I32 100000 3136x 163.918 us 17.83% 159.824 us 17.64%
I32 1000000 1328x 384.702 us 11.33% 379.871 us 7.39%
I32 10000000 1744x 2.371 ms 2.50% 2.367 ms 2.48%
I64 10000 3360x 153.467 us 18.28% 149.399 us 18.07%
I64 100000 3136x 164.309 us 14.35% 160.212 us 14.08%
I64 1000000 1600x 385.795 us 11.04% 381.748 us 10.99%
I64 10000000 2240x 2.407 ms 2.99% 2.403 ms 2.99%
F32 10000 3344x 154.145 us 15.34% 149.986 us 14.03%
F32 100000 3104x 172.753 us 14.15% 168.570 us 13.02%
F32 1000000 1424x 446.890 us 9.98% 442.811 us 9.94%
F32 10000000 1552x 3.466 ms 2.23% 3.461 ms 2.23%
cudf::timestamp_ms 10000 3584x 153.687 us 14.15% 149.512 us 12.79%
cudf::timestamp_ms 100000 3056x 168.547 us 15.12% 164.391 us 14.77%
cudf::timestamp_ms 1000000 2896x 392.451 us 8.76% 388.323 us 8.48%
cudf::timestamp_ms 10000000 1712x 2.440 ms 2.00% 2.436 ms 1.93%

distinct

[0] NVIDIA RTX A6000

Type NumRows Samples CPU Time Noise GPU Time Noise
bool 10000 6064x 86.724 us 21.76% 82.512 us 14.82%
bool 100000 5008x 104.379 us 19.96% 100.135 us 16.15%
bool 1000000 2960x 284.657 us 10.09% 280.587 us 9.98%
bool 10000000 1568x 1.955 ms 2.44% 1.951 ms 2.43%
I8 10000 6080x 86.428 us 17.91% 82.315 us 16.20%
I8 100000 5456x 95.889 us 16.47% 91.739 us 12.97%
I8 1000000 1856x 278.206 us 12.02% 274.187 us 11.93%
I8 10000000 1520x 1.949 ms 2.50% 1.945 ms 2.49%
I32 10000 6064x 86.619 us 22.41% 82.525 us 21.50%
I32 100000 5440x 95.985 us 22.99% 91.946 us 22.57%
I32 1000000 1936x 277.409 us 11.83% 273.181 us 11.49%
I32 10000000 2144x 1.950 ms 2.70% 1.946 ms 2.67%
I64 10000 6080x 86.326 us 19.96% 82.278 us 19.33%
I64 100000 5392x 97.024 us 24.22% 92.818 us 16.11%
I64 1000000 1824x 278.336 us 12.94% 274.197 us 12.76%
I64 10000000 1808x 1.986 ms 2.46% 1.982 ms 2.45%
F32 10000 6080x 86.418 us 19.24% 82.328 us 18.02%
F32 100000 4992x 104.270 us 16.05% 100.173 us 14.24%
F32 1000000 3200x 341.270 us 8.96% 337.142 us 8.57%
F32 10000000 1680x 3.575 ms 1.75% 3.571 ms 1.75%
cudf::timestamp_ms 10000 6080x 86.580 us 22.77% 82.434 us 21.40%
cudf::timestamp_ms 100000 5376x 97.271 us 15.64% 93.221 us 15.01%
cudf::timestamp_ms 1000000 1824x 279.394 us 12.65% 275.040 us 12.13%
cudf::timestamp_ms 10000000 2080x 2.014 ms 2.74% 2.009 ms 2.71%

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next time you can show the time diff for better readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttnghia Do you know how to do that? I am still learning the nvbench CLI. I also didn't want to make it appear that we were comparing apples-to-apples, since the algorithms are different.

},
stream,
mr);
return cudf::detail::apply_boolean_mask(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using apply_boolean_mask incurs no noticeable performance penalty, but speeds up compile time. We already have to compile kernels for apply_boolean_mask, so it's good to reuse them rather than add more Thrust kernels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the hood, apply_boolean_mask implements exactly the removed code but that is already compiled separately 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was inspired by the RAFT talk on reducing compile times. I think there are a ton of places like this in cudf that could shrink our kernel footprint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember that we have a lot of places that have similar compiling burdensome not just with copy_if but also detail::gather. I believe that we can easily convert calls to detail::gather into using apply_boolean_mask too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the name apply_boolean_mask is a bit misleading to me. What is the outcome of such "applying"? We are copying out of the input, not applying anything to it (in place). So maybe it should have a better name. Renaming it into another overload of copy_if could be an option.

Copy link
Contributor Author

@bdice bdice May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that name change could be sensible (I only just discovered apply_boolean_mask after searching unsuccessfully for copy_if in libcudf) but it is out of scope for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed an issue: #13405

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are not terribly interesting. I copied the cudf::distinct tests and modified them to behave stably.

@bdice bdice requested a review from ttnghia May 19, 2023 17:45
@bdice bdice added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 19, 2023
Co-authored-by: Nghia Truong <[email protected]>
@bdice
Copy link
Contributor Author

bdice commented May 22, 2023

/merge

@rapids-bot rapids-bot bot merged commit c32cb9a into rapidsai:branch-23.06 May 22, 2023
rapids-bot bot pushed a commit that referenced this pull request May 23, 2023
Depends on #13392.

Closes #11638
Closes #12449
Closes #11230
Closes #5286

This PR re-implements Python's `DataFrame.drop_duplicates` / `Series.drop_duplicates` to use the `stable_distinct` algorithm.

This fixed a large number of issues with correctness (ordering the same way as pandas) and also improves performance by eliminating a sorting step.

As a consequence of changing the behavior of `drop_duplicates`, a lot of refactoring was needed. The `drop_duplicates` function was used to implement `unique()`, which cascaded into changes for several groupby functions, one-hot encoding, `np.unique` array function dispatches, and more. Those downstream functions relied on the sorting order of `drop_duplicates` and `unique`, which is _not_ promised by pandas.

Authors:
  - https://github.com/brandon-b-miller
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Matthew Roeschke (https://github.com/mroeschke)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11656
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants