-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cudf::stable_distinct public API, tests, and benchmarks. #13392
Add cudf::stable_distinct public API, tests, and benchmarks. #13392
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is almost an exact copy of the distinct
benchmarks.
Here are the results for stable_distinct
vs. distinct
. I just wanted to show that they're similar in execution time. Stability isn't too expensive, if needed by the caller.
Benchmark Results
stable_distinct
[0] NVIDIA RTX A6000
Type | NumRows | Samples | CPU Time | Noise | GPU Time | Noise |
---|---|---|---|---|---|---|
bool | 10000 | 3328x | 154.991 us | 18.17% | 150.732 us | 17.58% |
bool | 100000 | 3152x | 171.956 us | 16.30% | 167.866 us | 16.11% |
bool | 1000000 | 2656x | 391.125 us | 7.81% | 386.949 us | 7.64% |
bool | 10000000 | 1744x | 2.370 ms | 2.38% | 2.366 ms | 2.36% |
I8 | 10000 | 3392x | 152.019 us | 15.78% | 147.786 us | 14.82% |
I8 | 100000 | 3136x | 163.654 us | 15.81% | 159.577 us | 15.60% |
I8 | 1000000 | 1776x | 385.042 us | 10.74% | 380.976 us | 10.68% |
I8 | 10000000 | 2800x | 2.409 ms | 7.88% | 2.405 ms | 7.85% |
I32 | 10000 | 3392x | 152.155 us | 17.05% | 147.984 us | 16.48% |
I32 | 100000 | 3136x | 163.918 us | 17.83% | 159.824 us | 17.64% |
I32 | 1000000 | 1328x | 384.702 us | 11.33% | 379.871 us | 7.39% |
I32 | 10000000 | 1744x | 2.371 ms | 2.50% | 2.367 ms | 2.48% |
I64 | 10000 | 3360x | 153.467 us | 18.28% | 149.399 us | 18.07% |
I64 | 100000 | 3136x | 164.309 us | 14.35% | 160.212 us | 14.08% |
I64 | 1000000 | 1600x | 385.795 us | 11.04% | 381.748 us | 10.99% |
I64 | 10000000 | 2240x | 2.407 ms | 2.99% | 2.403 ms | 2.99% |
F32 | 10000 | 3344x | 154.145 us | 15.34% | 149.986 us | 14.03% |
F32 | 100000 | 3104x | 172.753 us | 14.15% | 168.570 us | 13.02% |
F32 | 1000000 | 1424x | 446.890 us | 9.98% | 442.811 us | 9.94% |
F32 | 10000000 | 1552x | 3.466 ms | 2.23% | 3.461 ms | 2.23% |
cudf::timestamp_ms | 10000 | 3584x | 153.687 us | 14.15% | 149.512 us | 12.79% |
cudf::timestamp_ms | 100000 | 3056x | 168.547 us | 15.12% | 164.391 us | 14.77% |
cudf::timestamp_ms | 1000000 | 2896x | 392.451 us | 8.76% | 388.323 us | 8.48% |
cudf::timestamp_ms | 10000000 | 1712x | 2.440 ms | 2.00% | 2.436 ms | 1.93% |
distinct
[0] NVIDIA RTX A6000
Type | NumRows | Samples | CPU Time | Noise | GPU Time | Noise |
---|---|---|---|---|---|---|
bool | 10000 | 6064x | 86.724 us | 21.76% | 82.512 us | 14.82% |
bool | 100000 | 5008x | 104.379 us | 19.96% | 100.135 us | 16.15% |
bool | 1000000 | 2960x | 284.657 us | 10.09% | 280.587 us | 9.98% |
bool | 10000000 | 1568x | 1.955 ms | 2.44% | 1.951 ms | 2.43% |
I8 | 10000 | 6080x | 86.428 us | 17.91% | 82.315 us | 16.20% |
I8 | 100000 | 5456x | 95.889 us | 16.47% | 91.739 us | 12.97% |
I8 | 1000000 | 1856x | 278.206 us | 12.02% | 274.187 us | 11.93% |
I8 | 10000000 | 1520x | 1.949 ms | 2.50% | 1.945 ms | 2.49% |
I32 | 10000 | 6064x | 86.619 us | 22.41% | 82.525 us | 21.50% |
I32 | 100000 | 5440x | 95.985 us | 22.99% | 91.946 us | 22.57% |
I32 | 1000000 | 1936x | 277.409 us | 11.83% | 273.181 us | 11.49% |
I32 | 10000000 | 2144x | 1.950 ms | 2.70% | 1.946 ms | 2.67% |
I64 | 10000 | 6080x | 86.326 us | 19.96% | 82.278 us | 19.33% |
I64 | 100000 | 5392x | 97.024 us | 24.22% | 92.818 us | 16.11% |
I64 | 1000000 | 1824x | 278.336 us | 12.94% | 274.197 us | 12.76% |
I64 | 10000000 | 1808x | 1.986 ms | 2.46% | 1.982 ms | 2.45% |
F32 | 10000 | 6080x | 86.418 us | 19.24% | 82.328 us | 18.02% |
F32 | 100000 | 4992x | 104.270 us | 16.05% | 100.173 us | 14.24% |
F32 | 1000000 | 3200x | 341.270 us | 8.96% | 337.142 us | 8.57% |
F32 | 10000000 | 1680x | 3.575 ms | 1.75% | 3.571 ms | 1.75% |
cudf::timestamp_ms | 10000 | 6080x | 86.580 us | 22.77% | 82.434 us | 21.40% |
cudf::timestamp_ms | 100000 | 5376x | 97.271 us | 15.64% | 93.221 us | 15.01% |
cudf::timestamp_ms | 1000000 | 1824x | 279.394 us | 12.65% | 275.040 us | 12.13% |
cudf::timestamp_ms | 10000000 | 2080x | 2.014 ms | 2.74% | 2.009 ms | 2.71% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next time you can show the time diff for better readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ttnghia Do you know how to do that? I am still learning the nvbench CLI. I also didn't want to make it appear that we were comparing apples-to-apples, since the algorithms are different.
}, | ||
stream, | ||
mr); | ||
return cudf::detail::apply_boolean_mask( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using apply_boolean_mask
incurs no noticeable performance penalty, but speeds up compile time. We already have to compile kernels for apply_boolean_mask
, so it's good to reuse them rather than add more Thrust kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under the hood, apply_boolean_mask
implements exactly the removed code but that is already compiled separately 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was inspired by the RAFT talk on reducing compile times. I think there are a ton of places like this in cudf that could shrink our kernel footprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely remember that we have a lot of places that have similar compiling burdensome not just with copy_if
but also detail::gather
. I believe that we can easily convert calls to detail::gather
into using apply_boolean_mask
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the name apply_boolean_mask
is a bit misleading to me. What is the outcome of such "applying"? We are copying out of the input, not applying anything to it (in place). So maybe it should have a better name. Renaming it into another overload of copy_if
could be an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that name change could be sensible (I only just discovered apply_boolean_mask
after searching unsuccessfully for copy_if
in libcudf) but it is out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed an issue: #13405
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are not terribly interesting. I copied the cudf::distinct
tests and modified them to behave stably.
Co-authored-by: Nghia Truong <[email protected]>
/merge |
Depends on #13392. Closes #11638 Closes #12449 Closes #11230 Closes #5286 This PR re-implements Python's `DataFrame.drop_duplicates` / `Series.drop_duplicates` to use the `stable_distinct` algorithm. This fixed a large number of issues with correctness (ordering the same way as pandas) and also improves performance by eliminating a sorting step. As a consequence of changing the behavior of `drop_duplicates`, a lot of refactoring was needed. The `drop_duplicates` function was used to implement `unique()`, which cascaded into changes for several groupby functions, one-hot encoding, `np.unique` array function dispatches, and more. Those downstream functions relied on the sorting order of `drop_duplicates` and `unique`, which is _not_ promised by pandas. Authors: - https://github.com/brandon-b-miller - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - Nghia Truong (https://github.com/ttnghia) URL: #11656
Description
This PR supersedes part of #11656.
It adds a public API for
cudf::stable_distinct
, mirroring that ofcudf::distinct
but preserving the order of the input table. Thestable_distinct
implementation was refactored to useapply_boolean_mask
, which reduces the number of kernels needed. I also added tests/benchmarks forcudf::stable_distinct
.I split out the C++ changes from #11656 because that PR size was getting too large. Also these C++ changes are non-breaking, but the Python changes are breaking (and depend on these C++ changes), so separating into a new PR seemed like a good idea.
Checklist