Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stream compaction's drop_duplicates API to use stable sort #9417

Merged
merged 2 commits into from
Oct 15, 2021

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Oct 12, 2021

Currently, stream compaction API drop_duplicates uses unstable sort for all of its internal sorting. This is wrong since we may want to have stable sorting results so we can choose to keep the first or the last duplicate element from the repeated sequences.

This PR does two things:

  • Fixes the issue mentioned above by using stable sort if the input option is to keep the first or last duplicate element, and
  • Adds a new keep option into the enum class duplicate_keep_option: KEEP_ANY. This option allows the user to choose to keep one element from the repeated sequence at any position.

Note that the issue did not show up by any failed test because thrust default (unstable) sort, which is called internally in drop_duplicate, still produces the same results as thrust stable sort most of the time (but this is not guaranteed). As such, the current drop_duplicate still produces correct results in its tests.

This PR blocks #9345.

@ttnghia ttnghia added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf blocker libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Oct 12, 2021
@ttnghia ttnghia self-assigned this Oct 12, 2021
@ttnghia ttnghia requested a review from a team as a code owner October 12, 2021 03:28
@davidwendt
Copy link
Contributor

It may not be worth adding the KEEP_ANY and just use stable_sort here because issue #9413 may remove the sort call from this function.
@jrhemstad

@rapidsai rapidsai deleted a comment from codecov bot Oct 12, 2021
@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 12, 2021

It may not be worth adding the KEEP_ANY and just use stable_sort here because issue #9413 may remove the sort call from this function. @jrhemstad

Thanks! I agree, but until that issue is addressed this PR is still relevant to fix sorting to use stable sort.

In addition, KEEP_ANY is necessary for #9345 (which is going to reuse the enum duplicate_keep_option). lists::drop_list_duplicates is doing similar things (calling to segmented_sort then doing unique_copy). Again, KEEP_ANY will allow using the default (unstable) sort with less performance concern.

@bdice
Copy link
Contributor

bdice commented Oct 12, 2021

I agree with @davidwendt that drop_duplicates should always use a stable sort and that the parameter KEEP_ANY doesn't need to be introduced for this specific case of drop_duplicates. How big is the performance difference between stable and unstable sorting?

I looked at #9345 to understand how KEEP_ANY would be used in that PR, but it appears that only KEEP_LAST is needed for the issue #9124 that the PR is resolving. Is there a user need for KEEP_ANY or is it only useful for performance reasons? I would guess that lists::drop_list_duplicates is less affected by the difference between stable/unstable performance than drop_duplicates, but it would be good to have numbers to verify this.

@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 12, 2021

Using stable sort everywhere in drop_duplicates is fine, I can change that if all you want. But using unstable sort everywhere as currently is wrong, and that is one of the reasons I have this PR.

lists::drop_list_duplicates in some cases like collect_set only needs to remove duplicates and does not need to care about which element to preserve. That's why we need KEEP_ANY. Otherwise, stable segmented sort is required, which is not yet implemented and will be more expensive. I'm going to address #9124 by implementing stable segmented sort after this PR.

I believe that KEEP_ANY is also useful in the case of reduction because sometimes we want to make a set from a list and don't care about which element to preserve.

@bdice
Copy link
Contributor

bdice commented Oct 12, 2021

@ttnghia Thanks for the info. I think the scope of this PR should be limited to changing unstable sort to stable sort, since that's clearly a bug.

lists::drop_list_duplicates in some cases like collect_set only needs to remove duplicates and does not need to care about which element to preserve.

[...] stable segmented sort is required, which is not yet implemented and will be more expensive.

I saw this a little differently. It sounds like KEEP_ANY is not yet needed, until #9124 is addressed. I would implement the segmented stable sort and then compare to the performance of the unstable segmented sort. If the stable segmented sort is acceptably fast, I think it would be fine to make the default KEEP_FIRST or KEEP_LAST instead of immediately introducing the option KEEP_ANY. If the performance is significantly slower for the stable sort, then adding KEEP_ANY could be justified.

@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 12, 2021

Stable/unstable segmented sorts all use thrust:: to sort. Here is the benchmark of stable/unstable sorting on my machine:

-----------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations
-----------------------------------------------------------------------------------------------
Sort<false>/unstable_no_nulls/1024/1/manual_time          0.442 ms        0.595 ms         1985
Sort<false>/unstable_no_nulls/4096/1/manual_time          0.402 ms        0.541 ms         1841
Sort<false>/unstable_no_nulls/32768/1/manual_time         0.386 ms        0.547 ms         1807
Sort<false>/unstable_no_nulls/262144/1/manual_time        0.397 ms        0.555 ms         1840
Sort<false>/unstable_no_nulls/2097152/1/manual_time       0.738 ms        0.898 ms         1012
Sort<false>/unstable_no_nulls/16777216/1/manual_time       3.48 ms         3.64 ms          201
Sort<false>/unstable_no_nulls/67108864/1/manual_time       13.1 ms         13.2 ms           54
Sort<false>/unstable_no_nulls/1024/8/manual_time           1.23 ms         1.39 ms          581
Sort<false>/unstable_no_nulls/4096/8/manual_time           1.31 ms         1.47 ms          514
Sort<false>/unstable_no_nulls/32768/8/manual_time          1.43 ms         1.59 ms          487
Sort<false>/unstable_no_nulls/262144/8/manual_time         1.61 ms         1.75 ms          432
Sort<false>/unstable_no_nulls/2097152/8/manual_time        24.5 ms         24.7 ms           28
Sort<false>/unstable_no_nulls/16777216/8/manual_time        378 ms          378 ms            2
Sort<false>/unstable_no_nulls/67108864/8/manual_time       2043 ms         2043 ms            1

Sort<true>/stable_no_nulls/1024/1/manual_time             0.469 ms        0.632 ms         1811
Sort<true>/stable_no_nulls/4096/1/manual_time             0.372 ms        0.501 ms         2872
Sort<true>/stable_no_nulls/32768/1/manual_time            0.360 ms        0.507 ms         1811
Sort<true>/stable_no_nulls/262144/1/manual_time           0.359 ms        0.498 ms         1919
Sort<true>/stable_no_nulls/2097152/1/manual_time          0.743 ms        0.906 ms         1066
Sort<true>/stable_no_nulls/16777216/1/manual_time          3.49 ms         3.65 ms          201
Sort<true>/stable_no_nulls/67108864/1/manual_time          13.0 ms         13.2 ms           54
Sort<true>/stable_no_nulls/1024/8/manual_time              1.23 ms         1.39 ms          561
Sort<true>/stable_no_nulls/4096/8/manual_time              1.33 ms         1.49 ms          530
Sort<true>/stable_no_nulls/32768/8/manual_time             1.43 ms         1.59 ms          500
Sort<true>/stable_no_nulls/262144/8/manual_time            1.65 ms         1.80 ms          426
Sort<true>/stable_no_nulls/2097152/8/manual_time           24.5 ms         24.7 ms           28
Sort<true>/stable_no_nulls/16777216/8/manual_time           378 ms          378 ms            2
Sort<true>/stable_no_nulls/67108864/8/manual_time          2044 ms         2044 ms            1
-----------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations
-----------------------------------------------------------------------------------------------
Sort<false>/unstable/1024/1/manual_time                   0.312 ms        0.452 ms         2026
Sort<false>/unstable/4096/1/manual_time                   0.380 ms        0.524 ms         1701
Sort<false>/unstable/32768/1/manual_time                  0.469 ms        0.625 ms         1462
Sort<false>/unstable/262144/1/manual_time                 0.575 ms        0.729 ms         1228
Sort<false>/unstable/2097152/1/manual_time                 3.66 ms         3.82 ms          192
Sort<false>/unstable/16777216/1/manual_time                28.6 ms         28.8 ms           25
Sort<false>/unstable/67108864/1/manual_time                 119 ms          119 ms            6
Sort<false>/unstable/1024/8/manual_time                    1.32 ms         1.46 ms          507
Sort<false>/unstable/4096/8/manual_time                    1.55 ms         1.71 ms          460
Sort<false>/unstable/32768/8/manual_time                   1.82 ms         1.98 ms          377
Sort<false>/unstable/262144/8/manual_time                  2.20 ms         2.35 ms          319
Sort<false>/unstable/2097152/8/manual_time                 28.8 ms         28.9 ms           24
Sort<false>/unstable/16777216/8/manual_time                 481 ms          481 ms            2
Sort<false>/unstable/67108864/8/manual_time                2873 ms         2873 ms            1

Sort<true>/stable/1024/1/manual_time                      0.324 ms        0.471 ms         2524
Sort<true>/stable/4096/1/manual_time                      0.370 ms        0.509 ms         1748
Sort<true>/stable/32768/1/manual_time                     0.479 ms        0.640 ms         1554
Sort<true>/stable/262144/1/manual_time                    0.581 ms        0.737 ms         1283
Sort<true>/stable/2097152/1/manual_time                    3.65 ms         3.81 ms          192
Sort<true>/stable/16777216/1/manual_time                   28.6 ms         28.7 ms           25
Sort<true>/stable/67108864/1/manual_time                    119 ms          120 ms            6
Sort<true>/stable/1024/8/manual_time                       1.36 ms         1.51 ms          503
Sort<true>/stable/4096/8/manual_time                       1.54 ms         1.70 ms          444
Sort<true>/stable/32768/8/manual_time                      1.83 ms         1.99 ms          391
Sort<true>/stable/262144/8/manual_time                     2.18 ms         2.33 ms          315
Sort<true>/stable/2097152/8/manual_time                    28.5 ms         28.7 ms           25
Sort<true>/stable/16777216/8/manual_time                    481 ms          481 ms            2
Sort<true>/stable/67108864/8/manual_time                   2870 ms         2870 ms            1

@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 12, 2021

The benchmark shows that the performance difference between stable and unstable sorts (using thrust) is very small, less than 5%. I'm totally fine to switch to use stable sort all the time, but just feel that it is necessary. Need more vote @jrhemstad @davidwendt.

If using stable sort all the time, we can remove the KEEP_ANY option.

@mythrocks
Copy link
Contributor

LGTM, save for the removal of KEEP_ANY.

@rapidsai rapidsai deleted a comment from codecov bot Oct 15, 2021
@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 15, 2021

Removed the KEEP_ANY option. Now just use stable sort all the time.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttnghia Great, thanks for the update on this. Would it be possible to add a test of some kind to catch this bug? I recognize that the test might also pass on the previous code, based on what you wrote in the PR description, but it would be helpful to ensure that we are meeting our requirement of stability.

@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 15, 2021

@ttnghia Great, thanks for the update on this. Would it be possible to add a test of some kind to catch this bug?

I tried that but couldn't. In order to catch the bug, we have to generate an array (of structs) that thrust::sort produces unstable result. All my generated data was sorted stably by thrust unstable sort.

So this is a potential bug, which does not show up right away and it's difficult to catch.

@bdice
Copy link
Contributor

bdice commented Oct 15, 2021

@ttnghia Great, thanks for the update on this. Would it be possible to add a test of some kind to catch this bug?

I tried that but couldn't. In order to catch the bug, we have to generate an array (of structs) that thrust::sort produces unstable result. All my generated data was sorted stably by thrust unstable sort.

So this is a potential bug, which does not show up right away and it's difficult to catch.

I looked over drop_duplicates_tests.cpp and I agree with your assessment -- it is hard to catch because the implementation was already doing what we needed, just not as a guarantee. Approving.

@codecov
Copy link

codecov bot commented Oct 15, 2021

Codecov Report

Merging #9417 (4b2de5a) into branch-21.12 (ab4bfaa) will decrease coverage by 0.00%.
The diff coverage is 0.00%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9417      +/-   ##
================================================
- Coverage         10.79%   10.78%   -0.01%     
================================================
  Files               116      117       +1     
  Lines             18869    19445     +576     
================================================
+ Hits               2036     2097      +61     
- Misses            16833    17348     +515     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_lib/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/lists.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/numerical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/struct.py 0.00% <0.00%> (ø)
... and 78 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6caaf5...4b2de5a. Read the comment docs.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this.

@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 15, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b66f14e into rapidsai:branch-21.12 Oct 15, 2021
@ttnghia ttnghia deleted the drop_duplicates_stable_sort branch October 15, 2021 19:33
rapids-bot bot pushed a commit that referenced this pull request Jun 22, 2022
This adds `duplicate_keep_option` to `cudf::distinct`, allowing to specify a `keep` option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieve `O(n)` performance.

A `KEEP_ANY` option is also added to `duplicate_keep_option`, which was an attempt in #9417 but didn't get in eventually.

Partially addresses #11050 and #11053.

----
Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47

Follow up work:
 * #11053
 * #11089
 * #11092

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: #11052
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants