Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support duplicate_keep_option in places that use unordered map #11050

Closed
ttnghia opened this issue Jun 3, 2022 · 5 comments
Closed

[FEA] Support duplicate_keep_option in places that use unordered map #11050

ttnghia opened this issue Jun 3, 2022 · 5 comments
Assignees
Labels
feature request New feature or request

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Jun 3, 2022

In several places, we need to support duplicate_keep_option to drop duplicates except for the first or last element in a duplicate sequence. Such an option requires to stable-sort the input column so we can identify the first or last duplicate element.

Hash table/unordered_map for drop duplicates can avoid sorting and enhance performance from O(nlogn) to O(n) but the output elements are unordered thus it cannot support duplicate_keep_option. I believe that such limitation can be overcome somehow. I'll think about it and will prototype a solution if found any idea.

@ttnghia ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels Jun 3, 2022
@ttnghia ttnghia self-assigned this Jun 3, 2022
@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 3, 2022

@PointKernel I have some ideas for this. So I have a question: Can we access a value associated with a key in a kernel? Is that using the __device__ static_map::device_view::find() function? I'm not sure if calling it in a provided kernel will be fast or not, and should I use the overload of find that has CG g parameter or not?

@PointKernel
Copy link
Member

Is that using the device static_map::device_view::find() function?

Yes

I'm not sure if calling it in a provided kernel will be fast or not

Whether it's a cuco kernel or a custom kernel doesn't matter. Those APIs are designed to be invoked on the device side.

should I use the overload of find that has CG g parameter or not?

There is no short answer for this. Using CG or not and the optimal size of CG vary on a use-case base. But in general, if we stick with 50% occupancy (the default occupancy for all use cases of static_map in cudf), The non-CG version is slightly more efficient since hash collisions are rare with such low occupancy.

rapids-bot bot pushed a commit that referenced this issue Jun 22, 2022
This adds `duplicate_keep_option` to `cudf::distinct`, allowing to specify a `keep` option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieve `O(n)` performance.

A `KEEP_ANY` option is also added to `duplicate_keep_option`, which was an attempt in #9417 but didn't get in eventually.

Partially addresses #11050 and #11053.

----
Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47

Follow up work:
 * #11053
 * #11089
 * #11092

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: #11052
@github-actions
Copy link

github-actions bot commented Jul 4, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented Oct 2, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@GregoryKimball
Copy link
Contributor

Closed by #11052

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants