Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Expose stable_distinct to python #11638

Closed
brandon-b-miller opened this issue Sep 1, 2022 · 1 comment · Fixed by #11656
Closed

[FEA] Expose stable_distinct to python #11638

brandon-b-miller opened this issue Sep 1, 2022 · 1 comment · Fixed by #11656
Assignees
Labels
libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@brandon-b-miller
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Sometimes, cuDF needs to dedup things internally in a way that preserves the original order of the data elements. This is often in situations where replicating pandas semantics exactly are important. In #11597, a small function was proposed to do the same thing in python, but there exists a libcudf API that we should prefer.

Describe the solution you'd like
Move this function from detail to cudf and create cython bindings, plumb them through into the same places the that 11597 implements the workaround. Create tests and benchmarks.

Describe alternatives you've considered
The linked PR above implements a workaround.

Additional context
We could use this to allow users to opt in to pandas drop_duplicates() semantics exactly via a kwarg or configuration setting. It also may be necessary to make cuDF work with dask in some situations where dask calls drop_duplicates() and expects the pandas result.

@github-actions
Copy link

github-actions bot commented Oct 1, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue May 23, 2023
Depends on #13392.

Closes #11638
Closes #12449
Closes #11230
Closes #5286

This PR re-implements Python's `DataFrame.drop_duplicates` / `Series.drop_duplicates` to use the `stable_distinct` algorithm.

This fixed a large number of issues with correctness (ordering the same way as pandas) and also improves performance by eliminating a sorting step.

As a consequence of changing the behavior of `drop_duplicates`, a lot of refactoring was needed. The `drop_duplicates` function was used to implement `unique()`, which cascaded into changes for several groupby functions, one-hot encoding, `np.unique` array function dispatches, and more. Those downstream functions relied on the sorting order of `drop_duplicates` and `unique`, which is _not_ promised by pandas.

Authors:
  - https://github.com/brandon-b-miller
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Matthew Roeschke (https://github.com/mroeschke)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11656
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants