[FEA] Expose `stable_distinct` to python #11638

brandon-b-miller · 2022-09-01T13:22:02Z

Is your feature request related to a problem? Please describe.
Sometimes, cuDF needs to dedup things internally in a way that preserves the original order of the data elements. This is often in situations where replicating pandas semantics exactly are important. In #11597, a small function was proposed to do the same thing in python, but there exists a libcudf API that we should prefer.

Describe the solution you'd like
Move this function from detail to cudf and create cython bindings, plumb them through into the same places the that 11597 implements the workaround. Create tests and benchmarks.

Describe alternatives you've considered
The linked PR above implements a workaround.

Additional context
We could use this to allow users to opt in to pandas drop_duplicates() semantics exactly via a kwarg or configuration setting. It also may be necessary to make cuDF work with dask in some situations where dask calls drop_duplicates() and expects the pandas result.

The text was updated successfully, but these errors were encountered:

github-actions · 2022-10-01T14:05:34Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Depends on #13392. Closes #11638 Closes #12449 Closes #11230 Closes #5286 This PR re-implements Python's `DataFrame.drop_duplicates` / `Series.drop_duplicates` to use the `stable_distinct` algorithm. This fixed a large number of issues with correctness (ordering the same way as pandas) and also improves performance by eliminating a sorting step. As a consequence of changing the behavior of `drop_duplicates`, a lot of refactoring was needed. The `drop_duplicates` function was used to implement `unique()`, which cascaded into changes for several groupby functions, one-hot encoding, `np.unique` array function dispatches, and more. Those downstream functions relied on the sorting order of `drop_duplicates` and `unique`, which is _not_ promised by pandas. Authors: - https://github.com/brandon-b-miller - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - Nghia Truong (https://github.com/ttnghia) URL: #11656

brandon-b-miller added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Cython labels Sep 1, 2022

brandon-b-miller self-assigned this Sep 1, 2022

This was referenced Sep 1, 2022

Preserve order if necessary when deduping categoricals internally #11597

Merged

Implement Python drop_duplicates with cudf::stable_distinct. #11656

Merged

github-actions bot added the inactive-30d label Oct 1, 2022

GregoryKimball removed the inactive-30d label Apr 2, 2023

rapids-bot bot closed this as completed in #11656 May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Expose `stable_distinct` to python #11638

[FEA] Expose `stable_distinct` to python #11638

brandon-b-miller commented Sep 1, 2022

github-actions bot commented Oct 1, 2022

[FEA] Expose stable_distinct to python #11638

[FEA] Expose stable_distinct to python #11638

Comments

brandon-b-miller commented Sep 1, 2022

github-actions bot commented Oct 1, 2022

[FEA] Expose `stable_distinct` to python #11638

[FEA] Expose `stable_distinct` to python #11638