You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Sometimes, cuDF needs to dedup things internally in a way that preserves the original order of the data elements. This is often in situations where replicating pandas semantics exactly are important. In #11597, a small function was proposed to do the same thing in python, but there exists a libcudf API that we should prefer.
Describe the solution you'd like
Move this function from detail to cudf and create cython bindings, plumb them through into the same places the that 11597 implements the workaround. Create tests and benchmarks.
Describe alternatives you've considered
The linked PR above implements a workaround.
Additional context
We could use this to allow users to opt in to pandas drop_duplicates() semantics exactly via a kwarg or configuration setting. It also may be necessary to make cuDF work with dask in some situations where dask calls drop_duplicates() and expects the pandas result.
The text was updated successfully, but these errors were encountered:
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
Depends on #13392.
Closes#11638Closes#12449Closes#11230Closes#5286
This PR re-implements Python's `DataFrame.drop_duplicates` / `Series.drop_duplicates` to use the `stable_distinct` algorithm.
This fixed a large number of issues with correctness (ordering the same way as pandas) and also improves performance by eliminating a sorting step.
As a consequence of changing the behavior of `drop_duplicates`, a lot of refactoring was needed. The `drop_duplicates` function was used to implement `unique()`, which cascaded into changes for several groupby functions, one-hot encoding, `np.unique` array function dispatches, and more. Those downstream functions relied on the sorting order of `drop_duplicates` and `unique`, which is _not_ promised by pandas.
Authors:
- https://github.com/brandon-b-miller
- Bradley Dice (https://github.com/bdice)
Approvers:
- GALI PREM SAGAR (https://github.com/galipremsagar)
- Matthew Roeschke (https://github.com/mroeschke)
- Nghia Truong (https://github.com/ttnghia)
URL: #11656
Is your feature request related to a problem? Please describe.
Sometimes, cuDF needs to dedup things internally in a way that preserves the original order of the data elements. This is often in situations where replicating pandas semantics exactly are important. In #11597, a small function was proposed to do the same thing in python, but there exists a libcudf API that we should prefer.
Describe the solution you'd like
Move this function from
detail
tocudf
and create cython bindings, plumb them through into the same places the that 11597 implements the workaround. Create tests and benchmarks.Describe alternatives you've considered
The linked PR above implements a workaround.
Additional context
We could use this to allow users to opt in to pandas
drop_duplicates()
semantics exactly via a kwarg or configuration setting. It also may be necessary to make cuDF work with dask in some situations where dask callsdrop_duplicates()
and expects the pandas result.The text was updated successfully, but these errors were encountered: