-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Why is df.groupby 100 times faster than df.drop_duplicates? #12449
Comments
I found it. It's possible that |
@infzo Thanks for reporting the performance issue. As you pointed out, cudf/python/cudf/cudf/_lib/stream_compaction.pyx Lines 172 to 179 in 95a348b
@ttnghia Correct me if I was wrong, but I recall at one point you were trying to update the python bindings to use |
Yes I was doing so in #11230, but @brandon-b-miller later doing a similar work (#11656) which should cover it so I stopped. @brandon-b-miller How is the #11656 PR going? |
Depends on #13392. Closes #11638 Closes #12449 Closes #11230 Closes #5286 This PR re-implements Python's `DataFrame.drop_duplicates` / `Series.drop_duplicates` to use the `stable_distinct` algorithm. This fixed a large number of issues with correctness (ordering the same way as pandas) and also improves performance by eliminating a sorting step. As a consequence of changing the behavior of `drop_duplicates`, a lot of refactoring was needed. The `drop_duplicates` function was used to implement `unique()`, which cascaded into changes for several groupby functions, one-hot encoding, `np.unique` array function dispatches, and more. Those downstream functions relied on the sorting order of `drop_duplicates` and `unique`, which is _not_ promised by pandas. Authors: - https://github.com/brandon-b-miller - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - Nghia Truong (https://github.com/ttnghia) URL: #11656
Following up on this issue because of its long discussion. The |
What is your question?
Why is
df.groupby
100 times faster thandf.drop_duplicates
?The text was updated successfully, but these errors were encountered: