Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Why is df.groupby 100 times faster than df.drop_duplicates? #12449

Closed
infzo opened this issue Dec 29, 2022 · 4 comments · Fixed by #11656
Closed

[QST] Why is df.groupby 100 times faster than df.drop_duplicates? #12449

infzo opened this issue Dec 29, 2022 · 4 comments · Fixed by #11656
Labels
question Further information is requested

Comments

@infzo
Copy link

infzo commented Dec 29, 2022

What is your question?

Why is df.groupby 100 times faster than df.drop_duplicates?

捕获

>>> df
                                             col_0                                     col_1
0         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000004394  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000487
1         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000051  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000005893
2         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009403  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002170
3         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002504  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000004009
4         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002857  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000001875
...                                            ...                                       ...
24999995  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000007188  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000004901
24999996  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000421  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000169
24999997  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000006764  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000007317
24999998  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002301  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000001170
24999999  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009843  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000001078

[25000000 rows x 2 columns]
>>>
>>>
>>>
>>> time_list = []
>>> for i in range(10):
...     srt = time.time()
...     result = df.groupby(['col_0', 'col_1']).count().reset_index()
...     time_list.append(time.time() - srt)
...
>>> print(f'mean cost(str-80) {np.mean(time_list)} s, bandwidth: {len(df) / np.mean(time_list) / 10000} WRows/s')
mean cost(str-80) 0.10929956436157226 s, bandwidth: 22872.918246314206 WRows/s
>>> df.groupby(['col_0', 'col_1']).count().reset_index()
                                             col_0                                     col_1
0         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000008783  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000003094
1         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000008242  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002540
2         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000006452  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000007991
3         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000003325  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002243
4         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000055  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000001391
...                                            ...                                       ...
22121178  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000001029  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000007222
22121179  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000026  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002086
22121180  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000001611  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000261
22121181  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000005565  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000004872
22121182  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009194  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000008534

[22121183 rows x 2 columns]
>>>
>>>
>>>
>>> time_list = []
>>> for i in range(10):
...     srt = time.time()
...     result = df.drop_duplicates(['col_0', 'col_1'])
...     time_list.append(time.time() - srt)
...
>>> print(f'mean cost(str-80) {np.mean(time_list)} s, bandwidth: {len(df) / np.mean(time_list) / 10000} WRows/s')
mean cost(str-80) 15.08334140777588 s, bandwidth: 165.7457676262092 WRows/s
>>> df.drop_duplicates(['col_0', 'col_1'])
                                             col_0                                     col_1
10598445  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000000  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000004
22017212  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000000  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000005
10658475  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000000  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000008
7030325   xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000000  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000030
1202416   xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000000  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000033
...                                            ...                                       ...
4969238   xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009999  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009979
23187814  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009999  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009981
10266634  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009999  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009989
7145974   xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009999  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009992
2216169   xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009999  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009995

[22121183 rows x 2 columns]
>>>

@infzo infzo added Needs Triage Need team to review and classify question Further information is requested labels Dec 29, 2022
@infzo
Copy link
Author

infzo commented Dec 29, 2022

I found it. It's possible that df.drop_duplicates is done by sorting.

@PointKernel
Copy link
Member

PointKernel commented Dec 29, 2022

@infzo Thanks for reporting the performance issue. As you pointed out, drop_duplicates internally sorts the input first

sorted_source_table = move(
cpp_stable_sort_by_key(
source_table_view,
keys_view,
column_order,
null_precedence
)
)
and then drops thus that's probably where the performance degradation came from.

@ttnghia Correct me if I was wrong, but I recall at one point you were trying to update the python bindings to use distinct as opposed to the current stable sort + unique?

@ttnghia
Copy link
Contributor

ttnghia commented Dec 29, 2022

Yes I was doing so in #11230, but @brandon-b-miller later doing a similar work (#11656) which should cover it so I stopped.

@brandon-b-miller How is the #11656 PR going?

rapids-bot bot pushed a commit that referenced this issue May 23, 2023
Depends on #13392.

Closes #11638
Closes #12449
Closes #11230
Closes #5286

This PR re-implements Python's `DataFrame.drop_duplicates` / `Series.drop_duplicates` to use the `stable_distinct` algorithm.

This fixed a large number of issues with correctness (ordering the same way as pandas) and also improves performance by eliminating a sorting step.

As a consequence of changing the behavior of `drop_duplicates`, a lot of refactoring was needed. The `drop_duplicates` function was used to implement `unique()`, which cascaded into changes for several groupby functions, one-hot encoding, `np.unique` array function dispatches, and more. Those downstream functions relied on the sorting order of `drop_duplicates` and `unique`, which is _not_ promised by pandas.

Authors:
  - https://github.com/brandon-b-miller
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Matthew Roeschke (https://github.com/mroeschke)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11656
@bdice
Copy link
Contributor

bdice commented May 23, 2023

Following up on this issue because of its long discussion. The drop_duplicates algorithm was re-implemented in #11656 using libcudf's stable_distinct, which is a faster, hash-based, stably-ordered solution. The behavior should now match pandas and is about 3x faster than the previous implementation in my benchmarks for large tables.

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants