[QST] Why is df.groupby 100 times faster than df.drop_duplicates? #12449

infzo · 2022-12-29T09:49:14Z

What is your question?

Why is df.groupby 100 times faster than df.drop_duplicates?

>>> df
                                             col_0                                     col_1
0         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000004394  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000487
1         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000051  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000005893
2         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009403  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002170
3         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002504  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000004009
4         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002857  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000001875
...                                            ...                                       ...
24999995  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000007188  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000004901
24999996  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000421  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000169
24999997  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000006764  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000007317
24999998  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002301  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000001170
24999999  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009843  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000001078

[25000000 rows x 2 columns]
>>>
>>>
>>>
>>> time_list = []
>>> for i in range(10):
...     srt = time.time()
...     result = df.groupby(['col_0', 'col_1']).count().reset_index()
...     time_list.append(time.time() - srt)
...
>>> print(f'mean cost(str-80) {np.mean(time_list)} s, bandwidth: {len(df) / np.mean(time_list) / 10000} WRows/s')
mean cost(str-80) 0.10929956436157226 s, bandwidth: 22872.918246314206 WRows/s
>>> df.groupby(['col_0', 'col_1']).count().reset_index()
                                             col_0                                     col_1
0         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000008783  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000003094
1         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000008242  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002540
2         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000006452  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000007991
3         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000003325  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002243
4         xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000055  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000001391
...                                            ...                                       ...
22121178  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000001029  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000007222
22121179  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000026  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000002086
22121180  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000001611  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000261
22121181  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000005565  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000004872
22121182  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009194  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000008534

[22121183 rows x 2 columns]
>>>
>>>
>>>
>>> time_list = []
>>> for i in range(10):
...     srt = time.time()
...     result = df.drop_duplicates(['col_0', 'col_1'])
...     time_list.append(time.time() - srt)
...
>>> print(f'mean cost(str-80) {np.mean(time_list)} s, bandwidth: {len(df) / np.mean(time_list) / 10000} WRows/s')
mean cost(str-80) 15.08334140777588 s, bandwidth: 165.7457676262092 WRows/s
>>> df.drop_duplicates(['col_0', 'col_1'])
                                             col_0                                     col_1
10598445  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000000  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000004
22017212  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000000  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000005
10658475  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000000  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000008
7030325   xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000000  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000030
1202416   xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000000  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000000033
...                                            ...                                       ...
4969238   xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009999  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009979
23187814  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009999  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009981
10266634  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009999  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009989
7145974   xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009999  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009992
2216169   xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009999  xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_0000009995

[22121183 rows x 2 columns]
>>>

The text was updated successfully, but these errors were encountered:

infzo · 2022-12-29T11:28:03Z

I found it. It's possible that df.drop_duplicates is done by sorting.

PointKernel · 2022-12-29T19:24:16Z

@infzo Thanks for reporting the performance issue. As you pointed out, drop_duplicates internally sorts the input first

cudf/python/cudf/cudf/_lib/stream_compaction.pyx

Lines 172 to 179 in 95a348b

    
           sorted_source_table = move( 
        
               cpp_stable_sort_by_key( 
        
                   source_table_view, 
        
                   keys_view, 
        
                   column_order, 
        
                   null_precedence 
        
               ) 
        
           )

and then drops thus that's probably where the performance degradation came from.

@ttnghia Correct me if I was wrong, but I recall at one point you were trying to update the python bindings to use distinct as opposed to the current stable sort + unique?

ttnghia · 2022-12-29T20:35:59Z

Yes I was doing so in #11230, but @brandon-b-miller later doing a similar work (#11656) which should cover it so I stopped.

@brandon-b-miller How is the #11656 PR going?

Depends on #13392. Closes #11638 Closes #12449 Closes #11230 Closes #5286 This PR re-implements Python's `DataFrame.drop_duplicates` / `Series.drop_duplicates` to use the `stable_distinct` algorithm. This fixed a large number of issues with correctness (ordering the same way as pandas) and also improves performance by eliminating a sorting step. As a consequence of changing the behavior of `drop_duplicates`, a lot of refactoring was needed. The `drop_duplicates` function was used to implement `unique()`, which cascaded into changes for several groupby functions, one-hot encoding, `np.unique` array function dispatches, and more. Those downstream functions relied on the sorting order of `drop_duplicates` and `unique`, which is _not_ promised by pandas. Authors: - https://github.com/brandon-b-miller - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - Nghia Truong (https://github.com/ttnghia) URL: #11656

bdice · 2023-05-23T17:33:59Z

Following up on this issue because of its long discussion. The drop_duplicates algorithm was re-implemented in #11656 using libcudf's stable_distinct, which is a faster, hash-based, stably-ordered solution. The behavior should now match pandas and is about 3x faster than the previous implementation in my benchmarks for large tables.

infzo added Needs Triage Need team to review and classify question Further information is requested labels Dec 29, 2022

bdice mentioned this issue May 19, 2023

Implement Python drop_duplicates with cudf::stable_distinct. #11656

Merged

rapids-bot bot closed this as completed in #11656 May 23, 2023

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Why is df.groupby 100 times faster than df.drop_duplicates? #12449

[QST] Why is df.groupby 100 times faster than df.drop_duplicates? #12449

infzo commented Dec 29, 2022

infzo commented Dec 29, 2022

PointKernel commented Dec 29, 2022 •

edited

Loading

ttnghia commented Dec 29, 2022

bdice commented May 23, 2023

[QST] Why is df.groupby 100 times faster than df.drop_duplicates? #12449

[QST] Why is df.groupby 100 times faster than df.drop_duplicates? #12449

Comments

infzo commented Dec 29, 2022

infzo commented Dec 29, 2022

PointKernel commented Dec 29, 2022 • edited Loading

ttnghia commented Dec 29, 2022

bdice commented May 23, 2023

PointKernel commented Dec 29, 2022 •

edited

Loading