Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression uint64 groupby().min() in 0.25.1 vs. 0.24.2 #28122

Closed
deeplycloudy opened this issue Aug 23, 2019 · 10 comments · Fixed by #41859
Closed

Performance regression uint64 groupby().min() in 0.25.1 vs. 0.24.2 #28122

deeplycloudy opened this issue Aug 23, 2019 · 10 comments · Fixed by #41859
Assignees
Labels
Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@deeplycloudy
Copy link

deeplycloudy commented Aug 23, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
dft = pd.read_csv('pandas_gb_min_test.csv.gz', parse_dates=['split_lutevent_time_offset'])

# An unsigned dtype is slow (23 s) on 0.25.1, but fast on 0.24.2.
# Commenting out this line on 0.25.1 restores performance
dft['split_event_parent_event_id'] = dft.split_event_parent_event_id.astype('uint64')

for mdvn in dft.columns: print(mdvn, dft[mdvn].dtype, type(dft[mdvn]))
mindata = dft.groupby('xyidx').min()

Problem description

In pandas 0.25.1, the performance of df.groupby().min() is 30x slower than in 0.24.2 when one of the columns has a dtype of uint64.

I saved the exact dataframe min_data_in from my original code with min_data_in.to_csv('pandas_gb_min_test.csv'). The columns in the original code had the following data types:

>>> for mdvn in min_data_in.columns: print(mdvn, min_data_in[mdvn].dtype, type(min_data_in[mdvn]))

split_event_parent_event_id uint64 <class 'pandas.core.series.Series'>
split_lutevent_time_offset datetime64[ns] <class 'pandas.core.series.Series'>
split_event_mesh_x_idx int64 <class 'pandas.core.series.Series'>
split_event_mesh_y_idx int64 <class 'pandas.core.series.Series'>
split_lutevent_min_flash_area float64 <class 'pandas.core.series.Series'>
xyidx int64 <class 'pandas.core.series.Series'>

After loading with read_csv the dtype of split_event_parent_event_id was signed.

If I restore the uint64 dtype, v0.25.1 runtime is 23 s. Runtime for v0.24.2 is 0.7 s.

A signed dtype in v0.25.1 has the same performance as 0.24.2.

The data file, cProfile capture, and snakeviz output are attached.

Output of pd.show_versions()

0.24.2
and
0.25.1

pd0242.pdf
pd0251.pdf
pd0251.prof.zip
pd0242.prof.zip
pandas_gb_min_test.csv.gz

@jbrockmendel
Copy link
Member

The data file, cProfile capture, and snakeviz output are attached.

This is better than nothing, but can you provide something we can copy/paste directly?

@deeplycloudy
Copy link
Author

deeplycloudy commented Aug 24, 2019

The revised code, below, is sufficient to demonstrate the delta.
0.24.2 = 2 seconds
0.25.1 = 14 seconds

n=10000000
x = np.random.randint(0, 100000, size=n).astype('int64')
a = np.random.randint(0, 1000, size=n).astype('uint64')
df = pd.DataFrame({'a':a, 'x':x})
df.groupby('x').min() 

The delta is even worse on the original dataset above, suggesting a dependence on the number of groups in the groupby.

Complete profile dump and visualization in the attached.

pd0251synthetic.pdf
pd0242synthetic.pdf

@jbrockmendel
Copy link
Member

Thanks.

It looks like in groupby.generic there is a cyfunc = self._get_cython_func(func_or_funcs) that is returning None. I'll have to look closer to be sure, but I expect it should be finding a cythonized implementation.

@jbrockmendel
Copy link
Member

Yep, groupby.group L 1372 there is a try/except with self._cython_agg_general(alias, alt=npfunc, **kwargs) that raises on master and succeeds in 0.24.2

PR would be welcome.

@jbrockmendel
Copy link
Member

In _libs.groubpy_helper groupby_t needs to be amended to include uint64_t.

@WillAyd WillAyd added Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Aug 24, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Aug 24, 2019
@davidkimble
Copy link

davidkimble commented May 1, 2020

I too am noticing this issue, but not exclusive to uint64. I'm also seeing the performance regression on bool and float32 column types. ~25x slower in 0.25.1+ than in 0.24.2

@oscar6echo
Copy link

I have ran into this problem too: Any groupby on more than 7/8 columns takes forever. This performance regression is a wall.
Practically, I stick to version 0.24.2 in scripts that rely on groupby operations.

@smithto1
Copy link
Member

smithto1 commented Jun 4, 2021

This issue is fixed in the latest release.

@WillAyd @jbrockmendel should we add an ASV check to monitor this?

0.24.2
806 ms ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

1.2.4
727 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using minimal example below:

import numpy as np
import pandas as pd

n=10000000
x = np.random.randint(0, 100000, size=n).astype('int64')
a = np.random.randint(0, 1000, size=n).astype('uint64')
df = pd.DataFrame({'a':a, 'x':x})

print(pd.__version__)

%timeit df.groupby('x').min() 

@jbrockmendel
Copy link
Member

should we add an ASV check to monitor this?

go for it

@smithto1
Copy link
Member

smithto1 commented Jun 8, 2021

take

@jreback jreback modified the milestones: Contributions Welcome, 1.3 Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants