Performance regression uint64 groupby().min() in 0.25.1 vs. 0.24.2 #28122

deeplycloudy · 2019-08-23T23:05:51Z

Code Sample, a copy-pastable example if possible

import pandas as pd
dft = pd.read_csv('pandas_gb_min_test.csv.gz', parse_dates=['split_lutevent_time_offset'])

# An unsigned dtype is slow (23 s) on 0.25.1, but fast on 0.24.2.
# Commenting out this line on 0.25.1 restores performance
dft['split_event_parent_event_id'] = dft.split_event_parent_event_id.astype('uint64')

for mdvn in dft.columns: print(mdvn, dft[mdvn].dtype, type(dft[mdvn]))
mindata = dft.groupby('xyidx').min()

Problem description

In pandas 0.25.1, the performance of df.groupby().min() is 30x slower than in 0.24.2 when one of the columns has a dtype of uint64.

I saved the exact dataframe min_data_in from my original code with min_data_in.to_csv('pandas_gb_min_test.csv'). The columns in the original code had the following data types:

>>> for mdvn in min_data_in.columns: print(mdvn, min_data_in[mdvn].dtype, type(min_data_in[mdvn]))

split_event_parent_event_id uint64 <class 'pandas.core.series.Series'>
split_lutevent_time_offset datetime64[ns] <class 'pandas.core.series.Series'>
split_event_mesh_x_idx int64 <class 'pandas.core.series.Series'>
split_event_mesh_y_idx int64 <class 'pandas.core.series.Series'>
split_lutevent_min_flash_area float64 <class 'pandas.core.series.Series'>
xyidx int64 <class 'pandas.core.series.Series'>

After loading with read_csv the dtype of split_event_parent_event_id was signed.

If I restore the uint64 dtype, v0.25.1 runtime is 23 s. Runtime for v0.24.2 is 0.7 s.

A signed dtype in v0.25.1 has the same performance as 0.24.2.

The data file, cProfile capture, and snakeviz output are attached.

Output of `pd.show_versions()`

0.24.2
and
0.25.1

pd0242.pdf
pd0251.pdf
pd0251.prof.zip
pd0242.prof.zip
pandas_gb_min_test.csv.gz

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2019-08-23T23:17:33Z

The data file, cProfile capture, and snakeviz output are attached.

This is better than nothing, but can you provide something we can copy/paste directly?

deeplycloudy · 2019-08-24T00:16:52Z

The revised code, below, is sufficient to demonstrate the delta.
0.24.2 = 2 seconds
0.25.1 = 14 seconds

n=10000000
x = np.random.randint(0, 100000, size=n).astype('int64')
a = np.random.randint(0, 1000, size=n).astype('uint64')
df = pd.DataFrame({'a':a, 'x':x})
df.groupby('x').min()

The delta is even worse on the original dataset above, suggesting a dependence on the number of groups in the groupby.

Complete profile dump and visualization in the attached.

pd0251synthetic.pdf
pd0242synthetic.pdf

jbrockmendel · 2019-08-24T00:30:56Z

Thanks.

It looks like in groupby.generic there is a cyfunc = self._get_cython_func(func_or_funcs) that is returning None. I'll have to look closer to be sure, but I expect it should be finding a cythonized implementation.

jbrockmendel · 2019-08-24T00:38:05Z

Yep, groupby.group L 1372 there is a try/except with self._cython_agg_general(alias, alt=npfunc, **kwargs) that raises on master and succeeds in 0.24.2

PR would be welcome.

jbrockmendel · 2019-08-24T00:49:36Z

In _libs.groubpy_helper groupby_t needs to be amended to include uint64_t.

davidkimble · 2020-05-01T14:36:50Z

I too am noticing this issue, but not exclusive to uint64. I'm also seeing the performance regression on bool and float32 column types. ~25x slower in 0.25.1+ than in 0.24.2

oscar6echo · 2020-10-21T21:39:04Z

I have ran into this problem too: Any groupby on more than 7/8 columns takes forever. This performance regression is a wall.
Practically, I stick to version 0.24.2 in scripts that rely on groupby operations.

smithto1 · 2021-06-04T15:41:23Z

This issue is fixed in the latest release.

@WillAyd @jbrockmendel should we add an ASV check to monitor this?

0.24.2
806 ms ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

1.2.4
727 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using minimal example below:

import numpy as np
import pandas as pd

n=10000000
x = np.random.randint(0, 100000, size=n).astype('int64')
a = np.random.randint(0, 1000, size=n).astype('uint64')
df = pd.DataFrame({'a':a, 'x':x})

print(pd.__version__)

%timeit df.groupby('x').min()

jbrockmendel · 2021-06-04T19:14:08Z

should we add an ASV check to monitor this?

go for it

smithto1 · 2021-06-08T09:31:08Z

take

WillAyd added Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Aug 24, 2019

WillAyd added this to the Contributions Welcome milestone Aug 24, 2019

jbrockmendel mentioned this issue Aug 24, 2019

CLN: change regex try/except to checks #28121

Merged

Zaharid mentioned this issue May 13, 2020

BUG: Performance regression in categorical indexer #34162

Closed

3 tasks

deeplycloudy mentioned this issue May 28, 2020

speed of GLM grid generation for ABI CONUS fixed grid deeplycloudy/glmtools#68

Open

smithto1 added a commit to smithto1/pandas that referenced this issue Jun 7, 2021

GH pandas-dev#28122 add ASV check for groupby with uint64 col

6709b0e

smithto1 mentioned this issue Jun 7, 2021

TST: add ASV check for groupby with uint64 col #41859

Merged

3 tasks

github-actions bot assigned smithto1 Jun 8, 2021

jreback modified the milestones: Contributions Welcome, 1.3 Jun 8, 2021

jreback closed this as completed in #41859 Jun 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression uint64 groupby().min() in 0.25.1 vs. 0.24.2 #28122

Performance regression uint64 groupby().min() in 0.25.1 vs. 0.24.2 #28122

deeplycloudy commented Aug 23, 2019 •

edited

Loading

jbrockmendel commented Aug 23, 2019

deeplycloudy commented Aug 24, 2019 •

edited

Loading

jbrockmendel commented Aug 24, 2019

jbrockmendel commented Aug 24, 2019

jbrockmendel commented Aug 24, 2019

davidkimble commented May 1, 2020 •

edited

Loading

oscar6echo commented Oct 21, 2020

smithto1 commented Jun 4, 2021

jbrockmendel commented Jun 4, 2021

smithto1 commented Jun 8, 2021

Performance regression uint64 groupby().min() in 0.25.1 vs. 0.24.2 #28122

Performance regression uint64 groupby().min() in 0.25.1 vs. 0.24.2 #28122

Comments

deeplycloudy commented Aug 23, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

jbrockmendel commented Aug 23, 2019

deeplycloudy commented Aug 24, 2019 • edited Loading

jbrockmendel commented Aug 24, 2019

jbrockmendel commented Aug 24, 2019

jbrockmendel commented Aug 24, 2019

davidkimble commented May 1, 2020 • edited Loading

oscar6echo commented Oct 21, 2020

smithto1 commented Jun 4, 2021

jbrockmendel commented Jun 4, 2021

smithto1 commented Jun 8, 2021

deeplycloudy commented Aug 23, 2019 •

edited

Loading

Output of `pd.show_versions()`

deeplycloudy commented Aug 24, 2019 •

edited

Loading

davidkimble commented May 1, 2020 •

edited

Loading