-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression uint64 groupby().min() in 0.25.1 vs. 0.24.2 #28122
Comments
This is better than nothing, but can you provide something we can copy/paste directly? |
The revised code, below, is sufficient to demonstrate the delta.
The delta is even worse on the original dataset above, suggesting a dependence on the number of groups in the groupby. Complete profile dump and visualization in the attached. |
Thanks. It looks like in groupby.generic there is a |
Yep, groupby.group L 1372 there is a try/except with PR would be welcome. |
In _libs.groubpy_helper |
I too am noticing this issue, but not exclusive to uint64. I'm also seeing the performance regression on bool and float32 column types. ~25x slower in 0.25.1+ than in 0.24.2 |
I have ran into this problem too: Any groupby on more than 7/8 columns takes forever. This performance regression is a wall. |
This issue is fixed in the latest release. @WillAyd @jbrockmendel should we add an ASV check to monitor this? 0.24.2 1.2.4 Using minimal example below:
|
go for it |
take |
Code Sample, a copy-pastable example if possible
Problem description
In pandas 0.25.1, the performance of
df.groupby().min()
is 30x slower than in 0.24.2 when one of the columns has a dtype ofuint64
.I saved the exact dataframe
min_data_in
from my original code withmin_data_in.to_csv('pandas_gb_min_test.csv')
. The columns in the original code had the following data types:After loading with
read_csv
the dtype ofsplit_event_parent_event_id
was signed.If I restore the
uint64
dtype, v0.25.1 runtime is 23 s. Runtime for v0.24.2 is 0.7 s.A signed dtype in v0.25.1 has the same performance as 0.24.2.
The data file, cProfile capture, and snakeviz output are attached.
Output of
pd.show_versions()
0.24.2
and
0.25.1
pd0242.pdf
pd0251.pdf
pd0251.prof.zip
pd0242.prof.zip
pandas_gb_min_test.csv.gz
The text was updated successfully, but these errors were encountered: