-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: hash_pandas_object performance regression between 2.1.0 and 2.1.1 #55245
Comments
@jorisvandenbossche i think you and i are the only ones to mess with this code recently, but IIRC only EAs would have been affected? |
@snazzer thanks for the report! size = 10_000
df = pd.DataFrame(np.random.normal(size=(50, size)))
%timeit pd.util.hash_pandas_object(df.reset_index()) But, from a quick profile, it seems it is not related to the actual hashing code itself, but from getting each row: And it also seems it is related to some caching issue. With the
But when removing that part of the reproducer, we only see the slowdown the first time when called, and so when timing multiple times, we don't see it anymore:
@gupta-paras certainly welcome to help, but so it might require some more investigation first |
So this is another manifestation of #55008 (comment), I think |
@jorisvandenbossche, I have dome some initial investigation, and found that below section is slow in particular. This is affecting the iget:
|
Indeed, see #55008 (comment) for the context (that's the PR that caused this). There is already a PR open to address this in the meantime -> #55518 |
Exactly, disabling "_clear_dead_references" made program run faster. Thanks @jorisvandenbossche |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Given the following example:
The runtimes between version 2.1.0 and 2.1.1 are drastically different, with 2.1.1 showing polynomial/exponential runtime.
Installed Versions
For 2.1.0
For 2.1.1
Prior Performance
Same example as above. 2.1.0 is fine. 2.1.1 is not.
The text was updated successfully, but these errors were encountered: