-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Massive performance deterioration with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() #38260
Comments
Hi, could you share the version of pyarrow used and the Python version? |
I am able to reproduce it locally. The behavior changes only when updating python/pandas. I'm not sure what the root cause is though.
Here's the modified version of the above script I used:
|
Here are two flamegraphs produced from py-spy. It looks like the major difference between the two is in the proportion of time spent in
|
One thing noticed: |
This is the massively slower part of the code, based on pandas version:
|
The column lookup ( From this investigation, it seems like the bug is on pandas' end. I will research if there is an existing open issue. |
I was going to comment yesterday that this is quite likely an issue on the pandas side, which is in the meantime confirmed by the comments above. And coincidentally, I was now looking at a perf regression report in pandas (pandas-dev/pandas#55245) that shows the same culprit as in the py-spy image from @amoeba above: So it's indeed repeated column lookup in wide dataframes that has become significantly slower. It's a regression in 2.1.0 -> 2.1.1, and caused by pandas-dev/pandas#55008 (comment) |
Thanks for the thorough investigation everyone, great work! If this is a confirmed pandas issue, should we close this? |
Describe the bug, including details regarding any error messages, version, and platform.
We experierence a massiv drop in performance when using pandas 2.1.1 vs. pandas 1.5.3 when invoking pa.Table.from_pandas().
In this example, the conversion time increased from roughly 2.9 seconds to 16.2 seconds. In our data application the problem is evern more dramatic since the size of the dataframe is larger - it seems very sensitive to the number of columns. 2x number of columns yields roughly 4x compute time (
num_cols=20000
vs.num_cols=40000
). With pandas 1.5.3 the compute time is more linear with the number of columns. Not sure if this should be raised also with pandas.Component(s)
Python
The text was updated successfully, but these errors were encountered: