Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Massive performance deterioration with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() #38260

Closed
MMCMA opened this issue Oct 13, 2023 · 8 comments

Comments

@MMCMA
Copy link

MMCMA commented Oct 13, 2023

Describe the bug, including details regarding any error messages, version, and platform.

We experierence a massiv drop in performance when using pandas 2.1.1 vs. pandas 1.5.3 when invoking pa.Table.from_pandas().
In this example, the conversion time increased from roughly 2.9 seconds to 16.2 seconds. In our data application the problem is evern more dramatic since the size of the dataframe is larger - it seems very sensitive to the number of columns. 2x number of columns yields roughly 4x compute time (num_cols=20000 vs. num_cols=40000). With pandas 1.5.3 the compute time is more linear with the number of columns. Not sure if this should be raised also with pandas.

import pyarrow as pa
import pandas as pd
import numpy as np
import timeit

num_cols = 20000
num_dates = 8800
dates = pd.date_range(start='19900101', freq='b', periods=num_dates)
data = numpy.random.randint(low=0, high=10, size=(num_dates, num_cols))
df = pd.DataFrame(data, index=dates)

tic = timeit.default_timer()
pa.Table.from_pandas(df, preserve_index=True)
total_time = timeit.default_timer() - tic
print(f'Conversion from pandas to pyarrow took {total_time} seconds')

Component(s)

Python

@assignUser assignUser added the Priority: Blocker Marks a blocker for the release label Oct 13, 2023
@raulcd
Copy link
Member

raulcd commented Oct 13, 2023

Hi, could you share the version of pyarrow used and the Python version?

@raulcd raulcd changed the title Massive performance deterioation with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() [Python] Massive performance deterioation with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() Oct 13, 2023
@raulcd raulcd changed the title [Python] Massive performance deterioation with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() [Python] Massive performance deterioration with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() Oct 13, 2023
@danepitkin
Copy link
Member

I am able to reproduce it locally. The behavior changes only when updating python/pandas. I'm not sure what the root cause is though.

$ python arrow-38260.py 20000
python version:  3.9.18
pyarrow version: 13.0.0
pandas version:  1.5.3
numpy version:   1.26.0
Conversion from pandas to pyarrow took 1.114816792 seconds for 20000 columns
$ python arrow-38260.py 40000
python version:  3.9.18
pyarrow version: 13.0.0
pandas version:  1.5.3
numpy version:   1.26.0
Conversion from pandas to pyarrow took 2.4374076250000005 seconds for 40000 columns

$ python arrow-38260.py 20000
python version:  3.12.0
pyarrow version: 13.0.0
pandas version:  2.1.1
numpy version:   1.26.0
Conversion from pandas to pyarrow took 5.036314583034255 seconds for 20000 columns
$ python arrow-38260.py 40000
python version:  3.12.0
pyarrow version: 13.0.0
pandas version:  2.1.1
numpy version:   1.26.0
Conversion from pandas to pyarrow took 19.435286541993264 seconds for 40000 columns

Here's the modified version of the above script I used:

import argparse
import platform
import timeit

import numpy as np
import pandas as pd
import pyarrow as pa

parser = argparse.ArgumentParser()
parser.add_argument("num_cols", type=int)
args = parser.parse_args()

num_cols = args.num_cols
num_dates = 8800
dates = pd.date_range(start='19900101', freq='b', periods=num_dates)
data = np.random.randint(low=0, high=10, size=(num_dates, num_cols))
df = pd.DataFrame(data, index=dates)

tic = timeit.default_timer()
pa.Table.from_pandas(df, preserve_index=True)
total_time = timeit.default_timer() - tic
print(f'python version:  {platform.python_version()}')
print(f'pyarrow version: {pa.__version__}')
print(f'pandas version:  {pd.__version__}')
print(f'numpy version:   {np.__version__}')
print(f'Conversion from pandas to pyarrow took {total_time} seconds for {num_cols} columns')

@amoeba
Copy link
Member

amoeba commented Oct 13, 2023

Here are two flamegraphs produced from py-spy. It looks like the major difference between the two is in the proportion of time spent in dataframe_to_arrays.

python version:  3.11.6
pyarrow version: 11.0.0
pandas version:  1.5.3
numpy version:   1.26.0
Conversion from pandas to pyarrow took 1.1433022079290822 seconds for 20000 columns

flamegraph-pandas153-20000

python version:  3.11.6
pyarrow version: 13.0.0
pandas version:  2.1.1
numpy version:   1.26.0
Conversion from pandas to pyarrow took 3.7711586660007015 seconds for 20000 columns

flamegraph-pandas211-20000

@anjakefala
Copy link
Collaborator

One thing noticed: _get_columns_to_convert (pandas_compat.py(573)) is significantly faster in 1.5.3. Doing more digging!

@anjakefala
Copy link
Collaborator

This is the massively slower part of the code, based on pandas version:

 373     for name in columns:                                                                                                                                                                                      
 374         tic = timeit.default_timer()                                                                                                                                                                          
 375         col = df[name]                                                                                                                                                                                        
 376         name = _column_name_to_strings(name)                                                                                                                                                                  
 377                                                                                                                                                                                                               
 378         if _pandas_api.is_sparse(col):                                                                                                                                                                        
 379             raise TypeError(                                                                                                                                                                                  
 380                 "Sparse pandas data (column {}) not supported.".format(name))                                                                                                                                 
 381                                                                                                                                                                                                               
 382         columns_to_convert.append(col)                                                                                                                                                                        
 383         convert_fields.append(None)                                                                                                                                                                           
 384         column_names.append(name)         

@anjakefala
Copy link
Collaborator

anjakefala commented Oct 13, 2023

The column lookup (col = df[name]) is the performance drop. It is roughly 3 orders of magnitude slower. We are talking about a difference of maybe 100 microseconds, that adds up to seconds when you have a lot of columns.

From this investigation, it seems like the bug is on pandas' end. I will research if there is an existing open issue.

@jorisvandenbossche
Copy link
Member

I was going to comment yesterday that this is quite likely an issue on the pandas side, which is in the meantime confirmed by the comments above. And coincidentally, I was now looking at a perf regression report in pandas (pandas-dev/pandas#55245) that shows the same culprit as in the py-spy image from @amoeba above: Manager.iget, which is what is used under the hood to access a column.

So it's indeed repeated column lookup in wide dataframes that has become significantly slower. It's a regression in 2.1.0 -> 2.1.1, and caused by pandas-dev/pandas#55008 (comment)

@assignUser
Copy link
Member

Thanks for the thorough investigation everyone, great work! If this is a confirmed pandas issue, should we close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants