Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow.parquet.read_* should use pre_buffer=True #28218

Closed
asfimport opened this issue Apr 16, 2021 · 7 comments
Closed

[Python] pyarrow.parquet.read_* should use pre_buffer=True #28218

asfimport opened this issue Apr 16, 2021 · 7 comments

Comments

@asfimport
Copy link
Collaborator

If the user is synchronously reading a single file, we should try to read it as fast as possible. The one sticking point might be whether it's beneficial to enable this no matter the filesystem or whether we should try to only enable it on high-latency filesystems.

Reporter: David Li / @lidavidm
Assignee: David Li / @lidavidm

PRs and other links:

Note: This issue was originally created as ARROW-12428. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Here's a quick comparison between Pandas/S3FS and PyArrow with a pre_buffer option implemented:


Python: 3.9.2
Pandas: 1.2.3
PyArrow: 5.0.0 master (9c1e5bd19347635ea9f373bcf93f2cea0231d50a)

Pandas/S3FS: 107.31099020410329 seconds
Pandas/S3FS (no readahead): 676.9701101030223 seconds
PyArrow: 213.81073790509254 seconds
PyArrow (pre-buffer): 29.330630503827706 seconds
Pandas/S3FS (pre-buffer): 54.61801828909665 seconds
Pandas/S3FS (pre-buffer, no readahead): 46.7531590978615 seconds 
import time
import pandas as pd
import pyarrow.parquet as pq

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet")
duration = time.monotonic() - start
print("Pandas/S3FS:", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
    'default_block_size': 1,  # 0 is ignored
    'default_fill_cache': False,
})
duration = time.monotonic() - start
print("Pandas/S3FS (no readahead):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet")
duration = time.monotonic() - start
print("PyArrow:", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=True)
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
    'default_block_size': 1,  # 0 is ignored
    'default_fill_cache': False,
}, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, no readahead):", duration, "seconds")

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
And for local files, to confirm that pre_buffer isn't a negative:


Pandas: 14.584974920144305 seconds
PyArrow: 6.650648137088865 seconds
PyArrow (pre-buffer): 6.587288308190182 seconds

This is on a system with NVME storage, so results may vary for spinning-rust or SATA SSDs.

(Updated results to read once without measuring before taking the measurement, in case disk cache is a factor)

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Finally, if we perform column selection, fsspec's readahead is actually extremely detrimental:


Pandas/S3FS (no pre-buffer): 88.26093492098153 seconds
Pandas/S3FS (pre-buffer): 107.76374901900999 seconds
PyArrow (no pre-buffer): 55.75352717819624 seconds
PyArrow (pre-buffer): 9.941459016874433 seconds 
columns = ['vendor_id', 'pickup_latitude', 'pickup_longitude', 'extra']

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=False)
duration = time.monotonic() - start
print("PyArrow (no pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=True)
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds")

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
@lidavidm small comment on the benchmark code: for the pyarrow cases, you need to add a .to_pandas() call for it to be equivalent with the pandas pd.read_parquet version (although I would expect this not be that significant compared to reading from S3).
(the read_pandas is a bit confusing name, but it still reads into a pyarrow.Table, it only uses the pandas metadata by default to eg ensure to read the pandas index column as well)

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
D'oh, and you already explained this in the SO question :) I'll re-run the benchmarks to make sure they're fair.

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:


Whole file:
Pandas/S3FS (no pre-buffer, no readahead): 692.2505334559828 seconds
Pandas/S3FS (no pre-buffer, readahead): 99.55904859001748 seconds
Pandas/S3FS (pre-buffer, no readahead): 39.282157234149054 seconds
Pandas/S3FS (pre-buffer, readahead): 41.564441804075614 seconds
PyArrow (no pre-buffer): 242.97687190794386 seconds
PyArrow (pre-buffer): 39.5321765630506 seconds
===
Column selection:
Pandas/S3FS (no pre-buffer, no readahead): 153.64498204295523 seconds
Pandas/S3FS (no pre-buffer, readahead): 82.44589220592752 seconds
Pandas/S3FS (pre-buffer, no readahead): 114.55768134980462 seconds
Pandas/S3FS (pre-buffer, readahead): 133.1232347697951 seconds
PyArrow (no pre-buffer): 54.11452938010916 seconds
PyArrow (pre-buffer): 12.865494727157056 seconds
import time
import pandas as pd
import pyarrow.fs
import pyarrow.parquet as pq

columns = ['vendor_id', 'pickup_latitude', 'pickup_longitude', 'extra']

print("Whole file:")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
    'default_block_size': 1,  # 0 is ignored
    'default_fill_cache': False,
}, pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer, no readahead):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer, readahead):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
    'default_block_size': 1,  # 0 is ignored
    'default_fill_cache': False,
}, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, no readahead):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, readahead):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=False).to_pandas()
duration = time.monotonic() - start
print("PyArrow (no pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=True).to_pandas()
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds")

print("===")
print("Column selection:")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
    'default_block_size': 1,  # 0 is ignored
    'default_fill_cache': False,
}, columns=columns, pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer, no readahead):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer, readahead):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
    'default_block_size': 1,  # 0 is ignored
    'default_fill_cache': False,
}, columns=columns, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, no readahead):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, readahead):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=False).to_pandas()
duration = time.monotonic() - start
print("PyArrow (no pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=True).to_pandas()
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds")

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 10074
#10074

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants