-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] pyarrow.parquet.read_* should use pre_buffer=True #28218
Comments
David Li / @lidavidm:
import time
import pandas as pd
import pyarrow.parquet as pq
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet")
duration = time.monotonic() - start
print("Pandas/S3FS:", duration, "seconds")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
'default_block_size': 1, # 0 is ignored
'default_fill_cache': False,
})
duration = time.monotonic() - start
print("Pandas/S3FS (no readahead):", duration, "seconds")
start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet")
duration = time.monotonic() - start
print("PyArrow:", duration, "seconds")
start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=True)
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer):", duration, "seconds")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
'default_block_size': 1, # 0 is ignored
'default_fill_cache': False,
}, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, no readahead):", duration, "seconds") |
David Li / @lidavidm:
This is on a system with NVME storage, so results may vary for spinning-rust or SATA SSDs. (Updated results to read once without measuring before taking the measurement, in case disk cache is a factor) |
David Li / @lidavidm:
columns = ['vendor_id', 'pickup_latitude', 'pickup_longitude', 'extra']
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer):", duration, "seconds")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer):", duration, "seconds")
start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=False)
duration = time.monotonic() - start
print("PyArrow (no pre-buffer):", duration, "seconds")
start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=True)
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds") |
Joris Van den Bossche / @jorisvandenbossche: |
import time
import pandas as pd
import pyarrow.fs
import pyarrow.parquet as pq
columns = ['vendor_id', 'pickup_latitude', 'pickup_longitude', 'extra']
print("Whole file:")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
'default_block_size': 1, # 0 is ignored
'default_fill_cache': False,
}, pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer, no readahead):", duration, "seconds")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer, readahead):", duration, "seconds")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
'default_block_size': 1, # 0 is ignored
'default_fill_cache': False,
}, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, no readahead):", duration, "seconds")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, readahead):", duration, "seconds")
start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=False).to_pandas()
duration = time.monotonic() - start
print("PyArrow (no pre-buffer):", duration, "seconds")
start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=True).to_pandas()
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds")
print("===")
print("Column selection:")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
'default_block_size': 1, # 0 is ignored
'default_fill_cache': False,
}, columns=columns, pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer, no readahead):", duration, "seconds")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer, readahead):", duration, "seconds")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
'default_block_size': 1, # 0 is ignored
'default_fill_cache': False,
}, columns=columns, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, no readahead):", duration, "seconds")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, readahead):", duration, "seconds")
start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=False).to_pandas()
duration = time.monotonic() - start
print("PyArrow (no pre-buffer):", duration, "seconds")
start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=True).to_pandas()
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds") |
Joris Van den Bossche / @jorisvandenbossche: |
If the user is synchronously reading a single file, we should try to read it as fast as possible. The one sticking point might be whether it's beneficial to enable this no matter the filesystem or whether we should try to only enable it on high-latency filesystems.
Reporter: David Li / @lidavidm
Assignee: David Li / @lidavidm
PRs and other links:
Note: This issue was originally created as ARROW-12428. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: