-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][Dataset][Parquet] Enable Pre-Buffering by default for Parquet s3 datasets #36765
Comments
Nice catch! I guess |
return self.scanner(
schema=schema,
columns=columns,
filter=filter,
batch_size=batch_size,
batch_readahead=batch_readahead,
fragment_readahead=fragment_readahead,
fragment_scan_options=fragment_scan_options,
use_threads=use_threads,
memory_pool=memory_pool
).to_table() It call @westonpace would you mind take a look or give some advices? |
|
I guess
Will not buffer when read from single Parquet dataset, so might cause performance lost.
And this will default call |
You can require
|
I did some recent testing a few weeks ago and I also found that However, if |
@westonpace Since Acero Scan node would default prefetch some fragments, would it cause memory usage worse than previous |
Possibly. I think there are two concerns users generally have. Either they want "max speed" (use as little memory as possible but, if there is a speed/memory tradeoff, prefer speed) or they want "least memory" (if there is a speed/memory tradeoff, prefer memory) I tried to simplify things in #35889 (this is why I was running experiments a few weeks ago) and came up with: parquet::ReaderProperties
parquet::ArrowReaderProperties
I'm pretty sure that the I am very surprised to see a 10x difference due to pre-buffering. I don't think any of our experiments (S3 or not) ever showed a difference that was that drastic. |
Ah, I misunderstood this question. Yes, scan node / datasets will always use more memory than |
@westonpace With the limited understanding that I have, if memory consumption on local fs is an issue, could we check the filesystem on |
@westonpace I was surprised as well. but I've received similar speedup results from a colleague facing the same issue. On that note, are you able to replicate the issue with the snippet ? |
Yes, I am. I agree that this is pretty important we make this the default, at least for S3, if not the general default (which I think would probably be ok). |
I was not aware of this toggle before, but I had massive performance improvements when I turned on pre_buffer From 249 seconds to 2.9 seconds for 1.8 GB of data (2mil+ rows) querying for one month of data. Data is partitioned monthly and goes back to 2018. This was with adlfs filesystem querying for files on Azure Blob. |
I was chatting about this issue with some people at PyData Amsterdam, and was planning to make a PR to just switch the default when back, so here it is: #37854 That's only changing the default for Python ( I noticed that the R PR was also setting the Another useful reference for the above discussion is #28218 (https://issues.apache.org/jira/browse/ARROW-12428), where @lidavidm did some benchmarks with pre_buffer enabled/disabled, and which was the reason for exposing the pre_buffer option in |
Here, Pre_Buffer will try to buffer the require RowGroups if neccessary, and memory will not be released until read is finished. It's different from buffering mode( actually buffering mode might decrease the memory usage, lol). Even when policy is lazy, the reader might not get faster if RowGroup is large enough, and memory will not be released before read is finished. So I wonder if this is ok. |
Yeah, admittedly pre-buffer was a bit of a hack to minimize the changes to the Parquet reader. Ideally you want the Parquet reader to batch its I/O calls (as pre-buffer does) without necessarily caching them. But from what I remember, the reader is not designed that way (selecting columns eventually leads to a lot of disparate I/O calls far down the stack and you'd have to do a bunch of work to untangle that, hence caching was the easiest; that's also why the cache doesn't dump memory when things are done - it's hard from this level to tell when that time is). |
…reading Parquet files (#37854) ### Rationale for this change Enabling `pre_buffer` can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default. The option was already enabled by default in the `pyarrow.parquet.read_table` interface, this PR aligns the defaults when using `pyarrow.dataset` directly. * Closes: #36765 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…e for reading Parquet files (apache#37854) ### Rationale for this change Enabling `pre_buffer` can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default. The option was already enabled by default in the `pyarrow.parquet.read_table` interface, this PR aligns the defaults when using `pyarrow.dataset` directly. * Closes: apache#36765 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…e for reading Parquet files (apache#37854) ### Rationale for this change Enabling `pre_buffer` can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default. The option was already enabled by default in the `pyarrow.parquet.read_table` interface, this PR aligns the defaults when using `pyarrow.dataset` directly. * Closes: apache#36765 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…e for reading Parquet files (apache#37854) ### Rationale for this change Enabling `pre_buffer` can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default. The option was already enabled by default in the `pyarrow.parquet.read_table` interface, this PR aligns the defaults when using `pyarrow.dataset` directly. * Closes: apache#36765 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Describe the enhancement requested
Summary
I'm seeing a discrepancy in read times between
parquet
anddataset
modules for reading (more notice-able on fairly large) partitioned parquet tables.In a nutshell,
pyarrow.dataset.to_table()
seems to be considerably slower thanpyarrow.parquet.read_table()
in most cases I have tested so far. The difference in read times for some synthetic data (see Appendix) I tested on was approximately 12x.Details
>>> '3.8.17 (default, Jul 5 2023, 21:04:15) \n[GCC 11.2.0]'
>>> '12.0.1'
The test was done on a machine similiar in specs to an
m5.4xlarge
(nCPU=16 | Memory=64GiB) EC2 InstanceFrom doing some digging, I found that the root cause of the issue seems to be
pre_buffer
, which is not enabled forpyarrow.dataset
by default.Once I add the pre_buffering via
ParquetFileFormat
&ParquetFragmentScanOptions
the issue gets resolved.The issue seems to be exacerbated if the parquet datasets with partitions (maybe more file parts in general ?). Doing the read test on a single parquet was still worse but not as worse as the partition one.
Do we know the reasoning for this? As far as I understand, for S3-based parquet datasets, this could be turned on by default. Otherwise maybe we can add it as an argument for the user to enable in the
dataset()
call?Appendix
Code to generate test data
Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: