-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Allow specify batch_size
when reading Parquet file
#31165
Conversation
Signed-off-by: Cheng Su <[email protected]>
@@ -374,13 +374,14 @@ def _read_pieces( | |||
|
|||
logger.debug(f"Reading {len(pieces)} parquet pieces") | |||
use_threads = reader_args.pop("use_threads", False) | |||
batch_size = reader_args.pop("batch_size", PARQUET_READER_ROW_BATCH_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kwargs passing through the read_parquet() are for https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html, which doesn't have a "batch_size".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our documentation needs to be fixed. It's actually https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment, but not https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html. See #30915 (comment) for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is if it would make sense it to pass args from read_parquet()
to things other than read_table()
API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense. We are actually passing kwargs to pyarrow.parquet.ParquetDataset and Scanner.from_fragment. These APIs do not have same arguments as read_table()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems not desirable. Dataset is a distributed Arrow then it may make sense to pass through args to read_table
, the single-node version of read, but not others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianoaix The ray.data.read_parquet()
API is more of a distributed analog for pyarrow.parquet.ParquetDataset
, where we expose certain features that the underlying ParquetDataset
provides (e.g. reading path-based partition columns into the data, supporting zero-read filter pushdown on partition columns, etc.). We do actually have a distributed analog for pyarrow.parquet.read_table()
, and that's ray.data.read_parquet_bulk()
, which doesn't use pyarrow.parquet.ParquetDataset
and instead directly uses pyarrow.parquet.read_table()
.
For this API, I think that directing users in the docs to pyarrow.dataset.Scanner.from_fragment()
for **arrow_parquet_args
and to pyarrow.parquet.ParquetDataset
for dataset_kwargs
would be best, and we should look at turning these passthrough arguments into top-level arguments that we define, with the passthrough being an implementation detail. Going forward, if we continue to build out our own format-agnostic partitioning machinery, we should eventually consider switching to pyarrow.parquet.read_table()
if/when we achieve feature parity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know that. IIUC, the difference between read_table v.s. ParuqetDataset/Scanner is whether we stream read a single file. So in read_parquet_bulk()
, when reading a single file, it's not streamed, whereas in read_parquet()
, it is streaming. It looks to me converging on ParuqetDataset/Scanner for streaming single file is a good option for all those read APIs.
For this PR itself, since it's leveraging existing arg passing, LG to move forward. We may discuss in followup in how to make the APIs / arg passing better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
Small note: the key difference is actually whether or not we're using Arrow's dataset stuff, which gives us a bunch of partitioning machinery. The streaming vs. full read was just an implementation detail for getting the performance out of read_parquet_bulk()
that Amazon needed, since this was before we fixed the buffering for the streaming Parquet read. We could probably move the read_parquet_bulk()
to a streaming read with a buffer size set and get the same performance as the full read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix the docstring of read_parquet()
as well?
Signed-off-by: Cheng Su <[email protected]>
@jianoaix - yeah, updated the link to https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment. Will make a separate PR to add docstring for |
This PR is to allow users to specify batch_size when reading Parquet file.
…project#31165) This PR is to allow users to specify batch_size when reading Parquet file. Signed-off-by: tmynn <[email protected]>
Signed-off-by: Cheng Su [email protected]
Why are these changes needed?
This PR is to allow users to specify
batch_size
when reading Parquet file. Currently we have a hardcoded value of batch_size, which may read too much data into one Arrow block, if the table has wide rows. This does not work well with dynamic block splitting, as we never split one Arrow batch into smaller ones. Here we allow users to specifybatch_size
so user could tune thebatch_size
accordingly. We already allow users to pass most of PyArrow Parquet reader arguments, so this shouldn't add much configuration tuning overhead for them.Related issue number
Closes #30860
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.