-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Allow specify batch_size
when reading Parquet file
#31165
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kwargs passing through the read_parquet() are for https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html, which doesn't have a "batch_size".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our documentation needs to be fixed. It's actually https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment, but not https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html. See #30915 (comment) for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is if it would make sense it to pass args from
read_parquet()
to things other thanread_table()
API.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense. We are actually passing kwargs to pyarrow.parquet.ParquetDataset and Scanner.from_fragment. These APIs do not have same arguments as
read_table()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems not desirable. Dataset is a distributed Arrow then it may make sense to pass through args to
read_table
, the single-node version of read, but not others.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianoaix The
ray.data.read_parquet()
API is more of a distributed analog forpyarrow.parquet.ParquetDataset
, where we expose certain features that the underlyingParquetDataset
provides (e.g. reading path-based partition columns into the data, supporting zero-read filter pushdown on partition columns, etc.). We do actually have a distributed analog forpyarrow.parquet.read_table()
, and that'sray.data.read_parquet_bulk()
, which doesn't usepyarrow.parquet.ParquetDataset
and instead directly usespyarrow.parquet.read_table()
.For this API, I think that directing users in the docs to
pyarrow.dataset.Scanner.from_fragment()
for**arrow_parquet_args
and topyarrow.parquet.ParquetDataset
fordataset_kwargs
would be best, and we should look at turning these passthrough arguments into top-level arguments that we define, with the passthrough being an implementation detail. Going forward, if we continue to build out our own format-agnostic partitioning machinery, we should eventually consider switching topyarrow.parquet.read_table()
if/when we achieve feature parity.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know that. IIUC, the difference between read_table v.s. ParuqetDataset/Scanner is whether we stream read a single file. So in
read_parquet_bulk()
, when reading a single file, it's not streamed, whereas inread_parquet()
, it is streaming. It looks to me converging on ParuqetDataset/Scanner for streaming single file is a good option for all those read APIs.For this PR itself, since it's leveraging existing arg passing, LG to move forward. We may discuss in followup in how to make the APIs / arg passing better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
Small note: the key difference is actually whether or not we're using Arrow's dataset stuff, which gives us a bunch of partitioning machinery. The streaming vs. full read was just an implementation detail for getting the performance out of
read_parquet_bulk()
that Amazon needed, since this was before we fixed the buffering for the streaming Parquet read. We could probably move theread_parquet_bulk()
to a streaming read with a buffer size set and get the same performance as the full read.