Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Allow specify batch_size when reading Parquet file #31165

Merged
merged 2 commits into from
Dec 21, 2022

Conversation

c21
Copy link
Contributor

@c21 c21 commented Dec 16, 2022

Signed-off-by: Cheng Su [email protected]

Why are these changes needed?

This PR is to allow users to specify batch_size when reading Parquet file. Currently we have a hardcoded value of batch_size, which may read too much data into one Arrow block, if the table has wide rows. This does not work well with dynamic block splitting, as we never split one Arrow batch into smaller ones. Here we allow users to specify batch_size so user could tune the batch_size accordingly. We already allow users to pass most of PyArrow Parquet reader arguments, so this shouldn't add much configuration tuning overhead for them.

Related issue number

Closes #30860

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -374,13 +374,14 @@ def _read_pieces(

logger.debug(f"Reading {len(pieces)} parquet pieces")
use_threads = reader_args.pop("use_threads", False)
batch_size = reader_args.pop("batch_size", PARQUET_READER_ROW_BATCH_SIZE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kwargs passing through the read_parquet() are for https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html, which doesn't have a "batch_size".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is if it would make sense it to pass args from read_parquet() to things other than read_table() API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense. We are actually passing kwargs to pyarrow.parquet.ParquetDataset and Scanner.from_fragment. These APIs do not have same arguments as read_table().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems not desirable. Dataset is a distributed Arrow then it may make sense to pass through args to read_table, the single-node version of read, but not others.

Copy link
Contributor

@clarkzinzow clarkzinzow Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix The ray.data.read_parquet() API is more of a distributed analog for pyarrow.parquet.ParquetDataset, where we expose certain features that the underlying ParquetDataset provides (e.g. reading path-based partition columns into the data, supporting zero-read filter pushdown on partition columns, etc.). We do actually have a distributed analog for pyarrow.parquet.read_table(), and that's ray.data.read_parquet_bulk(), which doesn't use pyarrow.parquet.ParquetDataset and instead directly uses pyarrow.parquet.read_table().

For this API, I think that directing users in the docs to pyarrow.dataset.Scanner.from_fragment() for **arrow_parquet_args and to pyarrow.parquet.ParquetDataset for dataset_kwargs would be best, and we should look at turning these passthrough arguments into top-level arguments that we define, with the passthrough being an implementation detail. Going forward, if we continue to build out our own format-agnostic partitioning machinery, we should eventually consider switching to pyarrow.parquet.read_table() if/when we achieve feature parity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know that. IIUC, the difference between read_table v.s. ParuqetDataset/Scanner is whether we stream read a single file. So in read_parquet_bulk(), when reading a single file, it's not streamed, whereas in read_parquet(), it is streaming. It looks to me converging on ParuqetDataset/Scanner for streaming single file is a good option for all those read APIs.

For this PR itself, since it's leveraging existing arg passing, LG to move forward. We may discuss in followup in how to make the APIs / arg passing better.

Copy link
Contributor

@clarkzinzow clarkzinzow Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Small note: the key difference is actually whether or not we're using Arrow's dataset stuff, which gives us a bunch of partitioning machinery. The streaming vs. full read was just an implementation detail for getting the performance out of read_parquet_bulk() that Amazon needed, since this was before we fixed the buffering for the streaming Parquet read. We could probably move the read_parquet_bulk() to a streaming read with a buffer size set and get the same performance as the full read.

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the docstring of read_parquet() as well?

@c21
Copy link
Contributor Author

c21 commented Dec 21, 2022

Can you fix the docstring of read_parquet() as well?

@jianoaix - yeah, updated the link to https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment. Will make a separate PR to add docstring for dataset_kwargs, which is for a separate issue - #30915 (comment) .

@clarkzinzow clarkzinzow merged commit c8443c0 into ray-project:master Dec 21, 2022
@c21 c21 deleted the parquet-batch-size branch December 21, 2022 18:02
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
This PR is to allow users to specify batch_size when reading Parquet file.
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…project#31165)

This PR is to allow users to specify batch_size when reading Parquet file.

Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Support dynamic block splitting by row batch size in DatasetContext
3 participants