[Datasets] Allow specify `batch_size` when reading Parquet file #31165

c21 · 2022-12-16T22:29:08Z

Signed-off-by: Cheng Su [email protected]

Why are these changes needed?

This PR is to allow users to specify batch_size when reading Parquet file. Currently we have a hardcoded value of batch_size, which may read too much data into one Arrow block, if the table has wide rows. This does not work well with dynamic block splitting, as we never split one Arrow batch into smaller ones. Here we allow users to specify batch_size so user could tune the batch_size accordingly. We already allow users to pass most of PyArrow Parquet reader arguments, so this shouldn't add much configuration tuning overhead for them.

Related issue number

Closes #30860

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Cheng Su <[email protected]>

jianoaix · 2022-12-17T00:12:01Z

python/ray/data/datasource/parquet_datasource.py

@@ -374,13 +374,14 @@ def _read_pieces(

    logger.debug(f"Reading {len(pieces)} parquet pieces")
    use_threads = reader_args.pop("use_threads", False)
+    batch_size = reader_args.pop("batch_size", PARQUET_READER_ROW_BATCH_SIZE)


The kwargs passing through the read_parquet() are for https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html, which doesn't have a "batch_size".

Our documentation needs to be fixed. It's actually https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment, but not https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html. See #30915 (comment) for details.

The question is if it would make sense it to pass args from read_parquet() to things other than read_table() API.

I think it makes sense. We are actually passing kwargs to pyarrow.parquet.ParquetDataset and Scanner.from_fragment. These APIs do not have same arguments as read_table().

That seems not desirable. Dataset is a distributed Arrow then it may make sense to pass through args to read_table, the single-node version of read, but not others.

@jianoaix The ray.data.read_parquet() API is more of a distributed analog for pyarrow.parquet.ParquetDataset, where we expose certain features that the underlying ParquetDataset provides (e.g. reading path-based partition columns into the data, supporting zero-read filter pushdown on partition columns, etc.). We do actually have a distributed analog for pyarrow.parquet.read_table(), and that's ray.data.read_parquet_bulk(), which doesn't use pyarrow.parquet.ParquetDataset and instead directly uses pyarrow.parquet.read_table().

For this API, I think that directing users in the docs to pyarrow.dataset.Scanner.from_fragment() for **arrow_parquet_args and to pyarrow.parquet.ParquetDataset for dataset_kwargs would be best, and we should look at turning these passthrough arguments into top-level arguments that we define, with the passthrough being an implementation detail. Going forward, if we continue to build out our own format-agnostic partitioning machinery, we should eventually consider switching to pyarrow.parquet.read_table() if/when we achieve feature parity.

Good to know that. IIUC, the difference between read_table v.s. ParuqetDataset/Scanner is whether we stream read a single file. So in read_parquet_bulk(), when reading a single file, it's not streamed, whereas in read_parquet(), it is streaming. It looks to me converging on ParuqetDataset/Scanner for streaming single file is a good option for all those read APIs.

For this PR itself, since it's leveraging existing arg passing, LG to move forward. We may discuss in followup in how to make the APIs / arg passing better.

Sounds good!

Small note: the key difference is actually whether or not we're using Arrow's dataset stuff, which gives us a bunch of partitioning machinery. The streaming vs. full read was just an implementation detail for getting the performance out of read_parquet_bulk() that Amazon needed, since this was before we fixed the buffering for the streaming Parquet read. We could probably move the read_parquet_bulk() to a streaming read with a buffer size set and get the same performance as the full read.

clarkzinzow

LGTM!

jianoaix

Can you fix the docstring of read_parquet() as well?

Signed-off-by: Cheng Su <[email protected]>

c21 · 2022-12-21T05:19:00Z

Can you fix the docstring of read_parquet() as well?

@jianoaix - yeah, updated the link to https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment. Will make a separate PR to add docstring for dataset_kwargs, which is for a separate issue - #30915 (comment) .

This PR is to allow users to specify batch_size when reading Parquet file.

…project#31165) This PR is to allow users to specify batch_size when reading Parquet file. Signed-off-by: tmynn <[email protected]>

Allow specify batch_size for reading Parquet file

2034914

Signed-off-by: Cheng Su <[email protected]>

c21 requested review from ericl, scv119, clarkzinzow, jjyao and jianoaix as code owners December 16, 2022 22:29

c21 assigned clarkzinzow and jianoaix Dec 16, 2022

jianoaix reviewed Dec 17, 2022

View reviewed changes

clarkzinzow approved these changes Dec 20, 2022

View reviewed changes

jianoaix approved these changes Dec 20, 2022

View reviewed changes

Update PyArrow link in read_parquet docstring

429bfd0

Signed-off-by: Cheng Su <[email protected]>

jianoaix approved these changes Dec 21, 2022

View reviewed changes

clarkzinzow merged commit c8443c0 into ray-project:master Dec 21, 2022

c21 deleted the parquet-batch-size branch December 21, 2022 18:02

AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023

[Datasets] Allow specify batch_size when reading Parquet file (#31165)

a3f2d79

This PR is to allow users to specify batch_size when reading Parquet file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Allow specify `batch_size` when reading Parquet file #31165

[Datasets] Allow specify `batch_size` when reading Parquet file #31165

c21 commented Dec 16, 2022

jianoaix Dec 17, 2022

c21 Dec 17, 2022

jianoaix Dec 17, 2022

c21 Dec 19, 2022

jianoaix Dec 20, 2022

clarkzinzow Dec 20, 2022 •

edited

Loading

jianoaix Dec 20, 2022

clarkzinzow Dec 20, 2022 •

edited

Loading

clarkzinzow left a comment

jianoaix left a comment

c21 commented Dec 21, 2022

[Datasets] Allow specify batch_size when reading Parquet file #31165

[Datasets] Allow specify batch_size when reading Parquet file #31165

Conversation

c21 commented Dec 16, 2022

Why are these changes needed?

Related issue number

Checks

jianoaix Dec 17, 2022

Choose a reason for hiding this comment

c21 Dec 17, 2022

Choose a reason for hiding this comment

jianoaix Dec 17, 2022

Choose a reason for hiding this comment

c21 Dec 19, 2022

Choose a reason for hiding this comment

jianoaix Dec 20, 2022

Choose a reason for hiding this comment

clarkzinzow Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

jianoaix Dec 20, 2022

Choose a reason for hiding this comment

clarkzinzow Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

jianoaix left a comment

Choose a reason for hiding this comment

c21 commented Dec 21, 2022

[Datasets] Allow specify `batch_size` when reading Parquet file #31165

[Datasets] Allow specify `batch_size` when reading Parquet file #31165

clarkzinzow Dec 20, 2022 •

edited

Loading

clarkzinzow Dec 20, 2022 •

edited

Loading