[Data] Add `partitioning` parameter to `read_parquet` #47553

bveeramani · 2024-09-07T09:40:22Z

Why are these changes needed?

To extract path partition information with read_parquet, you pass a PyArrow partitioning object to dataset_kwargs. For example:

schema = pa.schema([("one", pa.int32()), ("two", pa.string())])
partitioning = pa.dataset.partitioning(schema, flavor="hive")
ds = ray.data.read_parquet(... dataset_kwargs=dict(partitioning=partitioning))

This is problematic for two reasons:

It tightly couples the interface with the implementation; partitioning only works if we use pyarrow.Dataset in a specific way in the implementation.
It's inconsistent with all of the other file-based API. All other APIs use expose a top-level partitioning parameter (rather than dataset_kwargs) where you pass a Ray Data Partitioning object (rather than a PyArrow partitioning object).

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <[email protected]>

python/ray/data/_internal/datasource/parquet_datasource.py

Signed-off-by: Balaji Veeramani <[email protected]>

) To extract path partition information with `read_parquet`, you pass a PyArrow `partitioning` object to `dataset_kwargs`. For example: ``` schema = pa.schema([("one", pa.int32()), ("two", pa.string())]) partitioning = pa.dataset.partitioning(schema, flavor="hive") ds = ray.data.read_parquet(... dataset_kwargs=dict(partitioning=partitioning)) ``` This is problematic for two reasons: 1. It tightly couples the interface with the implementation; partitioning only works if we use `pyarrow.Dataset` in a specific way in the implementation. 2. It's inconsistent with all of the other file-based API. All other APIs use expose a top-level `partitioning` parameter (rather than `dataset_kwargs`) where you pass a Ray Data `Partitioning` object (rather than a PyArrow partitioning object). --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

bveeramani added 2 commits September 7, 2024 02:39

Add parameter

321f964

Signed-off-by: Balaji Veeramani <[email protected]>

Fix typo

b9b4e27

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani requested review from ericl, scv119, c21, amogkam, scottjlee, raulchen, stephanie-wang and omatthew98 as code owners September 7, 2024 09:40

bveeramani assigned raulchen and alexeykudinkin Sep 9, 2024

raulchen approved these changes Sep 10, 2024

View reviewed changes

alexeykudinkin reviewed Sep 11, 2024

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

bveeramani added 3 commits September 13, 2024 21:34

Address review comments and fix test

4f15d25

Signed-off-by: Balaji Veeramani <[email protected]>

Appease lint

60ee52b

Signed-off-by: Balaji Veeramani <[email protected]>

Update docstring

d3585b0

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani enabled auto-merge (squash) September 14, 2024 04:39

github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 14, 2024

bveeramani disabled auto-merge September 14, 2024 04:40

bveeramani merged commit 1c80db5 into master Sep 16, 2024
7 checks passed

bveeramani deleted the parquet-partitioning-arg branch September 16, 2024 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add `partitioning` parameter to `read_parquet` #47553

[Data] Add `partitioning` parameter to `read_parquet` #47553

bveeramani commented Sep 7, 2024 •

edited

Loading

[Data] Add partitioning parameter to read_parquet #47553

[Data] Add partitioning parameter to read_parquet #47553

Conversation

bveeramani commented Sep 7, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

[Data] Add `partitioning` parameter to `read_parquet` #47553

[Data] Add `partitioning` parameter to `read_parquet` #47553

bveeramani commented Sep 7, 2024 •

edited

Loading