[Data] Make Parquet tests more robust and expose Parquet logic #46944

bveeramani · 2024-08-02T23:18:04Z

Why are these changes needed?

This PR makes the following changes:

Rewrites test_dataset_stats_read_parquet as test_dataset_stats_range to make the test independent of the Parquet implementation (the intent of the test isn't Parquet-specific)
Makes test_parquet_read_spread robust to different Parquet implementations (the test makes assumptions about how tasks are launched)
Makes test_fsspec_filesystem not depend on the number of files written by Ray Data (the test assumes write_parquet writes exactly two files)
Exposes Parquet-specific logic to other modules
- Removes underscores from SerializedFragment and check_for_legacy_tensor_type
- Refactors ParquetDatasource._sample_fragments as sample_fragments
- Introduces get_parquet_dataset

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <[email protected]>

Initial commit

f7b7426

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani requested review from ericl, scv119, c21, amogkam, scottjlee, raulchen, stephanie-wang and omatthew98 as code owners August 2, 2024 23:18

bveeramani changed the title ~~[Data]~~ [Data] Make Parquet tests more robust and expose Parquet logic Aug 2, 2024

bveeramani assigned raulchen Aug 3, 2024

bveeramani enabled auto-merge (squash) August 3, 2024 20:03

Merge branch 'master' into parquet-changes

1558393

github-actions bot added the go add ONLY when ready to merge, run all tests label Aug 3, 2024

github-actions bot disabled auto-merge August 3, 2024 20:03

omatthew98 approved these changes Aug 5, 2024

View reviewed changes

raulchen approved these changes Aug 5, 2024

View reviewed changes

bveeramani merged commit 84ce0e6 into ray-project:master Aug 5, 2024
5 checks passed

bveeramani deleted the parquet-changes branch August 5, 2024 18:32