[Datasets] Support different number of blocks/rows per block in zip(). #32795

clarkzinzow · 2023-02-24T00:27:10Z

This PR adds support for a different number of blocks/rows per block in ds1.zip(ds2), by aligning the blocks in ds2 to ds1 with a lightweight repartition/block splitting.

Design

We heavily utilize the block splitting machinery that's use for ds.split() and ds.split_at_indices() to avoid an overly expensive repartition. Namely, for ds1.zip(ds2), we:

Calculate the block sizes for ds1 in order to get split indices.
Apply _split_at_indices() to ds2 in order to get a list of ds2 block chunks for every block in ds1, such that self_block.num_rows() == sum(other_block.num_rows() for other_block in other_split_blocks) for every self_block in ds1.
Zip together each block in ds1 with the one or more blocks from ds2 that constitute the block-aligned split for that ds1 block.

Related issue number

Closes #32375

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl

LGTM at high level

clarkzinzow · 2023-02-24T02:05:02Z

Tests look ok, @c21 @jianoaix could one of y'all take a look?

jianoaix · 2023-02-24T02:41:35Z

python/ray/data/_internal/stage_impl.py

            )
            return blocks, {}

        super().__init__("zip", None, do_zip_all)


+def _do_zip(block: Block, *other_blocks: Block) -> (Block, BlockMetadata):


Annotate other_blocks as List[Block] for readability?

Variadic args should have the type annotation of a single one of the arguments, not the collection. https://peps.python.org/pep-0484/#arbitrary-argument-lists-and-default-argument-values

It's List[ObjectRef[Block] from splitting output, any reason this to be a variadic arg?

This is used as a Ray task function, and Ray only resolves object refs that are top-level arguments, so we want to have each of these data blocks as top-level arguments to (1) get automatic materialization, (2) ensure the task isn't scheduled until all blocks are resolved, and (3) take advantage of locality-aware scheduling of the task; we wouldn't get any of those 3 things if we did a ray.get() in the task function, if that's what you're recommending.

We use this same pattern elsewhere, whenever we send a variable number of data blocks to a Ray task, we destructure it into a variadic arg so all of the above happens.

ok, makes sense!

python/ray/data/_internal/stage_impl.py

c21 · 2023-02-24T05:14:07Z

btw it would be great if we can create a benchmark for zip() from we learned.

python/ray/data/dataset.py

jianoaix · 2023-02-24T17:42:31Z

python/ray/data/_internal/stage_impl.py

            )
            return blocks, {}

        super().__init__("zip", None, do_zip_all)


+def _do_zip(block: Block, *other_blocks: Block) -> (Block, BlockMetadata):


It's List[ObjectRef[Block] from splitting output, any reason this to be a variadic arg?

ray-project#32795) This PR adds support for a different number of blocks/rows per block in `ds1.zip(ds2)`, by aligning the blocks in `ds2` to `ds1` with a lightweight repartition/block splitting. ## Design We heavily utilize the block splitting machinery that's use for `ds.split()` and `ds.split_at_indices()` to avoid an overly expensive repartition. Namely, for `ds1.zip(ds2)`, we: 1. Calculate the block sizes for `ds1` in order to get split indices. 2. Apply `_split_at_indices()` to `ds2` in order to get a list of `ds2` block chunks for every block in `ds1`, such that `self_block.num_rows() == sum(other_block.num_rows() for other_block in other_split_blocks)` for every `self_block` in `ds1`. 3. Zip together each block in `ds1` with the one or more blocks from `ds2` that constitute the block-aligned split for that `ds1` block.

#32795) (#32998) This PR adds support for a different number of blocks/rows per block in `ds1.zip(ds2)`, by aligning the blocks in `ds2` to `ds1` with a lightweight repartition/block splitting. ## Design We heavily utilize the block splitting machinery that's use for `ds.split()` and `ds.split_at_indices()` to avoid an overly expensive repartition. Namely, for `ds1.zip(ds2)`, we: 1. Calculate the block sizes for `ds1` in order to get split indices. 2. Apply `_split_at_indices()` to `ds2` in order to get a list of `ds2` block chunks for every block in `ds1`, such that `self_block.num_rows() == sum(other_block.num_rows() for other_block in other_split_blocks)` for every `self_block` in `ds1`. 3. Zip together each block in `ds1` with the one or more blocks from `ds2` that constitute the block-aligned split for that `ds1` block.

…in zip(). (#32795) (#32998)" This reverts commit 6010649.

ray-project#32795) This PR adds support for a different number of blocks/rows per block in `ds1.zip(ds2)`, by aligning the blocks in `ds2` to `ds1` with a lightweight repartition/block splitting. ## Design We heavily utilize the block splitting machinery that's use for `ds.split()` and `ds.split_at_indices()` to avoid an overly expensive repartition. Namely, for `ds1.zip(ds2)`, we: 1. Calculate the block sizes for `ds1` in order to get split indices. 2. Apply `_split_at_indices()` to `ds2` in order to get a list of `ds2` block chunks for every block in `ds1`, such that `self_block.num_rows() == sum(other_block.num_rows() for other_block in other_split_blocks)` for every `self_block` in `ds1`. 3. Zip together each block in `ds1` with the one or more blocks from `ds2` that constitute the block-aligned split for that `ds1` block. Signed-off-by: Edward Oakes <[email protected]>

ray-project#32795) This PR adds support for a different number of blocks/rows per block in `ds1.zip(ds2)`, by aligning the blocks in `ds2` to `ds1` with a lightweight repartition/block splitting. ## Design We heavily utilize the block splitting machinery that's use for `ds.split()` and `ds.split_at_indices()` to avoid an overly expensive repartition. Namely, for `ds1.zip(ds2)`, we: 1. Calculate the block sizes for `ds1` in order to get split indices. 2. Apply `_split_at_indices()` to `ds2` in order to get a list of `ds2` block chunks for every block in `ds1`, such that `self_block.num_rows() == sum(other_block.num_rows() for other_block in other_split_blocks)` for every `self_block` in `ds1`. 3. Zip together each block in `ds1` with the one or more blocks from `ds2` that constitute the block-aligned split for that `ds1` block.

ray-project#32795) This PR adds support for a different number of blocks/rows per block in `ds1.zip(ds2)`, by aligning the blocks in `ds2` to `ds1` with a lightweight repartition/block splitting. ## Design We heavily utilize the block splitting machinery that's use for `ds.split()` and `ds.split_at_indices()` to avoid an overly expensive repartition. Namely, for `ds1.zip(ds2)`, we: 1. Calculate the block sizes for `ds1` in order to get split indices. 2. Apply `_split_at_indices()` to `ds2` in order to get a list of `ds2` block chunks for every block in `ds1`, such that `self_block.num_rows() == sum(other_block.num_rows() for other_block in other_split_blocks)` for every `self_block` in `ds1`. 3. Zip together each block in `ds1` with the one or more blocks from `ds2` that constitute the block-aligned split for that `ds1` block. Signed-off-by: elliottower <[email protected]>

Support different number of blocks/rows per block in zip().

e365106

clarkzinzow requested review from ericl, scv119, jjyao, jianoaix and c21 as code owners February 24, 2023 00:27

clarkzinzow assigned ericl, c21 and jianoaix Feb 24, 2023

ericl approved these changes Feb 24, 2023

View reviewed changes

jianoaix approved these changes Feb 24, 2023

View reviewed changes

c21 reviewed Feb 24, 2023

View reviewed changes

python/ray/data/_internal/stage_impl.py Outdated Show resolved Hide resolved

python/ray/data/_internal/stage_impl.py Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 24, 2023

PR feedback.

4a0e3f0

jianoaix reviewed Feb 24, 2023

View reviewed changes

c21 approved these changes Feb 24, 2023

View reviewed changes

Improve docs.

40ed5a5

ericl merged commit e3f875c into ray-project:master Feb 24, 2023

jianoaix mentioned this pull request Feb 27, 2023

[Datasets] Streaming executor fixes #3 #32836

Merged

7 tasks

clarkzinzow mentioned this pull request Mar 3, 2023

[Cherry-pick] [Datasets] Support different number of blocks/rows per block in zip() (#32795) #32998

Merged

amogkam added a commit that referenced this pull request Mar 7, 2023

Revert "[Datasets] Support different number of blocks/rows per block …

c8ab2fa

…in zip(). (#32795) (#32998)" This reverts commit 6010649.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Support different number of blocks/rows per block in zip(). #32795

[Datasets] Support different number of blocks/rows per block in zip(). #32795

clarkzinzow commented Feb 24, 2023

ericl left a comment

clarkzinzow commented Feb 24, 2023

jianoaix Feb 24, 2023

clarkzinzow Feb 24, 2023

jianoaix Feb 24, 2023

clarkzinzow Feb 24, 2023 •

edited

Loading

jianoaix Feb 24, 2023

c21 commented Feb 24, 2023

jianoaix Feb 24, 2023

[Datasets] Support different number of blocks/rows per block in zip(). #32795

[Datasets] Support different number of blocks/rows per block in zip(). #32795

Conversation

clarkzinzow commented Feb 24, 2023

Design

Related issue number

Checks

ericl left a comment

Choose a reason for hiding this comment

clarkzinzow commented Feb 24, 2023

jianoaix Feb 24, 2023

Choose a reason for hiding this comment

clarkzinzow Feb 24, 2023

Choose a reason for hiding this comment

jianoaix Feb 24, 2023

Choose a reason for hiding this comment

clarkzinzow Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

jianoaix Feb 24, 2023

Choose a reason for hiding this comment

c21 commented Feb 24, 2023

jianoaix Feb 24, 2023

Choose a reason for hiding this comment

clarkzinzow Feb 24, 2023 •

edited

Loading