[Data][Split optimization] don't generate empty blocks #26768

scv119 · 2022-07-20T05:13:12Z

Why are these changes needed?

The current split_at_index might generate empty blocks and also trigger unnecessary split task. The empty blocks happens when there are duplicate split indices, or the split index falls at the block boundaries. The unnecessary split tasks are triggered when the split index falls at the block boundaries.

This PR fix that by checking if the split index is duplicated or falls at the boundaries of blocks. in that case, we could safely ignore those indices.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: scv119 <[email protected]>

jianoaix · 2022-07-20T16:37:14Z

python/ray/data/_internal/split.py

        prev_index = index
    return (block_id, split_result)


+def _drop_empty_block_split(block_split_indices: List[int], num_rows: int) -> List[int]:


This seems to have implication on public API Dataset.split_at_indices? It currently can generate empty blocks but this will change it.

How about dropping only the ones for start/end of block?

I think I'm fine with completely dropping empty splits (unless they have a utility that I'm not aware of). In that case, we can just go to Dataset.split_at_indices and revamp the semantics there (we just need to de-dup the global indices and document the API that it will do so).

Alternative: we do this internally as this PR, but when returning Dataset.split_at_indices we create empty splits to maintain the existing semantics?

@jianoaix ah this is internal change but not change the split_at_indices public api. So what happens is previously we will have dataset who contains empty block (block with 0 rows), now we remove the block but the number of (split) dataset is still the same.

Yeah this PR does not change split_at_indices semantics.

Wondering would it be better just embed this function's logic inside _generate_per_block_split_indices() L70-72 ? We can just avoid adding these unuseful indices in the first place.

ericl

Seems fine, but could we add a unit test?

c21

Thanks @scv119 for the optimization!

c21 · 2022-07-20T23:44:40Z

python/ray/data/_internal/split.py

-    return _generate_global_split_results(all_blocks_split_results)
+
+    # first calculate the size for each split.
+    helper = [0] + valid_indices + [sum(block_rows)]


valid_indices are the user-provided indices without deduplication. The final blocks are generated based on valid_indices, so we will still generate empty blocks as before if user provides duplicated indices, FYI @jianoaix and @matthewdeng.

c21 · 2022-07-20T23:46:04Z

python/ray/data/_internal/split.py

        prev_index = index
    return (block_id, split_result)


+def _drop_empty_block_split(block_split_indices: List[int], num_rows: int) -> List[int]:


Yeah this PR does not change split_at_indices semantics.

Wondering would it be better just embed this function's logic inside _generate_per_block_split_indices() L70-72 ? We can just avoid adding these unuseful indices in the first place.

python/ray/data/_internal/split.py

Signed-off-by: scv119 <[email protected]>

scv119 · 2022-07-21T07:01:21Z

python/ray/data/tests/test_split.py

+    verify_splits(splits, [[], []])
+
+
+def test_private_split_at_indices(ray_start_regular_shared):


note we have very comprehensive test for test_split_at_indices. this is only for edge cases and block distributions.

Signed-off-by: scv119 <[email protected]>

…6768) The current split_at_index might generate empty blocks and also trigger unnecessary split task. The empty blocks happens when there are duplicate split indices, or the split index falls at the block boundaries. The unnecessary split tasks are triggered when the split index falls at the block boundaries. This PR fix that by checking if the split index is duplicated or falls at the boundaries of blocks. in that case, we could safely ignore those indices. Signed-off-by: Rohan138 <[email protected]>

…6768) The current split_at_index might generate empty blocks and also trigger unnecessary split task. The empty blocks happens when there are duplicate split indices, or the split index falls at the block boundaries. The unnecessary split tasks are triggered when the split index falls at the block boundaries. This PR fix that by checking if the split index is duplicated or falls at the boundaries of blocks. in that case, we could safely ignore those indices. Signed-off-by: Stefan van der Kleij <[email protected]>

scv119 changed the title ~~[Data][Split optimize-1] don't split empty blocks~~ [Data][Split optimization] don't split empty blocks Jul 20, 2022

add

d54688c

Signed-off-by: scv119 <[email protected]>

scv119 force-pushed the optimize-1 branch from bf42914 to d54688c Compare July 20, 2022 10:23

scv119 marked this pull request as ready for review July 20, 2022 10:23

scv119 requested review from ericl, clarkzinzow, jjyao and jianoaix as code owners July 20, 2022 10:23

scv119 assigned ericl, jianoaix, clarkzinzow and c21 Jul 20, 2022

scv119 changed the title ~~[Data][Split optimization] don't split empty blocks~~ [Data][Split optimization] don't generate empty blocks Jul 20, 2022

jianoaix reviewed Jul 20, 2022

View reviewed changes

ericl reviewed Jul 20, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 20, 2022

c21 reviewed Jul 20, 2022

View reviewed changes

scv119 added 2 commits July 20, 2022 22:59

address comments

0fb8cd6

Signed-off-by: scv119 <[email protected]>

add more tests

d4e2da7

Signed-off-by: scv119 <[email protected]>

scv119 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 21, 2022

scv119 commented Jul 21, 2022

View reviewed changes

jianoaix approved these changes Jul 21, 2022

View reviewed changes

c21 approved these changes Jul 21, 2022

View reviewed changes

fix -test

ceda5b0

Signed-off-by: scv119 <[email protected]>

scv119 merged commit de0d1fa into ray-project:master Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data][Split optimization] don't generate empty blocks #26768

[Data][Split optimization] don't generate empty blocks #26768

scv119 commented Jul 20, 2022 •

edited

Loading

jianoaix Jul 20, 2022

matthewdeng Jul 20, 2022

jianoaix Jul 20, 2022 •

edited

Loading

jianoaix Jul 20, 2022

scv119 Jul 20, 2022

c21 Jul 20, 2022

ericl left a comment

c21 left a comment

c21 Jul 20, 2022

c21 Jul 20, 2022

scv119 Jul 21, 2022

		verify_splits(splits, [[], []])


		def test_private_split_at_indices(ray_start_regular_shared):

[Data][Split optimization] don't generate empty blocks #26768

[Data][Split optimization] don't generate empty blocks #26768

Conversation

scv119 commented Jul 20, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jianoaix Jul 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scv119 commented Jul 20, 2022 •

edited

Loading

jianoaix Jul 20, 2022 •

edited

Loading