Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data][Split optimization] don't generate empty blocks #26768

Merged
merged 4 commits into from
Jul 21, 2022

Conversation

scv119
Copy link
Contributor

@scv119 scv119 commented Jul 20, 2022

Signed-off-by: scv119 [email protected]

Why are these changes needed?

The current split_at_index might generate empty blocks and also trigger unnecessary split task. The empty blocks happens when there are duplicate split indices, or the split index falls at the block boundaries. The unnecessary split tasks are triggered when the split index falls at the block boundaries.

This PR fix that by checking if the split index is duplicated or falls at the boundaries of blocks. in that case, we could safely ignore those indices.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@scv119 scv119 changed the title [Data][Split optimize-1] don't split empty blocks [Data][Split optimization] don't split empty blocks Jul 20, 2022
Signed-off-by: scv119 <[email protected]>
@scv119 scv119 marked this pull request as ready for review July 20, 2022 10:23
@scv119 scv119 changed the title [Data][Split optimization] don't split empty blocks [Data][Split optimization] don't generate empty blocks Jul 20, 2022
prev_index = index
return (block_id, split_result)


def _drop_empty_block_split(block_split_indices: List[int], num_rows: int) -> List[int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to have implication on public API Dataset.split_at_indices? It currently can generate empty blocks but this will change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about dropping only the ones for start/end of block?

Copy link
Contributor

@jianoaix jianoaix Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm fine with completely dropping empty splits (unless they have a utility that I'm not aware of). In that case, we can just go to Dataset.split_at_indices and revamp the semantics there (we just need to de-dup the global indices and document the API that it will do so).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative: we do this internally as this PR, but when returning Dataset.split_at_indices we create empty splits to maintain the existing semantics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix ah this is internal change but not change the split_at_indices public api. So what happens is previously we will have dataset who contains empty block (block with 0 rows), now we remove the block but the number of (split) dataset is still the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this PR does not change split_at_indices semantics.

Wondering would it be better just embed this function's logic inside _generate_per_block_split_indices() L70-72 ? We can just avoid adding these unuseful indices in the first place.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine, but could we add a unit test?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 20, 2022
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scv119 for the optimization!

return _generate_global_split_results(all_blocks_split_results)

# first calculate the size for each split.
helper = [0] + valid_indices + [sum(block_rows)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid_indices are the user-provided indices without deduplication. The final blocks are generated based on valid_indices, so we will still generate empty blocks as before if user provides duplicated indices, FYI @jianoaix and @matthewdeng.

prev_index = index
return (block_id, split_result)


def _drop_empty_block_split(block_split_indices: List[int], num_rows: int) -> List[int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this PR does not change split_at_indices semantics.

Wondering would it be better just embed this function's logic inside _generate_per_block_split_indices() L70-72 ? We can just avoid adding these unuseful indices in the first place.

python/ray/data/_internal/split.py Outdated Show resolved Hide resolved
python/ray/data/_internal/split.py Outdated Show resolved Hide resolved
python/ray/data/_internal/split.py Outdated Show resolved Hide resolved
Signed-off-by: scv119 <[email protected]>
Signed-off-by: scv119 <[email protected]>
@scv119 scv119 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 21, 2022
verify_splits(splits, [[], []])


def test_private_split_at_indices(ray_start_regular_shared):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note we have very comprehensive test for test_split_at_indices. this is only for edge cases and block distributions.

Signed-off-by: scv119 <[email protected]>
@scv119 scv119 merged commit de0d1fa into ray-project:master Jul 21, 2022
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
…6768)

The current split_at_index might generate empty blocks and also trigger unnecessary split task. The empty blocks happens when there are duplicate split indices, or the split index falls at the block boundaries. The unnecessary split tasks are triggered when the split index falls at the block boundaries.

This PR fix that by checking if the split index is duplicated or falls at the boundaries of blocks. in that case, we could safely ignore those indices.

Signed-off-by: Rohan138 <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…6768)

The current split_at_index might generate empty blocks and also trigger unnecessary split task. The empty blocks happens when there are duplicate split indices, or the split index falls at the block boundaries. The unnecessary split tasks are triggered when the split index falls at the block boundaries.

This PR fix that by checking if the split index is duplicated or falls at the boundaries of blocks. in that case, we could safely ignore those indices.

Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants