-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data][Split optimization] don't generate empty blocks #26768
Conversation
Signed-off-by: scv119 <[email protected]>
prev_index = index | ||
return (block_id, split_result) | ||
|
||
|
||
def _drop_empty_block_split(block_split_indices: List[int], num_rows: int) -> List[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to have implication on public API Dataset.split_at_indices? It currently can generate empty blocks but this will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about dropping only the ones for start/end of block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm fine with completely dropping empty splits (unless they have a utility that I'm not aware of). In that case, we can just go to Dataset.split_at_indices and revamp the semantics there (we just need to de-dup the global indices and document the API that it will do so).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative: we do this internally as this PR, but when returning Dataset.split_at_indices we create empty splits to maintain the existing semantics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianoaix ah this is internal change but not change the split_at_indices public api. So what happens is previously we will have dataset who contains empty block (block with 0 rows), now we remove the block but the number of (split) dataset is still the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this PR does not change split_at_indices
semantics.
Wondering would it be better just embed this function's logic inside _generate_per_block_split_indices()
L70-72 ? We can just avoid adding these unuseful indices in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine, but could we add a unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @scv119 for the optimization!
return _generate_global_split_results(all_blocks_split_results) | ||
|
||
# first calculate the size for each split. | ||
helper = [0] + valid_indices + [sum(block_rows)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valid_indices
are the user-provided indices without deduplication. The final blocks are generated based on valid_indices
, so we will still generate empty blocks as before if user provides duplicated indices, FYI @jianoaix and @matthewdeng.
prev_index = index | ||
return (block_id, split_result) | ||
|
||
|
||
def _drop_empty_block_split(block_split_indices: List[int], num_rows: int) -> List[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this PR does not change split_at_indices
semantics.
Wondering would it be better just embed this function's logic inside _generate_per_block_split_indices()
L70-72 ? We can just avoid adding these unuseful indices in the first place.
Signed-off-by: scv119 <[email protected]>
Signed-off-by: scv119 <[email protected]>
verify_splits(splits, [[], []]) | ||
|
||
|
||
def test_private_split_at_indices(ray_start_regular_shared): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note we have very comprehensive test for test_split_at_indices. this is only for edge cases and block distributions.
Signed-off-by: scv119 <[email protected]>
…6768) The current split_at_index might generate empty blocks and also trigger unnecessary split task. The empty blocks happens when there are duplicate split indices, or the split index falls at the block boundaries. The unnecessary split tasks are triggered when the split index falls at the block boundaries. This PR fix that by checking if the split index is duplicated or falls at the boundaries of blocks. in that case, we could safely ignore those indices. Signed-off-by: Rohan138 <[email protected]>
…6768) The current split_at_index might generate empty blocks and also trigger unnecessary split task. The empty blocks happens when there are duplicate split indices, or the split index falls at the block boundaries. The unnecessary split tasks are triggered when the split index falls at the block boundaries. This PR fix that by checking if the split index is duplicated or falls at the boundaries of blocks. in that case, we could safely ignore those indices. Signed-off-by: Stefan van der Kleij <[email protected]>
Signed-off-by: scv119 [email protected]
Why are these changes needed?
The current split_at_index might generate empty blocks and also trigger unnecessary split task. The empty blocks happens when there are duplicate split indices, or the split index falls at the block boundaries. The unnecessary split tasks are triggered when the split index falls at the block boundaries.
This PR fix that by checking if the split index is duplicated or falls at the boundaries of blocks. in that case, we could safely ignore those indices.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.