-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] randomize_block_order() not compatible with stage fusion #26057
Comments
I'm not sure about the best way to fix this; one approach would be to define a new kind of stage, that both OneToOne and AllToAll stages could support fusing with. Fusing with this stage would just require the stage to call cc @clarkzinzow @jianoaix for any thoughts. |
If the goal of using randomize_block_order() is to approximate the row-level shuffle at lower cost, we may switch the order of them in optimization, e.g. ds.randomize_block_order().map_batches() is equivalent to ds.map_batches(). randomize_block_order() (commutative), enable the OneToOne ops to fuse. If the goal is to randomize the placement of blocks onto nodes for some operation (e.g. read), it seems have to be before that operation. But in this case:
|
It's relevant for pipelining too, so ideally you push the reorder as soon as possible (all the way into the read stage of possible). Otherwise, the pipeline stages won't access blocks in random order. I don't think it's efficient/correct to move it to a later stage. |
Another option is to mutate the dataset Blocklist by reordering the blocks in place. That would be the simplest, but is a little breaking the model. |
Agreed with @jianoaix in general.
@ericl If using the API for an efficient block-level random shuffle (not for avoiding hotspots), then pushing the For the hotspot avoidance use case, I still think that pushing the |
It makes sense. I just want to make sure it also works for |
A few options come to mind:
|
Hmm shouldn't that happen automatically? The source blocks for the windowed pipeline will be either materialized or will be only a read stage, so ray/python/ray/data/dataset.py Lines 3146 to 3153 in 95fe327
|
Right, for now the window()/repeat() will materialize the dataset, so any op (except read) will not move across the boarder of dataset transform and pipeline transform. |
It makes sense, but more that window does not materialize the dataset: that would defeat the purpose. Window can operate over lazy block lists. |
What happened + What you expected to happen
Why are these changes needed?
The randomize_block_order() command has issues with breaking stage fusion, per #25870
Versions / Dependencies
master
Reproduction script
ray.data.range(10).randomize_block_order().map(fn)
In the above script, you'd expect
read->map_batches
to be fused, but the randomize stage breaks this.Issue Severity
No response
The text was updated successfully, but these errors were encountered: