-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lay the groundwork for lazy dataset optimization (no behavior changes) #22233
Conversation
Two high level comments:
|
We can debate it, but having lazy as an optimization only may be more usable. cc @clarkzinzow on thoughts.
It's not really related. GroupBy is only specifying half of an operation, the second half is the aggregation. |
A few high-level things:
What are your thoughts on these, especially (1)? |
That's right, it currently is limited, but there's nothing stopping us from optimizing this within the shuffle operation itself. We'd have to go and optimize these separately.
For OneToOneOp, we can implement this in the common compute layer. For AllToAllOp, we'd need to implement the moving within the specific op implementations (like do_shuffle()). Each op can assume it can take ownership of its input.
You can do it in ExecutionPlan.execute(), for example, fusing OneToOneOps if they have compatible compute parameters. It's also possible to fuse OneToOneOp and AllToAllOps, if we extend the AllToAllOp API a bit.
Actually, the split is still lazy if done immediately on the base data. We could also keep it lazy if the plan only contains OneToOneOps as an optimization.
You can call ensure_executed() to materialize the intermediate copy, though yeah it would be a little bit of a manual operation. |
So we'd do this manually for those ops, got it. 👍
Doing this manually for
Yep that makes sense. 👍
Ah, so this would be done at the block_list --> block_list level, walking the op graph, eagerly fusing op chains that have compatible compute parameters AND that don't change the number of blocks into a single new op, then apply that fused The common use case that I run through is
I'm assuming that by "base data" you mean right after reading. If we make splitting lazy and splits are passed to multiple trainers, we'll need an actor to coordinate the execution of the dataset up to the split, right?
This can be done automatically if the executed dataset is cached and somehow attached to the stage, but the big issue there is that you're then indefinitely holding on to a For both branching and splitting, if we want to allow execution to still be lazy, I think that we'll need an actor to coordinate execution up to the branch/split in case the branches/splits are passed to Ray tasks/actors and attempt to trigger execution. We can force materialization for each for now, but I think that might result in poor UX. |
Yep this one is tricky. Basically I think the AllToAll API would look like
You'll need an actor if you want to avoid calling .cache() explicitly prior to the split point (to do the coordination at runtime instead of ahead of time). Though this does materialize it. Anyway, it's solved with DatasetPipeline which does have a coordinator actor, so maybe this isn't too critical? |
So to summarize, the general paradigm here is to support automatic block lifecycle optimization, move semantics, and task fusion for
Sure, as long as you think there's a sensible path forward if this does become an issue, we can consider the primary splitting use case, pipelined ML ingest, to be covered by |
Great work and I look forward to seeing it merged! Just one remaining comment :)
Should we just do a regular #api-changes review? |
@zhe-thoughts that's only for core APIs |
@clarkzinzow this is ready for review (almost all tests are passing now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, a few nits, and a few questions around supporting lazy fan-in operations in follow-up work.
ray-project#22233) This PR refactors Dataset execution to enable lazy mode in the future, which can reduce memory usage in large-scale ingest pipelines. There should be no behavior changes in this PR. Many of the optimizations are also punted for future work.
Why are these changes needed?
This PR refactors Dataset execution to enable lazy mode in the future, which can reduce memory usage in large-scale ingest pipelines. There should be no behavior changes in this PR. Many of the optimizations are also punted for future work.
Future work: