[Datasets] [WIP] Prototype wrapping operation-based lazy compute model. #22268
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains a rough prototype of a lazy compute model for Datasets, implemented via wrapping the existing eagerly-computed Datasets abstractions with lazy proxies. After switching to lazy mode (via a
ds.lazy()
call or alazy=True
arg to any dataset-creating APIs), all future operations are lazy, building up an operation graph that won't be executed until.compute()
or a consuming method (e.g..show()
,.iter_batches()
,.to_torch()
, etc.) is called.I'm pushing this PR up for reference; we will almost surely be going with the operation-based lazy compute model in #22233 since it provides a clearer path to implementing (1) move semantics for blocks within operations, and (2) optimizations such as task fusion.
The big pros of this PR are:
Dataset
,DatasetPipeline
, andGroupedDataset
; the only changes to those abstractions as the addition of the.lazy()
API. The fact that this prototype doesn't touch the core Datasets abstractions makes it very easy to add and later remove..split()
and.split_at_indices()
are still lazy, and support automatic caching of the materialized pre-split dataset (pipeline) and triggering of execution by any of the consumers. This is done by parking the pre-split dataset (pipeline) in an actor, where all downstream operations of the split reference a split fetch to the actor, the first of which will trigger execution of the dataset on the actor, after which the dataset will be cached. The lifetime of this actor is tied to the lifetime of the underlying dataset, so the cached dataset should be released once it's no longer needed by any of the splits.ds.cache()
method, which can be used before a branching computation, even if the branches are then passed to Ray tasks.Hopefully we can pull a subset of these pros that makes sense into the other prototype.