-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Don't drop first dataset when peeking DatasetPipeline
#31513
[Data] Don't drop first dataset when peeking DatasetPipeline
#31513
Conversation
Signed-off-by: amogkam <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good to me, just the first Dataset will get executed twice if users peek() and then iter_datasets(), but probably not a big efficiency loss.
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Thanks @jianoaix! Regarding
I updated the PR to add this back to retain old behavior. The cached peeked dataset will be used whenever possible, unless new transformations are applied. |
Signed-off-by: amogkam <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
# We re-use the saved _first_dataset and _remaining_dataset_iter | ||
if self._first_dataset is not None: | ||
|
||
class _IterableWrapper(Iterable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this wrapping needed since iterator itself has iter to return itself?
Need to tweak this PR a bit...cannot create new Pipelines since the stats are not carried over to the current pipeline. |
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
…-schema-map-batches
…-schema-map-batches
Failing test is also failing on master...going to merge. |
…roject#31513) Signed-off-by: amogkam [email protected] Closes ray-project#31505. When peeking a DatasetPipeline via .schema() for example, the first dataset in the base iterator is consumed. Then when chaining new operations on the pipeline, such as a map_batches, the dataset that was peeked is lost. In this PR, we change the implementation of peek to not consume the base iterable, but rather create a new iterable consisting of just the first dataset. Signed-off-by: Andrea Pisoni <[email protected]>
Signed-off-by: amogkam [email protected]
Closes #31505.
When peeking a
DatasetPipeline
via.schema()
for example, the first dataset in the base iterator is consumed. Then when chaining new operations on the pipeline, such as amap_batches
, the dataset that was peeked is lost.In this PR, we change the implementation of peek to not consume the base iterable, but rather create a new iterable consisting of just the first dataset.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.