Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Don't drop first dataset when peeking DatasetPipeline #31513

Merged
merged 16 commits into from
Jan 18, 2023

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Jan 7, 2023

Signed-off-by: amogkam [email protected]

Closes #31505.

When peeking a DatasetPipeline via .schema() for example, the first dataset in the base iterator is consumed. Then when chaining new operations on the pipeline, such as a map_batches, the dataset that was peeked is lost.

In this PR, we change the implementation of peek to not consume the base iterable, but rather create a new iterable consisting of just the first dataset.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: amogkam <[email protected]>
Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good to me, just the first Dataset will get executed twice if users peek() and then iter_datasets(), but probably not a big efficiency loss.

python/ray/data/dataset_pipeline.py Outdated Show resolved Hide resolved
python/ray/data/dataset_pipeline.py Show resolved Hide resolved
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
@amogkam
Copy link
Contributor Author

amogkam commented Jan 9, 2023

Thanks @jianoaix!

Regarding

the first Dataset will get executed twice if users peek() and then iter_datasets()

I updated the PR to add this back to retain old behavior. The cached peeked dataset will be used whenever possible, unless new transformations are applied.

@amogkam amogkam requested a review from jianoaix January 9, 2023 21:54
Signed-off-by: amogkam <[email protected]>
Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

# We re-use the saved _first_dataset and _remaining_dataset_iter
if self._first_dataset is not None:

class _IterableWrapper(Iterable):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this wrapping needed since iterator itself has iter to return itself?

@amogkam
Copy link
Contributor Author

amogkam commented Jan 10, 2023

Need to tweak this PR a bit...cannot create new Pipelines since the stats are not carried over to the current pipeline.

Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
@jianoaix jianoaix added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 11, 2023
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
@amogkam
Copy link
Contributor Author

amogkam commented Jan 18, 2023

Failing test is also failing on master...going to merge.

@amogkam amogkam removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 18, 2023
@amogkam amogkam merged commit 6053d93 into ray-project:master Jan 18, 2023
@amogkam amogkam deleted the fix-pipeline-schema-map-batches branch January 18, 2023 01:29
andreapiso pushed a commit to andreapiso/ray that referenced this pull request Jan 22, 2023
…roject#31513)

Signed-off-by: amogkam [email protected]

Closes ray-project#31505.

When peeking a DatasetPipeline via .schema() for example, the first dataset in the base iterator is consumed. Then when chaining new operations on the pipeline, such as a map_batches, the dataset that was peeked is lost.

In this PR, we change the implementation of peek to not consume the base iterable, but rather create a new iterable consisting of just the first dataset.

Signed-off-by: Andrea Pisoni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Iterating through DatasetPipeline fails with ZeroDivisionError
3 participants