[Data] Don't drop first dataset when peeking `DatasetPipeline` #31513

amogkam · 2023-01-07T03:39:19Z

Signed-off-by: amogkam [email protected]

When peeking a DatasetPipeline via .schema() for example, the first dataset in the base iterator is consumed. Then when chaining new operations on the pipeline, such as a map_batches, the dataset that was peeked is lost.

In this PR, we change the implementation of peek to not consume the base iterable, but rather create a new iterable consisting of just the first dataset.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: amogkam <[email protected]>

python/ray/data/dataset_pipeline.py

jianoaix

The fix looks good to me, just the first Dataset will get executed twice if users peek() and then iter_datasets(), but probably not a big efficiency loss.

python/ray/data/dataset_pipeline.py

Signed-off-by: amogkam <[email protected]>

amogkam · 2023-01-09T21:54:39Z

Thanks @jianoaix!

Regarding

the first Dataset will get executed twice if users peek() and then iter_datasets()

I updated the PR to add this back to retain old behavior. The cached peeked dataset will be used whenever possible, unless new transformations are applied.

Signed-off-by: amogkam <[email protected]>

jianoaix

LGTM!

jianoaix · 2023-01-09T22:55:25Z

python/ray/data/dataset_pipeline.py

+        # We re-use the saved _first_dataset and _remaining_dataset_iter
+        if self._first_dataset is not None:
+
+            class _IterableWrapper(Iterable):


Is this wrapping needed since iterator itself has iter to return itself?

amogkam · 2023-01-10T23:49:22Z

Need to tweak this PR a bit...cannot create new Pipelines since the stats are not carried over to the current pipeline.

Signed-off-by: amogkam <[email protected]>

python/ray/data/_internal/stats.py

Signed-off-by: amogkam <[email protected]>

…-schema-map-batches

amogkam · 2023-01-18T01:29:10Z

Failing test is also failing on master...going to merge.

…roject#31513) Signed-off-by: amogkam [email protected] Closes ray-project#31505. When peeking a DatasetPipeline via .schema() for example, the first dataset in the base iterator is consumed. Then when chaining new operations on the pipeline, such as a map_batches, the dataset that was peeked is lost. In this PR, we change the implementation of peek to not consume the base iterable, but rather create a new iterable consisting of just the first dataset. Signed-off-by: Andrea Pisoni <[email protected]>

fix

fd3616f

Signed-off-by: amogkam <[email protected]>

amogkam requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners January 7, 2023 03:39

amogkam assigned jianoaix Jan 7, 2023

amogkam commented Jan 7, 2023

View reviewed changes

python/ray/data/dataset_pipeline.py Show resolved Hide resolved

amogkam assigned clarkzinzow Jan 7, 2023

jianoaix reviewed Jan 9, 2023

View reviewed changes

python/ray/data/dataset_pipeline.py Outdated Show resolved Hide resolved

python/ray/data/dataset_pipeline.py Show resolved Hide resolved

amogkam added 2 commits January 9, 2023 13:49

update

94a7681

Signed-off-by: amogkam <[email protected]>

add comments

1b42df2

Signed-off-by: amogkam <[email protected]>

amogkam requested a review from jianoaix January 9, 2023 21:54

fix

b6ed7f5

Signed-off-by: amogkam <[email protected]>

jianoaix approved these changes Jan 9, 2023

View reviewed changes

amogkam added 2 commits January 10, 2023 17:08

handle stats

3f2ec86

Signed-off-by: amogkam <[email protected]>

update name

6ebbcc6

Signed-off-by: amogkam <[email protected]>

jianoaix added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 11, 2023

amogkam added 7 commits January 10, 2023 18:33

infinite recursion

ab9200d

Signed-off-by: amogkam <[email protected]>

fix

f341b24

Signed-off-by: amogkam <[email protected]>

fix

baf1008

Signed-off-by: amogkam <[email protected]>

fix

26a405a

Signed-off-by: amogkam <[email protected]>

update test

ce8cd1f

Signed-off-by: amogkam <[email protected]>

update

2a9111f

Signed-off-by: amogkam <[email protected]>

comment

9c0161d

Signed-off-by: amogkam <[email protected]>

jianoaix reviewed Jan 12, 2023

View reviewed changes

python/ray/data/_internal/stats.py Show resolved Hide resolved

comment

8a648ef

Signed-off-by: amogkam <[email protected]>

jianoaix approved these changes Jan 13, 2023

View reviewed changes

amogkam added 2 commits January 13, 2023 14:44

Merge branch 'master' of github.com:ray-project/ray into fix-pipeline…

e240069

…-schema-map-batches

Merge branch 'master' of github.com:ray-project/ray into fix-pipeline…

b77e054

…-schema-map-batches

amogkam removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 18, 2023

amogkam merged commit 6053d93 into ray-project:master Jan 18, 2023

amogkam deleted the fix-pipeline-schema-map-batches branch January 18, 2023 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Don't drop first dataset when peeking `DatasetPipeline` #31513

[Data] Don't drop first dataset when peeking `DatasetPipeline` #31513

amogkam commented Jan 7, 2023

jianoaix left a comment

amogkam commented Jan 9, 2023 •

edited

Loading

jianoaix left a comment

jianoaix Jan 9, 2023

amogkam commented Jan 10, 2023

amogkam commented Jan 18, 2023

[Data] Don't drop first dataset when peeking DatasetPipeline #31513

[Data] Don't drop first dataset when peeking DatasetPipeline #31513

Conversation

amogkam commented Jan 7, 2023

Why are these changes needed?

Related issue number

Checks

jianoaix left a comment

Choose a reason for hiding this comment

amogkam commented Jan 9, 2023 • edited Loading

jianoaix left a comment

Choose a reason for hiding this comment

jianoaix Jan 9, 2023

Choose a reason for hiding this comment

amogkam commented Jan 10, 2023

amogkam commented Jan 18, 2023

[Data] Don't drop first dataset when peeking `DatasetPipeline` #31513

[Data] Don't drop first dataset when peeking `DatasetPipeline` #31513

amogkam commented Jan 9, 2023 •

edited

Loading