[Datasets] Iterating through `DatasetPipeline` fails with `ZeroDivisionError` #31505

amogkam · 2023-01-06T22:43:15Z

What happened + What you expected to happen

I expect the below code snippet to not fail.

Instead it fails with the following error:

self = <ray.data.dataset_pipeline.DatasetPipeline.repeat.<locals>.RepeatIterator object at 0x16a801a30>

    def __next__(self) -> Dataset[T]:
        # Still going through the original pipeline.
        if self._original_iter:
            try:
                make_ds = next(self._original_iter)
                self._results.append(make_ds)

                def gen():
                    res = make_ds()
                    res._set_epoch(0)
                    return res

                return gen
            except StopIteration:
                self._original_iter = None
                # Calculate the cursor limit.
                if times:
                    self._max_i = len(self._results) * (times - 1)
                else:
                    self._max_i = float("inf")
        # Going through a repeat of the pipeline.
        if self._i < self._max_i:
>           make_ds = self._results[self._i % len(self._results)]
E           ZeroDivisionError: integer division or modulo by zero

Versions / Dependencies

master

Reproduction script

pipe = ray.data.range(6, parallelism=6).window(blocks_per_window=2).repeat()
assert pipe.schema() == int
pipe = pipe.map_batches(lambda x: x)
next(pipe.iter_epochs())

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

jianoaix · 2023-01-07T00:23:37Z

A mitigation is switching the order of pipe.schema() and pipe.map_batches():

pipe = ray.data.range(6, parallelism=6).window(blocks_per_window=2).repeat()
pipe = pipe.map_batches(lambda x: x)
assert pipe.schema() == int
next(pipe.iter_epochs())

The issue seems to be the peek() (used in pipe.schema()) advanced the base_iterable of pipeline, which was then used to create another pipeline (with pipe.map_batches()).

amogkam · 2023-01-07T02:13:45Z

In this case, we're using the schema to determine what batch format to use for the map_batches call so I won't be able to switch the order.

Signed-off-by: amogkam [email protected] Closes #31505. When peeking a DatasetPipeline via .schema() for example, the first dataset in the base iterator is consumed. Then when chaining new operations on the pipeline, such as a map_batches, the dataset that was peeked is lost. In this PR, we change the implementation of peek to not consume the base iterable, but rather create a new iterable consisting of just the first dataset.

…roject#31513) Signed-off-by: amogkam [email protected] Closes ray-project#31505. When peeking a DatasetPipeline via .schema() for example, the first dataset in the base iterator is consumed. Then when chaining new operations on the pipeline, such as a map_batches, the dataset that was peeked is lost. In this PR, we change the implementation of peek to not consume the base iterable, but rather create a new iterable consisting of just the first dataset. Signed-off-by: Andrea Pisoni <[email protected]>

amogkam added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 6, 2023

amogkam mentioned this issue Jan 6, 2023

[AIR] [Data] Preprocessor Transform configs are not respected with stream API #31503

Closed

amogkam added the data Ray Data-related issues label Jan 6, 2023

jianoaix added the P1 Issue that should be fixed within a few weeks label Jan 7, 2023

jianoaix changed the title ~~[Data] Iterating through DatasetPipeline fails with ZeroDivisionError~~ [Datasets] Iterating through DatasetPipeline fails with ZeroDivisionError Jan 7, 2023

amogkam mentioned this issue Jan 7, 2023

[Data] Don't drop first dataset when peeking DatasetPipeline #31513

Merged

7 tasks

jianoaix assigned amogkam and jianoaix Jan 9, 2023

amogkam closed this as completed in #31513 Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Iterating through `DatasetPipeline` fails with `ZeroDivisionError` #31505

[Datasets] Iterating through `DatasetPipeline` fails with `ZeroDivisionError` #31505

amogkam commented Jan 6, 2023

jianoaix commented Jan 7, 2023

amogkam commented Jan 7, 2023

[Datasets] Iterating through DatasetPipeline fails with ZeroDivisionError #31505

[Datasets] Iterating through DatasetPipeline fails with ZeroDivisionError #31505

Comments

amogkam commented Jan 6, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

jianoaix commented Jan 7, 2023

amogkam commented Jan 7, 2023

[Datasets] Iterating through `DatasetPipeline` fails with `ZeroDivisionError` #31505

[Datasets] Iterating through `DatasetPipeline` fails with `ZeroDivisionError` #31505