[Datasets] Change `map_batches` to fetch input blocks on-demand #29289

c21 · 2022-10-12T23:13:06Z

Signed-off-by: Cheng Su [email protected]

Why are these changes needed?

This is the fix the issue we found during AIR benchmark. When the map_batches have multiple input blocks (it can happen when dynamic block splitting is enabled by default, or multiple input blocks are coalesced together), previously we always fetch and buffer all input blocks before producing first batch. This is bad especially for dynamic block splitting, because it essentially buffers all split blocks again in memory. So in this PR, change map_batches to fetch and buffer input blocks on-demand, i.e. only fetch blocks when needed to construct the next required batch.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

c21 · 2022-10-12T23:14:37Z

It's kind of hard to add a unit test for this. @stephanie-wang and I tested together in the WIP release test for AIR benchmark. After nightly test gets merged, this will be covered by nightly test.

stephanie-wang · 2022-10-12T23:22:23Z

It's kind of hard to add a unit test for this. @stephanie-wang and I tested together in the WIP release test for AIR benchmark. After nightly test gets merged, this will be covered by nightly test.

I think we can add a unit test like this:

make a single large file that would OOM if read in memory all together
check that .read() + map_batches() with block splitting enabled doesn't OOM

This would be ideal, so we don't need to rely on nightly to catch such issues (nightly is too heavyweight).

Seems like a good time to start adding some block splitting-specific tests anyway!

python/ray/data/dataset.py

c21 · 2022-10-12T23:52:15Z

@stephanie-wang - makes sense, let me look more to create a unit test.

python/ray/data/dataset.py

clarkzinzow

This is certainly a good fix for the read + map_batches fusion case! However, I think doing something like the following, with an outer iteration over the input blocks, would be a cleaner implementation:

        def transform(
            blocks: Iterable[Block],
            batch_fn: BatchUDF,
            *fn_args,
            **fn_kwargs,
        ) -> Iterable[Block]:
            DatasetContext._set_current(context)
            output_buffer = BlockOutputBuffer(None, context.target_max_block_size)
            # Ensure that zero-copy batch views are copied so mutating UDFs don't error.
            batcher = Batcher(batch_size, ensure_copy=batch_size is not None)

            def process_next_batch() -> Iterator[Block]:
                batch = batcher.next_batch()
                # Convert to batch format.
                batch = BlockAccessor.for_block(batch).to_batch_format(batch_format)
                # Apply UDF.
                batch = batch_fn(batch, *fn_args, **fn_kwargs)
                if not (
                    isinstance(batch, list)
                    or isinstance(batch, pa.Table)
                    or isinstance(batch, np.ndarray)
                    or (
                        isinstance(batch, dict)
                        and all(isinstance(col, np.ndarray) for col in batch.values())
                    )
                    or isinstance(batch, pd.core.frame.DataFrame)
                ):
                    raise ValueError(
                        "The map batches UDF returned the value "
                        f"{batch} of type {type(batch)}, "
                        "which is not allowed. "
                        f"The return type must be one of: {BatchType}"
                    )
                # Add output batch to output buffer.
                output_buffer.add_batch(batch)
                if output_buffer.has_next():
                    yield output_buffer.next()

            # Process batches for each block.
            for block in blocks:
                batcher.add(block)
                while batcher.has_batch():
                    yield from process_next_batch()

            # Process any partial/remainder batches.
            batcher.done_adding()
            if batcher.has_any():
                yield from process_next_batch()

            # Yield partial/remainder blocks from finalized output buffer.
            output_buffer.finalize()
            if output_buffer.has_next():
                yield output_buffer.next()

python/ray/data/dataset.py

c21 · 2022-10-26T23:12:54Z

This PR is ready for review, thanks @clarkzinzow, @stephanie-wang and @jianoaix.

jianoaix

Looks nice improvement!

jianoaix · 2022-10-27T00:28:45Z

python/ray/data/dataset.py

-            batcher.done_adding()
-            while batcher.has_any():
+
+            def process_next_batch() -> Iterator[Block]:


For readability, it'd be clear to pass in batch as arg, so it read like process_next_batch(batch) when calling.

@jianoaix - no strong opinion given it's an internal short function. But updated.

Signed-off-by: Cheng Su <[email protected]>

jianoaix · 2022-10-27T02:16:06Z

python/ray/data/tests/test_dynamic_block_split.py

+
+        # Data source generates multiple 1G random bytes data
+        class LargeBytesDatasource(Datasource):
+            def prepare_read(self, parallelism):


This method is deprecated, prefer using create_reader().

@jianoaix - yeah, forgot about it, updated.

Signed-off-by: Cheng Su <[email protected]>

c21 · 2022-10-27T14:35:44Z

The PR is ready for review again. Let me know if there's more comment, o.w. it's ready for merge. Thanks! @stephanie-wang, @jianoaix and @clarkzinzow.

jianoaix · 2022-10-27T16:54:30Z

python/ray/data/tests/test_dynamic_block_split.py

+            parallelism=1,
+        )
+
+        ds = ds.map_batches(foo, batch_size=None)


Actually do you need to set a memory limit for the cluster? How does this make sure it will OOM without block splitting?

After talking to Core folks, no easy way to set memory limit for the cluster in CI unit test now. As you can see here, we already process 20G data in one task. Planning to enable dynamic block splitting by default in 2.2. So I am not sure if it's our best interest to chase down having OOM without block splitting.

clarkzinzow

LGTM!

…project#29289) Signed-off-by: Cheng Su [email protected] This is the fix the issue we found during AIR benchmark. When the map_batches have multiple input blocks (it can happen when dynamic block splitting is enabled by default, or multiple input blocks are coalesced together), previously we always fetch and buffer all input blocks before producing first batch. This is bad especially for dynamic block splitting, because it essentially buffers all split blocks again in memory. So in this PR, change map_batches to fetch and buffer input blocks on-demand, i.e. only fetch blocks when needed to construct the next required batch. Signed-off-by: Weichen Xu <[email protected]>

c21 requested review from ericl, scv119, clarkzinzow, jjyao and jianoaix as code owners October 12, 2022 23:13

matthewdeng reviewed Oct 12, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

stephanie-wang reviewed Oct 12, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

clarkzinzow reviewed Oct 13, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

c21 force-pushed the map-batches branch from 821aed6 to 96a74bc Compare October 19, 2022 19:53

c21 mentioned this pull request Oct 25, 2022

[Datasets] Fix iter_batches to not return empty batch #29638

Merged

7 tasks

c21 force-pushed the map-batches branch from 96a74bc to 18e9e9a Compare October 25, 2022 21:49

c21 assigned stephanie-wang, clarkzinzow and jianoaix Oct 26, 2022

stephanie-wang approved these changes Oct 26, 2022

View reviewed changes

jianoaix reviewed Oct 27, 2022

View reviewed changes

c21 added 5 commits October 26, 2022 18:10

map_batches fetch input blocks on-demand

bb4fdbe

Signed-off-by: Cheng Su <[email protected]>

Fix to add iter on blocks

0de5135

Signed-off-by: Cheng Su <[email protected]>

Address comments

8f4cee0

Signed-off-by: Cheng Su <[email protected]>

Add test for dynamic block split with 20G input

982854a

Signed-off-by: Cheng Su <[email protected]>

Address comment

b3fa07b

Signed-off-by: Cheng Su <[email protected]>

c21 force-pushed the map-batches branch from c4b48b0 to b3fa07b Compare October 27, 2022 01:10

jianoaix reviewed Oct 27, 2022

View reviewed changes

Address comment

f87d4dc

Signed-off-by: Cheng Su <[email protected]>

jianoaix approved these changes Oct 27, 2022

View reviewed changes

jianoaix reviewed Oct 27, 2022

View reviewed changes

clarkzinzow approved these changes Oct 27, 2022

View reviewed changes

stephanie-wang merged commit f733d78 into ray-project:master Oct 27, 2022

c21 deleted the map-batches branch October 27, 2022 21:03

c21 restored the map-batches branch October 27, 2022 21:03

c21 deleted the map-batches branch October 27, 2022 21:03

c21 mentioned this pull request Nov 18, 2022

[Datasets] Enable dynamic block splitting by default #30442

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Change `map_batches` to fetch input blocks on-demand #29289

[Datasets] Change `map_batches` to fetch input blocks on-demand #29289

c21 commented Oct 12, 2022

c21 commented Oct 12, 2022

stephanie-wang commented Oct 12, 2022 •

edited

Loading

c21 commented Oct 12, 2022

clarkzinzow left a comment

c21 commented Oct 26, 2022

jianoaix left a comment

jianoaix Oct 27, 2022

c21 Oct 27, 2022

jianoaix Oct 27, 2022

c21 Oct 27, 2022

c21 commented Oct 27, 2022

jianoaix Oct 27, 2022

c21 Oct 27, 2022

clarkzinzow left a comment

[Datasets] Change map_batches to fetch input blocks on-demand #29289

[Datasets] Change map_batches to fetch input blocks on-demand #29289

Conversation

c21 commented Oct 12, 2022

Why are these changes needed?

Related issue number

Checks

c21 commented Oct 12, 2022

stephanie-wang commented Oct 12, 2022 • edited Loading

c21 commented Oct 12, 2022

clarkzinzow left a comment

Choose a reason for hiding this comment

c21 commented Oct 26, 2022

jianoaix left a comment

Choose a reason for hiding this comment

jianoaix Oct 27, 2022

Choose a reason for hiding this comment

c21 Oct 27, 2022

Choose a reason for hiding this comment

jianoaix Oct 27, 2022

Choose a reason for hiding this comment

c21 Oct 27, 2022

Choose a reason for hiding this comment

c21 commented Oct 27, 2022

jianoaix Oct 27, 2022

Choose a reason for hiding this comment

c21 Oct 27, 2022

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

[Datasets] Change `map_batches` to fetch input blocks on-demand #29289

[Datasets] Change `map_batches` to fetch input blocks on-demand #29289

stephanie-wang commented Oct 12, 2022 •

edited

Loading