Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Async batch fetching for map_batches #31576

Merged
merged 23 commits into from
Jan 21, 2023

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Jan 10, 2023

Signed-off-by: Amog Kamsetty [email protected]

Implements batch fetching in a separate thread for GPU UDFs in map_batches. This allows CPU based batch fetching to be overlapped with the UDF computation.

prefetch_batches is added as an argument to map_batches. By default, this is set to 0.

We do not add it to DatasetContext as this functionality needs to be configured for each map_batchesindependently and not globally for the entire dataset. This is because the Dataset workflow might contain some transformations that are on GPU and others that are on CPU.

We see GPU prediction throughput increase from ~260 images/sec to ~300 images/sec:

No prefetching:

Total images 16232
Times for each stage:  {'read': 15.336565732955933, 'preprocess': 6.303653955459595, 'predict': 62.256098985672}
Throughput for each stage:  {'read': '1058.3855787948612 (img/sec)', 'preprocess': '2575.0144463341717 (img/sec)', 'predict': '260.72947493442746 (img/sec)'}
Total time:  83.89631867408752
Throughput 193.47690407080358 (img/sec)

With prefetching:

Total images 16232
Times for each stage:  {'read': 16.441548347473145, 'preprocess': 5.674700975418091, 'predict': 54.01595449447632}
Throughput for each stage:  {'read': '987.2549505043818 (img/sec)', 'preprocess': '2860.415036900528 (img/sec)', 'predict': '300.5038076603809 (img/sec)'}
Total time:  76.13220381736755
Throughput 213.20806683776962 (img/sec)

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
block_bundles = [((b,), (m,)) for b, m in blocks_in]
block_bundles: List[
Tuple[Tuple[ObjectRef[Block]], Tuple[BlockMetadata]]
] = [((b,), (m,)) for b, m in blocks_in]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a necessary change needed to get the performance improvements for batch prediction.

Before, we would only bundle blocks up to batch size and submit each bundle as a separate actor task. This means we cannot do prefetching when batch size is greater than block size since each bundle is a separate task.

Instead, if the max actor pool size is set, then we bundle up to min(batch size, max actor pool size).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully once we switch to a fully iterator-based implementation, these type of special cases are no longer necessary.

Signed-off-by: amogkam <[email protected]>
python/ray/data/_internal/block_batching.py Outdated Show resolved Hide resolved
python/ray/data/_internal/block_batching.py Show resolved Hide resolved
python/ray/data/_internal/block_batching.py Show resolved Hide resolved
python/ray/data/_internal/block_batching.py Show resolved Hide resolved
python/ray/data/_internal/block_batching.py Show resolved Hide resolved
# always be less than this max_size.
# Otherwise, it leads to inefficiencies with creating extra actor tasks and
# prevents the actor task from doing optimizations such as batch or block prefetching.
if self.max_size and len(block_bundles) > self.max_size:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will become deprecated with new executor backend, cc @clarkzinzow.

@@ -121,6 +123,96 @@ def test_format_batches(batch_format):
assert isinstance(batch["foo"], np.ndarray)


def test_async_batch_fetching():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add a test for map_batches as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it but there's too much time variance for a deterministic small-scale map_batches CI test. I'll confirm the performance improvements via running the batch inference release tests.

python/ray/train/batch_predictor.py Outdated Show resolved Hide resolved
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
@amogkam amogkam requested a review from c21 January 11, 2023 21:21
@c21
Copy link
Contributor

c21 commented Jan 11, 2023

LGTM except one comment - #31576 (comment) . cc @clarkzinzow.

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, one big thing that we need to resolve is that the actor pool rebundling will break block ordering, which I don't think we'll want to do.

python/ray/data/_internal/block_batching.py Outdated Show resolved Hide resolved
python/ray/data/_internal/block_batching.py Outdated Show resolved Hide resolved
python/ray/data/_internal/compute.py Outdated Show resolved Hide resolved
if self.max_size and len(block_bundles) > self.max_size:

def chunkify(bundles: List, num_chunks: int):
return [bundles[i::num_chunks] for i in range(num_chunks)]
Copy link
Contributor

@clarkzinzow clarkzinzow Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure that I understand the motivation, this is giving us stratified chunking, where a given chunk consists of an equal number of blocks from each of the original bundles (modulo the number of chunks), right? Might be worth leaving a comment as much for those that are less familiar with this pattern.

Two potential issues with this chunking scheme:

  1. This breaks block and therefore row ordering; the previous block bundling and actor compute strategy made sure to preserve it. This doesn't matter for batch prediction workloads but may matter for other workloads that use the actor compute strategy.
  2. There are pathological cases of skewed blocks/bundles that could pop up. E.g. suppose we had bundles = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] (pretend the numbers are block IDs) and num_chunks = 2, and suppose that blocks with odd IDs are much larger than blocks with even IDs; this chunking would produce bundles [[1, 3, 5, 7, 9], [2, 4, 6, 8]], where the first bundle is way, way larger than the second bundle.

Could solve (1) by changing the rechunking to merge adjacent chunks without breaking ordering, but (2) would require rebundling while taking the block sizes into account. I think that (1) is probably a blocker but (2) is not, cc @matthewdeng @c21 for more opinions.

If we are only wanting to solve (1) for now, we could do the simple thing of merging adjacent bundles until we either (1) are at the specified number of chunks (max pool size), or (2) all would-be merged bundles exceed the max target block size threshold (currently 512 MiB by default).

Could do something like the following progressive merging of adjacent bundles, which should preserve block/row order:

def rebundle_to_size(bundles: list, num_bundles: int):
    if len(bundles) <= num_bundles:
        # Already done.
        return bundles
    max_bundle_size = DatasetContext.get_current().target_max_block_size
    # Carry out multiple rounds of merging adjacent blocks, until we have scaled down
    # to num_bundles bundles, or we've stopped making merging progress.
    while len(bundles) > num_bundles:
        new_bundles = []
        num_merges = 0
        for i in range(len(bundles) // 2):
            left, right = bundles[2 * i], bundles[2 * i + 1]
            left_size = sum(meta.size_bytes for _, meta in left)
            right_size = sum(meta.size_bytes for _, meta in right)
            if left_size + right_size <= max_bundle_size:
                # Merging these bundles would be under the max bundle size, so we merge them.
                new_bundles.append(left + right)
                num_merges += 1
                if len(bundles) - num_merges == num_bundles:
                    # This merging round has already brought us to the requisite number of bundles,
                    # so we short-circuit.
                    break
            else:
                new_bundles.extend([left, right])
        if num_merges == 0:
            break
        # Add leftover bundles due to odd number of bundles or short-circuiting to new bundles.
        for j in range(2*i + 1, len(bundles)):
            new_bundles.append(bundles[i])
        bundles = new_bundles
    return bundles

python/ray/data/_internal/block_batching.py Show resolved Hide resolved
# always be less than this max_size.
# Otherwise, it leads to inefficiencies with creating extra actor tasks and
# prevents the actor task from doing optimizations
# such as batch or block prefetching.
Copy link
Contributor

@clarkzinzow clarkzinzow Jan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more orthogonal, future looking thing: a target bundle size that might be better than the user-provided batch_size is probably something like the following:

target_size = max(
    min((prefetch_batches + 1) * batch_size_in_bytes, ctx.target_min_block_size),
    ctx.target_max_block_size,
)

I.e. where we bundle up to at least the ctx.target_min_block_size (default is 1 MiB) since that's what we consider to be the smallest "reasonable" block to make the task overhead worth it; if the number of desired concurrent batches is larger than this (e.g. batch size is larger than this and/or aggressive prefetching is specified), then we use that as a bundling target. And all of this is capped by ctx.target_max_block_size.

We'd probably still have the max actor pool size serve as a cap on the number of block bundles as well, but I'd imagine that the initial actor pool size (i.e. actors started at the beginning of execution) and the scale-up rate would be influenced by the number of block bundles.

We should experiment with a few of these hints/policies in the new execution model, and try to ensure good performance with the default configuration. cc @ericl @c21

Signed-off-by: amogkam <[email protected]>
@c21
Copy link
Contributor

c21 commented Jan 19, 2023

@amogkam - can you rebase to latest master? It should fix the CI failures.

Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
@@ -480,6 +481,9 @@ def map_batches(
``pandas.DataFrame``, "pyarrow" to select ``pyarrow.Table``, or
``"numpy"`` to select ``numpy.ndarray`` for tensor datasets and
``Dict[str, numpy.ndarray]`` for tabular datasets. Default is "default".
prefetch_batches: The number of batches to fetch ahead of the current batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When porting this to the new executor, we should try to consolidate prefetch_batches and prefetch_blocks into a single prefetch_batches argument, where we always prefetch enough blocks to satisfy prefetch_batches, which should be simple enough to implement since we have the size for each to-be-fetched block on hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep +1!

python/ray/data/_internal/block_batching.py Show resolved Hide resolved
block_bundles = _bundle_blocks_up_to_size(
blocks_in, target_block_size, name
)
total_size = sum(metadata.num_rows for _, metadata in blocks_in)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata.num_rows could technically be None, but shouldn't happen in practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to handle None the same way as _bundle_blocks_up_to_size

python/ray/data/_internal/compute.py Show resolved Hide resolved
python/ray/data/_internal/compute.py Outdated Show resolved Hide resolved
python/ray/data/_internal/compute.py Outdated Show resolved Hide resolved
python/ray/data/_internal/compute.py Show resolved Hide resolved
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work! 👏

Signed-off-by: amogkam <[email protected]>
@amogkam amogkam merged commit 789232e into ray-project:master Jan 21, 2023
@amogkam amogkam deleted the async-batch-fetching branch January 21, 2023 00:40
andreapiso pushed a commit to andreapiso/ray that referenced this pull request Jan 22, 2023
Signed-off-by: Amog Kamsetty [email protected]

Implements batch fetching in a separate thread for GPU UDFs in map_batches. This allows CPU based batch fetching to be overlapped with the UDF computation.

prefetch_batches is added as an argument to map_batches. By default, this is set to 0.

We do not add it to DatasetContext as this functionality needs to be configured for each map_batchesindependently and not globally for the entire dataset. This is because the Dataset workflow might contain some transformations that are on GPU and others that are on CPU.

We see GPU prediction throughput increase from ~260 images/sec to ~300 images/sec.

Signed-off-by: Andrea Pisoni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants