[data] The iter_batch default batch size should be block size #32004

ericl · 2023-01-27T19:27:59Z

Signed-off-by: Eric Liang [email protected]

Why are these changes needed?

Unlike with map_batches(), is no advantage in setting the batch size of iter_batches() to 256 by default. This only causes extra buffering and performance overhead by default.

Signed-off-by: Eric Liang <[email protected]>

clarkzinzow

Unlike with map_batches(), is no advantage in setting the batch size of iter_batches() to 256 by default. This only causes extra buffering and performance overhead by default.

Hmm if the batch is copied into the worker heap (as is common for certain data types and batch formats) and the consumer is bottlenecked by worker heap memory, this could definitely matter and is why we have a default batch size. Could you expand more on why this isn't a concern?

jianoaix · 2023-01-30T18:54:25Z

python/ray/data/dataset.py

@@ -2786,7 +2786,7 @@ def iter_batches(
        self,
        *,
        prefetch_blocks: int = 0,
-        batch_size: Optional[int] = 256,
+        batch_size: Optional[int] = None,


I think is the batch_size at iteration is affecting the model and block size may not be a good default (folks with more context in ML can correct this).
The other thing is we tie the shuffle size to batch_size, so it can impact performance as well.

ericl · 2023-01-30T19:19:46Z

Hmm if the batch is copied into the worker heap (as is common for certain data types and batch formats) and the consumer is bottlenecked by worker heap memory, this could definitely matter and is why we have a default batch size. Could you expand more on why this isn't a concern?

Isn't the entire block copied into the heap already? In this case, converting into batches of non-block size can add unexpected delays and conversion overheads.

Btw, this is for the raw iter_batches() API only. The ML specific iterators / DatasetIterator still specify a fixed batch size.

clarkzinzow · 2023-01-30T19:26:21Z

Isn't the entire block copied into the heap already? In this case, converting into batches of non-block size can add unexpected delays and conversion overheads.

No, typically we have zero-copy access to the block's data buffers in the object store, then we perform a zero-copy slice to get the batches (data buffers are still in the object store), and only then do we do the format conversion on the (potentially) much smaller batch. E.g. for creating Pandas DataFrame batches off of Arrow blocks, which can involve a 10x+ inflation due to the format conversion, converting a large block vs. a small batch can be the difference between OOMing and not OOMing.

Btw, this is for the raw iter_batches() API only. The ML specific iterators / DatasetIterator still specify a fixed batch size.

That eliminates the risk for users of those APIs, but what about users that are using the raw .iter_batches() API? This still seems like a memory stability issue for those users, and we made an explicit issue a while back to emphasize memory stability out-of-the-box rather than performance.

ericl · 2023-01-30T19:36:12Z

No, typically we have zero-copy access to the block's data buffers in the object store, then we perform a zero-copy slice to get the batches (data buffers are still in the object store), and only then do we do the format conversion on the (potentially) much smaller batch. E.g. for creating Pandas DataFrame batches off of Arrow blocks, which can involve a 10x+ inflation due to the format conversion, converting a large block vs. a small batch can be the difference between OOMing and not OOMing.

I don't really buy this, since the majority of memory usage is from map_batches(), which has a much larger default batch size of 4096. The driver OOMing sounds a bit far fetched given you typically only have 1 of these, and it will be fetching 1 block at at time.

In other words, the driver bottleneck is more likely to be CPU than memory, since there's just one of it, in comparison to many map workers.

That eliminates the risk for users of those APIs, but what about users that are using the raw .iter_batches() API? This still seems like a memory stability issue for those users, and we made an explicit issue a while back to emphasize memory stability out-of-the-box rather than performance.

Isn't memory stability mostly about the map workers and not the driver process?

ericl · 2023-01-30T20:14:19Z

Discussed offline--- let's instead prioritize the async/thread-pool based batch conversion here: #31911

That should give us best of both worlds--- predictable batch size, and high performance iteration by default.

fix iter batch

866e1f9

Signed-off-by: Eric Liang <[email protected]>

ericl assigned c21 Jan 27, 2023

ericl requested review from scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners January 27, 2023 19:28

ericl assigned clarkzinzow and jianoaix Jan 27, 2023

ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 28, 2023

clarkzinzow reviewed Jan 30, 2023

View reviewed changes

jianoaix reviewed Jan 30, 2023

View reviewed changes

ericl closed this Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] The iter_batch default batch size should be block size #32004

[data] The iter_batch default batch size should be block size #32004

ericl commented Jan 27, 2023

clarkzinzow left a comment •

edited

Loading

jianoaix Jan 30, 2023

ericl commented Jan 30, 2023

clarkzinzow commented Jan 30, 2023 •

edited

Loading

ericl commented Jan 30, 2023

ericl commented Jan 30, 2023

[data] The iter_batch default batch size should be block size #32004

[data] The iter_batch default batch size should be block size #32004

Conversation

ericl commented Jan 27, 2023

Why are these changes needed?

clarkzinzow left a comment • edited Loading

Choose a reason for hiding this comment

jianoaix Jan 30, 2023

Choose a reason for hiding this comment

ericl commented Jan 30, 2023

clarkzinzow commented Jan 30, 2023 • edited Loading

ericl commented Jan 30, 2023

ericl commented Jan 30, 2023

clarkzinzow left a comment •

edited

Loading

clarkzinzow commented Jan 30, 2023 •

edited

Loading