Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] The iter_batch default batch size should be block size #32004

Closed
wants to merge 1 commit into from

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Jan 27, 2023

Signed-off-by: Eric Liang [email protected]

Why are these changes needed?

Unlike with map_batches(), is no advantage in setting the batch size of iter_batches() to 256 by default. This only causes extra buffering and performance overhead by default.

Signed-off-by: Eric Liang <[email protected]>
@ericl ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 28, 2023
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike with map_batches(), is no advantage in setting the batch size of iter_batches() to 256 by default. This only causes extra buffering and performance overhead by default.

Hmm if the batch is copied into the worker heap (as is common for certain data types and batch formats) and the consumer is bottlenecked by worker heap memory, this could definitely matter and is why we have a default batch size. Could you expand more on why this isn't a concern?

@@ -2786,7 +2786,7 @@ def iter_batches(
self,
*,
prefetch_blocks: int = 0,
batch_size: Optional[int] = 256,
batch_size: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is the batch_size at iteration is affecting the model and block size may not be a good default (folks with more context in ML can correct this).
The other thing is we tie the shuffle size to batch_size, so it can impact performance as well.

@ericl
Copy link
Contributor Author

ericl commented Jan 30, 2023

Hmm if the batch is copied into the worker heap (as is common for certain data types and batch formats) and the consumer is bottlenecked by worker heap memory, this could definitely matter and is why we have a default batch size. Could you expand more on why this isn't a concern?

Isn't the entire block copied into the heap already? In this case, converting into batches of non-block size can add unexpected delays and conversion overheads.

Btw, this is for the raw iter_batches() API only. The ML specific iterators / DatasetIterator still specify a fixed batch size.

@clarkzinzow
Copy link
Contributor

clarkzinzow commented Jan 30, 2023

Isn't the entire block copied into the heap already? In this case, converting into batches of non-block size can add unexpected delays and conversion overheads.

No, typically we have zero-copy access to the block's data buffers in the object store, then we perform a zero-copy slice to get the batches (data buffers are still in the object store), and only then do we do the format conversion on the (potentially) much smaller batch. E.g. for creating Pandas DataFrame batches off of Arrow blocks, which can involve a 10x+ inflation due to the format conversion, converting a large block vs. a small batch can be the difference between OOMing and not OOMing.

Btw, this is for the raw iter_batches() API only. The ML specific iterators / DatasetIterator still specify a fixed batch size.

That eliminates the risk for users of those APIs, but what about users that are using the raw .iter_batches() API? This still seems like a memory stability issue for those users, and we made an explicit issue a while back to emphasize memory stability out-of-the-box rather than performance.

@ericl
Copy link
Contributor Author

ericl commented Jan 30, 2023

No, typically we have zero-copy access to the block's data buffers in the object store, then we perform a zero-copy slice to get the batches (data buffers are still in the object store), and only then do we do the format conversion on the (potentially) much smaller batch. E.g. for creating Pandas DataFrame batches off of Arrow blocks, which can involve a 10x+ inflation due to the format conversion, converting a large block vs. a small batch can be the difference between OOMing and not OOMing.

I don't really buy this, since the majority of memory usage is from map_batches(), which has a much larger default batch size of 4096. The driver OOMing sounds a bit far fetched given you typically only have 1 of these, and it will be fetching 1 block at at time.

In other words, the driver bottleneck is more likely to be CPU than memory, since there's just one of it, in comparison to many map workers.

That eliminates the risk for users of those APIs, but what about users that are using the raw .iter_batches() API? This still seems like a memory stability issue for those users, and we made an explicit issue a while back to emphasize memory stability out-of-the-box rather than performance.

Isn't memory stability mostly about the map workers and not the driver process?

@ericl
Copy link
Contributor Author

ericl commented Jan 30, 2023

Discussed offline--- let's instead prioritize the async/thread-pool based batch conversion here: #31911

That should give us best of both worlds--- predictable batch size, and high performance iteration by default.

@ericl ericl closed this Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants