-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] The iter_batch default batch size should be block size #32004
Conversation
Signed-off-by: Eric Liang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike with map_batches(), is no advantage in setting the batch size of iter_batches() to 256 by default. This only causes extra buffering and performance overhead by default.
Hmm if the batch is copied into the worker heap (as is common for certain data types and batch formats) and the consumer is bottlenecked by worker heap memory, this could definitely matter and is why we have a default batch size. Could you expand more on why this isn't a concern?
@@ -2786,7 +2786,7 @@ def iter_batches( | |||
self, | |||
*, | |||
prefetch_blocks: int = 0, | |||
batch_size: Optional[int] = 256, | |||
batch_size: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is the batch_size at iteration is affecting the model and block size may not be a good default (folks with more context in ML can correct this).
The other thing is we tie the shuffle size to batch_size, so it can impact performance as well.
Isn't the entire block copied into the heap already? In this case, converting into batches of non-block size can add unexpected delays and conversion overheads. Btw, this is for the raw iter_batches() API only. The ML specific iterators / DatasetIterator still specify a fixed batch size. |
No, typically we have zero-copy access to the block's data buffers in the object store, then we perform a zero-copy slice to get the batches (data buffers are still in the object store), and only then do we do the format conversion on the (potentially) much smaller batch. E.g. for creating Pandas DataFrame batches off of Arrow blocks, which can involve a 10x+ inflation due to the format conversion, converting a large block vs. a small batch can be the difference between OOMing and not OOMing.
That eliminates the risk for users of those APIs, but what about users that are using the raw |
I don't really buy this, since the majority of memory usage is from map_batches(), which has a much larger default batch size of 4096. The driver OOMing sounds a bit far fetched given you typically only have 1 of these, and it will be fetching 1 block at at time. In other words, the driver bottleneck is more likely to be CPU than memory, since there's just one of it, in comparison to many map workers.
Isn't memory stability mostly about the map workers and not the driver process? |
Discussed offline--- let's instead prioritize the async/thread-pool based batch conversion here: #31911 That should give us best of both worlds--- predictable batch size, and high performance iteration by default. |
Signed-off-by: Eric Liang [email protected]
Why are these changes needed?
Unlike with map_batches(), is no advantage in setting the batch size of iter_batches() to 256 by default. This only causes extra buffering and performance overhead by default.