-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Always convert arrow batches to pandas batches when user specifies batch_format="native" #21566
Conversation
Will this break a lot of existing workflows? Shouldn't we set the default format to arrow? |
@ericl I think that I'm more in favor of keeping the existing
The points in favor of a
Any other points in favor for the |
I'm not concerned with breaking things, as this is still a beta product. Btw, the pandas-block optimization will break workflows if native is specified. @clarkzinzow , the dealbreaker for "native" is that the user gets an entirely unpredictable type to their |
Ah, so that perspective makes sense if we're thinking that the dataset format will eventually be an implementation detail, and that the dataset format won't be a first-class part of the read API. I didn't think that we were going there yet. Anyways, that sounds good to me. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about .iter_rows()
API
python/ray/data/dataset.py
Outdated
@@ -1829,7 +1829,7 @@ def iter_rows(self, *, prefetch_blocks: int = 0) -> Iterator[T]: | |||
A local iterator over the entire dataset. | |||
""" | |||
for batch in self.iter_batches( | |||
prefetch_blocks=prefetch_blocks, batch_format="native"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm if a user was trying to keep everything in Arrow, this hard-coded Pandas batch would break that. Could we expose this in the .iter_rows()
API as row_format
, with a default to "pandas"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
…lementation (partial) (ray-project#20988) (ray-project#21661)" This reverts commit fa5c167.
Blocked on #20988 |
@ericl PR resubmitted. |
|
Why are these changes needed?
With the addition of #20988, the native format becomes ambiguous. This PR proposes to auto-promote arrow to pandas blocks when the user specifies "native" format, to avoid uncertainty.