Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Always convert arrow batches to pandas batches when user specifies batch_format="native" #21566

Merged
merged 11 commits into from
Feb 2, 2022

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Jan 12, 2022

Why are these changes needed?

With the addition of #20988, the native format becomes ambiguous. This PR proposes to auto-promote arrow to pandas blocks when the user specifies "native" format, to avoid uncertainty.

@kfstorm
Copy link
Member

kfstorm commented Jan 14, 2022

Will this break a lot of existing workflows? Shouldn't we set the default format to arrow?

@clarkzinzow
Copy link
Contributor

clarkzinzow commented Jan 14, 2022

@ericl I think that I'm more in favor of keeping the existing "native" format default for the following reasons:

  1. Batch format is implied by upstream dataset format. I'm more in favor of having users decide on the dataset format at read time and have that imply the default format of all downstream batch operations, since this gives you a well-defined, unambiguous dataset format at the beginning of the dataset lifetime + a guarantee of no hidden downstream conversions, which may be inefficient (see next point).
  2. Less likely to surprise users with data conversion costs. By default, batches will be mapped/iterated over in their native format, so the default is the most efficient option. By requiring them to explicitly convert an Arrow dataset to a Pandas dataset when mapping/iterating, it should make them think about whether they should instead, for the sake of efficiency/performance, map/iterate on the native Arrow format or change their upstream dataset to a Pandas dataset. If we implicitly convert batches to Pandas DataFrames by default, they might never become aware of this cost.
  3. Backwards compatible. Changing the default to a non-native format will break existing datasets uses that rely on the native default.

The points in favor of a "pandas" default:

  1. It's the most commonly used batch format. With the "native" default, users will often try to use .map_batches()/.iter_batches() for a tabular dataset and be surprised when the batch is an Arrow Table instead of a Pandas DataFrame, which adds a bit of friction. I think that this also has some benefits, such as getting the user to think about the best end-to-end underlying format for their data (see point (2) above).
  2. Less ambiguous than inferring batch format from upstream dataset format.

Any other points in favor for the "pandas" default?

@ericl
Copy link
Contributor Author

ericl commented Jan 14, 2022

I'm not concerned with breaking things, as this is still a beta product. Btw, the pandas-block optimization will break workflows if native is specified.

@clarkzinzow , the dealbreaker for "native" is that the user gets an entirely unpredictable type to their map_batches operation based on what happened before, or even internal optimizations / changes to Datasets code.

@clarkzinzow
Copy link
Contributor

the dealbreaker for "native" is that the user gets an entirely unpredictable type to their map_batches operation based on what happened before, or even internal optimizations / changes to Datasets code.

Ah, so that perspective makes sense if we're thinking that the dataset format will eventually be an implementation detail, and that the dataset format won't be a first-class part of the read API. I didn't think that we were going there yet. Anyways, that sounds good to me. 👍

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about .iter_rows() API

@@ -1829,7 +1829,7 @@ def iter_rows(self, *, prefetch_blocks: int = 0) -> Iterator[T]:
A local iterator over the entire dataset.
"""
for batch in self.iter_batches(
prefetch_blocks=prefetch_blocks, batch_format="native"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm if a user was trying to keep everything in Arrow, this hard-coded Pandas batch would break that. Could we expose this in the .iter_rows() API as row_format, with a default to "pandas"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 18, 2022
@ericl
Copy link
Contributor Author

ericl commented Jan 26, 2022

Blocked on #20988

@kfstorm
Copy link
Member

kfstorm commented Jan 26, 2022

@ericl PR resubmitted.

@bveeramani
Copy link
Member

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

  1. Install Black
pip install -I black==21.12b0
  1. Format changed files with Black
curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh
  1. Commit your changes.
git add --all
git commit -m "Format Python code with Black"
  1. Merge master into your branch.
git pull upstream master
  1. Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

@ericl ericl changed the title [rfc] Fix the dataset batch iteration format to "pandas" by default [data] Fix the dataset batch iteration format to "pandas" by default Feb 2, 2022
@ericl ericl removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 2, 2022
@ericl ericl changed the title [data] Fix the dataset batch iteration format to "pandas" by default [data] Always convert arrow batches to pandas batches when user specifies batch_format="native" Feb 2, 2022
@ericl ericl merged commit 54fe2f8 into ray-project:master Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants