[data] Always convert arrow batches to pandas batches when user specifies batch_format="native" #21566

ericl · 2022-01-12T23:28:02Z

Why are these changes needed?

With the addition of #20988, the native format becomes ambiguous. This PR proposes to auto-promote arrow to pandas blocks when the user specifies "native" format, to avoid uncertainty.

kfstorm · 2022-01-14T11:44:01Z

Will this break a lot of existing workflows? Shouldn't we set the default format to arrow?

clarkzinzow · 2022-01-14T20:49:08Z

@ericl I think that I'm more in favor of keeping the existing "native" format default for the following reasons:

Batch format is implied by upstream dataset format. I'm more in favor of having users decide on the dataset format at read time and have that imply the default format of all downstream batch operations, since this gives you a well-defined, unambiguous dataset format at the beginning of the dataset lifetime + a guarantee of no hidden downstream conversions, which may be inefficient (see next point).
Less likely to surprise users with data conversion costs. By default, batches will be mapped/iterated over in their native format, so the default is the most efficient option. By requiring them to explicitly convert an Arrow dataset to a Pandas dataset when mapping/iterating, it should make them think about whether they should instead, for the sake of efficiency/performance, map/iterate on the native Arrow format or change their upstream dataset to a Pandas dataset. If we implicitly convert batches to Pandas DataFrames by default, they might never become aware of this cost.
Backwards compatible. Changing the default to a non-native format will break existing datasets uses that rely on the native default.

The points in favor of a "pandas" default:

It's the most commonly used batch format. With the "native" default, users will often try to use .map_batches()/.iter_batches() for a tabular dataset and be surprised when the batch is an Arrow Table instead of a Pandas DataFrame, which adds a bit of friction. I think that this also has some benefits, such as getting the user to think about the best end-to-end underlying format for their data (see point (2) above).
Less ambiguous than inferring batch format from upstream dataset format.

Any other points in favor for the "pandas" default?

ericl · 2022-01-14T20:58:27Z

I'm not concerned with breaking things, as this is still a beta product. Btw, the pandas-block optimization will break workflows if native is specified.

@clarkzinzow , the dealbreaker for "native" is that the user gets an entirely unpredictable type to their map_batches operation based on what happened before, or even internal optimizations / changes to Datasets code.

clarkzinzow · 2022-01-14T21:29:26Z

the dealbreaker for "native" is that the user gets an entirely unpredictable type to their map_batches operation based on what happened before, or even internal optimizations / changes to Datasets code.

Ah, so that perspective makes sense if we're thinking that the dataset format will eventually be an implementation detail, and that the dataset format won't be a first-class part of the read API. I didn't think that we were going there yet. Anyways, that sounds good to me. 👍

clarkzinzow

Question about .iter_rows() API

clarkzinzow · 2022-01-14T21:32:13Z

python/ray/data/dataset.py

@@ -1829,7 +1829,7 @@ def iter_rows(self, *, prefetch_blocks: int = 0) -> Iterator[T]:
            A local iterator over the entire dataset.
        """
        for batch in self.iter_batches(
-                prefetch_blocks=prefetch_blocks, batch_format="native"):


Hmm if a user was trying to keep everything in Arrow, this hard-coded Pandas batch would break that. Could we expose this in the .iter_rows() API as row_format, with a default to "pandas"?

Good point!

…lementation (partial) (ray-project#20988) (ray-project#21661)" This reverts commit fa5c167.

ericl · 2022-01-26T03:40:52Z

Blocked on #20988

kfstorm · 2022-01-26T12:16:08Z

@ericl PR resubmitted.

bveeramani · 2022-01-30T05:11:59Z

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

Install Black

pip install -I black==21.12b0

Format changed files with Black

curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh

Commit your changes.

git add --all
git commit -m "Format Python code with Black"

Merge master into your branch.

git pull upstream master

Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

update

2a5b858

ericl requested a review from scv119 as a code owner January 12, 2022 23:28

ericl assigned kfstorm and clarkzinzow Jan 12, 2022

clarkzinzow reviewed Jan 14, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 18, 2022

ericl added 2 commits January 25, 2022 15:51

Merge remote-tracking branch 'upstream/master' into pandas-default

95da74e

wip

eb619f5

ericl mentioned this pull request Jan 26, 2022

[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) #20988

Merged

6 tasks

Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format imp…

77e4326

…lementation (partial) (ray-project#20988) (ray-project#21661)" This reverts commit fa5c167.

ericl added 6 commits January 31, 2022 14:23

update

e3796f2

Merge remote-tracking branch 'upstream/master' into pandas-default

968197c

fix

e39d141

fix

c9b8872

Merge remote-tracking branch 'upstream/master' into pandas-default

9cba2f8

fix

99e1b89

ericl changed the title ~~[rfc] Fix the dataset batch iteration format to "pandas" by default~~ [data] Fix the dataset batch iteration format to "pandas" by default Feb 2, 2022

Merge remote-tracking branch 'upstream/master' into pandas-default

f528455

ericl removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 2, 2022

ericl changed the title ~~[data] Fix the dataset batch iteration format to "pandas" by default~~ [data] Always convert arrow batches to pandas batches when user specifies batch_format="native" Feb 2, 2022

scv119 approved these changes Feb 2, 2022

View reviewed changes

ericl merged commit 54fe2f8 into ray-project:master Feb 2, 2022

clarkzinzow mentioned this pull request Feb 10, 2022

Revert "[Datasets] Support ignoring NaNs in aggregations. (#20787)" #22258

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Always convert arrow batches to pandas batches when user specifies batch_format="native" #21566

[data] Always convert arrow batches to pandas batches when user specifies batch_format="native" #21566

ericl commented Jan 12, 2022 •

edited

Loading

kfstorm commented Jan 14, 2022

clarkzinzow commented Jan 14, 2022 •

edited

Loading

ericl commented Jan 14, 2022

clarkzinzow commented Jan 14, 2022

clarkzinzow left a comment •

edited

Loading

clarkzinzow Jan 14, 2022

ericl Jan 14, 2022

ericl commented Jan 26, 2022

kfstorm commented Jan 26, 2022

bveeramani commented Jan 30, 2022

[data] Always convert arrow batches to pandas batches when user specifies batch_format="native" #21566

[data] Always convert arrow batches to pandas batches when user specifies batch_format="native" #21566

Conversation

ericl commented Jan 12, 2022 • edited Loading

Why are these changes needed?

kfstorm commented Jan 14, 2022

clarkzinzow commented Jan 14, 2022 • edited Loading

ericl commented Jan 14, 2022

clarkzinzow commented Jan 14, 2022

clarkzinzow left a comment • edited Loading

Choose a reason for hiding this comment

clarkzinzow Jan 14, 2022

Choose a reason for hiding this comment

ericl Jan 14, 2022

Choose a reason for hiding this comment

ericl commented Jan 26, 2022

kfstorm commented Jan 26, 2022

bveeramani commented Jan 30, 2022

‼️ ACTION REQUIRED ‼️

ericl commented Jan 12, 2022 •

edited

Loading

clarkzinzow commented Jan 14, 2022 •

edited

Loading

clarkzinzow left a comment •

edited

Loading