[Datasets] Add `Dataset.default_batch_format` #28434

bveeramani · 2022-09-11T09:07:35Z

Signed-off-by: Balaji Veeramani [email protected]

Depends on:

[Datasets] Deprecate "native" batch format in favor of "default" #28489

Why are these changes needed?

Participants in the PyTorch UX study couldn't understand how the "native" batch format works. This PR introduces a method Dataset.native_batch_format that tells users exactly what the native batch format is, so users don't have to guess.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

clarkzinzow

We also might want to emphasize this more in the feature guides and the key concepts page. I.e. that the "native" batch format is Pandas DataFrames for tabular data, NumPy ndarrays for tensor data, and Python lists otherwise.

python/ray/data/dataset.py

jianoaix · 2022-09-12T20:09:17Z

Does this worth a public API to Dataset? Can we have more clear documentation or naming (is "native" confusing) to address the feedback?

bveeramani · 2022-09-12T20:28:08Z

Does this worth a public API to Dataset?

If our "native" logic was simpler, I don't think it'd be worth it, but given that our logic is non-trivial, it might make sense to add.

Can we have more clear documentation or naming (is "native" confusing) to address the feedback?

Even if we clearly document the behavior and rename "native", I still feel like it's easier to call ds.native_batch_format. Like, you could either do

>>> ds ray.data.read_images(...)
>>> ds.native_batch_format()
<class 'list'>

Or read something like

The “native” batch format presents data as follows for each Dataset type:

Tabular Datasets: Each batch will be a pandas.DataFrame. This may incur a conversion cost if the underlying Dataset block is not zero-copy convertible from an Arrow table.

Tensor Datasets (single-column): Each batch will be a single numpy.ndarray containing the single tensor column for this batch.

Simple Datasets: Each batch will be a Python list.

jianoaix · 2022-09-12T20:55:05Z

Regarding naming, would it better to call it "default"? To me the "native" feels like it's indicating something underlying.

Regarding this API, what's the use case? Is it just a convenient shortcut to answer "what is the native format for the dataset"?

bveeramani · 2022-09-12T21:19:44Z

Regarding naming, would it better to call it "default"? To me the "native" feels like it's indicating something underlying.

Yeah, I think "default" makes a lot more sense.

Regarding this API, what's the use case? Is it just a convenient shortcut to answer "what is the native format for the dataset"?

Yeah, pretty much. Otherwise, you'd need to do something like this

def map_fn(batch):
    print(batch)
    return batch

clarkzinzow · 2022-09-13T17:45:24Z

Agreed that the "native" batch format naming isn't great, and agreed that something like "default" would be better. @ericl What do you think about that API change?

ericl · 2022-09-13T19:27:05Z

Sure, "default" sounds good. We can do it without breaking anything with a backwards-compatibility alias.

…tch-format

bveeramani · 2022-09-13T23:23:04Z

Sure, "default" sounds good. We can do it without breaking anything with a backwards-compatibility alias.

Opened new PR for rename: #28489

…-batch-format

clarkzinzow

Some soft recommendations for improving the default format docs for tabular and tensor data, since this difference is a common confuser for users.

python/ray/data/dataset.py

Add Dataset.native_batch_format

d88840f

bveeramani requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix, maxpumperla, c21 and a team as code owners September 11, 2022 09:07

bveeramani assigned ericl and clarkzinzow Sep 11, 2022

bveeramani mentioned this pull request Sep 11, 2022

[Datasets] [Docs] Update map_batches documentation #28435

Merged

8 tasks

clarkzinzow reviewed Sep 12, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 13, 2022

bveeramani added 4 commits September 13, 2022 15:36

Change name and implementation

fe95ffa

Rename "native" as "default"

07b3a85

Update docs

35c12a3

Add warnings

56df5f5

bveeramani mentioned this pull request Sep 13, 2022

[Datasets] Deprecate "native" batch format in favor of "default" #28489

Merged

7 tasks

Merge branch 'bveeramani/native-to-default' into bveeramani/native-ba…

a596e4d

…tch-format

bveeramani changed the title ~~[Datasets] Add Dataset.native_batch_format~~ [Datasets] Add Dataset.default_batch_format Sep 13, 2022

bveeramani removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 13, 2022

Update dataset.rst

472c00f

bveeramani added 2 commits September 14, 2022 17:39

Merge remote-tracking branch 'upstream/master' into bveeramani/native…

d9eb36a

…-batch-format

Update dataset.py

a1986a5

clarkzinzow reviewed Sep 15, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

Update dataset.py

aeaee9b

clarkzinzow approved these changes Sep 19, 2022

View reviewed changes

Address review comments

8317a21

clarkzinzow merged commit 206e847 into ray-project:master Sep 19, 2022

bveeramani deleted the bveeramani/native-batch-format branch September 19, 2022 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Add `Dataset.default_batch_format` #28434

[Datasets] Add `Dataset.default_batch_format` #28434

bveeramani commented Sep 11, 2022 •

edited

Loading

clarkzinzow left a comment

jianoaix commented Sep 12, 2022

bveeramani commented Sep 12, 2022 •

edited

Loading

jianoaix commented Sep 12, 2022

bveeramani commented Sep 12, 2022

clarkzinzow commented Sep 13, 2022

ericl commented Sep 13, 2022

bveeramani commented Sep 13, 2022

clarkzinzow left a comment

[Datasets] Add Dataset.default_batch_format #28434

[Datasets] Add Dataset.default_batch_format #28434

Conversation

bveeramani commented Sep 11, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

clarkzinzow left a comment

Choose a reason for hiding this comment

jianoaix commented Sep 12, 2022

bveeramani commented Sep 12, 2022 • edited Loading

jianoaix commented Sep 12, 2022

bveeramani commented Sep 12, 2022

clarkzinzow commented Sep 13, 2022

ericl commented Sep 13, 2022

bveeramani commented Sep 13, 2022

clarkzinzow left a comment

Choose a reason for hiding this comment

[Datasets] Add `Dataset.default_batch_format` #28434

[Datasets] Add `Dataset.default_batch_format` #28434

bveeramani commented Sep 11, 2022 •

edited

Loading

bveeramani commented Sep 12, 2022 •

edited

Loading