Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add Dataset.default_batch_format #28434

Merged

Conversation

bveeramani
Copy link
Member

@bveeramani bveeramani commented Sep 11, 2022

Signed-off-by: Balaji Veeramani [email protected]

Depends on:

Why are these changes needed?

Participants in the PyTorch UX study couldn't understand how the "native" batch format works. This PR introduces a method Dataset.native_batch_format that tells users exactly what the native batch format is, so users don't have to guess.

image image

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also might want to emphasize this more in the feature guides and the key concepts page. I.e. that the "native" batch format is Pandas DataFrames for tabular data, NumPy ndarrays for tensor data, and Python lists otherwise.

python/ray/data/dataset.py Outdated Show resolved Hide resolved
@jianoaix
Copy link
Contributor

Does this worth a public API to Dataset? Can we have more clear documentation or naming (is "native" confusing) to address the feedback?

@bveeramani
Copy link
Member Author

bveeramani commented Sep 12, 2022

Does this worth a public API to Dataset?

If our "native" logic was simpler, I don't think it'd be worth it, but given that our logic is non-trivial, it might make sense to add.

Can we have more clear documentation or naming (is "native" confusing) to address the feedback?

Even if we clearly document the behavior and rename "native", I still feel like it's easier to call ds.native_batch_format. Like, you could either do

>>> ds ray.data.read_images(...)
>>> ds.native_batch_format()
<class 'list'>

Or read something like

The “native” batch format presents data as follows for each Dataset type:

Tabular Datasets: Each batch will be a pandas.DataFrame. This may incur a conversion cost if the underlying Dataset block is not zero-copy convertible from an Arrow table.

Tensor Datasets (single-column): Each batch will be a single numpy.ndarray containing the single tensor column for this batch.

Simple Datasets: Each batch will be a Python list.

@jianoaix
Copy link
Contributor

Regarding naming, would it better to call it "default"? To me the "native" feels like it's indicating something underlying.

Regarding this API, what's the use case? Is it just a convenient shortcut to answer "what is the native format for the dataset"?

@bveeramani
Copy link
Member Author

Regarding naming, would it better to call it "default"? To me the "native" feels like it's indicating something underlying.

Yeah, I think "default" makes a lot more sense.

Regarding this API, what's the use case? Is it just a convenient shortcut to answer "what is the native format for the dataset"?

Yeah, pretty much. Otherwise, you'd need to do something like this

def map_fn(batch):
    print(batch)
    return batch

@clarkzinzow
Copy link
Contributor

Agreed that the "native" batch format naming isn't great, and agreed that something like "default" would be better. @ericl What do you think about that API change?

@ericl
Copy link
Contributor

ericl commented Sep 13, 2022

Sure, "default" sounds good. We can do it without breaking anything with a backwards-compatibility alias.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 13, 2022
@bveeramani bveeramani changed the title [Datasets] Add Dataset.native_batch_format [Datasets] Add Dataset.default_batch_format Sep 13, 2022
@bveeramani
Copy link
Member Author

Sure, "default" sounds good. We can do it without breaking anything with a backwards-compatibility alias.

Opened new PR for rename: #28489

@bveeramani bveeramani removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 13, 2022
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some soft recommendations for improving the default format docs for tabular and tensor data, since this difference is a common confuser for users.

python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
@clarkzinzow clarkzinzow merged commit 206e847 into ray-project:master Sep 19, 2022
@bveeramani bveeramani deleted the bveeramani/native-batch-format branch September 19, 2022 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants