-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add Dataset.default_batch_format
#28434
[Datasets] Add Dataset.default_batch_format
#28434
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also might want to emphasize this more in the feature guides and the key concepts page. I.e. that the "native" batch format is Pandas DataFrames for tabular data, NumPy ndarrays for tensor data, and Python lists otherwise.
Does this worth a public API to Dataset? Can we have more clear documentation or naming (is "native" confusing) to address the feedback? |
If our "native" logic was simpler, I don't think it'd be worth it, but given that our logic is non-trivial, it might make sense to add.
Even if we clearly document the behavior and rename "native", I still feel like it's easier to call
Or read something like
|
Regarding naming, would it better to call it "default"? To me the "native" feels like it's indicating something underlying. Regarding this API, what's the use case? Is it just a convenient shortcut to answer "what is the native format for the dataset"? |
Yeah, I think "default" makes a lot more sense.
Yeah, pretty much. Otherwise, you'd need to do something like this
|
Agreed that the |
Sure, "default" sounds good. We can do it without breaking anything with a backwards-compatibility alias. |
Dataset.native_batch_format
Dataset.default_batch_format
Opened new PR for rename: #28489 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some soft recommendations for improving the default format docs for tabular and tensor data, since this difference is a common confuser for users.
Signed-off-by: Balaji Veeramani [email protected]
Depends on:
"native"
batch format in favor of"default"
#28489Why are these changes needed?
Participants in the PyTorch UX study couldn't understand how the
"native"
batch format works. This PR introduces a methodDataset.native_batch_format
that tells users exactly what the native batch format is, so users don't have to guess.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.