Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add a new "zero_copy" batch format #32662

Closed
Tracked by #28346
amogkam opened this issue Feb 17, 2023 · 3 comments
Closed
Tracked by #28346

[Datasets] Add a new "zero_copy" batch format #32662

amogkam opened this issue Feb 17, 2023 · 3 comments
Assignees
Labels
data Ray Data-related issues P1 Issue that should be fixed within a few weeks Ray 2.5

Comments

@amogkam
Copy link
Contributor

amogkam commented Feb 17, 2023

Currently, predictors and preprocessors manually call into dataset.dataset_format() to determine what batch format to use that results in zero copy, and thus allowing preprocessors/predictors to delegate to format-specific optimized implementations of their transformation functions

However, this triggers execution of the dataset. Instead, we should introduce a "zero-copy" batch format in datasets that can pick the best batch format during runtime.

This will also allow us to deprecate dataset_format.

@amogkam amogkam changed the title [P1] Add a new "zero_copy" batch format that allows preprocessors/predictors to delegate to format-specific optimized implementations of their transformation functions Add a new "zero_copy" batch format that allows preprocessors/predictors to delegate to format-specific optimized implementations of their transformation functions Feb 17, 2023
@amogkam amogkam changed the title Add a new "zero_copy" batch format that allows preprocessors/predictors to delegate to format-specific optimized implementations of their transformation functions Add a new "zero_copy" batch format Feb 17, 2023
@amogkam
Copy link
Contributor Author

amogkam commented Feb 17, 2023

cc @clarkzinzow

@amogkam amogkam added the data Ray Data-related issues label Feb 17, 2023
@amogkam amogkam added this to the Dataset Streaming Execution milestone Feb 17, 2023
@clarkzinzow clarkzinzow changed the title Add a new "zero_copy" batch format [Datasets] Add a new "zero_copy" batch format Feb 17, 2023
@c21 c21 added P1 Issue that should be fixed within a few weeks Ray 2.4 labels Mar 16, 2023
@c21
Copy link
Contributor

c21 commented Mar 16, 2023

Let's also deprecate Dataset.dataset_format.

@ericl ericl added Ray 2.5 and removed Ray 2.4 labels Mar 20, 2023
@ericl ericl closed this as completed Mar 25, 2023
@ericl
Copy link
Contributor

ericl commented Mar 25, 2023

Closed in #33562

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues P1 Issue that should be fixed within a few weeks Ray 2.5
Projects
None yet
Development

No branches or pull requests

3 participants