-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[datasets] Add select_columns API to allow users to select a subset of columns #29081
[datasets] Add select_columns API to allow users to select a subset of columns #29081
Conversation
Signed-off-by: Michael Mui <[email protected]>
Signed-off-by: Michael Mui <[email protected]>
Signed-off-by: Michael Mui <[email protected]>
Signed-off-by: Michael Mui <[email protected]>
Signed-off-by: Michael Mui <[email protected]>
a86e109
to
4cce85f
Compare
hey @jianoaix can you help take a look? have addressed the comments in my draft PR - one thing i may still add is some schema validation check to see if users a selecting columns outside of the dataset schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @heyitsmui for the contribution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contribution! :)
Signed-off-by: Michael Mui <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good!
Can you add this API to python/ray/data/dataset_pipeline.py as well?
Also we need to update the documentation:
- The Basic Transformations section in https://docs.ray.io/en/master/data/api/dataset.html
- Similarly in https://docs.ray.io/en/master/data/api/dataset_pipeline.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me, except @jianoaix's comment.
…taset_pipeline, and update docs Signed-off-by: Michael Mui <[email protected]>
@jianoaix i just rebased and fixed the merging issues in test_dataset_pipeline |
@c21 Yes, underlying pandas will raise a KeyError e.g. KeyError: "['dummy_col'] not in index" which is intuitive enough for user to figure out which columns to remove. |
thanks @heyitsmui ! |
…f columns (ray-project#29081) Provide a new select_columns API to allow users to flexibly select a subset of existing columns. Added a new select_columns API within Dataset that uses map_batches to invoke the select method of each block. Added a new select_columns API within DatasetPipeline that apply Dataset.select_columns to each dataset/window in this pipeline. Signed-off-by: Weichen Xu <[email protected]>
Why are these changes needed?
Provide a new
select_columns
API to allow users to flexibly select a subset of existing columns.Added a new
select_columns
API withinDataset
that usesmap_batches
to invoke theselect
method of each block. Added a newselect_columns
API withinDatasetPipeline
that applyDataset.select_columns
to each dataset/window in this pipeline.Related issue number
Closes #27667
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.