Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datasets] Add select_columns API to allow users to select a subset of columns #29081

Merged
merged 17 commits into from
Oct 26, 2022

Conversation

heyitsmui
Copy link
Contributor

@heyitsmui heyitsmui commented Oct 5, 2022

Why are these changes needed?

Provide a new select_columns API to allow users to flexibly select a subset of existing columns.

Added a new select_columns API within Dataset that uses map_batches to invoke the select method of each block. Added a new select_columns API within DatasetPipeline that apply Dataset.select_columns to each dataset/window in this pipeline.

Related issue number

Closes #27667

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@heyitsmui
Copy link
Contributor Author

hey @jianoaix can you help take a look? have addressed the comments in my draft PR - one thing i may still add is some schema validation check to see if users a selecting columns outside of the dataset schema

@jianoaix jianoaix self-assigned this Oct 5, 2022
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @heyitsmui for the contribution!

python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_dataset.py Show resolved Hide resolved
@c21 c21 self-assigned this Oct 5, 2022
Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contribution! :)

python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_dataset.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_dataset.py Show resolved Hide resolved
@jianoaix jianoaix added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 11, 2022
Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good!

Can you add this API to python/ray/data/dataset_pipeline.py as well?
Also we need to update the documentation:

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me, except @jianoaix's comment.

python/ray/data/dataset.py Outdated Show resolved Hide resolved
…taset_pipeline, and update docs

Signed-off-by: Michael Mui <[email protected]>
@heyitsmui heyitsmui requested review from maxpumperla and a team as code owners October 14, 2022 01:00
@heyitsmui heyitsmui requested review from c21 and jianoaix and removed request for c21 and jianoaix October 14, 2022 08:33
@heyitsmui
Copy link
Contributor Author

@jianoaix i just rebased and fixed the merging issues in test_dataset_pipeline

@heyitsmui
Copy link
Contributor Author

@c21 Yes, underlying pandas will raise a KeyError e.g. KeyError: "['dummy_col'] not in index" which is intuitive enough for user to figure out which columns to remove.

@c21
Copy link
Contributor

c21 commented Oct 26, 2022

thanks @heyitsmui !

@clarkzinzow clarkzinzow merged commit 290033f into ray-project:master Oct 26, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…f columns (ray-project#29081)

Provide a new select_columns API to allow users to flexibly select a subset of existing columns.

Added a new select_columns API within Dataset that uses map_batches to invoke the select method of each block. Added a new select_columns API within DatasetPipeline that apply Dataset.select_columns to each dataset/window in this pipeline.

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Support select_columns to select subset of columns
4 participants