Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Add dataset.random_sample() API #24449

Closed
ericl opened this issue May 3, 2022 · 5 comments
Closed

[data] Add dataset.random_sample() API #24449

ericl opened this issue May 3, 2022 · 5 comments
Assignees
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability good first issue Great starter issue for someone just starting to contribute to Ray P2 Important issue, but not time-critical

Comments

@ericl
Copy link
Contributor

ericl commented May 3, 2022

Description

Per https://discuss.ray.io/t/how-do-i-sample-from-a-ray-datasets/5308, we should add a random_sample(N) API that returns N records from a Dataset. This can be implemented via a map_batches() followed by a take().

cc @simon-mo @clarkzinzow

Use case

Random sample is useful for a variety of scenarios, including creating training batches, and downsampling the dataset for faster analysis / testing.

@ericl ericl added enhancement Request for new feature and/or capability P2 Important issue, but not time-critical labels May 3, 2022
@ericl ericl added this to the Datasets GA milestone May 3, 2022
@simon-mo simon-mo added the good first issue Great starter issue for someone just starting to contribute to Ray label May 4, 2022
@clarkzinzow clarkzinzow added the data Ray Data-related issues label May 4, 2022
@clarkzinzow clarkzinzow removed this from the Datasets GA milestone May 4, 2022
@bushshrub
Copy link
Contributor

bushshrub commented May 5, 2022

I'd like to give this a shot! I'm having a bit of trouble setting up my development environment on macOS though. Should I just go ahead and switch to an ubuntu machine?

EDIT: Turns out the wheel link given in the documentation is not universal. Anyway, currently have it working on my other computer.

bushshrub added a commit to bushshrub/ray that referenced this issue May 5, 2022
@bushshrub
Copy link
Contributor

Would adding in a feature to sample a fraction (as referenced in the forum) be useful too?

@simon-mo
Copy link
Contributor

simon-mo commented May 5, 2022

@xiurobert thanks for picking it up. Out of curiosity, how did you find this issue?

@bushshrub
Copy link
Contributor

@xiurobert thanks for picking it up. Out of curiosity, how did you find this issue?

I was randomly browsing GitHub out of boredom and this repo was recommended to me.

@bushshrub
Copy link
Contributor

Now that the random_sample API is added, can this issue be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability good first issue Great starter issue for someone just starting to contribute to Ray P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

4 participants