Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow downloading just some columns of a dataset #4114

Open
osanseviero opened this issue Apr 6, 2022 · 9 comments
Open

Allow downloading just some columns of a dataset #4114

osanseviero opened this issue Apr 6, 2022 · 9 comments
Labels
enhancement New feature or request

Comments

@osanseviero
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Some people are interested in doing label analysis of a CV dataset without downloading all the images. Downloading the whole dataset does not always makes sense for this kind of use case

Describe the solution you'd like
Be able to just download some columns of a dataset, such as doing

load_dataset("huggan/wikiart",columns=["artist", "genre"])

Although this might make things a bit complicated in terms of local caching of datasets.

@osanseviero osanseviero added the enhancement New feature or request label Apr 6, 2022
@lhoestq
Copy link
Member

lhoestq commented Apr 6, 2022

In the general case you can’t always reduce the quantity of data to download, since you can’t parse CSV or JSON data without downloading the whole files right ? ^^ However we could explore this case-by-case I guess

@osanseviero
Copy link
Contributor Author

Actually for csv pandas has usecols which allows loading a subset of columns in a more efficient way afaik, but yes, you're right this might be more complex than I thought.

@lukasugar
Copy link

Bumping the visibility of this :) Is there a recommended way of doing this?

@lhoestq
Copy link
Member

lhoestq commented Feb 21, 2024

Passing columns=[...] to load_dataset() in streaming mode does work if the dataset is in Parquet format, but for other formats it's either not possible or not implemented

@oza75
Copy link

oza75 commented Apr 7, 2024

I tried using the columns=['bambara'] on this dataset oza75/bambara-tts which is in parquet, but it does not work. This feature is really useful because sometimes you don't want to download the whole dataset but just a few columns.

@Ravi2712
Copy link

It doesn't work for the dataset with parquet format. Are we missing something?

@lhoestq
Copy link
Member

lhoestq commented May 17, 2024

It only works for streaming=True. When not streaming it does download the full files locally before reading the data

@kdcyberdude
Copy link

Hi @lhoestq, I have an audio dataset of 250GB on the huggingface hub in parquet format. I only wanted to load the text column. It is taking a lot of time. It seems like it is downloading audio as well even in streaming mode.

@trojblue
Copy link

bump on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants