Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parquet files for storing the chunks #191

Open
tchaton opened this issue Jun 27, 2024 · 3 comments
Open

Add support for parquet files for storing the chunks #191

tchaton opened this issue Jun 27, 2024 · 3 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@tchaton
Copy link
Collaborator

tchaton commented Jun 27, 2024

This would enable users to avoid converting their dataset if they already have their dataset as parquet folders. We would need to run an indexation but this isn't too painful.

@tchaton tchaton added bug Something isn't working help wanted Extra attention is needed labels Jun 27, 2024
@deependujha
Copy link
Collaborator

deependujha commented Aug 31, 2024

My preliminary approach:


In OptimizeDataset:

Only if parquet files, Use pyarrow read_table function to load each parquet file one by one and then the writer will only write the number of files, and column types in index.json file, no chunk files created.

In reader,

All indices will remain as usual, only the reading at index i will be changed:

df.slice(7, 1).to_pandas().to_dict() # parquet file 7th index value

If a parquet dataset has no index.json file, we can still call the helper function to generate index.json on the fly and then StreamingDataset takes control.


Why no multithreading or multiprocessing while creating index.json file:

  • Parquet files once loaded in memory are uncompressed and may exceed memory limit.

Or, we might take care of it in another PR.


What do you think @tchaton !

@tchaton
Copy link
Collaborator Author

tchaton commented Sep 2, 2024

Yes, that's what I had in mind. The main challenge will be to make the slicing and reading as fast as possible. Might be worth to use: https://github.com/pola-rs/polars

@tchaton
Copy link
Collaborator Author

tchaton commented Sep 3, 2024

The goal is to enable reading pyarrow HF datasets with LitData

@Borda Borda added enhancement New feature or request and removed bug Something isn't working labels Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants