Add support for parquet files for storing the chunks #191

tchaton · 2024-06-27T15:17:26Z

This would enable users to avoid converting their dataset if they already have their dataset as parquet folders. We would need to run an indexation but this isn't too painful.

deependujha · 2024-08-31T19:04:07Z

My preliminary approach:

In OptimizeDataset:

Only if parquet files, Use pyarrow read_table function to load each parquet file one by one and then the writer will only write the number of files, and column types in index.json file, no chunk files created.

In reader,

All indices will remain as usual, only the reading at index i will be changed:

df.slice(7, 1).to_pandas().to_dict() # parquet file 7th index value

If a parquet dataset has no index.json file, we can still call the helper function to generate index.json on the fly and then StreamingDataset takes control.

Why no multithreading or multiprocessing while creating index.json file:

Parquet files once loaded in memory are uncompressed and may exceed memory limit.

Or, we might take care of it in another PR.

What do you think @tchaton !

tchaton · 2024-09-02T06:40:43Z

Yes, that's what I had in mind. The main challenge will be to make the slicing and reading as fast as possible. Might be worth to use: https://github.com/pola-rs/polars

tchaton · 2024-09-03T18:06:37Z

The goal is to enable reading pyarrow HF datasets with LitData

tchaton added bug Something isn't working help wanted Extra attention is needed labels Jun 27, 2024

Borda added enhancement New feature or request and removed bug Something isn't working labels Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for parquet files for storing the chunks #191

Add support for parquet files for storing the chunks #191

tchaton commented Jun 27, 2024 •

edited

Loading

deependujha commented Aug 31, 2024 •

edited

Loading

tchaton commented Sep 2, 2024

tchaton commented Sep 3, 2024

Add support for parquet files for storing the chunks #191

Add support for parquet files for storing the chunks #191

Comments

tchaton commented Jun 27, 2024 • edited Loading

deependujha commented Aug 31, 2024 • edited Loading

Why no multithreading or multiprocessing while creating index.json file:

tchaton commented Sep 2, 2024

tchaton commented Sep 3, 2024

tchaton commented Jun 27, 2024 •

edited

Loading

deependujha commented Aug 31, 2024 •

edited

Loading