-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support streaming datasets with pyarrow.parquet.read_table #6251
Support streaming datasets with pyarrow.parquet.read_table #6251
Conversation
The documentation is not available anymore as the PR was closed or merged. |
This function reads an entire Arrow table in one go, which is not ideal memory-wise, so I don't think we should encourage using this function, considering we want to keep RAM usage as low as possible in the streaming mode. (Note that Parquet files are compressed, meaning the loaded table can be significantly larger than the size in Parquet.) Instead, we should suggest the authors to use: with open(doc_path, "rb") as f:
parquet_file = pq.ParquetFile(f)
for batch in parquet_file.iter_batches():
pa_table = pa.Table.from_batches([batch])
yield idx, pa_table
idx += 1 |
@mariosasko I think the potential problem you evoke is independent of whether or not we support streaming mode:
In fact, what we should suggest instead is to follow the scriptless approach, so that our In summary, let me clarify the goal and the scope of this PR:
|
Yes, the no-script approach with metadata configs makes the most sense.
Some of the Parquet files in that repo are larger than 1 GB ... Also, I'd wait for more instances of people using the |
@mariosasko, yes, this solution is not specifically for the "uonlp/CulturaX" dataset, but for other use cases as I explained above: indeed, they finally removed the use of
Do you know how many datasets are currently using |
Zero (based on the script that checks the script contents of the public Hub datasets). |
I see... Thanks! 🤗 |
@mariosasko thanks for pointing the script! 🤗 However, I have found some Hub datasets that are using
|
I'm merging this PR as discussed in private. |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Support streaming datasets with
pyarrow.parquet.read_table
.See: https://huggingface.co/datasets/uonlp/CulturaX/discussions/2
CC: @AndreaFrancis