Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Datasets] Change sampling to use same API as read Parquet (#28258)
Found sampling OOM issue in #28230, after debugging I found the issue is due to `batch_size` passed when reading Parquet. Previously we set `batsh_size=5`, but it is causing too much overhead when reading files in #28230 (where on-disk file size is 2GB). So here I change the code to set `batch_size` as a larger number - 1024. In the mean time, restricting the number of rows to sample no more than the first row group, as suggested in https://lists.apache.org/thread/dq6g7yyt6jl8r6pcpgokl13cfyg6vdml . Tested on the nightly test (with 400GB files in total), and [the nightly test finished successfully before the timeout](https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_DQgxh91xNpBJQGbH2zcnTXpW?command-history-section=command_history&drivers-section=deployments.). Sample 2 files, each file is 2GB on disk, roughly takes 14 seconds now. This time looks within resonable to me, so I think it's better to have same behavior between sampling and reading, to avoid any future surprise, even though one batch is large now. ``` Parquet Files Sample: 100%|██████████| 2/2 [00:14<00:00, 7.23s/it] ```
- Loading branch information