-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Change sampling to use same API as read Parquet #28258
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried that we're going to be reading 20,000x more data with this new sampling.
Do we know why the small batch size was causing a lot of extra overhead in Arrow? Is it a bug in the head()
implementation?
batches = piece.to_batches( | ||
columns=columns, | ||
schema=schema, | ||
batch_size=PARQUET_READER_ROW_BATCH_SIZE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned that we're going to be reading a lot more data now, going from 5 rows to 100k rows, which could be a lot slower/heavier for very wide tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I shared the same concern. Read 2GB takes roughly 14 seconds, is the baseline we have now, which is not too bad. Let me explore more if I can find anything better.
unfortunately we don't know yet and I am gonna ask in Arrow mail list. It's not a bug in
and it's also running slow and OOM. So I guess there might be some exponential overhead associated with |
@clarkzinzow - based on discussion in https://lists.apache.org/thread/dq6g7yyt6jl8r6pcpgokl13cfyg6vdml, the Arrow Parquet reader has readahead feature beyond batch size, and extreme small batch size will incur a lot of overhead in readahead as we saw here. So I think best option for us now:
WDYT? |
Batch readahead should be disabled if
It looks like we weren't doing this for the file sampling, resulting in concurrent readaheads. If you have a quick benchmarking script handy, could you try it out with |
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Discussed with @clarkzinzow offline:
So here we change the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for trying this out!
…ct#28258) Found sampling OOM issue in ray-project#28230, after debugging I found the issue is due to `batch_size` passed when reading Parquet. Previously we set `batsh_size=5`, but it is causing too much overhead when reading files in ray-project#28230 (where on-disk file size is 2GB). So here I change the code to set `batch_size` as a larger number - 1024. In the mean time, restricting the number of rows to sample no more than the first row group, as suggested in https://lists.apache.org/thread/dq6g7yyt6jl8r6pcpgokl13cfyg6vdml . Tested on the nightly test (with 400GB files in total), and [the nightly test finished successfully before the timeout](https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_DQgxh91xNpBJQGbH2zcnTXpW?command-history-section=command_history&drivers-section=deployments.). Sample 2 files, each file is 2GB on disk, roughly takes 14 seconds now. This time looks within resonable to me, so I think it's better to have same behavior between sampling and reading, to avoid any future surprise, even though one batch is large now. ``` Parquet Files Sample: 100%|██████████| 2/2 [00:14<00:00, 7.23s/it] ``` Signed-off-by: ilee300a <[email protected]>
…ct#28258) Found sampling OOM issue in ray-project#28230, after debugging I found the issue is due to `batch_size` passed when reading Parquet. Previously we set `batsh_size=5`, but it is causing too much overhead when reading files in ray-project#28230 (where on-disk file size is 2GB). So here I change the code to set `batch_size` as a larger number - 1024. In the mean time, restricting the number of rows to sample no more than the first row group, as suggested in https://lists.apache.org/thread/dq6g7yyt6jl8r6pcpgokl13cfyg6vdml . Tested on the nightly test (with 400GB files in total), and [the nightly test finished successfully before the timeout](https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_DQgxh91xNpBJQGbH2zcnTXpW?command-history-section=command_history&drivers-section=deployments.). Sample 2 files, each file is 2GB on disk, roughly takes 14 seconds now. This time looks within resonable to me, so I think it's better to have same behavior between sampling and reading, to avoid any future surprise, even though one batch is large now. ``` Parquet Files Sample: 100%|██████████| 2/2 [00:14<00:00, 7.23s/it] ```
Signed-off-by: Cheng Su [email protected]
Why are these changes needed?
Found sampling OOM issue in #28230, after debugging I found the issue is due to
batch_size
passed when reading Parquet. Previously we setbatsh_size=5
, but it is causing too much overhead when reading files in #28230 (where on-disk file size is 2GB). So here I change the code to setbatch_size
as a larger number - 1024. In the mean time, restricting the number of rows to sample no more than the first row group, as suggested in https://lists.apache.org/thread/dq6g7yyt6jl8r6pcpgokl13cfyg6vdml .Tested on the nightly test (with 400GB files in total), and the nightly test finished successfully before the timeout. Sample 2 files, each file is 2GB on disk, roughly takes 14 seconds now.
This time looks within resonable to me, so I think it's better to have same behavior between sampling and reading, to avoid any future surprise, even though one batch is large now.
Related issue number
#28230
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.