[Datasets] Change sampling to use same API as read Parquet #28258

c21 · 2022-09-02T02:11:40Z

Signed-off-by: Cheng Su [email protected]

Why are these changes needed?

Found sampling OOM issue in #28230, after debugging I found the issue is due to batch_size passed when reading Parquet. Previously we set batsh_size=5, but it is causing too much overhead when reading files in #28230 (where on-disk file size is 2GB). So here I change the code to set batch_size as a larger number - 1024. In the mean time, restricting the number of rows to sample no more than the first row group, as suggested in https://lists.apache.org/thread/dq6g7yyt6jl8r6pcpgokl13cfyg6vdml .

Tested on the nightly test (with 400GB files in total), and the nightly test finished successfully before the timeout. Sample 2 files, each file is 2GB on disk, roughly takes 14 seconds now.

This time looks within resonable to me, so I think it's better to have same behavior between sampling and reading, to avoid any future surprise, even though one batch is large now.

Parquet Files Sample: 100%|██████████| 2/2 [00:14<00:00,  7.23s/it]

Related issue number

#28230

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

clarkzinzow

I'm a bit worried that we're going to be reading 20,000x more data with this new sampling.

Do we know why the small batch size was causing a lot of extra overhead in Arrow? Is it a bug in the head() implementation?

python/ray/data/datasource/parquet_datasource.py

clarkzinzow · 2022-09-02T17:48:48Z

python/ray/data/datasource/parquet_datasource.py

+    batches = piece.to_batches(
+        columns=columns,
+        schema=schema,
+        batch_size=PARQUET_READER_ROW_BATCH_SIZE,


I'm a bit concerned that we're going to be reading a lot more data now, going from 5 rows to 100k rows, which could be a lot slower/heavier for very wide tables.

Yeah I shared the same concern. Read 2GB takes roughly 14 seconds, is the baseline we have now, which is not too bad. Let me explore more if I can find anything better.

c21 · 2022-09-02T18:04:51Z

Do we know why the small batch size was causing a lot of extra overhead in Arrow? Is it a bug in the head() implementation?

unfortunately we don't know yet and I am gonna ask in Arrow mail list. It's not a bug in head() implementation, I tried with to_batches()

batches = piece.to_batches(
  columns=columns,
  schema=schema,
  batch_size=5,
  **reader_args,
)

and it's also running slow and OOM. So I guess there might be some exponential overhead associated with batch_size when its value is small.

c21 · 2022-09-06T20:49:38Z

@clarkzinzow - based on discussion in https://lists.apache.org/thread/dq6g7yyt6jl8r6pcpgokl13cfyg6vdml, the Arrow Parquet reader has readahead feature beyond batch size, and extreme small batch size will incur a lot of overhead in readahead as we saw here. So I think best option for us now:

For now, sample the first row group as the Arrow folk suggested (and restrict the batch_size to be 100000 same as read, avoid any inconsistent surprise when row group is large).
After Arrow 10.0.0 release, explore the new readahead option (right now it's not exposed).

WDYT?

clarkzinzow · 2022-09-06T22:56:27Z

Batch readahead should be disabled if use_threads is set to False, as we already do on the actual file reading, so we should be able to make the readahead a non-factor.

https://github.com/apache/arrow/blob/3ea5af64865f9910d3c98162c7949af8d63ec68e/cpp/src/arrow/dataset/scanner.h#L92

ray/python/ray/data/datasource/parquet_datasource.py

Lines 353 to 362 in 64df4ce

    
           use_threads = reader_args.pop("use_threads", False) 
        
           for piece in pieces: 
        
               part = _get_partition_keys(piece.partition_expression) 
        
               batches = piece.to_batches( 
        
                   use_threads=use_threads, 
        
                   columns=columns, 
        
                   schema=schema, 
        
                   batch_size=PARQUET_READER_ROW_BATCH_SIZE, 
        
                   **reader_args, 
        
               )

It looks like we weren't doing this for the file sampling, resulting in concurrent readaheads. If you have a quick benchmarking script handy, could you try it out with use_threads=False and a small batch size?

Signed-off-by: Cheng Su <[email protected]>

c21 · 2022-09-07T19:26:53Z

Discussed with @clarkzinzow offline:

I tried setting batch_size=5 and use_threads=False, the file read becomes pretty slow and it takes >5 minutes for single file, and not working. So we guess there probably still some other overhead going on even when we disabling use_threads.
I also tried with batch_size=1024, it works well with similar time as batch_size=100000. With batch_size=1024, it's much more acceptable compared to batch_size=100000. Spark and Arrow Rust is also using 1024 rows as batch size when reading Parquet file.

So here we change the batch_size to be 1024. Also nightly test worked successfully - https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_DQgxh91xNpBJQGbH2zcnTXpW?command-history-section=command_history&drivers-section=deployments. .

clarkzinzow

LGTM, thanks for trying this out!

…ct#28258) Found sampling OOM issue in ray-project#28230, after debugging I found the issue is due to `batch_size` passed when reading Parquet. Previously we set `batsh_size=5`, but it is causing too much overhead when reading files in ray-project#28230 (where on-disk file size is 2GB). So here I change the code to set `batch_size` as a larger number - 1024. In the mean time, restricting the number of rows to sample no more than the first row group, as suggested in https://lists.apache.org/thread/dq6g7yyt6jl8r6pcpgokl13cfyg6vdml . Tested on the nightly test (with 400GB files in total), and [the nightly test finished successfully before the timeout](https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_DQgxh91xNpBJQGbH2zcnTXpW?command-history-section=command_history&drivers-section=deployments.). Sample 2 files, each file is 2GB on disk, roughly takes 14 seconds now. This time looks within resonable to me, so I think it's better to have same behavior between sampling and reading, to avoid any future surprise, even though one batch is large now. ``` Parquet Files Sample: 100%|██████████| 2/2 [00:14<00:00, 7.23s/it] ``` Signed-off-by: ilee300a <[email protected]>

…ct#28258) Found sampling OOM issue in ray-project#28230, after debugging I found the issue is due to `batch_size` passed when reading Parquet. Previously we set `batsh_size=5`, but it is causing too much overhead when reading files in ray-project#28230 (where on-disk file size is 2GB). So here I change the code to set `batch_size` as a larger number - 1024. In the mean time, restricting the number of rows to sample no more than the first row group, as suggested in https://lists.apache.org/thread/dq6g7yyt6jl8r6pcpgokl13cfyg6vdml . Tested on the nightly test (with 400GB files in total), and [the nightly test finished successfully before the timeout](https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_DQgxh91xNpBJQGbH2zcnTXpW?command-history-section=command_history&drivers-section=deployments.). Sample 2 files, each file is 2GB on disk, roughly takes 14 seconds now. This time looks within resonable to me, so I think it's better to have same behavior between sampling and reading, to avoid any future surprise, even though one batch is large now. ``` Parquet Files Sample: 100%|██████████| 2/2 [00:14<00:00, 7.23s/it] ```

c21 requested review from ericl, scv119, clarkzinzow, jjyao and jianoaix as code owners September 2, 2022 02:11

c21 mentioned this pull request Sep 2, 2022

[Data][nightly-test] datasets_ingest_400G is failing #28230

Closed

clarkzinzow reviewed Sep 2, 2022

View reviewed changes

c21 assigned ericl, clarkzinzow and jianoaix Sep 6, 2022

c21 added 5 commits September 7, 2022 12:23

Change sampling to use same API as read Parquet

9defffc

Signed-off-by: Cheng Su <[email protected]>

Fix lint

ad13692

Signed-off-by: Cheng Su <[email protected]>

Restrict to sample no more than the first row group

cf5ae7e

Signed-off-by: Cheng Su <[email protected]>

Fix lint

875fd1d

Signed-off-by: Cheng Su <[email protected]>

Change batch_size to 1024

bbfa657

Signed-off-by: Cheng Su <[email protected]>

c21 force-pushed the sample-fix branch from 3aeb9c7 to bbfa657 Compare September 7, 2022 19:23

c21 changed the title ~~[Datasets] Change sampling to use batch_size as read Parquet~~ [Datasets] Change sampling to use same API as read Parquet Sep 7, 2022

clarkzinzow approved these changes Sep 7, 2022

View reviewed changes

ericl approved these changes Sep 7, 2022

View reviewed changes

ericl merged commit c2be475 into ray-project:master Sep 8, 2022

c21 deleted the sample-fix branch September 8, 2022 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Change sampling to use same API as read Parquet #28258

[Datasets] Change sampling to use same API as read Parquet #28258

c21 commented Sep 2, 2022 •

edited

Loading

clarkzinzow left a comment

clarkzinzow Sep 2, 2022

c21 Sep 2, 2022

c21 commented Sep 2, 2022

c21 commented Sep 6, 2022 •

edited

Loading

clarkzinzow commented Sep 6, 2022

c21 commented Sep 7, 2022

clarkzinzow left a comment

[Datasets] Change sampling to use same API as read Parquet #28258

[Datasets] Change sampling to use same API as read Parquet #28258

Conversation

c21 commented Sep 2, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow Sep 2, 2022

Choose a reason for hiding this comment

c21 Sep 2, 2022

Choose a reason for hiding this comment

c21 commented Sep 2, 2022

c21 commented Sep 6, 2022 • edited Loading

clarkzinzow commented Sep 6, 2022

c21 commented Sep 7, 2022

clarkzinzow left a comment

Choose a reason for hiding this comment

c21 commented Sep 2, 2022 •

edited

Loading

c21 commented Sep 6, 2022 •

edited

Loading