[Data] combine_chunks before chunking pyarrow.Table block into batches #34352

jjyao · 2023-04-13T04:02:15Z

Why are these changes needed?

pyarrow.Table.slice is slow when the table has many chunks which makes batching pyarrow block slow. The fix is combining chunks into a single one to make slice faster with the cost of an extra copy.

Related issue number

Closes #31108

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Jiajun Yao <[email protected]>

ericl · 2023-04-13T04:38:01Z

python/ray/data/_internal/batcher.py

+            # pyarrow.Table.slice is slow when the table has many chunks
+            # so we combine chunks into a single one to make slice faster
+            # with the cost of an extra copy.
+            # See https://github.com/ray-project/ray/issues/31108 for more details.


Can we file an upstream bug with Arrow here? It would nice to be able to not do this in the future, since presumably this significantly increases our peak memory usage.

I reached out to them via the mailing list. I can also file a GH issue.

Signed-off-by: Jiajun Yao <[email protected]>

jjyao · 2023-04-13T16:11:31Z

map_batches_benchmark_single_node: https://buildkite.com/ray-project/release-tests-pr/builds/34832#01877910-398e-4751-ba0c-f94fbacd1306

  | map-batches-pandas-1024-2-eager = {'time': 39.35118924599999}
  | map-batches-pandas-1024-2-lazy = {'time': 54.940281376}
  | map-batches-pandas-2048-2-eager = {'time': 23.207430657000003}
  | map-batches-pandas-2048-2-lazy = {'time': 31.50245190199999}
  | map-batches-pandas-4096-2-eager = {'time': 15.63020905899998}
  | map-batches-pandas-4096-2-lazy = {'time': 20.375160023000007}
  | map-batches-pandas-None-2-eager = {'time': 11.814258278000011}
  | map-batches-pandas-None-2-lazy = {'time': 7.915047336999976}
  | map-batches-numpy-1024-2-eager = {'time': 28.09076438400001}
  | map-batches-numpy-1024-2-lazy = {'time': 21.001037853000014}
  | map-batches-numpy-2048-2-eager = {'time': 15.757699274000004}
  | map-batches-numpy-2048-2-lazy = {'time': 12.071804544999964}
  | map-batches-numpy-4096-2-eager = {'time': 10.054167910000047}
  | map-batches-numpy-4096-2-lazy = {'time': 7.911538545999974}
  | map-batches-numpy-None-2-eager = {'time': 3.281631476999962}
  | map-batches-numpy-None-2-lazy = {'time': 2.60068088700001}

c21 · 2023-04-13T17:11:45Z

python/ray/data/_internal/batcher.py

+                and block.num_columns > 0
+                and block.column(0).num_chunks > 1
+            ):
+                block = transform_pyarrow.combine_chunks(block)


can we structure the code to only do combine_chunks when necessary?

In next_batch() below, we call slice() to get one batch from block. Can we call combine_chunks there instead? We don't need to call combine_chunks if the block is not sliced.

How many num_chunks we saw in benchmark? Can we have a minimal threshold? I feel num_chunks > 1 is a bit too aggressive.

In the benchmark we have 8k+ chunks. But with my test, even with 10 big chunks, combine_chunks first is still faster.

what's the size of big chunks? Shall we decide to combine chunks based on

number of chunks

size of each chunk

Discussed offline with @jjyao, we decided to go with current approach, and expose a constant MIN_NUM_CHUNKS_TO_TRIGGER_COMBINE_CHUNKS, so we can change the constant easily when debugging the issue later.

c21 · 2023-04-13T17:13:08Z

thanks @jjyao! can we also make the same change inside ShufflingBatcher? It's the batcher with local shuffling. Thanks.
btw it would be good to add a unit test as well.

ericl

Cool, let's make sure to add a TODO with the linked Arrow issue.

ericl · 2023-04-13T20:20:00Z

It's the batcher with local shuffling. Thanks.
btw it would be good to add a unit test as well.

One way to do this is to add some asserts on num chunks on all the main consumption paths.

Signed-off-by: Jiajun Yao <[email protected]>

c21

thanks @jjyao!

ray-project#34352) pyarrow.Table.slice is slow when the table has many chunks which makes batching pyarrow block slow. The fix is combining chunks into a single one to make slice faster with the cost of an extra copy. Signed-off-by: Jiajun Yao <[email protected]>

ray-project#34352) pyarrow.Table.slice is slow when the table has many chunks which makes batching pyarrow block slow. The fix is combining chunks into a single one to make slice faster with the cost of an extra copy. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliottower <[email protected]>

ray-project#34352) pyarrow.Table.slice is slow when the table has many chunks which makes batching pyarrow block slow. The fix is combining chunks into a single one to make slice faster with the cost of an extra copy. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: Jack He <[email protected]>

combine_chunks before chunking pyarrow.Table block into batches

37ebf7f

Signed-off-by: Jiajun Yao <[email protected]>

jjyao requested review from ericl, scv119, clarkzinzow, jianoaix and c21 as code owners April 13, 2023 04:02

ericl reviewed Apr 13, 2023

View reviewed changes

up

70e7fc1

Signed-off-by: Jiajun Yao <[email protected]>

c21 reviewed Apr 13, 2023

View reviewed changes

ericl approved these changes Apr 13, 2023

View reviewed changes

jjyao added 4 commits April 13, 2023 15:28

up

f48e393

Signed-off-by: Jiajun Yao <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into jjyao/slice

d824e55

up

67631c3

Signed-off-by: Jiajun Yao <[email protected]>

up

99f5630

Signed-off-by: Jiajun Yao <[email protected]>

jjyao requested a review from c21 April 14, 2023 15:22

c21 approved these changes Apr 14, 2023

View reviewed changes

jjyao merged commit 0100e64 into ray-project:master Apr 14, 2023

jjyao deleted the jjyao/slice branch April 14, 2023 17:15

This was referenced Apr 19, 2023

ray.iter_batches is much slower than torch.dataloader iterable #33325

Closed

why ray.data.read_images cat not combine_chunks #34563

Open

yanxiaod123 mentioned this pull request Apr 26, 2023

ray dataset is very slower than pyarrow for numpy format #34780

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] combine_chunks before chunking pyarrow.Table block into batches #34352

[Data] combine_chunks before chunking pyarrow.Table block into batches #34352

jjyao commented Apr 13, 2023

ericl Apr 13, 2023 •

edited

Loading

jjyao Apr 13, 2023

jjyao commented Apr 13, 2023

c21 Apr 13, 2023

jjyao Apr 13, 2023

c21 Apr 13, 2023

c21 Apr 14, 2023

c21 commented Apr 13, 2023 •

edited

Loading

ericl left a comment

ericl commented Apr 13, 2023

c21 left a comment

[Data] combine_chunks before chunking pyarrow.Table block into batches #34352

[Data] combine_chunks before chunking pyarrow.Table block into batches #34352

Conversation

jjyao commented Apr 13, 2023

Why are these changes needed?

Related issue number

Checks

ericl Apr 13, 2023 • edited Loading

Choose a reason for hiding this comment

jjyao Apr 13, 2023

Choose a reason for hiding this comment

jjyao commented Apr 13, 2023

c21 Apr 13, 2023

Choose a reason for hiding this comment

jjyao Apr 13, 2023

Choose a reason for hiding this comment

c21 Apr 13, 2023

Choose a reason for hiding this comment

c21 Apr 14, 2023

Choose a reason for hiding this comment

c21 commented Apr 13, 2023 • edited Loading

ericl left a comment

Choose a reason for hiding this comment

ericl commented Apr 13, 2023

c21 left a comment

Choose a reason for hiding this comment

ericl Apr 13, 2023 •

edited

Loading

c21 commented Apr 13, 2023 •

edited

Loading