Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] combine_chunks before chunking pyarrow.Table block into batches #34352

Merged
merged 6 commits into from
Apr 14, 2023

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Apr 13, 2023

Why are these changes needed?

pyarrow.Table.slice is slow when the table has many chunks which makes batching pyarrow block slow. The fix is combining chunks into a single one to make slice faster with the cost of an extra copy.

Related issue number

Closes #31108

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

# pyarrow.Table.slice is slow when the table has many chunks
# so we combine chunks into a single one to make slice faster
# with the cost of an extra copy.
# See https://github.com/ray-project/ray/issues/31108 for more details.
Copy link
Contributor

@ericl ericl Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we file an upstream bug with Arrow here? It would nice to be able to not do this in the future, since presumably this significantly increases our peak memory usage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reached out to them via the mailing list. I can also file a GH issue.

Signed-off-by: Jiajun Yao <[email protected]>
@jjyao
Copy link
Collaborator Author

jjyao commented Apr 13, 2023

map_batches_benchmark_single_node: https://buildkite.com/ray-project/release-tests-pr/builds/34832#01877910-398e-4751-ba0c-f94fbacd1306

  | map-batches-pandas-1024-2-eager = {'time': 39.35118924599999}
  | map-batches-pandas-1024-2-lazy = {'time': 54.940281376}
  | map-batches-pandas-2048-2-eager = {'time': 23.207430657000003}
  | map-batches-pandas-2048-2-lazy = {'time': 31.50245190199999}
  | map-batches-pandas-4096-2-eager = {'time': 15.63020905899998}
  | map-batches-pandas-4096-2-lazy = {'time': 20.375160023000007}
  | map-batches-pandas-None-2-eager = {'time': 11.814258278000011}
  | map-batches-pandas-None-2-lazy = {'time': 7.915047336999976}
  | map-batches-numpy-1024-2-eager = {'time': 28.09076438400001}
  | map-batches-numpy-1024-2-lazy = {'time': 21.001037853000014}
  | map-batches-numpy-2048-2-eager = {'time': 15.757699274000004}
  | map-batches-numpy-2048-2-lazy = {'time': 12.071804544999964}
  | map-batches-numpy-4096-2-eager = {'time': 10.054167910000047}
  | map-batches-numpy-4096-2-lazy = {'time': 7.911538545999974}
  | map-batches-numpy-None-2-eager = {'time': 3.281631476999962}
  | map-batches-numpy-None-2-lazy = {'time': 2.60068088700001}

and block.num_columns > 0
and block.column(0).num_chunks > 1
):
block = transform_pyarrow.combine_chunks(block)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we structure the code to only do combine_chunks when necessary?

  • In next_batch() below, we call slice() to get one batch from block. Can we call combine_chunks there instead? We don't need to call combine_chunks if the block is not sliced.

  • How many num_chunks we saw in benchmark? Can we have a minimal threshold? I feel num_chunks > 1 is a bit too aggressive.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the benchmark we have 8k+ chunks. But with my test, even with 10 big chunks, combine_chunks first is still faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the size of big chunks? Shall we decide to combine chunks based on

  • number of chunks
  • size of each chunk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with @jjyao, we decided to go with current approach, and expose a constant MIN_NUM_CHUNKS_TO_TRIGGER_COMBINE_CHUNKS, so we can change the constant easily when debugging the issue later.

@c21
Copy link
Contributor

c21 commented Apr 13, 2023

thanks @jjyao! can we also make the same change inside ShufflingBatcher? It's the batcher with local shuffling. Thanks.
btw it would be good to add a unit test as well.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, let's make sure to add a TODO with the linked Arrow issue.

@ericl
Copy link
Contributor

ericl commented Apr 13, 2023

It's the batcher with local shuffling. Thanks.
btw it would be good to add a unit test as well.

One way to do this is to add some asserts on num chunks on all the main consumption paths.

@jjyao jjyao requested a review from c21 April 14, 2023 15:22
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jjyao!

@jjyao jjyao merged commit 0100e64 into ray-project:master Apr 14, 2023
@jjyao jjyao deleted the jjyao/slice branch April 14, 2023 17:15
vitsai pushed a commit to vitsai/ray that referenced this pull request Apr 17, 2023
ray-project#34352)

pyarrow.Table.slice is slow when the table has many chunks which makes batching pyarrow block slow. The fix is combining chunks into a single one to make slice faster with the cost of an extra copy.

Signed-off-by: Jiajun Yao <[email protected]>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
ray-project#34352)

pyarrow.Table.slice is slow when the table has many chunks which makes batching pyarrow block slow. The fix is combining chunks into a single one to make slice faster with the cost of an extra copy.

Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
ray-project#34352)

pyarrow.Table.slice is slow when the table has many chunks which makes batching pyarrow block slow. The fix is combining chunks into a single one to make slice faster with the cost of an extra copy.

Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jack He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] NumPy batch is significantly slower than Pandas batch in map_batches benchmark
3 participants