[Data] should set num_returns in ray.wait inside ray data progress bar #46692

tespent · 2024-07-18T06:48:58Z

Why are these changes needed?

Fixes #46674

Related issue number

Closes #46674

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Wu Yufei <[email protected]>

raulchen · 2024-07-22T21:04:40Z

I agree that the default num_returns=1 is too small and can bring significant overheads. But on the other hand, I'm also concerned that if we simply wait for all remaining objects, it can also regress small-scaled latency-sensitive workloads.
A better mid-ground solution would be make the num_returns proportional to the remaining. But I'm not sure what is the best value. We need to do some tests with different percentages on the release tests. cc @scottjlee

scottjlee · 2024-07-24T17:57:43Z

I ran release tests comparing master (BK) vs. the changes in this PR (BK run). Here are the results for some of the major batch inference / training ingest benchmarks:

| Test Name                                      | Runtime (Master) | Runtime (PR) | % Difference |
|------------------------------------------------|------------------|--------------|--------------|
| torch_batch_inference_1_gpu_10gb_parquet       | 65.91            | 73.49        | 11.49%       |
| stable_diffusion_benchmark                     | 1166.5           | 1308.18      | 12.15%       |
| read_parquet_train_16_gpu                      | 125.4            | 116.5        | -7.10%       |
| read_images_train_1_gpu_5_cpu                  | 900.3            | 899.7        | -0.07%       |
| iter_tensor_batches_benchmark_multi_node       | 76.4             | 83.4         | 9.16%        |
| dataset_shuffle_random_shuffle_1tb             | 560.8            | 450.5        | -19.67%      |

so it looks like larger workloads are negatively impacted. @raulchen

tespent · 2024-07-25T02:52:27Z

I think tasks like inference with a 10gb dataset might be too small to expose the issue behind this change, thus causing performance degradation. In our jobs, an repartition from ~5,000 to ~20,000 blocks on a 360 nodes cluster takes about 1 hour without this change but only about 10min after that. A larger value of num_returns is desirable for such case.

Although it looks ugly and requires many efforts to choose a better magic number, perhaps we can use piecewise function to mitigate impact on smaller datasets? for example: (50 and 4000 blocks are the turning points for the below function)

num_returns=int(max(1, 0.5*len(remaining)-100, 0.8*len(remaining)-1300))

scottjlee

Discussed with @raulchen offline, we concluded that it's fine to use len(remaining) as num_returns here, since it is used in the context of AllToAllOperators and not the overall streaming executor. The streaming executor main usage in process_completed_tasks() has a separate loop and timeout: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/execution/streaming_executor_state.py#L402-L407

[Data] should set num_returns in ray.wait inside ray data progress bar

cc9f895

Signed-off-by: Wu Yufei <[email protected]>

tespent requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners July 18, 2024 06:48

scottjlee approved these changes Jul 29, 2024

View reviewed changes

raulchen approved these changes Jul 30, 2024

View reviewed changes

scottjlee added the go add ONLY when ready to merge, run all tests label Jul 30, 2024

raulchen enabled auto-merge (squash) July 30, 2024 19:56

Merge branch 'master' into fix/data-progress-wait

cbb4889

github-actions bot disabled auto-merge July 30, 2024 19:57

raulchen merged commit 9f8b8be into ray-project:master Jul 30, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] should set num_returns in ray.wait inside ray data progress bar #46692

[Data] should set num_returns in ray.wait inside ray data progress bar #46692

tespent commented Jul 18, 2024 •

edited

Loading

raulchen commented Jul 22, 2024 •

edited

Loading

scottjlee commented Jul 24, 2024 •

edited

Loading

tespent commented Jul 25, 2024

scottjlee left a comment

[Data] should set num_returns in ray.wait inside ray data progress bar #46692

[Data] should set num_returns in ray.wait inside ray data progress bar #46692

Conversation

tespent commented Jul 18, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

raulchen commented Jul 22, 2024 • edited Loading

scottjlee commented Jul 24, 2024 • edited Loading

tespent commented Jul 25, 2024

scottjlee left a comment

Choose a reason for hiding this comment

tespent commented Jul 18, 2024 •

edited

Loading

raulchen commented Jul 22, 2024 •

edited

Loading

scottjlee commented Jul 24, 2024 •

edited

Loading