Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] should set num_returns in ray.wait inside ray data progress bar #46692

Merged
merged 2 commits into from
Jul 30, 2024

Conversation

tespent
Copy link
Contributor

@tespent tespent commented Jul 18, 2024

Why are these changes needed?

Fixes #46674

Related issue number

Closes #46674

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@raulchen
Copy link
Contributor

raulchen commented Jul 22, 2024

I agree that the default num_returns=1 is too small and can bring significant overheads. But on the other hand, I'm also concerned that if we simply wait for all remaining objects, it can also regress small-scaled latency-sensitive workloads.
A better mid-ground solution would be make the num_returns proportional to the remaining. But I'm not sure what is the best value. We need to do some tests with different percentages on the release tests. cc @scottjlee

@scottjlee
Copy link
Contributor

scottjlee commented Jul 24, 2024

I ran release tests comparing master (BK) vs. the changes in this PR (BK run). Here are the results for some of the major batch inference / training ingest benchmarks:

| Test Name                                      | Runtime (Master) | Runtime (PR) | % Difference |
|------------------------------------------------|------------------|--------------|--------------|
| torch_batch_inference_1_gpu_10gb_parquet       | 65.91            | 73.49        | 11.49%       |
| stable_diffusion_benchmark                     | 1166.5           | 1308.18      | 12.15%       |
| read_parquet_train_16_gpu                      | 125.4            | 116.5        | -7.10%       |
| read_images_train_1_gpu_5_cpu                  | 900.3            | 899.7        | -0.07%       |
| iter_tensor_batches_benchmark_multi_node       | 76.4             | 83.4         | 9.16%        |
| dataset_shuffle_random_shuffle_1tb             | 560.8            | 450.5        | -19.67%      |

so it looks like larger workloads are negatively impacted. @raulchen

@tespent
Copy link
Contributor Author

tespent commented Jul 25, 2024

I think tasks like inference with a 10gb dataset might be too small to expose the issue behind this change, thus causing performance degradation. In our jobs, an repartition from ~5,000 to ~20,000 blocks on a 360 nodes cluster takes about 1 hour without this change but only about 10min after that. A larger value of num_returns is desirable for such case.

Although it looks ugly and requires many efforts to choose a better magic number, perhaps we can use piecewise function to mitigate impact on smaller datasets? for example: (50 and 4000 blocks are the turning points for the below function)

num_returns=int(max(1, 0.5*len(remaining)-100, 0.8*len(remaining)-1300))

Copy link
Contributor

@scottjlee scottjlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @raulchen offline, we concluded that it's fine to use len(remaining) as num_returns here, since it is used in the context of AllToAllOperators and not the overall streaming executor. The streaming executor main usage in process_completed_tasks() has a separate loop and timeout: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/execution/streaming_executor_state.py#L402-L407

@scottjlee scottjlee added the go add ONLY when ready to merge, run all tests label Jul 30, 2024
@raulchen raulchen enabled auto-merge (squash) July 30, 2024 19:56
@github-actions github-actions bot disabled auto-merge July 30, 2024 19:57
@raulchen raulchen merged commit 9f8b8be into ray-project:master Jul 30, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Execution of repartition stuck at ray.wait
3 participants