-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] should set num_returns in ray.wait inside ray data progress bar #46692
Conversation
Signed-off-by: Wu Yufei <[email protected]>
I agree that the default |
I ran release tests comparing master (BK) vs. the changes in this PR (BK run). Here are the results for some of the major batch inference / training ingest benchmarks:
so it looks like larger workloads are negatively impacted. @raulchen |
I think tasks like inference with a 10gb dataset might be too small to expose the issue behind this change, thus causing performance degradation. In our jobs, an repartition from ~5,000 to ~20,000 blocks on a 360 nodes cluster takes about 1 hour without this change but only about 10min after that. A larger value of num_returns is desirable for such case. Although it looks ugly and requires many efforts to choose a better magic number, perhaps we can use piecewise function to mitigate impact on smaller datasets? for example: (50 and 4000 blocks are the turning points for the below function)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @raulchen offline, we concluded that it's fine to use len(remaining)
as num_returns here, since it is used in the context of AllToAllOperators and not the overall streaming executor. The streaming executor main usage in process_completed_tasks()
has a separate loop and timeout: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/execution/streaming_executor_state.py#L402-L407
Why are these changes needed?
Fixes #46674
Related issue number
Closes #46674
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.