[tune/execution] Update staged resources in a fixed counter for faster lookup #32087

krfricke · 2023-01-31T00:07:44Z

Signed-off-by: Kai Fricke [email protected]

Why are these changes needed?

In #30016 we migrated Ray Tune to use a new resource management interface. In the same PR, we simplified the resource consolidation logic. This lead to a performance regression first identified in #31337.

After manual profiling, the regression seems to come from RayTrialExecutor._count_staged_resources. We have 1000 staged trials, and this function is called on every step, executing a linear scan through all trials.

This PR fixes this performance bottleneck by keeping state of the resource counter instead of dynamically recreating it every time. This is simple as we can just add/subtract the resources whenever we add/remove from the RayTrialExecutor._staged_trials set.

Manual testing confirmed this improves the runtime of tune_scalability_result_throughput_cluster from ~132 seconds to ~122 seconds, bringing it back to the same level as before the refactor.

Related issue number

Closes #32077

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <[email protected]>

release/ray_release/alerts/tune_tests.py

python/ray/tune/execution/ray_trial_executor.py

Signed-off-by: Kai Fricke <[email protected]>

…-resources

krfricke · 2023-01-31T17:49:41Z

Release test passes: https://buildkite.com/ray-project/release-tests-pr/builds/27037#018608cc-d313-4c32-9369-424fd498f0b3

Signed-off-by: Kai Fricke <[email protected]>

…r lookup (ray-project#32087) In ray-project#30016 we migrated Ray Tune to use a new resource management interface. In the same PR, we simplified the resource consolidation logic. This lead to a performance regression first identified in ray-project#31337. After manual profiling, the regression seems to come from `RayTrialExecutor._count_staged_resources`. We have 1000 staged trials, and this function is called on every step, executing a linear scan through all trials. This PR fixes this performance bottleneck by keeping state of the resource counter instead of dynamically recreating it every time. This is simple as we can just add/subtract the resources whenever we add/remove from the `RayTrialExecutor._staged_trials` set. Manual testing confirmed this improves the runtime of `tune_scalability_result_throughput_cluster` from ~132 seconds to ~122 seconds, bringing it back to the same level as before the refactor. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: Edward Oakes <[email protected]>

Kai Fricke added 2 commits January 30, 2023 16:03

[tune] Cache staged resources for faster lookup

6c4795e

Signed-off-by: Kai Fricke <[email protected]>

adjust timeout

ee82d48

Signed-off-by: Kai Fricke <[email protected]>

krfricke requested review from rkooo567 and simon-mo as code owners January 31, 2023 00:07

krfricke requested review from cadedaniel and xwjiang2010 January 31, 2023 00:07

krfricke assigned cadedaniel, xwjiang2010 and rkooo567 Jan 31, 2023

cadedaniel reviewed Jan 31, 2023

View reviewed changes

release/ray_release/alerts/tune_tests.py Show resolved Hide resolved

xwjiang2010 reviewed Jan 31, 2023

View reviewed changes

python/ray/tune/execution/ray_trial_executor.py Outdated Show resolved Hide resolved

Kai Fricke added 2 commits January 30, 2023 16:17

Revert throughput changes

d7cd168

Signed-off-by: Kai Fricke <[email protected]>

comments

7e31cf1

Signed-off-by: Kai Fricke <[email protected]>

rkooo567 approved these changes Jan 31, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/master' into tune/cache-staged…

1bd5be1

…-resources

xwjiang2010 approved these changes Jan 31, 2023

View reviewed changes

Comment

836833e

Signed-off-by: Kai Fricke <[email protected]>

krfricke merged commit 10d52f7 into ray-project:master Jan 31, 2023

krfricke deleted the tune/cache-staged-resources branch January 31, 2023 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune/execution] Update staged resources in a fixed counter for faster lookup #32087

[tune/execution] Update staged resources in a fixed counter for faster lookup #32087

krfricke commented Jan 31, 2023 •

edited

Loading

krfricke commented Jan 31, 2023

[tune/execution] Update staged resources in a fixed counter for faster lookup #32087

[tune/execution] Update staged resources in a fixed counter for faster lookup #32087

Conversation

krfricke commented Jan 31, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

krfricke commented Jan 31, 2023

krfricke commented Jan 31, 2023 •

edited

Loading