Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune/execution] Update staged resources in a fixed counter for faster lookup #32087

Merged
merged 6 commits into from
Jan 31, 2023

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Jan 31, 2023

Signed-off-by: Kai Fricke [email protected]

Why are these changes needed?

In #30016 we migrated Ray Tune to use a new resource management interface. In the same PR, we simplified the resource consolidation logic. This lead to a performance regression first identified in #31337.

After manual profiling, the regression seems to come from RayTrialExecutor._count_staged_resources. We have 1000 staged trials, and this function is called on every step, executing a linear scan through all trials.

This PR fixes this performance bottleneck by keeping state of the resource counter instead of dynamically recreating it every time. This is simple as we can just add/subtract the resources whenever we add/remove from the RayTrialExecutor._staged_trials set.

Manual testing confirmed this improves the runtime of tune_scalability_result_throughput_cluster from ~132 seconds to ~122 seconds, bringing it back to the same level as before the refactor.

Related issue number

Closes #32077

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke added 2 commits January 30, 2023 16:03
Kai Fricke added 2 commits January 30, 2023 16:17
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
@krfricke
Copy link
Contributor Author

Signed-off-by: Kai Fricke <[email protected]>
@krfricke krfricke merged commit 10d52f7 into ray-project:master Jan 31, 2023
@krfricke krfricke deleted the tune/cache-staged-resources branch January 31, 2023 21:16
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…r lookup (ray-project#32087)

In ray-project#30016 we migrated Ray Tune to use a new resource management interface. In the same PR, we simplified the resource consolidation logic. This lead to a performance regression first identified in ray-project#31337.

After manual profiling, the regression seems to come from `RayTrialExecutor._count_staged_resources`. We have 1000 staged trials, and this function is called on every step, executing a linear scan through all trials.

This PR fixes this performance bottleneck by keeping state of the resource counter instead of dynamically recreating it every time. This is simple as we can just add/subtract the resources whenever we add/remove from the `RayTrialExecutor._staged_trials` set.

Manual testing confirmed this improves the runtime of `tune_scalability_result_throughput_cluster` from ~132 seconds to ~122 seconds, bringing it back to the same level as before the refactor.

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[tune] Possible performance regression in test_result_throughput_cluster
4 participants