Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Never block for results #18391

Merged
merged 4 commits into from
Sep 9, 2021
Merged

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Sep 7, 2021

Why are these changes needed?

If at least one trial is running, Tune is currently blocking until a result is received before continuing with the tuning loop.

This is an artifact of the legacy implementation, where Tune was aware of resource availability through its own resource management system.

However, since we changed to placement groups which can become ready anytime, blocking for results is no longer a good option.

In practice, we encountered the following problems:

  1. Blocking for results of one running trial will prevent us from starting other trials if they become ready. Previously this was tackled with the staging grace period, but for long trial and resource startup times, this is unfeasible to tweak.
  2. Blocking for results even if all trials are running will prevent us from printing periodic updates (e.g. the status table). During that time it seems like the process is hanging, even though it isn't.

This PR introduces an environment variable controlling the maximum wait time, defaulting to one second.

cc @AmeerHajAli

Related issue number

Addresses part of #18003

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@xwjiang2010
Copy link
Contributor

xwjiang2010 commented Sep 7, 2021

Nooby questions:

  1. Should we start removing legacy code and just leave PG code path?
  2. What is staging grace period? Why do we handle it separately? Is it a legacy concept?
  3. On a high level, the event loop wants to progress whichever the following becomes True:
  • There is some result available for running trials, in which case, we want to process it.
  • Some PG is ready for new trials to start running, in which case, we want to kick off the trial.
    Is there a way that we can wait on both of them at the same time like
    wait([FutureA, FutureB]) rather than
    wait(FutureA, timeout=timeout). # do a bunch of things to tweak this behavior
    wait(FutureB)

Thanks!

@krfricke
Copy link
Contributor Author

krfricke commented Sep 8, 2021

Hi @xwjiang2010, those are great questions!

  1. We should! Let's do this in separate PRs though, and let's include this in Q4 planning
  2. The original purpose was to ensure that we don't block for results when placement groups become ready to avoid exactly what happened in the issue we observed here. We could get rid of it, but I think it's still worthwhile to have lower timeout times in the beginning (during the grace period) than later. We should definitely review this behavior though
  3. Yes, that's a good suggestion and has been brought up when we originally introduced the placement groups. Back then we decided against it as it would have required (even more) refactoring of the tuning loop. But yes, generally we should wait for training futures and placement group futures at the same time and process whatever gets ready first.

@@ -225,7 +230,8 @@ def testMultiTrialReuse(self):
config={
"message": tune.grid_search(
["First", "Second", "Third", "Fourth"]),
"id": -1
"id": -1,
"sleep": 1,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some comments on why this ActorReuseMultiTest needs sleep to work(v.s. ActorReuseTest does not)?

I don't have a good answer yet as to how. But do you think when we restructure Tune code, it's also a good time to revisit testability, especially how to emulate the timing in tests in a consistent and readable way, preferably less sleep so that test can run faster.

@xwjiang2010
Copy link
Contributor

xwjiang2010 commented Sep 8, 2021

Hi @xwjiang2010, those are great questions!

  1. We should! Let's do this in separate PRs though, and let's include this in Q4 planning
    Sounds good. It's already captured in the planning doc.
  2. The original purpose was to ensure that we don't block for results when placement groups become ready to avoid exactly what happened in the issue we observed here. We could get rid of it, but I think it's still worthwhile to have lower timeout times in the beginning (during the grace period) than later. We should definitely review this behavior though
  3. Yes, that's a good suggestion and has been brought up when we originally introduced the placement groups. Back then we decided against it as it would have required (even more) refactoring of the tuning loop. But yes, generally we should wait for training futures and placement group futures at the same time and process whatever gets ready first.

For 2 & 3, understand the PG stuff introduces some tech debt. +1 to review the behavior. Personally I feel like we are at a point of refactoring the tuning loop to make it more robust and efficient to be future proof (less adhoc logic, easier to follow). Would like to know your thoughts.

@xwjiang2010
Copy link
Contributor

LGTM. Some minor questions on test setup.

@richardliaw richardliaw merged commit 395976c into ray-project:master Sep 9, 2021
@krfricke krfricke deleted the tune/pg-ready branch September 22, 2023 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants