-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Never block for results #18391
Conversation
Nooby questions:
Thanks! |
Hi @xwjiang2010, those are great questions!
|
@@ -225,7 +230,8 @@ def testMultiTrialReuse(self): | |||
config={ | |||
"message": tune.grid_search( | |||
["First", "Second", "Third", "Fourth"]), | |||
"id": -1 | |||
"id": -1, | |||
"sleep": 1, | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some comments on why this ActorReuseMultiTest needs sleep to work(v.s. ActorReuseTest does not)?
I don't have a good answer yet as to how. But do you think when we restructure Tune code, it's also a good time to revisit testability, especially how to emulate the timing in tests in a consistent and readable way, preferably less sleep so that test can run faster.
For 2 & 3, understand the PG stuff introduces some tech debt. +1 to review the behavior. Personally I feel like we are at a point of refactoring the tuning loop to make it more robust and efficient to be future proof (less adhoc logic, easier to follow). Would like to know your thoughts. |
LGTM. Some minor questions on test setup. |
Why are these changes needed?
If at least one trial is running, Tune is currently blocking until a result is received before continuing with the tuning loop.
This is an artifact of the legacy implementation, where Tune was aware of resource availability through its own resource management system.
However, since we changed to placement groups which can become ready anytime, blocking for results is no longer a good option.
In practice, we encountered the following problems:
This PR introduces an environment variable controlling the maximum wait time, defaulting to one second.
cc @AmeerHajAli
Related issue number
Addresses part of #18003
Checks
scripts/format.sh
to lint the changes in this PR.