-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Add retry logic for restoring trials. #29086
Conversation
Signed-off-by: xwjiang2010 <[email protected]>
Signed-off-by: xwjiang2010 <[email protected]>
@krfricke actually do you know why we don't physically delete those checkpoint_0000x folders when tune decides to restore from scratch? I think this a potential bug source. |
Signed-off-by: xwjiang2010 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this whole logic into the trial to avoid strong coupling between the trial runner and the trial. Basically we want to avoid that the trial runner tells the trial how to update it's internal state.
Instead, we should make the information available to the trial so that it can act based on this. For this case, we would want to know what part of the trial failed.
I can think of a few ways to do that:
- We introduce
TuneRestoreError
,TuneTrainingError
,TuneActorError
, and wrapexc
in it before passing it towrite_error_log
. - We pass something like an
ErrorType
enum
I think I prefer option 1, but open to suggestions.
The trial can then decide which counters to increase and how to write the error log. For instance, it would be helpful to log restore errors as Restore failure # {}
to distinguish this from training errors in the log file.
For a first step to land it until branch cut, we can just distinguish between restore and non-restore errors, if that's easier.
As a minor refactor note, I think we should rename write_error_log
to handle_error
and only call it when an error is actually set (i.e. move the if exc
into the trial runner and remove the Optional
annotation in the method).
What do you think?
Yeah this makes sense. |
Signed-off-by: xwjiang2010 <[email protected]>
ea886ea
to
dde19af
Compare
Ok updated. PTAL! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor nit
Signed-off-by: xwjiang2010 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approve for docs
Signed-off-by: xwjiang2010 <[email protected]>
Signed-off-by: xwjiang2010 <[email protected]>
Signed-off-by: xwjiang2010 <[email protected]>
@@ -437,6 +440,63 @@ def test_tuner_restore_latest_available_checkpoint( | |||
assert result.metrics["iterations_since_restore"] == 5 | |||
|
|||
|
|||
@pytest.mark.parametrize("retry_num", [0, 1]) | |||
def test_retore_retry(retry_num): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_retore_retry(retry_num): | |
def test_retore_retry(ray_start_4_cpus, retry_num): |
should shut down ray gracefully
Signed-off-by: xwjiang2010 <[email protected]>
This is an advanced setting. Consider the following scenario: Due to scheduling glitches, sometimes a restoring trial may be scheduled onto a dying node. By setting this env var to a positive number, the trial can be restored several times and hopefully one of the times it will not be put on a dying node. This retry behavior won't increment the per trial failure number, which is compared against max_failures. Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: xwjiang2010 [email protected]
Why are these changes needed?
This is an advanced setting. Consider the following scenario: Due to scheduling glitches, sometimes a restoring
trial may be scheduled onto a dying node. By setting this env var to a positive number, the trial can be restored
several times and hopefully one of the times it will not be put on a dying node. This retry behavior won't increment
the per trial failure number, which is compared against
max_failures
.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.