Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Add retry logic for restoring trials. #29086

Merged
merged 11 commits into from
Oct 7, 2022

Conversation

xwjiang2010
Copy link
Contributor

@xwjiang2010 xwjiang2010 commented Oct 5, 2022

Signed-off-by: xwjiang2010 [email protected]

Why are these changes needed?

This is an advanced setting. Consider the following scenario: Due to scheduling glitches, sometimes a restoring
trial may be scheduled onto a dying node. By setting this env var to a positive number, the trial can be restored
several times and hopefully one of the times it will not be put on a dying node. This retry behavior won't increment
the per trial failure number, which is compared against max_failures.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: xwjiang2010 <[email protected]>
@xwjiang2010
Copy link
Contributor Author

@krfricke actually do you know why we don't physically delete those checkpoint_0000x folders when tune decides to restore from scratch? I think this a potential bug source.

Signed-off-by: xwjiang2010 <[email protected]>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this whole logic into the trial to avoid strong coupling between the trial runner and the trial. Basically we want to avoid that the trial runner tells the trial how to update it's internal state.

Instead, we should make the information available to the trial so that it can act based on this. For this case, we would want to know what part of the trial failed.

I can think of a few ways to do that:

  1. We introduce TuneRestoreError, TuneTrainingError, TuneActorError, and wrap exc in it before passing it to write_error_log.
  2. We pass something like an ErrorType enum

I think I prefer option 1, but open to suggestions.

The trial can then decide which counters to increase and how to write the error log. For instance, it would be helpful to log restore errors as Restore failure # {} to distinguish this from training errors in the log file.

For a first step to land it until branch cut, we can just distinguish between restore and non-restore errors, if that's easier.

As a minor refactor note, I think we should rename write_error_log to handle_error and only call it when an error is actually set (i.e. move the if exc into the trial runner and remove the Optional annotation in the method).

What do you think?

doc/source/tune/api_docs/env.rst Outdated Show resolved Hide resolved
python/ray/tune/execution/ray_trial_executor.py Outdated Show resolved Hide resolved
@xwjiang2010
Copy link
Contributor Author

Yeah this makes sense.
I was prioritizing making the cut. But let me try how extensive the change may be.

@xwjiang2010
Copy link
Contributor Author

Ok updated. PTAL!

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor nit

python/ray/tune/experiment/trial.py Outdated Show resolved Hide resolved
Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve for docs

@@ -437,6 +440,63 @@ def test_tuner_restore_latest_available_checkpoint(
assert result.metrics["iterations_since_restore"] == 5


@pytest.mark.parametrize("retry_num", [0, 1])
def test_retore_retry(retry_num):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_retore_retry(retry_num):
def test_retore_retry(ray_start_4_cpus, retry_num):

should shut down ray gracefully

@krfricke krfricke merged commit f1882f9 into ray-project:master Oct 7, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
This is an advanced setting. Consider the following scenario: Due to scheduling glitches, sometimes a restoring
trial may be scheduled onto a dying node. By setting this env var to a positive number, the trial can be restored
several times and hopefully one of the times it will not be put on a dying node. This retry behavior won't increment
the per trial failure number, which is compared against max_failures.

Signed-off-by: xwjiang2010 <[email protected]>
Signed-off-by: xwjiang2010 <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
@xwjiang2010 xwjiang2010 deleted the retry_restoring branch July 26, 2023 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants