[tune] Add retry logic for restoring trials. #29086

xwjiang2010 · 2022-10-05T18:44:12Z

Signed-off-by: xwjiang2010 [email protected]

Why are these changes needed?

This is an advanced setting. Consider the following scenario: Due to scheduling glitches, sometimes a restoring
trial may be scheduled onto a dying node. By setting this env var to a positive number, the trial can be restored
several times and hopefully one of the times it will not be put on a dying node. This retry behavior won't increment
the per trial failure number, which is compared against max_failures.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: xwjiang2010 <[email protected]>

xwjiang2010 · 2022-10-05T23:21:15Z

@krfricke actually do you know why we don't physically delete those checkpoint_0000x folders when tune decides to restore from scratch? I think this a potential bug source.

Signed-off-by: xwjiang2010 <[email protected]>

krfricke

I think we should move this whole logic into the trial to avoid strong coupling between the trial runner and the trial. Basically we want to avoid that the trial runner tells the trial how to update it's internal state.

Instead, we should make the information available to the trial so that it can act based on this. For this case, we would want to know what part of the trial failed.

I can think of a few ways to do that:

We introduce TuneRestoreError, TuneTrainingError, TuneActorError, and wrap exc in it before passing it to write_error_log.
We pass something like an ErrorType enum

I think I prefer option 1, but open to suggestions.

The trial can then decide which counters to increase and how to write the error log. For instance, it would be helpful to log restore errors as Restore failure # {} to distinguish this from training errors in the log file.

For a first step to land it until branch cut, we can just distinguish between restore and non-restore errors, if that's easier.

As a minor refactor note, I think we should rename write_error_log to handle_error and only call it when an error is actually set (i.e. move the if exc into the trial runner and remove the Optional annotation in the method).

What do you think?

doc/source/tune/api_docs/env.rst

python/ray/tune/execution/ray_trial_executor.py

xwjiang2010 · 2022-10-06T16:01:04Z

Yeah this makes sense.
I was prioritizing making the cut. But let me try how extensive the change may be.

Signed-off-by: xwjiang2010 <[email protected]>

xwjiang2010 · 2022-10-06T16:29:38Z

Ok updated. PTAL!

krfricke

LGTM, minor nit

python/ray/tune/experiment/trial.py

Signed-off-by: xwjiang2010 <[email protected]>

richardliaw

approve for docs

Signed-off-by: xwjiang2010 <[email protected]>

…y_restoring

Signed-off-by: xwjiang2010 <[email protected]>

krfricke · 2022-10-07T17:22:50Z

python/ray/tune/tests/test_tuner_restore.py

@@ -437,6 +440,63 @@ def test_tuner_restore_latest_available_checkpoint(
    assert result.metrics["iterations_since_restore"] == 5


+@pytest.mark.parametrize("retry_num", [0, 1])
+def test_retore_retry(retry_num):


Suggested change

def test_retore_retry(retry_num):

def test_retore_retry(ray_start_4_cpus, retry_num):

should shut down ray gracefully

Signed-off-by: xwjiang2010 <[email protected]>

This is an advanced setting. Consider the following scenario: Due to scheduling glitches, sometimes a restoring trial may be scheduled onto a dying node. By setting this env var to a positive number, the trial can be restored several times and hopefully one of the times it will not be put on a dying node. This retry behavior won't increment the per trial failure number, which is compared against max_failures. Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

[tune] Add retry logic for restoring trials.

58df3fa

Signed-off-by: xwjiang2010 <[email protected]>

xwjiang2010 requested review from richardliaw, krfricke, amogkam, matthewdeng, Yard1, maxpumperla and a team as code owners October 5, 2022 18:44

xwjiang2010 assigned krfricke Oct 5, 2022

fix test

510d5b7

Signed-off-by: xwjiang2010 <[email protected]>

move test to use tuner api.

93a6ba2

Signed-off-by: xwjiang2010 <[email protected]>

krfricke requested changes Oct 6, 2022

View reviewed changes

doc/source/tune/api_docs/env.rst Outdated Show resolved Hide resolved

python/ray/tune/execution/ray_trial_executor.py Outdated Show resolved Hide resolved

Introduce _TuneRestoreError and move error handling logic into trial.py

dde19af

Signed-off-by: xwjiang2010 <[email protected]>

xwjiang2010 force-pushed the retry_restoring branch from ea886ea to dde19af Compare October 6, 2022 16:29

krfricke approved these changes Oct 6, 2022

View reviewed changes

python/ray/tune/experiment/trial.py Outdated Show resolved Hide resolved

xwjiang2010 and others added 2 commits October 6, 2022 09:54

simplify logic.

c7022b3

Signed-off-by: xwjiang2010 <[email protected]>

Merge branch 'ray-project:master' into retry_restoring

b75fae1

Yard1 approved these changes Oct 6, 2022

View reviewed changes

xwjiang2010 assigned richardliaw Oct 6, 2022

richardliaw approved these changes Oct 7, 2022

View reviewed changes

xwjiang2010 and others added 4 commits October 7, 2022 07:39

Merge branch 'master' into retry_restoring

c1da4b4

Signed-off-by: xwjiang2010 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into retr…

83af58f

…y_restoring

simplify test

b83e739

Signed-off-by: xwjiang2010 <[email protected]>

move test to later.

cf27f4f

Signed-off-by: xwjiang2010 <[email protected]>

krfricke reviewed Oct 7, 2022

View reviewed changes

add fixture to shut down ray cluster.

f999063

Signed-off-by: xwjiang2010 <[email protected]>

krfricke merged commit f1882f9 into ray-project:master Oct 7, 2022

xwjiang2010 deleted the retry_restoring branch July 26, 2023 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Add retry logic for restoring trials. #29086

[tune] Add retry logic for restoring trials. #29086

xwjiang2010 commented Oct 5, 2022 •

edited

Loading

xwjiang2010 commented Oct 5, 2022

krfricke left a comment

xwjiang2010 commented Oct 6, 2022

xwjiang2010 commented Oct 6, 2022

krfricke left a comment

richardliaw left a comment •

edited

Loading

krfricke Oct 7, 2022

	def test_retore_retry(retry_num):
	def test_retore_retry(ray_start_4_cpus, retry_num):

[tune] Add retry logic for restoring trials. #29086

[tune] Add retry logic for restoring trials. #29086

Conversation

xwjiang2010 commented Oct 5, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

xwjiang2010 commented Oct 5, 2022

krfricke left a comment

Choose a reason for hiding this comment

xwjiang2010 commented Oct 6, 2022

xwjiang2010 commented Oct 6, 2022

krfricke left a comment

Choose a reason for hiding this comment

richardliaw left a comment • edited Loading

Choose a reason for hiding this comment

krfricke Oct 7, 2022

Choose a reason for hiding this comment

xwjiang2010 commented Oct 5, 2022 •

edited

Loading

richardliaw left a comment •

edited

Loading