[tune] Optional forcible trial cleanup, return default autofilled metrics even if Trainable doesn't report at least once #19144

Yard1 · 2021-10-06T16:56:53Z

Why are these changes needed?

This PR adds two features:

An option to forcibly terminate trials scheduled for cleanup after a grace period of 60 seconds. By default this is False (old behavior). This is controlled by a new env var, TUNE_FORCE_TRIAL_CLEANUP.
An assurance that even if the Trainable doesn't get to report results at least once (due to an error or termination), auto-filled metrics will be returned.

Related issue number

Closes #18745

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/tune/trial.py

python/ray/tune/trial_runner.py

python/ray/tune/ray_trial_executor.py

xwjiang2010 · 2021-10-06T19:32:09Z

Thanks for picking this up. Can you add a screenshot of what the autofilled metrics would look like?

krfricke

Looks good so far!

doc/source/tune/user-guide.rst

python/ray/tune/ray_trial_executor.py

xwjiang2010 · 2021-10-07T17:32:33Z

python/ray/tune/ray_trial_executor.py

@@ -123,15 +130,27 @@ def cleanup(self, partial: bool = True):
        If partial=False, all futures are expected to return. If a future
        does not return within the timeout period, the cleanup terminates.
        """
+        # At this point, self._cleanup_map holds the last references


@krfricke Do you know why we return_pg before calling trainable.stop()?
If pg is gone, it doesn't make sense to stop the actor anymore?

Also if we can rely on GC to always clean up remote actor, why do we bother with self._cleanup_map anyways?

Also if we can rely on GC to always clean up remote actor, why do we bother with self._cleanup_map anyways?

We want to have a way to keep a reference for the actor during graceful termination so that the cleanup method may finish

xwjiang2010 · 2021-10-07T17:33:04Z

LGTM. Just some questions for my own understanding.

krfricke

Awesome, thanks!

Yard1 added 11 commits September 21, 2021 20:37

Force trial cleanup

7ae7664

Add on_experiment_stop callback

54cffcb

Pre-fill results dict

0bf7c7a

Trial id

89ad0be

Merge branch 'master' into force_trial_cleanup

57a3f72

Merge branch 'master' into force_trial_cleanup

510f1b2

Force trial cleanup

6e80815

Remove callback

bc0bd7a

Return some metrics even if trial doesn't report

b1a144e

Improve test

94f1388

Improve test

fb48883

Yard1 requested review from richardliaw, krfricke and xwjiang2010 October 6, 2021 16:56

Yard1 assigned krfricke and xwjiang2010 Oct 6, 2021

Yard1 added 2 commits October 6, 2021 16:59

Clean default result future

7a3abbd

Fix

8ce1bb8

xwjiang2010 reviewed Oct 6, 2021

View reviewed changes

python/ray/tune/trial.py Outdated Show resolved Hide resolved

xwjiang2010 reviewed Oct 6, 2021

View reviewed changes

python/ray/tune/trial_runner.py Outdated Show resolved Hide resolved

xwjiang2010 reviewed Oct 6, 2021

View reviewed changes

python/ray/tune/ray_trial_executor.py Show resolved Hide resolved

Yard1 added 4 commits October 6, 2021 23:07

Fixes

1e1cbac

Return a smaller subset of metrics

dd49bbf

Revert change

ca41e6a

Fix

4bcad7a

Yard1 requested a review from xwjiang2010 October 6, 2021 23:48

Yard1 added 3 commits October 6, 2021 23:51

Improve docstring

6f0e7c0

Test fixes

5866f1e

Debug

907fde9

Yard1 added 4 commits October 7, 2021 12:04

Set location

2990720

Merge branch 'master' into force_trial_cleanup

3254de2

Skip test for local mode

7520066

Merge branch 'master' into force_trial_cleanup

0625086

krfricke reviewed Oct 7, 2021

View reviewed changes

doc/source/tune/user-guide.rst Outdated Show resolved Hide resolved

python/ray/tune/ray_trial_executor.py Show resolved Hide resolved

Yard1 added 3 commits October 7, 2021 15:14

Fix progress reporter test

c74dcc5

Implement feedback

21a13d2

Fix

a71650f

xwjiang2010 reviewed Oct 7, 2021

View reviewed changes

xwjiang2010 approved these changes Oct 7, 2021

View reviewed changes

krfricke approved these changes Oct 8, 2021

View reviewed changes

krfricke merged commit c7d6f83 into ray-project:master Oct 8, 2021

Yard1 deleted the force_trial_cleanup branch October 8, 2021 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Optional forcible trial cleanup, return default autofilled metrics even if Trainable doesn't report at least once #19144

[tune] Optional forcible trial cleanup, return default autofilled metrics even if Trainable doesn't report at least once #19144

Yard1 commented Oct 6, 2021

xwjiang2010 commented Oct 6, 2021

krfricke left a comment

xwjiang2010 Oct 7, 2021

Yard1 Oct 7, 2021 •

edited

Loading

xwjiang2010 commented Oct 7, 2021

krfricke left a comment

[tune] Optional forcible trial cleanup, return default autofilled metrics even if Trainable doesn't report at least once #19144

[tune] Optional forcible trial cleanup, return default autofilled metrics even if Trainable doesn't report at least once #19144

Conversation

Yard1 commented Oct 6, 2021

Why are these changes needed?

Related issue number

Checks

xwjiang2010 commented Oct 6, 2021

krfricke left a comment

Choose a reason for hiding this comment

xwjiang2010 Oct 7, 2021

Choose a reason for hiding this comment

Yard1 Oct 7, 2021 • edited Loading

Choose a reason for hiding this comment

xwjiang2010 commented Oct 7, 2021

krfricke left a comment

Choose a reason for hiding this comment

Yard1 Oct 7, 2021 •

edited

Loading