[tune] [wip] Application-level Fault Tolerance #3165

richardliaw · 2018-10-30T07:24:18Z

Adds Trial meta-data to disk-based trial checkpoints. This allows trials
to be recovered exactly as if nothing has happened.

The only oddity is logging since we don't do any roll-back of the result
logging, but perhaps we can fix later.

TODO:

Tests
Clean-up
Expose functionality via some API like
tune.resume(LOGDIR)?

AmplabJenkins · 2018-10-30T07:42:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8974/
Test FAILed.

ericl

Overall I think this change is too invasive. Why modify the trial execution state at all?

It would also be good to have a design doc. For example, what needs to be in the state of the scheduler checkpoint? I can think of

Hash of the experiment configuration.
List of all generated trials and their experiment tag.

ericl · 2018-10-31T02:14:13Z

python/ray/tune/logger.py

+            logger.debug("progress.csv found; appending to this file.")
+        except FileNotFoundError:
+            logger.debug("progress.csv not found.")
+            labels = None


You don't need any of this right? Just open the file for append.

ericl · 2018-10-31T02:16:36Z

python/ray/tune/ray_trial_executor.py

-        # Since trial resume after paused should not run
-        # trial.train.remote(), thus no more new remote object id generated.
-        # We use self._paused to store paused trials here.
-        self._paused = {}


Do we need to change this? We can just restore these from last checkpoint right?

ericl · 2018-10-31T02:17:01Z

python/ray/tune/ray_trial_executor.py

+            logger.info("Restoring result from in-flight trial.")
+            # If Trial was in flight when paused, we restore result.
+            self._running[ray.put(trial.next_result)] = trial
+            trial.next_result = None


I don't think we need to do make changes to restore of trial.

some of these things are orthogonal to this change; I'll separate it into a subsequent PR

richardliaw · 2018-10-31T05:27:33Z

I don't know what you're referring to by "trial execution state"; if you're talking about changing the trial executor logic, sure, those changes are sort of separate and can be done in a subsequent PR

The current trial checkpointing can work without scheduler and search algorithm checkpointing. Scheduler checkpoint and search algorithm checkpoints can be done separately from this PR.

richardliaw · 2018-11-05T08:06:53Z

Closing this PR and will reopen when revised (soon).

richardliaw added 19 commits September 18, 2018 10:02

wip

7a33c1f

Merge branch 'master' into tune_fault

3d5f5a2

Merge branch 'master' into tune_fault

88e5937

save_state

268951b

some todos for executor

de9aa25

Resume ability for Trial logging

427f056

temporary test

f06e5d4

wip

90e626b

Merge branch 'master' into tune_fault

0259507

temp changes

cece8e0

small changes

5a6361e

Checkpoint restore test

88399d2

test going

f72481a

Merge branch 'master' into tune_fault

2ec87fc

some update

e4e97b0

Removes pausing queue in Trial Executor

65b6238

Merge branch 'remove_pausing' into tune_fault

f975fd7

debugging everywhere

c810ed8

Fix up restoration

3b5d5bf

richardliaw changed the title ~~[wip] Application-level Fault Tolerance~~ [tune] [wip] Application-level Fault Tolerance Oct 31, 2018

ericl reviewed Oct 31, 2018

View reviewed changes

richardliaw closed this Nov 5, 2018

richardliaw mentioned this pull request Nov 12, 2018

[tune] Cluster Fault Tolerance #3309

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] [wip] Application-level Fault Tolerance #3165

[tune] [wip] Application-level Fault Tolerance #3165

richardliaw commented Oct 30, 2018

AmplabJenkins commented Oct 30, 2018

ericl left a comment

ericl Oct 31, 2018

ericl Oct 31, 2018

ericl Oct 31, 2018

richardliaw Oct 31, 2018

richardliaw commented Oct 31, 2018

richardliaw commented Nov 5, 2018

[tune] [wip] Application-level Fault Tolerance #3165

[tune] [wip] Application-level Fault Tolerance #3165

Conversation

richardliaw commented Oct 30, 2018

AmplabJenkins commented Oct 30, 2018

ericl left a comment

Choose a reason for hiding this comment

ericl Oct 31, 2018

Choose a reason for hiding this comment

ericl Oct 31, 2018

Choose a reason for hiding this comment

ericl Oct 31, 2018

Choose a reason for hiding this comment

richardliaw Oct 31, 2018

Choose a reason for hiding this comment

richardliaw commented Oct 31, 2018

richardliaw commented Nov 5, 2018