[tune] Cluster Fault Tolerance #3309

richardliaw · 2018-11-12T23:57:10Z

What do these changes do?

A redo of #3165 with extraneous cleanup changes removed.

This currently does not use the same restoring code-path as #3238, but this can change later when component FT is implemented... (i.e., this doesn't notify components that some trials go RUNNING -> PENDING).

This adds the following functionality:

pickleable trials and TrialRunner.
checkpointing/restoring functionality for Trial runner
user endpoints for experiment checkpointing

Example:

In [6]: import time
   ...: import ray
   ...: from ray import tune
   ...:
   ...: ray.init()
   ...:
   ...: kwargs = dict(
   ...:     run="__fake",
   ...:     stop=dict(training_iteration=5),
   ...:     checkpoint_freq=1,
   ...:     max_failures=1)
   ...:
   ...: # This will save the experiment state to disk on each step
   ...: tune.run_experiments(
   ...:     dict(experiment1=kwargs),
   ...:     raise_on_failed_trial=False)
   ...:

TODO:

User endpoints implemented.
NODE FT: Add test for scheduler notification when nodes die and trials running -> pending

NOTE: this should be a lot easier to review after #3414 is merged.

…fig_updating

AmplabJenkins · 2018-12-27T04:01:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10432/
Test FAILed.

AmplabJenkins · 2018-12-27T04:18:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10433/
Test FAILed.

AmplabJenkins · 2018-12-27T04:41:35Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10435/
Test FAILed.

ericl

Works great now! I also patched rllib's train.py to only resume on --resume.

It might be a bit aggressive to enable this by default, but we can always default it to false later, and I think this will help a lot for visibility.

python/ray/tune/tune.py

ericl · 2018-12-27T11:10:34Z

python/ray/tune/trial_executor.py

+        except Exception:
+            logger.exception("Error checkpointing trial metadata.")
+
+    def get_checkpoints(self):


@old-bear any comments on the changes to the trial_executor interface?

ericl · 2018-12-27T11:19:16Z

Also nit: "This will ignore any new changes to specification" isn't grammatically correct.

AmplabJenkins · 2018-12-27T12:57:06Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10452/
Test FAILed.

Co-Authored-By: richardliaw <[email protected]>

AmplabJenkins · 2018-12-27T13:21:06Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10456/
Test FAILed.

…_cluster-2

AmplabJenkins · 2018-12-28T08:38:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10471/
Test FAILed.

AmplabJenkins · 2018-12-28T08:43:49Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10473/
Test FAILed.

AmplabJenkins · 2018-12-28T11:06:44Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10476/
Test PASSed.

AmplabJenkins · 2018-12-28T13:09:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10479/
Test FAILed.

richardliaw · 2018-12-28T15:09:24Z

jenkins retest this please

AmplabJenkins · 2018-12-28T18:15:06Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10481/
Test FAILed.

richardliaw · 2018-12-29T00:41:50Z

jenkins retest this please

AmplabJenkins · 2018-12-29T03:18:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10487/
Test FAILed.

richardliaw · 2018-12-29T03:38:57Z

Merging this since:

travis tests pass
on jenkins, tests are hanging on SGD (which is not related to this PR it seems), but rest of the tests pass and the only tests after the SGD portion are either Ray multi-node or pytorch, so this should be unrelated

@ericl, thanks for the multiple rounds of extensive review!

hartikainen · 2019-01-02T02:44:20Z

Nice, this helps a lot! Thanks guys!

richardliaw and others added 30 commits November 4, 2018 21:44

[tune] Throw on overstepping

e557b6a

Add Tune Multi-Node Tests

b755785

Add cluster bookkeeping code

32d1242

add test for adding node

9ec3a60

multinode test fixes

44fe1e2

First pass at allowing updatable values

d9c9e3b

Fix compilation issues

d6cade1

Merge branch 'config_updating' into global_state_multinode

ac74520

Add config file parsing

a95c718

Full initialization

5814655

Merge branch 'config_updating' into global_state_multinode

f63df3f

Wrote a good test

2824836

Merge branch 'config_updating' into tune_cluster

6e7bd6a

configuration parsing and stuff

4842481

docs

8e52103

write some tests, make it good

83d6947

Merge branch 'master' into config_updating

4349adf

fixed init

8078967

Add all config options and bring back stress tests.

2db9f18

Merge branch 'config_updating' into tune_cluster

cc8fca2

Update python/ray/worker.py

59480dc

Update python/ray/worker.py

6fa9d7c

TEMP

856547c

Fix internalization

25e45cd

Merge branch 'config_updating' of github.com:richardliaw/ray into con…

2e2b8b0

…fig_updating

some last changes

d3fa8f0

Merge branch 'config_updating' into tune_cluster

233f3ee

skip for now

c3c1c9c

Linting and Java fix

3e96ec9

add docstring

4081c60

richardliaw added 2 commits December 26, 2018 19:01

fix

fc6802b

fix tests

d1f1c0b

Update train.py

7650e9f

ericl approved these changes Dec 27, 2018

View reviewed changes

ericl and others added 2 commits December 27, 2018 20:58

Update python/ray/tune/tune.py

54af15c

Co-Authored-By: richardliaw <[email protected]>

Update python/ray/tune/tune.py

8503e17

Co-Authored-By: richardliaw <[email protected]>

richardliaw added 3 commits December 28, 2018 00:16

fix py2 test

4294f42

Merge branch 'tune_cluster-2' of github.com:richardliaw/ray into tune…

b6083b3

…_cluster-2

grammar

b8da076

fix

c92313f

fix travis

577da9b

richardliaw merged commit aad3c50 into ray-project:master Dec 29, 2018

This was referenced Dec 29, 2018

[tune] Checkpoint/Restore API for TrialScheduler and SearchAlgorithms #3659

Closed

[tune] Add Real Multi-Node Tests (Stress Tests) #2877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Cluster Fault Tolerance #3309

[tune] Cluster Fault Tolerance #3309

richardliaw commented Nov 12, 2018 •

edited

Loading

AmplabJenkins commented Dec 27, 2018

AmplabJenkins commented Dec 27, 2018

AmplabJenkins commented Dec 27, 2018

ericl left a comment

ericl Dec 27, 2018

ericl commented Dec 27, 2018

AmplabJenkins commented Dec 27, 2018

AmplabJenkins commented Dec 27, 2018

AmplabJenkins commented Dec 28, 2018

AmplabJenkins commented Dec 28, 2018

AmplabJenkins commented Dec 28, 2018

AmplabJenkins commented Dec 28, 2018

richardliaw commented Dec 28, 2018

AmplabJenkins commented Dec 28, 2018

richardliaw commented Dec 29, 2018

AmplabJenkins commented Dec 29, 2018

richardliaw commented Dec 29, 2018

hartikainen commented Jan 2, 2019

[tune] Cluster Fault Tolerance #3309

[tune] Cluster Fault Tolerance #3309

Conversation

richardliaw commented Nov 12, 2018 • edited Loading

What do these changes do?

AmplabJenkins commented Dec 27, 2018

AmplabJenkins commented Dec 27, 2018

AmplabJenkins commented Dec 27, 2018

ericl left a comment

Choose a reason for hiding this comment

ericl Dec 27, 2018

Choose a reason for hiding this comment

ericl commented Dec 27, 2018

AmplabJenkins commented Dec 27, 2018

AmplabJenkins commented Dec 27, 2018

AmplabJenkins commented Dec 28, 2018

AmplabJenkins commented Dec 28, 2018

AmplabJenkins commented Dec 28, 2018

AmplabJenkins commented Dec 28, 2018

richardliaw commented Dec 28, 2018

AmplabJenkins commented Dec 28, 2018

richardliaw commented Dec 29, 2018

AmplabJenkins commented Dec 29, 2018

richardliaw commented Dec 29, 2018

hartikainen commented Jan 2, 2019

richardliaw commented Nov 12, 2018 •

edited

Loading