Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Cluster Fault Tolerance #3309

Merged
merged 213 commits into from
Dec 29, 2018
Merged

Conversation

richardliaw
Copy link
Contributor

@richardliaw richardliaw commented Nov 12, 2018

What do these changes do?

A redo of #3165 with extraneous cleanup changes removed.

This currently does not use the same restoring code-path as #3238, but this can change later when component FT is implemented... (i.e., this doesn't notify components that some trials go RUNNING -> PENDING).

This adds the following functionality:

  • pickleable trials and TrialRunner.
  • checkpointing/restoring functionality for Trial runner
  • user endpoints for experiment checkpointing

Example:

In [6]: import time
   ...: import ray
   ...: from ray import tune
   ...:
   ...: ray.init()
   ...:
   ...: kwargs = dict(
   ...:     run="__fake",
   ...:     stop=dict(training_iteration=5),
   ...:     checkpoint_freq=1,
   ...:     max_failures=1)
   ...:
   ...: # This will save the experiment state to disk on each step
   ...: tune.run_experiments(
   ...:     dict(experiment1=kwargs),
   ...:     raise_on_failed_trial=False)
   ...:

TODO:

  • User endpoints implemented.
  • NODE FT: Add test for scheduler notification when nodes die and trials running -> pending

NOTE: this should be a lot easier to review after #3414 is merged.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10432/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10433/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10435/
Test FAILed.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works great now! I also patched rllib's train.py to only resume on --resume.

It might be a bit aggressive to enable this by default, but we can always default it to false later, and I think this will help a lot for visibility.

python/ray/tune/tune.py Outdated Show resolved Hide resolved
python/ray/tune/tune.py Show resolved Hide resolved
except Exception:
logger.exception("Error checkpointing trial metadata.")

def get_checkpoints(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@old-bear any comments on the changes to the trial_executor interface?

@ericl
Copy link
Contributor

ericl commented Dec 27, 2018

Also nit: "This will ignore any new changes to specification" isn't grammatically correct.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10452/
Test FAILed.

ericl and others added 2 commits December 27, 2018 20:58
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10456/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10471/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10473/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10476/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10479/
Test FAILed.

@richardliaw
Copy link
Contributor Author

jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10481/
Test FAILed.

@richardliaw
Copy link
Contributor Author

jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10487/
Test FAILed.

@richardliaw
Copy link
Contributor Author

Merging this since:

  • travis tests pass
  • on jenkins, tests are hanging on SGD (which is not related to this PR it seems), but rest of the tests pass and the only tests after the SGD portion are either Ray multi-node or pytorch, so this should be unrelated

@ericl, thanks for the multiple rounds of extensive review!

@hartikainen
Copy link
Contributor

Nice, this helps a lot! Thanks guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants