-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Cluster Fault Tolerance #3309
Conversation
Test FAILed. |
Test FAILed. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works great now! I also patched rllib's train.py to only resume on --resume.
It might be a bit aggressive to enable this by default, but we can always default it to false later, and I think this will help a lot for visibility.
except Exception: | ||
logger.exception("Error checkpointing trial metadata.") | ||
|
||
def get_checkpoints(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@old-bear any comments on the changes to the trial_executor interface?
Also nit: "This will ignore any new changes to specification" isn't grammatically correct. |
Test FAILed. |
Co-Authored-By: richardliaw <[email protected]>
Co-Authored-By: richardliaw <[email protected]>
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
jenkins retest this please |
Test FAILed. |
jenkins retest this please |
Test FAILed. |
Merging this since:
@ericl, thanks for the multiple rounds of extensive review! |
Nice, this helps a lot! Thanks guys! |
What do these changes do?
A redo of #3165 with extraneous cleanup changes removed.
This currently does not use the same restoring code-path as #3238, but this can change later when component FT is implemented... (i.e., this doesn't notify components that some trials go RUNNING -> PENDING).
This adds the following functionality:
Example:
TODO:
NOTE: this should be a lot easier to review after #3414 is merged.