-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] [wip] Application-level Fault Tolerance #3165
Conversation
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this change is too invasive. Why modify the trial execution state at all?
It would also be good to have a design doc. For example, what needs to be in the state of the scheduler checkpoint? I can think of
- Hash of the experiment configuration.
- List of all generated trials and their experiment tag.
logger.debug("progress.csv found; appending to this file.") | ||
except FileNotFoundError: | ||
logger.debug("progress.csv not found.") | ||
labels = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need any of this right? Just open the file for append.
# Since trial resume after paused should not run | ||
# trial.train.remote(), thus no more new remote object id generated. | ||
# We use self._paused to store paused trials here. | ||
self._paused = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to change this? We can just restore these from last checkpoint right?
logger.info("Restoring result from in-flight trial.") | ||
# If Trial was in flight when paused, we restore result. | ||
self._running[ray.put(trial.next_result)] = trial | ||
trial.next_result = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do make changes to restore of trial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some of these things are orthogonal to this change; I'll separate it into a subsequent PR
I don't know what you're referring to by "trial execution state"; if you're talking about changing the trial executor logic, sure, those changes are sort of separate and can be done in a subsequent PR The current trial checkpointing can work without scheduler and search algorithm checkpointing. Scheduler checkpoint and search algorithm checkpoints can be done separately from this PR. |
Closing this PR and will reopen when revised (soon). |
Adds Trial meta-data to disk-based trial checkpoints. This allows trials
to be recovered exactly as if nothing has happened.
The only oddity is logging since we don't do any roll-back of the result
logging, but perhaps we can fix later.
TODO:
tune.resume(LOGDIR)
?