Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] [wip] Application-level Fault Tolerance #3165

Closed
wants to merge 19 commits into from

Conversation

richardliaw
Copy link
Contributor

Adds Trial meta-data to disk-based trial checkpoints. This allows trials
to be recovered exactly as if nothing has happened.

The only oddity is logging since we don't do any roll-back of the result
logging, but perhaps we can fix later.

TODO:

  • Tests
  • Clean-up
  • Expose functionality via some API like
    tune.resume(LOGDIR)?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8974/
Test FAILed.

@richardliaw richardliaw changed the title [wip] Application-level Fault Tolerance [tune] [wip] Application-level Fault Tolerance Oct 31, 2018
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this change is too invasive. Why modify the trial execution state at all?

It would also be good to have a design doc. For example, what needs to be in the state of the scheduler checkpoint? I can think of

  • Hash of the experiment configuration.
  • List of all generated trials and their experiment tag.

logger.debug("progress.csv found; appending to this file.")
except FileNotFoundError:
logger.debug("progress.csv not found.")
labels = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need any of this right? Just open the file for append.

# Since trial resume after paused should not run
# trial.train.remote(), thus no more new remote object id generated.
# We use self._paused to store paused trials here.
self._paused = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change this? We can just restore these from last checkpoint right?

logger.info("Restoring result from in-flight trial.")
# If Trial was in flight when paused, we restore result.
self._running[ray.put(trial.next_result)] = trial
trial.next_result = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do make changes to restore of trial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of these things are orthogonal to this change; I'll separate it into a subsequent PR

@richardliaw
Copy link
Contributor Author

I don't know what you're referring to by "trial execution state"; if you're talking about changing the trial executor logic, sure, those changes are sort of separate and can be done in a subsequent PR

The current trial checkpointing can work without scheduler and search algorithm checkpointing. Scheduler checkpoint and search algorithm checkpoints can be done separately from this PR.

@richardliaw
Copy link
Contributor Author

Closing this PR and will reopen when revised (soon).

@richardliaw richardliaw closed this Nov 5, 2018
@richardliaw richardliaw mentioned this pull request Nov 12, 2018
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants