Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] [Doc] Tune checkpointing and Tuner restore docfix #29411

Merged
merged 15 commits into from
Oct 27, 2022

Conversation

justinvyu
Copy link
Contributor

Why are these changes needed?

  1. The docs suggest that a newly created Tuner will try to resume from the last checkpoint if an experiment of the same name is found at <local_dir>/<exp_name>. This was the case in the old tune.run API example since we specified resume="AUTO", but the new Tuner API always starts a new experiment. See attached issue for more details.
  2. It was very hard to find how to actually save and load checkpoints. I added references to the Trainable documentation in the Tune checkpointing user guide. Previously, the checkpointing user guide didn't actually tell the user how to checkpoint: it only gave examples of how to configure CheckpointConfig and SyncConfig.

Related issue number

Closes #28722

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@maxpumperla maxpumperla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits, but this looks really good!

doc/source/tune/tutorials/tune-checkpoints.rst Outdated Show resolved Hide resolved
python/ray/tune/trainable/trainable.py Show resolved Hide resolved
python/ray/tune/trainable/trainable.py Show resolved Hide resolved
@xwjiang2010
Copy link
Contributor

Really love this edition. The modularized sections look good to me.
Seems like some merging conflicts and linting errors to resolve?


Tune also may copy or move checkpoints during the course of tuning. For this purpose,
it is important not to depend on absolute paths in the implementation of ``save``.
.. include:: checkpointing/function-checkpointing.rst
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why break this out into a separate file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reuse this section in the Tune "working with checkpoints" user guide, since that's where I would intuitively look for an example of how to actually checkpoint. I've commented where that is below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you instead link it, instead of re-rendering it in two places?

Otherwise for example, the search results are going to get cluttered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So concretely, don't break it out into a separate file, put it in the "working with checkpoints" part, and use a relative reference here linking to that "working with checkpoints" section

See below for examples of saving and loading trial-level checkpoints.


How do I save and load trial checkpoints?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the examples again within the Tune user guide.

@richardliaw richardliaw merged commit cd031a0 into ray-project:master Oct 27, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Tune] [Docs] Checkpointing example resume functionality is misleading
5 participants