-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added example to user guide for cloud checkpointing #20045
Conversation
doc/source/tune/user-guide.rst
Outdated
# when checkpoints happen, all the trained parameters are saved alongside, | ||
# not just your model's hyperparameters! useful if you want your checkpoints | ||
# to be "inference ready" checkpoints of your models | ||
tune.with_parameters(my_pytorch_training_func), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tune.with_parameters(my_pytorch_training_func), | |
tune.with_parameters(tune.durable(my_pytorch_training_func)), |
doc/source/tune/user-guide.rst
Outdated
# cluster when your instances are terminated | ||
sync_config = tune.syncConfig( | ||
upload_dir="s3://my-checkpoints-bucket/path/", # requires AWS credentials | ||
sync_to_cloud=True # this can also be a func for custom logic! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync_to_driver=False
doc/source/tune/user-guide.rst
Outdated
|
||
tune.run(train_func) | ||
* Locally: ``local_dir/my-tune-exp/<trial_name>/checkpoint_<step>`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if sync_to_driver=False AND running locally from laptop → cluster, then this NOT stored locally
if sync_to_driver=False AND run with ray submit, then only trials run on head will be on head
|
||
# our same config as above! | ||
restore=sync_config, | ||
) | ||
|
||
.. _tune-distributed-checkpointing: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc/source/tune/user-guide.rst
Outdated
|
||
2. **Custom training function** | ||
|
||
* All this means is that your function has to expose a ``checkpoint_dir`` argument in the function signature, and call ``tune.checkpoint_dir``. See this example, it's quite simple to do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to link here to custom_func_checkpointing.rst
, but not sure how to do that
A simple local/rsync checkpointing example | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Local or rsync checkpointing can be a good option if: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double check me on this list!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine. I guess you can also add "if you are not interested in persistence"
doc/source/tune/user-guide.rst
Outdated
@@ -346,7 +454,7 @@ Distributed Checkpointing | |||
|
|||
On a multinode cluster, Tune automatically creates a copy of all trial checkpoints on the head node. This requires the Ray cluster to be started with the :ref:`cluster launcher <cluster-cloud>` and also requires rsync to be installed. | |||
|
|||
Note that you must use the ``tune.checkpoint_dir`` API to trigger syncing. | |||
Note that you must use the ``tune.checkpoint_dir`` API to trigger syncing (or use a Ray built-in model type described here). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also want to link to that example custom_func_checkpointing.rst
or a better one from github.
@Yard1 made a bunch of fixes & adds here. Would love your review, esp among some of the examples I linked. We might have better ones to link to for the "ray models" bullet list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed a commit that improved the way links are handled, and added links to custom_func_checkpointing
as you wanted - let me know if anything is unclear
Co-authored-by: Antoni Baum <[email protected]>
doc/source/tune/user-guide.rst
Outdated
A simple (cloud) checkpointing example | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Cloud storage-backed Tune checkpointing is the recommended best practice for both performance and reliability reasons. rsync-based checkpointing is also incompatible with Ray on Kubernetes. If you'd rather checkpoint locally or use rsync based checkpointing, see here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link missing ("here")
also, please add a sentence of context here (explain ssh based rsync checkpointing) - it may not be clear for users that this is the default way.
I'd also rather frame this positively ("it also enables checkpointing if using Ray on Kubernetes, which does not work out of the box" or so).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do I link to an anchor? "here" would ideally link to the A simple local/rsync checkpointing example
section below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted the text, let me know if phrased better @krfricke
Co-authored-by: Antoni Baum <[email protected]>
…tune-checkpointing
doc/source/tune/user-guide.rst
Outdated
A simple (cloud) checkpointing example | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Cloud storage-backed Tune checkpointing is the recommended best practice for both performance and reliability reasons. rsync-based checkpointing is also incompatible with Ray on Kubernetes. If you'd rather checkpoint locally or use rsync based checkpointing, see here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted the text, let me know if phrased better @krfricke
doc/source/tune/user-guide.rst
Outdated
In this example, checkpoints will be saved by training iteration to ``local_dir/exp_name/trial_name/checkpoint_<step>``. | ||
* **Locally**: not saved! we set ``sync_to_driver=False``, so nothing will be sync'd to the driver (your laptop) | ||
* **S3**: ``s3://my-checkpoints-bucket/path/my-tune-exp/<trial_name>/checkpoint_<step>`` | ||
* **On head node**: ``~/ray-results/my-tune-exp/<trial_name>/checkpoint_<step>`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* **On head node**: ``~/ray-results/my-tune-exp/<trial_name>/checkpoint_<step>`` | |
* **On head node**: ``~/ray-results/my-tune-exp/<trial_name>/checkpoint_<step>`` (will only have trials run on the head node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 added suggestion
@krfricke all look good here? ready to merge unless you have additional comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two small changes
doc/source/tune/user-guide.rst
Outdated
A simple (cloud) checkpointing example | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Cloud storage-backed Tune checkpointing is the recommended best practice for both performance and reliability reasons. It also enables checkpointing if using Ray on Kubernetes, which does not work out of the box with rsync-based sync, which relies on SSH. If you'd rather checkpoint locally or use rsync based checkpointing, see here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"here" - link missing?
doc/source/tune/user-guide.rst
Outdated
# with_parameters() ensures when we checkpoint, weights are saved (not just hyperparameters) | ||
durable_trainer_with_params = tune.with_parameters(my_trainable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is not correct - let's not use with_parameters
here at all
Generally, with_parameters
is used to e.g. re-use data across trainables. It saves parameters in the object store to be re-used, similarly to functools.partial
- we don't use this ere, so let's just directly pass the trainable below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted the example to match the latest API changes and moved the docs around a bit
Added a few starter code examples just to get the ball rolling in documenting cloud sync and trainables.