Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added example to user guide for cloud checkpointing #20045

Merged
merged 19 commits into from
Nov 15, 2021
Merged

Conversation

worldveil
Copy link
Contributor

@worldveil worldveil commented Nov 4, 2021

Added a few starter code examples just to get the ball rolling in documenting cloud sync and trainables.

@Yard1 Yard1 self-assigned this Nov 4, 2021
@Yard1 Yard1 self-requested a review November 4, 2021 16:16
# when checkpoints happen, all the trained parameters are saved alongside,
# not just your model's hyperparameters! useful if you want your checkpoints
# to be "inference ready" checkpoints of your models
tune.with_parameters(my_pytorch_training_func),
Copy link
Contributor Author

@worldveil worldveil Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tune.with_parameters(my_pytorch_training_func),
tune.with_parameters(tune.durable(my_pytorch_training_func)),

# cluster when your instances are terminated
sync_config = tune.syncConfig(
upload_dir="s3://my-checkpoints-bucket/path/", # requires AWS credentials
sync_to_cloud=True # this can also be a func for custom logic!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync_to_driver=False


tune.run(train_func)
* Locally: ``local_dir/my-tune-exp/<trial_name>/checkpoint_<step>``
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if sync_to_driver=False AND running locally from laptop → cluster, then this NOT stored locally

if sync_to_driver=False AND run with ray submit, then only trials run on head will be on head


# our same config as above!
restore=sync_config,
)

.. _tune-distributed-checkpointing:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


2. **Custom training function**

* All this means is that your function has to expose a ``checkpoint_dir`` argument in the function signature, and call ``tune.checkpoint_dir``. See this example, it's quite simple to do.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to link here to custom_func_checkpointing.rst, but not sure how to do that

A simple local/rsync checkpointing example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Local or rsync checkpointing can be a good option if:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double check me on this list!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine. I guess you can also add "if you are not interested in persistence"

@@ -346,7 +454,7 @@ Distributed Checkpointing

On a multinode cluster, Tune automatically creates a copy of all trial checkpoints on the head node. This requires the Ray cluster to be started with the :ref:`cluster launcher <cluster-cloud>` and also requires rsync to be installed.

Note that you must use the ``tune.checkpoint_dir`` API to trigger syncing.
Note that you must use the ``tune.checkpoint_dir`` API to trigger syncing (or use a Ray built-in model type described here).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also want to link to that example custom_func_checkpointing.rst or a better one from github.

@worldveil
Copy link
Contributor Author

@Yard1 made a bunch of fixes & adds here.

Would love your review, esp among some of the examples I linked. We might have better ones to link to for the "ray models" bullet list

Copy link
Member

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed a commit that improved the way links are handled, and added links to custom_func_checkpointing as you wanted - let me know if anything is unclear

doc/source/tune/user-guide.rst Outdated Show resolved Hide resolved
doc/source/tune/user-guide.rst Outdated Show resolved Hide resolved
doc/source/tune/user-guide.rst Outdated Show resolved Hide resolved
doc/source/tune/user-guide.rst Outdated Show resolved Hide resolved
A simple (cloud) checkpointing example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Cloud storage-backed Tune checkpointing is the recommended best practice for both performance and reliability reasons. rsync-based checkpointing is also incompatible with Ray on Kubernetes. If you'd rather checkpoint locally or use rsync based checkpointing, see here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link missing ("here")

also, please add a sentence of context here (explain ssh based rsync checkpointing) - it may not be clear for users that this is the default way.

I'd also rather frame this positively ("it also enables checkpointing if using Ray on Kubernetes, which does not work out of the box" or so).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I link to an anchor? "here" would ideally link to the A simple local/rsync checkpointing example section below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted the text, let me know if phrased better @krfricke

doc/source/tune/user-guide.rst Outdated Show resolved Hide resolved
doc/source/tune/user-guide.rst Outdated Show resolved Hide resolved
@worldveil
Copy link
Contributor Author

@krfricke @Yard1 hopefully should have addressed all your comments

@ray-project ray-project deleted a comment from Yard1 Nov 10, 2021
A simple (cloud) checkpointing example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Cloud storage-backed Tune checkpointing is the recommended best practice for both performance and reliability reasons. rsync-based checkpointing is also incompatible with Ray on Kubernetes. If you'd rather checkpoint locally or use rsync based checkpointing, see here.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted the text, let me know if phrased better @krfricke

doc/source/tune/user-guide.rst Outdated Show resolved Hide resolved
In this example, checkpoints will be saved by training iteration to ``local_dir/exp_name/trial_name/checkpoint_<step>``.
* **Locally**: not saved! we set ``sync_to_driver=False``, so nothing will be sync'd to the driver (your laptop)
* **S3**: ``s3://my-checkpoints-bucket/path/my-tune-exp/<trial_name>/checkpoint_<step>``
* **On head node**: ``~/ray-results/my-tune-exp/<trial_name>/checkpoint_<step>``
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* **On head node**: ``~/ray-results/my-tune-exp/<trial_name>/checkpoint_<step>``
* **On head node**: ``~/ray-results/my-tune-exp/<trial_name>/checkpoint_<step>`` (will only have trials run on the head node)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yard1 added suggestion

@worldveil
Copy link
Contributor Author

@krfricke all look good here? ready to merge unless you have additional comments

@krfricke krfricke self-assigned this Nov 11, 2021
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two small changes

A simple (cloud) checkpointing example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Cloud storage-backed Tune checkpointing is the recommended best practice for both performance and reliability reasons. It also enables checkpointing if using Ray on Kubernetes, which does not work out of the box with rsync-based sync, which relies on SSH. If you'd rather checkpoint locally or use rsync based checkpointing, see here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"here" - link missing?

Comment on lines 412 to 413
# with_parameters() ensures when we checkpoint, weights are saved (not just hyperparameters)
durable_trainer_with_params = tune.with_parameters(my_trainable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is not correct - let's not use with_parameters here at all

Generally, with_parameters is used to e.g. re-use data across trainables. It saves parameters in the object store to be re-used, similarly to functools.partial - we don't use this ere, so let's just directly pass the trainable below

@worldveil worldveil changed the title [WIP] Added example to user guide for cloud checkpointing Added example to user guide for cloud checkpointing Nov 12, 2021
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted the example to match the latest API changes and moved the docs around a bit

@krfricke krfricke merged commit fa878e2 into master Nov 15, 2021
@krfricke krfricke deleted the tune-checkpointing branch November 15, 2021 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants