[Train] implement CheckpointStrategy #19111

matthewdeng · 2021-10-05T20:11:57Z

Implements CheckpointStrategy to delete checkpoints from disk.

Why are these changes needed?

For a training function with many epochs, persisted checkpoints may lead to a high disk usage. The user should be able to define in their training run to clean up checkpoints. Specifically, they may want to keep N checkpoints, removing checkpoints based on:

Timestamp - remove old and outdated checkpoints.
A user-defined metric - remove "bad" or low-scoring checkpoints.

Example

This example is also included in the docs.

from ray import train
from ray.train import CheckpointStrategy, Trainer


def train_func():
    # first checkpoint
    train.save_checkpoint(loss=2)
    # second checkpoint
    train.save_checkpoint(loss=4)
    # third checkpoint
    train.save_checkpoint(loss=1)
    # fourth checkpoint
    train.save_checkpoint(loss=3)

# Keep the 2 checkpoints with the smallest "loss" value.
checkpoint_strategy = CheckpointStrategy(num_to_keep=2,
                                         checkpoint_score_attribute="loss",
                                         checkpoint_score_order="min")

trainer = Trainer(backend="torch", num_workers=2)
trainer.start()
trainer.run(train_func, checkpoint_strategy=checkpoint_strategy)
print(trainer.best_checkpoint_path)
# /home/ray_results/train_2021-09-01_12-00-00/run_001/checkpoints/checkpoint_000003
print(trainer.latest_checkpoint_dir)
# /home/ray_results/train_2021-09-01_12-00-00/run_001/checkpoints
print([checkpoint_path for checkpoint_path in trainer.latest_checkpoint_dir.iterdir()])
# [PosixPath('/home/ray_results/train_2021-09-01_12-00-00/run_001/checkpoints/checkpoint_000003'),
# PosixPath('/home/ray_results/train_2021-09-01_12-00-00/run_001/checkpoints/checkpoint_000001')]
trainer.shutdown()

Implementation

The implementation loosely follows that of Tune's checkpoint_manager.

Add a _top_persisted_checkpoints priority queue in CheckpointManager to keep track of the "best" checkpoints. Checkpoint paths are wrapped in a PersistedCheckpoint object.
The user can customize checkpoint retention logic in CheckpointStrategy. This will determine when and which checkpoints to delete. Note: If two checkpoints have the same priority, the original one will be kept and the latest one will never be persisted.
Store a single _best_persisted_checkpoint, which is exposed in Trainer.best_checkpoint_path. This replaces Trainer.last_checkpoint_path, as the most chronologically recent persistent checkpoint path may be already deleted based on the CheckpointStrategy. The default strategy uses timestamp, which would allow this field to effectively remain the same.
A default _timestamp attribute is added to checkpoints for the purpose of supporting timestamp based retention.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

amogkam

Nice! Is there docs for this?

python/ray/util/sgd/v2/backends/backend.py

amogkam · 2021-10-05T21:33:17Z

python/ray/util/sgd/v2/backends/backend.py

@@ -157,6 +245,14 @@ def latest_checkpoint_path(self) -> Optional[Path]:
        else:


Can we consolidate the implementation for latest_checkpoint_path and best_checkpoint_path? Currently best_checkpoint_path uses the PersistedCheckpoint object to get the path, but latest_checkpoint_path does path manipulation to get the path given the directory and file name. I think we can have both of these use the path stored in the PersistedCheckpoint object.

Actually do we even need latest_checkpoint_path anymore?

Great point, latest_checkpoint_path is actually just used for writing while best_checkpoint_path is used for reads. I renamed latest_checkpoint_path to next_checkpoint_path (with a minor change to increment the ID).

python/ray/util/sgd/v2/tests/test_trainer.py

amogkam · 2021-10-05T21:51:06Z

python/ray/util/sgd/v2/trainer.py

@@ -361,14 +361,17 @@ def latest_checkpoint_dir(self) -> Optional[Path]:
        return self._executor.latest_checkpoint_dir

    @property
-    def latest_checkpoint_path(self) -> Optional[Path]:
-        """Path to the latest persisted checkpoint from the latest run.
+    def best_checkpoint_path(self) -> Optional[Path]:


Do we also want to expose a best_checkpoint object the same way we have for latest_checkpoint?

Hmmm, I didn't think this was necessary right now and thought it might convolute the API. Is there a use-case where we'd need to expose this?

…checkpoint-strategy

amogkam

LGTM! Just left some minor documentation points

doc/source/train/user_guide.rst

Co-authored-by: Amog Kamsetty <[email protected]>

[SGD] implement CheckpointStrategy

d7565df

matthewdeng requested a review from amogkam October 5, 2021 20:11

amogkam self-assigned this Oct 5, 2021

amogkam reviewed Oct 5, 2021

View reviewed changes

amogkam added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 9, 2021

matthewdeng added 3 commits October 25, 2021 21:44

Merge branch 'master' of https://github.com/ray-project/ray into sgd-…

a245b23

…checkpoint-strategy

address comments

6a3f223

update docs

1025f87

matthewdeng changed the title ~~[SGD] implement CheckpointStrategy~~ [Train] implement CheckpointStrategy Oct 26, 2021

matthewdeng removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 26, 2021

matthewdeng requested a review from amogkam October 26, 2021 18:21

amogkam approved these changes Oct 27, 2021

View reviewed changes

doc/source/train/user_guide.rst Outdated Show resolved Hide resolved

doc/source/train/user_guide.rst Show resolved Hide resolved

matthewdeng and others added 2 commits October 26, 2021 17:54

Update doc/source/train/user_guide.rst

4736ca8

Co-authored-by: Amog Kamsetty <[email protected]>

best checkpoint

c63e09c

amogkam merged commit aa5499e into ray-project:master Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] implement CheckpointStrategy #19111

[Train] implement CheckpointStrategy #19111

matthewdeng commented Oct 5, 2021 •

edited

Loading

amogkam left a comment

amogkam Oct 5, 2021

amogkam Oct 5, 2021

matthewdeng Oct 26, 2021

amogkam Oct 5, 2021

matthewdeng Oct 26, 2021

amogkam left a comment

		@@ -157,6 +245,14 @@ def latest_checkpoint_path(self) -> Optional[Path]:
		else:

[Train] implement CheckpointStrategy #19111

[Train] implement CheckpointStrategy #19111

Conversation

matthewdeng commented Oct 5, 2021 • edited Loading

Why are these changes needed?

Example

Implementation

Related issue number

Checks

amogkam left a comment

Choose a reason for hiding this comment

amogkam Oct 5, 2021

Choose a reason for hiding this comment

amogkam Oct 5, 2021

Choose a reason for hiding this comment

matthewdeng Oct 26, 2021

Choose a reason for hiding this comment

amogkam Oct 5, 2021

Choose a reason for hiding this comment

matthewdeng Oct 26, 2021

Choose a reason for hiding this comment

amogkam left a comment

Choose a reason for hiding this comment

matthewdeng commented Oct 5, 2021 •

edited

Loading