Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] implement CheckpointStrategy #19111

Merged
merged 6 commits into from
Oct 27, 2021

Conversation

matthewdeng
Copy link
Contributor

@matthewdeng matthewdeng commented Oct 5, 2021

Implements CheckpointStrategy to delete checkpoints from disk.

Why are these changes needed?

For a training function with many epochs, persisted checkpoints may lead to a high disk usage. The user should be able to define in their training run to clean up checkpoints. Specifically, they may want to keep N checkpoints, removing checkpoints based on:

  1. Timestamp - remove old and outdated checkpoints.
  2. A user-defined metric - remove "bad" or low-scoring checkpoints.

Example

This example is also included in the docs.

from ray import train
from ray.train import CheckpointStrategy, Trainer


def train_func():
    # first checkpoint
    train.save_checkpoint(loss=2)
    # second checkpoint
    train.save_checkpoint(loss=4)
    # third checkpoint
    train.save_checkpoint(loss=1)
    # fourth checkpoint
    train.save_checkpoint(loss=3)

# Keep the 2 checkpoints with the smallest "loss" value.
checkpoint_strategy = CheckpointStrategy(num_to_keep=2,
                                         checkpoint_score_attribute="loss",
                                         checkpoint_score_order="min")

trainer = Trainer(backend="torch", num_workers=2)
trainer.start()
trainer.run(train_func, checkpoint_strategy=checkpoint_strategy)
print(trainer.best_checkpoint_path)
# /home/ray_results/train_2021-09-01_12-00-00/run_001/checkpoints/checkpoint_000003
print(trainer.latest_checkpoint_dir)
# /home/ray_results/train_2021-09-01_12-00-00/run_001/checkpoints
print([checkpoint_path for checkpoint_path in trainer.latest_checkpoint_dir.iterdir()])
# [PosixPath('/home/ray_results/train_2021-09-01_12-00-00/run_001/checkpoints/checkpoint_000003'),
# PosixPath('/home/ray_results/train_2021-09-01_12-00-00/run_001/checkpoints/checkpoint_000001')]
trainer.shutdown()

Implementation

The implementation loosely follows that of Tune's checkpoint_manager.

  1. Add a _top_persisted_checkpoints priority queue in CheckpointManager to keep track of the "best" checkpoints. Checkpoint paths are wrapped in a PersistedCheckpoint object.
  2. The user can customize checkpoint retention logic in CheckpointStrategy. This will determine when and which checkpoints to delete. Note: If two checkpoints have the same priority, the original one will be kept and the latest one will never be persisted.
  3. Store a single _best_persisted_checkpoint, which is exposed in Trainer.best_checkpoint_path. This replaces Trainer.last_checkpoint_path, as the most chronologically recent persistent checkpoint path may be already deleted based on the CheckpointStrategy. The default strategy uses timestamp, which would allow this field to effectively remain the same.
  4. A default _timestamp attribute is added to checkpoints for the purpose of supporting timestamp based retention.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@amogkam amogkam self-assigned this Oct 5, 2021
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Is there docs for this?

python/ray/util/sgd/v2/backends/backend.py Outdated Show resolved Hide resolved
python/ray/util/sgd/v2/backends/backend.py Outdated Show resolved Hide resolved
python/ray/util/sgd/v2/backends/backend.py Outdated Show resolved Hide resolved
@@ -157,6 +245,14 @@ def latest_checkpoint_path(self) -> Optional[Path]:
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consolidate the implementation for latest_checkpoint_path and best_checkpoint_path? Currently best_checkpoint_path uses the PersistedCheckpoint object to get the path, but latest_checkpoint_path does path manipulation to get the path given the directory and file name. I think we can have both of these use the path stored in the PersistedCheckpoint object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually do we even need latest_checkpoint_path anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, latest_checkpoint_path is actually just used for writing while best_checkpoint_path is used for reads. I renamed latest_checkpoint_path to next_checkpoint_path (with a minor change to increment the ID).

python/ray/util/sgd/v2/tests/test_trainer.py Outdated Show resolved Hide resolved
@@ -361,14 +361,17 @@ def latest_checkpoint_dir(self) -> Optional[Path]:
return self._executor.latest_checkpoint_dir

@property
def latest_checkpoint_path(self) -> Optional[Path]:
"""Path to the latest persisted checkpoint from the latest run.
def best_checkpoint_path(self) -> Optional[Path]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to expose a best_checkpoint object the same way we have for latest_checkpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I didn't think this was necessary right now and thought it might convolute the API. Is there a use-case where we'd need to expose this?

@amogkam amogkam added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 9, 2021
@matthewdeng matthewdeng changed the title [SGD] implement CheckpointStrategy [Train] implement CheckpointStrategy Oct 26, 2021
@matthewdeng matthewdeng removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 26, 2021
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just left some minor documentation points

doc/source/train/user_guide.rst Outdated Show resolved Hide resolved
doc/source/train/user_guide.rst Show resolved Hide resolved
@amogkam amogkam merged commit aa5499e into ray-project:master Oct 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants