-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] implement CheckpointStrategy #19111
[Train] implement CheckpointStrategy #19111
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Is there docs for this?
@@ -157,6 +245,14 @@ def latest_checkpoint_path(self) -> Optional[Path]: | |||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we consolidate the implementation for latest_checkpoint_path
and best_checkpoint_path
? Currently best_checkpoint_path uses the PersistedCheckpoint object to get the path, but latest_checkpoint_path does path manipulation to get the path given the directory and file name. I think we can have both of these use the path stored in the PersistedCheckpoint
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually do we even need latest_checkpoint_path
anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, latest_checkpoint_path
is actually just used for writing while best_checkpoint_path
is used for reads. I renamed latest_checkpoint_path
to next_checkpoint_path
(with a minor change to increment the ID).
python/ray/util/sgd/v2/trainer.py
Outdated
@@ -361,14 +361,17 @@ def latest_checkpoint_dir(self) -> Optional[Path]: | |||
return self._executor.latest_checkpoint_dir | |||
|
|||
@property | |||
def latest_checkpoint_path(self) -> Optional[Path]: | |||
"""Path to the latest persisted checkpoint from the latest run. | |||
def best_checkpoint_path(self) -> Optional[Path]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to expose a best_checkpoint
object the same way we have for latest_checkpoint
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I didn't think this was necessary right now and thought it might convolute the API. Is there a use-case where we'd need to expose this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just left some minor documentation points
Co-authored-by: Amog Kamsetty <[email protected]>
Implements
CheckpointStrategy
to delete checkpoints from disk.Why are these changes needed?
For a training function with many epochs, persisted checkpoints may lead to a high disk usage. The user should be able to define in their training run to clean up checkpoints. Specifically, they may want to keep N checkpoints, removing checkpoints based on:
Example
This example is also included in the docs.
Implementation
The implementation loosely follows that of Tune's checkpoint_manager.
_top_persisted_checkpoints
priority queue inCheckpointManager
to keep track of the "best" checkpoints. Checkpoint paths are wrapped in aPersistedCheckpoint
object.CheckpointStrategy
. This will determine when and which checkpoints to delete. Note: If two checkpoints have the same priority, the original one will be kept and the latest one will never be persisted._best_persisted_checkpoint
, which is exposed inTrainer.best_checkpoint_path
. This replacesTrainer.last_checkpoint_path
, as the most chronologically recent persistent checkpoint path may be already deleted based on theCheckpointStrategy
. The default strategy uses timestamp, which would allow this field to effectively remain the same._timestamp
attribute is added to checkpoints for the purpose of supporting timestamp based retention.Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.