-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NeMo-UX] Support save_last="link"
#10548
Conversation
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Anna Shors <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
logging.info(f'Scheduled async checkpoint save for {filepath}') | ||
else: | ||
finalize_fn() | ||
|
||
def _save_last_checkpoint(self, trainer: "pl.Trainer", monitor_candidates: Dict[str, torch.Tensor]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if there is some way to avoid overriding the whole method. This is always risky, since we lose touch with the upstream.
How is our flow different from the one in PTL which makes us add saved_current_step
logic and also not rely on self.last_model_path
?
Is it because PTL links to any available last checkpoint (not necessarily from the last iteration)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I made two changes:
- Made sure to add a symlink only when the current step was actually saved. As you suggested, PTL always links to the last checkpoint saved, which might not correspond to the latest step
- Added these lines which fix
last_model_path
saved to the*-last
checkpoint state dict when using symlinks
I'll think about whether we can make these fixes without overwriting the entire _save_last_checkpoint
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Maybe overwriting save_last_checkpoint is inevitable in which case current version is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running some final tests now, but I think I was able to avoid overwriting _save_last_checkpoint
. Please let me know if you have any concerns with the current approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks, this is great.
Do you know how last_model_path
is used during restart? I'm wondering if the loaded state dict will be valid if e.g. failure happens between the regular and "last" ckpt save
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think last_model_path
is only used when removing the previous -last
model to ensure we only retain a single -last
checkpoint. If failure happens between the regular and last checkpoint save, I don't think the state dict will be valid, but I also don't think this is a concern, because we'd end up restoring from the previously saved -last
checkpoint which does have the correct state dict.
Signed-off-by: ashors1 <[email protected]>
…o ashors/symlink-last-ckpt
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
…o ashors/symlink-last-ckpt
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
…o ashors/symlink-last-ckpt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @ashors1
[🤖]: Hi @ashors1 👋, I just wanted to let you know that, you know, a CICD pipeline for this PR just finished successfully ✨ So it might be time to merge this PR or like to get some approvals 🚀 But I'm just a 🤖 so I'll leave it you what to do next. Have a great day! //cc @ko3n1g |
* provide support for save_last='link' Signed-off-by: ashors1 <[email protected]> * fix symlinks when top_k checkpoint not saved Signed-off-by: ashors1 <[email protected]> * support symlinks with async checkpointing Signed-off-by: ashors1 <[email protected]> * only unlink on rank 0 Signed-off-by: Anna Shors <[email protected]> * fix race condition Signed-off-by: ashors1 <[email protected]> * force linked checkpoint to correspond to last finalized checkpoint Signed-off-by: ashors1 <[email protected]> * fix last_model_path after restore Signed-off-by: ashors1 <[email protected]> * move symlink removal to strategy Signed-off-by: ashors1 <[email protected]> * remove unneeded lines Signed-off-by: ashors1 <[email protected]> * add some more documentation Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * address some comments Signed-off-by: ashors1 <[email protected]> * fix syntax Signed-off-by: ashors1 <[email protected]> * avoid overwriting _save_last_checkpoint Signed-off-by: ashors1 <[email protected]> * fix base call Signed-off-by: ashors1 <[email protected]> * small fix Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * add test for save_last=link Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * clean up test Signed-off-by: ashors1 <[email protected]> * use megatroncheckpointio in test Signed-off-by: ashors1 <[email protected]> * add async test and clean up Signed-off-by: ashors1 <[email protected]> * fix remaining merge conflicts Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * check number of saved checkpoints Signed-off-by: ashors1 <[email protected]> * remove unused import Signed-off-by: ashors1 <[email protected]> * run test on gpu only Signed-off-by: ashors1 <[email protected]> * fix a small bug and add a resume test Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * remove old comment Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: Anna Shors <[email protected]> Signed-off-by: ashors1 <[email protected]> Co-authored-by: ashors1 <[email protected]>
* provide support for save_last='link' Signed-off-by: ashors1 <[email protected]> * fix symlinks when top_k checkpoint not saved Signed-off-by: ashors1 <[email protected]> * support symlinks with async checkpointing Signed-off-by: ashors1 <[email protected]> * only unlink on rank 0 Signed-off-by: Anna Shors <[email protected]> * fix race condition Signed-off-by: ashors1 <[email protected]> * force linked checkpoint to correspond to last finalized checkpoint Signed-off-by: ashors1 <[email protected]> * fix last_model_path after restore Signed-off-by: ashors1 <[email protected]> * move symlink removal to strategy Signed-off-by: ashors1 <[email protected]> * remove unneeded lines Signed-off-by: ashors1 <[email protected]> * add some more documentation Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * address some comments Signed-off-by: ashors1 <[email protected]> * fix syntax Signed-off-by: ashors1 <[email protected]> * avoid overwriting _save_last_checkpoint Signed-off-by: ashors1 <[email protected]> * fix base call Signed-off-by: ashors1 <[email protected]> * small fix Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * add test for save_last=link Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * clean up test Signed-off-by: ashors1 <[email protected]> * use megatroncheckpointio in test Signed-off-by: ashors1 <[email protected]> * add async test and clean up Signed-off-by: ashors1 <[email protected]> * fix remaining merge conflicts Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * check number of saved checkpoints Signed-off-by: ashors1 <[email protected]> * remove unused import Signed-off-by: ashors1 <[email protected]> * run test on gpu only Signed-off-by: ashors1 <[email protected]> * fix a small bug and add a resume test Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * remove old comment Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: Anna Shors <[email protected]> Signed-off-by: ashors1 <[email protected]> Co-authored-by: ashors1 <[email protected]>
* provide support for save_last='link' Signed-off-by: ashors1 <[email protected]> * fix symlinks when top_k checkpoint not saved Signed-off-by: ashors1 <[email protected]> * support symlinks with async checkpointing Signed-off-by: ashors1 <[email protected]> * only unlink on rank 0 Signed-off-by: Anna Shors <[email protected]> * fix race condition Signed-off-by: ashors1 <[email protected]> * force linked checkpoint to correspond to last finalized checkpoint Signed-off-by: ashors1 <[email protected]> * fix last_model_path after restore Signed-off-by: ashors1 <[email protected]> * move symlink removal to strategy Signed-off-by: ashors1 <[email protected]> * remove unneeded lines Signed-off-by: ashors1 <[email protected]> * add some more documentation Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * address some comments Signed-off-by: ashors1 <[email protected]> * fix syntax Signed-off-by: ashors1 <[email protected]> * avoid overwriting _save_last_checkpoint Signed-off-by: ashors1 <[email protected]> * fix base call Signed-off-by: ashors1 <[email protected]> * small fix Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * add test for save_last=link Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * clean up test Signed-off-by: ashors1 <[email protected]> * use megatroncheckpointio in test Signed-off-by: ashors1 <[email protected]> * add async test and clean up Signed-off-by: ashors1 <[email protected]> * fix remaining merge conflicts Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * check number of saved checkpoints Signed-off-by: ashors1 <[email protected]> * remove unused import Signed-off-by: ashors1 <[email protected]> * run test on gpu only Signed-off-by: ashors1 <[email protected]> * fix a small bug and add a resume test Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * remove old comment Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: Anna Shors <[email protected]> Signed-off-by: ashors1 <[email protected]> Co-authored-by: ashors1 <[email protected]> Signed-off-by: Lifu Zhang <[email protected]>
* provide support for save_last='link' Signed-off-by: ashors1 <[email protected]> * fix symlinks when top_k checkpoint not saved Signed-off-by: ashors1 <[email protected]> * support symlinks with async checkpointing Signed-off-by: ashors1 <[email protected]> * only unlink on rank 0 Signed-off-by: Anna Shors <[email protected]> * fix race condition Signed-off-by: ashors1 <[email protected]> * force linked checkpoint to correspond to last finalized checkpoint Signed-off-by: ashors1 <[email protected]> * fix last_model_path after restore Signed-off-by: ashors1 <[email protected]> * move symlink removal to strategy Signed-off-by: ashors1 <[email protected]> * remove unneeded lines Signed-off-by: ashors1 <[email protected]> * add some more documentation Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * address some comments Signed-off-by: ashors1 <[email protected]> * fix syntax Signed-off-by: ashors1 <[email protected]> * avoid overwriting _save_last_checkpoint Signed-off-by: ashors1 <[email protected]> * fix base call Signed-off-by: ashors1 <[email protected]> * small fix Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * add test for save_last=link Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * clean up test Signed-off-by: ashors1 <[email protected]> * use megatroncheckpointio in test Signed-off-by: ashors1 <[email protected]> * add async test and clean up Signed-off-by: ashors1 <[email protected]> * fix remaining merge conflicts Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * check number of saved checkpoints Signed-off-by: ashors1 <[email protected]> * remove unused import Signed-off-by: ashors1 <[email protected]> * run test on gpu only Signed-off-by: ashors1 <[email protected]> * fix a small bug and add a resume test Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * remove old comment Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: Anna Shors <[email protected]> Signed-off-by: ashors1 <[email protected]> Co-authored-by: ashors1 <[email protected]> Signed-off-by: Lifu Zhang <[email protected]>
* provide support for save_last='link' Signed-off-by: ashors1 <[email protected]> * fix symlinks when top_k checkpoint not saved Signed-off-by: ashors1 <[email protected]> * support symlinks with async checkpointing Signed-off-by: ashors1 <[email protected]> * only unlink on rank 0 Signed-off-by: Anna Shors <[email protected]> * fix race condition Signed-off-by: ashors1 <[email protected]> * force linked checkpoint to correspond to last finalized checkpoint Signed-off-by: ashors1 <[email protected]> * fix last_model_path after restore Signed-off-by: ashors1 <[email protected]> * move symlink removal to strategy Signed-off-by: ashors1 <[email protected]> * remove unneeded lines Signed-off-by: ashors1 <[email protected]> * add some more documentation Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * address some comments Signed-off-by: ashors1 <[email protected]> * fix syntax Signed-off-by: ashors1 <[email protected]> * avoid overwriting _save_last_checkpoint Signed-off-by: ashors1 <[email protected]> * fix base call Signed-off-by: ashors1 <[email protected]> * small fix Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * add test for save_last=link Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * clean up test Signed-off-by: ashors1 <[email protected]> * use megatroncheckpointio in test Signed-off-by: ashors1 <[email protected]> * add async test and clean up Signed-off-by: ashors1 <[email protected]> * fix remaining merge conflicts Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * check number of saved checkpoints Signed-off-by: ashors1 <[email protected]> * remove unused import Signed-off-by: ashors1 <[email protected]> * run test on gpu only Signed-off-by: ashors1 <[email protected]> * fix a small bug and add a resume test Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * remove old comment Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: Anna Shors <[email protected]> Signed-off-by: ashors1 <[email protected]> Co-authored-by: ashors1 <[email protected]> Signed-off-by: Hainan Xu <[email protected]>
What does this PR do ?
Adds support for creating a symlink for
-last
checkpoints. Implementation is compatible with synchronous and asynchronous checkpointing.Collection: llm
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information