Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NeMo-UX] Support save_last="link" #10548

Merged
merged 35 commits into from
Sep 30, 2024
Merged

[NeMo-UX] Support save_last="link" #10548

merged 35 commits into from
Sep 30, 2024

Conversation

ashors1
Copy link
Collaborator

@ashors1 ashors1 commented Sep 20, 2024

What does this PR do ?

Adds support for creating a symlink for -last checkpoints. Implementation is compatible with synchronous and asynchronous checkpointing.

Collection: llm

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

nemo/lightning/pytorch/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
logging.info(f'Scheduled async checkpoint save for {filepath}')
else:
finalize_fn()

def _save_last_checkpoint(self, trainer: "pl.Trainer", monitor_candidates: Dict[str, torch.Tensor]) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there is some way to avoid overriding the whole method. This is always risky, since we lose touch with the upstream.

How is our flow different from the one in PTL which makes us add saved_current_step logic and also not rely on self.last_model_path?
Is it because PTL links to any available last checkpoint (not necessarily from the last iteration)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I made two changes:

  1. Made sure to add a symlink only when the current step was actually saved. As you suggested, PTL always links to the last checkpoint saved, which might not correspond to the latest step
  2. Added these lines which fix last_model_path saved to the *-last checkpoint state dict when using symlinks

I'll think about whether we can make these fixes without overwriting the entire _save_last_checkpoint method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Maybe overwriting save_last_checkpoint is inevitable in which case current version is ok

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running some final tests now, but I think I was able to avoid overwriting _save_last_checkpoint. Please let me know if you have any concerns with the current approach

Copy link
Collaborator

@mikolajblaz mikolajblaz Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks, this is great.

Do you know how last_model_path is used during restart? I'm wondering if the loaded state dict will be valid if e.g. failure happens between the regular and "last" ckpt save

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think last_model_path is only used when removing the previous -last model to ensure we only retain a single -last checkpoint. If failure happens between the regular and last checkpoint save, I don't think the state dict will be valid, but I also don't think this is a concern, because we'd end up restoring from the previously saved -last checkpoint which does have the correct state dict.

nemo/lightning/pytorch/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
@ashors1 ashors1 added Run CICD and removed Run CICD labels Sep 24, 2024
@ashors1 ashors1 added Run CICD and removed Run CICD labels Sep 24, 2024
Signed-off-by: ashors1 <[email protected]>
@ashors1 ashors1 added Run CICD and removed Run CICD labels Sep 25, 2024
mikolajblaz
mikolajblaz previously approved these changes Sep 26, 2024
@ashors1 ashors1 added Run CICD and removed Run CICD labels Sep 28, 2024
Copy link
Collaborator

@athitten athitten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @ashors1

Copy link
Contributor

[🤖]: Hi @ashors1 👋,

I just wanted to let you know that, you know, a CICD pipeline for this PR just finished successfully ✨

So it might be time to merge this PR or like to get some approvals 🚀

But I'm just a 🤖 so I'll leave it you what to do next.

Have a great day!

//cc @ko3n1g

@athitten athitten merged commit d664b74 into main Sep 30, 2024
153 of 160 checks passed
@athitten athitten deleted the ashors/symlink-last-ckpt branch September 30, 2024 17:33
maanug-nv pushed a commit that referenced this pull request Oct 2, 2024
* provide support for save_last='link'

Signed-off-by: ashors1 <[email protected]>

* fix symlinks when top_k checkpoint not saved

Signed-off-by: ashors1 <[email protected]>

* support symlinks with async checkpointing

Signed-off-by: ashors1 <[email protected]>

* only unlink on rank 0

Signed-off-by: Anna Shors <[email protected]>

* fix race condition

Signed-off-by: ashors1 <[email protected]>

* force linked checkpoint to correspond to last finalized checkpoint

Signed-off-by: ashors1 <[email protected]>

* fix last_model_path after restore

Signed-off-by: ashors1 <[email protected]>

* move symlink removal to strategy

Signed-off-by: ashors1 <[email protected]>

* remove unneeded lines

Signed-off-by: ashors1 <[email protected]>

* add some more documentation

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* address some comments

Signed-off-by: ashors1 <[email protected]>

* fix syntax

Signed-off-by: ashors1 <[email protected]>

* avoid overwriting _save_last_checkpoint

Signed-off-by: ashors1 <[email protected]>

* fix base call

Signed-off-by: ashors1 <[email protected]>

* small fix

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* add test for save_last=link

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* clean up test

Signed-off-by: ashors1 <[email protected]>

* use megatroncheckpointio in test

Signed-off-by: ashors1 <[email protected]>

* add async test and clean up

Signed-off-by: ashors1 <[email protected]>

* fix remaining merge conflicts

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* check number of saved checkpoints

Signed-off-by: ashors1 <[email protected]>

* remove unused import

Signed-off-by: ashors1 <[email protected]>

* run test on gpu only

Signed-off-by: ashors1 <[email protected]>

* fix a small bug and add a resume test

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* remove old comment

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Anna Shors <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ashors1 <[email protected]>
monica-sekoyan pushed a commit that referenced this pull request Oct 14, 2024
* provide support for save_last='link'

Signed-off-by: ashors1 <[email protected]>

* fix symlinks when top_k checkpoint not saved

Signed-off-by: ashors1 <[email protected]>

* support symlinks with async checkpointing

Signed-off-by: ashors1 <[email protected]>

* only unlink on rank 0

Signed-off-by: Anna Shors <[email protected]>

* fix race condition

Signed-off-by: ashors1 <[email protected]>

* force linked checkpoint to correspond to last finalized checkpoint

Signed-off-by: ashors1 <[email protected]>

* fix last_model_path after restore

Signed-off-by: ashors1 <[email protected]>

* move symlink removal to strategy

Signed-off-by: ashors1 <[email protected]>

* remove unneeded lines

Signed-off-by: ashors1 <[email protected]>

* add some more documentation

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* address some comments

Signed-off-by: ashors1 <[email protected]>

* fix syntax

Signed-off-by: ashors1 <[email protected]>

* avoid overwriting _save_last_checkpoint

Signed-off-by: ashors1 <[email protected]>

* fix base call

Signed-off-by: ashors1 <[email protected]>

* small fix

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* add test for save_last=link

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* clean up test

Signed-off-by: ashors1 <[email protected]>

* use megatroncheckpointio in test

Signed-off-by: ashors1 <[email protected]>

* add async test and clean up

Signed-off-by: ashors1 <[email protected]>

* fix remaining merge conflicts

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* check number of saved checkpoints

Signed-off-by: ashors1 <[email protected]>

* remove unused import

Signed-off-by: ashors1 <[email protected]>

* run test on gpu only

Signed-off-by: ashors1 <[email protected]>

* fix a small bug and add a resume test

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* remove old comment

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Anna Shors <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ashors1 <[email protected]>
tomlifu pushed a commit to tomlifu/NeMo that referenced this pull request Oct 25, 2024
* provide support for save_last='link'

Signed-off-by: ashors1 <[email protected]>

* fix symlinks when top_k checkpoint not saved

Signed-off-by: ashors1 <[email protected]>

* support symlinks with async checkpointing

Signed-off-by: ashors1 <[email protected]>

* only unlink on rank 0

Signed-off-by: Anna Shors <[email protected]>

* fix race condition

Signed-off-by: ashors1 <[email protected]>

* force linked checkpoint to correspond to last finalized checkpoint

Signed-off-by: ashors1 <[email protected]>

* fix last_model_path after restore

Signed-off-by: ashors1 <[email protected]>

* move symlink removal to strategy

Signed-off-by: ashors1 <[email protected]>

* remove unneeded lines

Signed-off-by: ashors1 <[email protected]>

* add some more documentation

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* address some comments

Signed-off-by: ashors1 <[email protected]>

* fix syntax

Signed-off-by: ashors1 <[email protected]>

* avoid overwriting _save_last_checkpoint

Signed-off-by: ashors1 <[email protected]>

* fix base call

Signed-off-by: ashors1 <[email protected]>

* small fix

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* add test for save_last=link

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* clean up test

Signed-off-by: ashors1 <[email protected]>

* use megatroncheckpointio in test

Signed-off-by: ashors1 <[email protected]>

* add async test and clean up

Signed-off-by: ashors1 <[email protected]>

* fix remaining merge conflicts

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* check number of saved checkpoints

Signed-off-by: ashors1 <[email protected]>

* remove unused import

Signed-off-by: ashors1 <[email protected]>

* run test on gpu only

Signed-off-by: ashors1 <[email protected]>

* fix a small bug and add a resume test

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* remove old comment

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Anna Shors <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Signed-off-by: Lifu Zhang <[email protected]>
tomlifu pushed a commit to tomlifu/NeMo that referenced this pull request Oct 25, 2024
* provide support for save_last='link'

Signed-off-by: ashors1 <[email protected]>

* fix symlinks when top_k checkpoint not saved

Signed-off-by: ashors1 <[email protected]>

* support symlinks with async checkpointing

Signed-off-by: ashors1 <[email protected]>

* only unlink on rank 0

Signed-off-by: Anna Shors <[email protected]>

* fix race condition

Signed-off-by: ashors1 <[email protected]>

* force linked checkpoint to correspond to last finalized checkpoint

Signed-off-by: ashors1 <[email protected]>

* fix last_model_path after restore

Signed-off-by: ashors1 <[email protected]>

* move symlink removal to strategy

Signed-off-by: ashors1 <[email protected]>

* remove unneeded lines

Signed-off-by: ashors1 <[email protected]>

* add some more documentation

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* address some comments

Signed-off-by: ashors1 <[email protected]>

* fix syntax

Signed-off-by: ashors1 <[email protected]>

* avoid overwriting _save_last_checkpoint

Signed-off-by: ashors1 <[email protected]>

* fix base call

Signed-off-by: ashors1 <[email protected]>

* small fix

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* add test for save_last=link

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* clean up test

Signed-off-by: ashors1 <[email protected]>

* use megatroncheckpointio in test

Signed-off-by: ashors1 <[email protected]>

* add async test and clean up

Signed-off-by: ashors1 <[email protected]>

* fix remaining merge conflicts

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* check number of saved checkpoints

Signed-off-by: ashors1 <[email protected]>

* remove unused import

Signed-off-by: ashors1 <[email protected]>

* run test on gpu only

Signed-off-by: ashors1 <[email protected]>

* fix a small bug and add a resume test

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* remove old comment

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Anna Shors <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Signed-off-by: Lifu Zhang <[email protected]>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
* provide support for save_last='link'

Signed-off-by: ashors1 <[email protected]>

* fix symlinks when top_k checkpoint not saved

Signed-off-by: ashors1 <[email protected]>

* support symlinks with async checkpointing

Signed-off-by: ashors1 <[email protected]>

* only unlink on rank 0

Signed-off-by: Anna Shors <[email protected]>

* fix race condition

Signed-off-by: ashors1 <[email protected]>

* force linked checkpoint to correspond to last finalized checkpoint

Signed-off-by: ashors1 <[email protected]>

* fix last_model_path after restore

Signed-off-by: ashors1 <[email protected]>

* move symlink removal to strategy

Signed-off-by: ashors1 <[email protected]>

* remove unneeded lines

Signed-off-by: ashors1 <[email protected]>

* add some more documentation

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* address some comments

Signed-off-by: ashors1 <[email protected]>

* fix syntax

Signed-off-by: ashors1 <[email protected]>

* avoid overwriting _save_last_checkpoint

Signed-off-by: ashors1 <[email protected]>

* fix base call

Signed-off-by: ashors1 <[email protected]>

* small fix

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* add test for save_last=link

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* clean up test

Signed-off-by: ashors1 <[email protected]>

* use megatroncheckpointio in test

Signed-off-by: ashors1 <[email protected]>

* add async test and clean up

Signed-off-by: ashors1 <[email protected]>

* fix remaining merge conflicts

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* check number of saved checkpoints

Signed-off-by: ashors1 <[email protected]>

* remove unused import

Signed-off-by: ashors1 <[email protected]>

* run test on gpu only

Signed-off-by: ashors1 <[email protected]>

* fix a small bug and add a resume test

Signed-off-by: ashors1 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* remove old comment

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Anna Shors <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Signed-off-by: Hainan Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants