Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some zero-overhead checkpointing bugs #602

Merged
merged 1 commit into from
Oct 7, 2024
Merged

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Oct 7, 2024

Stack from ghstack (oldest at bottom):

Summary:

  1. The original code does not utitlize share_memory=True, this may
    cause incorrectness or slowdown.
  2. The original code does not pass the correct cpu-offloaded state_dict,
    which can cause another slowdown or incorrect saving.

[ghstack-poisoned]
fegin added a commit that referenced this pull request Oct 7, 2024
Summary:
1. The original code does not utitlize `share_memory=True`, this may
   cause incorrectness or slowdown.
2. The original code does not pass the correct cpu-offloaded state_dict,
   which can cause another slowdown or incorrect saving.

ghstack-source-id: c04c634af9f377d860a021875bf65017b152a5c9
Pull Request resolved: #602
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2024
@fegin fegin requested a review from wz337 October 7, 2024 05:56
Copy link
Contributor

@wz337 wz337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@fegin fegin merged commit 853ce3f into gh/fegin/6/base Oct 7, 2024
5 checks passed
fegin added a commit that referenced this pull request Oct 7, 2024
Summary:
1. The original code does not utitlize `share_memory=True`, this may
   cause incorrectness or slowdown.
2. The original code does not pass the correct cpu-offloaded state_dict,
   which can cause another slowdown or incorrect saving.

ghstack-source-id: c04c634af9f377d860a021875bf65017b152a5c9
Pull Request resolved: #602
@fegin fegin deleted the gh/fegin/6/head branch October 7, 2024 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants