Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix 2D parallel crash caused by all-reduce on 2D world_mesh #105

Merged
merged 2 commits into from
Mar 2, 2024

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Mar 2, 2024

Stack from ghstack (oldest at bottom):

In 2D case (FSDP + SP), loss metric should be computed by doing all-reduce only on the DP submesh. Previously it was doing all-reduce on the world mesh; this PR fixes it.

tianyu-l added a commit that referenced this pull request Mar 2, 2024
ghstack-source-id: 3f4046538ef80e36cfeb7d95cf92b986276b9b3c
Pull Request resolved: #105
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 2, 2024
tianyu-l added a commit that referenced this pull request Mar 2, 2024
ghstack-source-id: 1c5bf790d7473f6a24124051fcfa1fd2585a56f9
Pull Request resolved: #105
@tianyu-l tianyu-l requested a review from gnadathur March 2, 2024 01:29
@tianyu-l tianyu-l merged commit b05fad1 into gh/tianyu-l/2/base Mar 2, 2024
3 of 4 checks passed
tianyu-l added a commit that referenced this pull request Mar 2, 2024
ghstack-source-id: 1c5bf790d7473f6a24124051fcfa1fd2585a56f9
Pull Request resolved: #105
@tianyu-l tianyu-l deleted the gh/tianyu-l/2/head branch March 2, 2024 01:32
dp_degree = world_mesh.size(0)
dp_rank = world_mesh.get_local_rank(0)
dp_mesh = world_mesh["dp"]
dp_degree = dp_mesh.size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a config called data_parallel_degree. Should we use that ?

lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
ghstack-source-id: 1c5bf790d7473f6a24124051fcfa1fd2585a56f9
Pull Request resolved: #105
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
ghstack-source-id: 1c5bf790d7473f6a24124051fcfa1fd2585a56f9
Pull Request resolved: pytorch#105
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants