Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wezuo/zero stage2 prototype #4746

Closed
wants to merge 92 commits into from
Closed

Wezuo/zero stage2 prototype #4746

wants to merge 92 commits into from

Conversation

wezuo
Copy link
Contributor

@wezuo wezuo commented Aug 10, 2020

Zero stage-2 Part 1

Description:
When accumulation_step = 1, enable zero stage-2. Currently only enable the synchronized nccl-reduce, on the same stage. It is partitioned by weight-gradient boundary.

In this PR:

  1. Building the graph to enable Zero stage-2.
  2. Enable the correct execution order.
  3. Implement NcclReduce kernel.
  4. Modify NcclAllGather kernel for Zero stage-2.

@snnn
Copy link
Member

snnn commented Aug 20, 2021

I will close this stale PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants