Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add graph pass and backward option #20837

Open
wants to merge 26 commits into
base: zero_sharding
Choose a base branch
from

Conversation

xinyual
Copy link
Contributor

@xinyual xinyual commented Jan 22, 2022

Description

This is the branch to add the zero sharding (especially partitioning gradient). The changes can be divided into three part:

  1. in C++, I add an operation called reduce operation. It will do nothing in forward but reduce gradient on backward.
  2. Then open an API called backward option to delete the output of certain graph.

Current problem

  1. The reduce operation will cause a deadlock(won't happen in NaiveEngine)
  2. The decrease of memory consumption hasn't been verified.

@mxnet-bot
Copy link

Hey @xinyual , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-gpu, centos-cpu, website, unix-cpu, windows-gpu, windows-cpu, centos-gpu, miscellaneous, clang, sanity, edge]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jan 22, 2022
@leezu
Copy link
Contributor

leezu commented Jan 25, 2022

Thank you @xinyual! Could you please make sure the CI tests (ci/jenkins/mxnet-validation/sanity) pass, then we can merge this code into the zero_sharding feature branch for future devolpment

@xinyual
Copy link
Contributor Author

xinyual commented Jan 28, 2022

Thank you @xinyual! Could you please make sure the CI tests (ci/jenkins/mxnet-validation/sanity) pass, then we can merge this code into the zero_sharding feature branch for future devolpment

Hi leo, I don't know where is the problem since the error information only shows the wrong command. Is this related to third party ondnn? It seems I don't update it to latest version.

@leezu
Copy link
Contributor

leezu commented Feb 4, 2022

Thank you @xinyual. Please rebase on top of the apache/incubator-mxnet master branch and force push to your xinyual/incubator-mxnet add_graph_pass_and_backward_option branch

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants