Use GPTModel from mcore #7093

ericharper · 2023-07-21T21:51:57Z

What does this PR do ?

This PR adds a path to use GPTModel from Megatron Core.

Requirements:
Make sure that megatron core is installed from a recent commit:

git clone https://github.com/NVIDIA/Megatron-LM.git && \
            cd Megatron-LM && \
            git checkout 3316e811cc5335ee24c2d203416d864edcf2f7a8 && \
            pip install -e .

Collection: NLP

Changelog

Add specific line by line info of high level changes in this PR.

Usage

Set mcore_gpt=True in the MegatronGPTModel config.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

github-advanced-security

CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

Signed-off-by: ericharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: ericharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: ericharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: ericharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: ericharper <[email protected]>

for more information, see https://pre-commit.ci

* Add GQA config in gpt config file Signed-off-by: jasonwan <[email protected]> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <[email protected]> --------- Signed-off-by: jasonwan <[email protected]>

Signed-off-by: ericharper <[email protected]>

Signed-off-by: eharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: eharper <[email protected]>

…pt_model

for more information, see https://pre-commit.ci

Signed-off-by: eharper <[email protected]>

…pt_model

for more information, see https://pre-commit.ci

Signed-off-by: eharper <[email protected]>

nemo/collections/nlp/modules/common/megatron/adapters/parallel_adapters.py

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

nemo/collections/nlp/parts/utils_funcs.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

Signed-off-by: eharper <[email protected]>

aklife97 · 2023-08-14T21:04:38Z

examples/nlp/language_modeling/conf/megatron_gpt_config.yaml

@@ -44,6 +44,9 @@ exp_manager:
    model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}

 model:
+  # use GPTModel from megatron.core
+  mcore_gpt: False


I think we should add a CI test for mcore gpt = True

aklife97

LGTM!

* start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <[email protected]> * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * remove imports Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <[email protected]> * small clean up Signed-off-by: ericharper <[email protected]> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <[email protected]> * update module args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add config to test Signed-off-by: ericharper <[email protected]> * get hidden_size from config Signed-off-by: ericharper <[email protected]> * add try except Signed-off-by: ericharper <[email protected]> * use default Signed-off-by: ericharper <[email protected]> * update config with hidden size Signed-off-by: ericharper <[email protected]> * remove arg Signed-off-by: ericharper <[email protected]> * comment out jenkins test Signed-off-by: ericharper <[email protected]> * revert import Signed-off-by: ericharper <[email protected]> * remove optimizer_idx Signed-off-by: eharper <[email protected]> * prefetch num microbatches Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * remove imports Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * build transformer config Signed-off-by: ericharper <[email protected]> * add model to provider func Signed-off-by: ericharper <[email protected]> * update forward and float16 wrapper Signed-off-by: ericharper <[email protected]> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <[email protected]> * set virtual rank Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (NVIDIA#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <[email protected]> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <[email protected]> --------- Signed-off-by: jasonwan <[email protected]> * revert Signed-off-by: ericharper <[email protected]> * remove import Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update for dist adam Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use get_gpt_module_list Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update megatron core commit Signed-off-by: eharper <[email protected]> * revert change Signed-off-by: eharper <[email protected]> * remove import Signed-off-by: eharper <[email protected]> * remove import Signed-off-by: eharper <[email protected]> * remove import Signed-off-by: eharper <[email protected]> --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: eharper <[email protected]> Signed-off-by: jasonwan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jason Wang <[email protected]> Signed-off-by: dorotat <[email protected]>

* start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <[email protected]> * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * remove imports Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <[email protected]> * small clean up Signed-off-by: ericharper <[email protected]> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <[email protected]> * update module args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add config to test Signed-off-by: ericharper <[email protected]> * get hidden_size from config Signed-off-by: ericharper <[email protected]> * add try except Signed-off-by: ericharper <[email protected]> * use default Signed-off-by: ericharper <[email protected]> * update config with hidden size Signed-off-by: ericharper <[email protected]> * remove arg Signed-off-by: ericharper <[email protected]> * comment out jenkins test Signed-off-by: ericharper <[email protected]> * revert import Signed-off-by: ericharper <[email protected]> * remove optimizer_idx Signed-off-by: eharper <[email protected]> * prefetch num microbatches Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * remove imports Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * build transformer config Signed-off-by: ericharper <[email protected]> * add model to provider func Signed-off-by: ericharper <[email protected]> * update forward and float16 wrapper Signed-off-by: ericharper <[email protected]> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <[email protected]> * set virtual rank Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (NVIDIA#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <[email protected]> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <[email protected]> --------- Signed-off-by: jasonwan <[email protected]> * revert Signed-off-by: ericharper <[email protected]> * remove import Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update for dist adam Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use get_gpt_module_list Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update megatron core commit Signed-off-by: eharper <[email protected]> * revert change Signed-off-by: eharper <[email protected]> * remove import Signed-off-by: eharper <[email protected]> * remove import Signed-off-by: eharper <[email protected]> * remove import Signed-off-by: eharper <[email protected]> --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: eharper <[email protected]> Signed-off-by: jasonwan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jason Wang <[email protected]>

github-actions bot added core Changes to NeMo Core NLP CI labels Jul 21, 2023

github-advanced-security bot found potential problems Jul 21, 2023

View reviewed changes

ericharper marked this pull request as ready for review July 25, 2023 18:16

ericharper changed the base branch from main to mcore_gpt_path July 25, 2023 18:16

github-actions bot removed core Changes to NeMo Core CI labels Jul 25, 2023

ericharper force-pushed the mcore_gpt_model branch from 7dac2a2 to ab454e6 Compare July 25, 2023 18:22

ericharper and others added 21 commits August 8, 2023 10:46

start adding gpt from megatron core path

63c127d

Signed-off-by: ericharper <[email protected]>

set model parallel config

16d85c4

Signed-off-by: ericharper <[email protected]>

use model parallel config object

19e1420

Signed-off-by: ericharper <[email protected]>

update args

a309c0b

Signed-off-by: ericharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2ea9285

for more information, see https://pre-commit.ci

set vp size to none if it is 1

46ec121

Signed-off-by: ericharper <[email protected]>

set vp size to none if it is 1

575ef8a

Signed-off-by: ericharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a8b177c

for more information, see https://pre-commit.ci

add TransformerConfig

a296be8

Signed-off-by: ericharper <[email protected]>

start updating to TransformerConfig

ec3c170

Signed-off-by: ericharper <[email protected]>

add todo

e2090ae

Signed-off-by: ericharper <[email protected]>

revert to model parallel config

e1f38d8

Signed-off-by: ericharper <[email protected]>

add hidden_size to model_parallel_config

cbfb0d4

Signed-off-by: ericharper <[email protected]>

remove imports

cbf5036

Signed-off-by: ericharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c6fe7ed

for more information, see https://pre-commit.ci

remove import

2bd408c

Signed-off-by: ericharper <[email protected]>

small clean up

06064bf

Signed-off-by: ericharper <[email protected]>

update hidden size in peft base model, add mcore commit to jenkins

d8e9f4f

Signed-off-by: ericharper <[email protected]>

update module args

afdf3f0

Signed-off-by: ericharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

90e8160

for more information, see https://pre-commit.ci

add config obj to flash attention tests

3f194ac

Signed-off-by: ericharper <[email protected]>

ericharper and others added 4 commits August 10, 2023 11:29

set virtual rank

6fb875b

Signed-off-by: ericharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

299e7b5

for more information, see https://pre-commit.ci

Add GQA config to megatron gpt model (#7096)

d4a7576

* Add GQA config in gpt config file Signed-off-by: jasonwan <[email protected]> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <[email protected]> --------- Signed-off-by: jasonwan <[email protected]>

revert

3213cac

Signed-off-by: ericharper <[email protected]>

ericharper force-pushed the mcore_gpt_model branch from ab454e6 to 3213cac Compare August 10, 2023 18:26

ericharper and others added 8 commits August 10, 2023 12:27

remove import

22de2ef

Signed-off-by: eharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

bc38715

for more information, see https://pre-commit.ci

update for dist adam

9e7cb71

Signed-off-by: eharper <[email protected]>

Merge branch 'mcore_gpt_model' of github.com:NVIDIA/NeMo into mcore_g…

24bcbbe

…pt_model

[pre-commit.ci] auto fixes from pre-commit.com hooks

70eeef2

for more information, see https://pre-commit.ci

use get_gpt_module_list

f822996

Signed-off-by: eharper <[email protected]>

Merge branch 'mcore_gpt_model' of github.com:NVIDIA/NeMo into mcore_g…

a6a10ae

…pt_model

[pre-commit.ci] auto fixes from pre-commit.com hooks

2c6ba91

for more information, see https://pre-commit.ci

Base automatically changed from mcore_gpt_path to main August 14, 2023 04:55

pull main

5105658

Signed-off-by: eharper <[email protected]>

github-advanced-security bot found potential problems Aug 14, 2023

View reviewed changes

update megatron core commit

9808ec4

Signed-off-by: eharper <[email protected]>

github-actions bot added the CI label Aug 14, 2023

ericharper and others added 5 commits August 14, 2023 11:56

revert change

5a20f0d

Signed-off-by: eharper <[email protected]>

remove import

d5da981

Signed-off-by: eharper <[email protected]>

remove import

dbdffc6

Signed-off-by: eharper <[email protected]>

remove import

61a0902

Signed-off-by: eharper <[email protected]>

Merge branch 'main' into mcore_gpt_model

cb4850d

aklife97 reviewed Aug 14, 2023

View reviewed changes

aklife97 approved these changes Aug 14, 2023

View reviewed changes

ericharper merged commit e64076e into main Aug 14, 2023
10 of 12 checks passed

ericharper deleted the mcore_gpt_model branch August 14, 2023 21:37

timmoon10 mentioned this pull request Apr 2, 2024

Distributed optimizer reduces GPT embedding grads in FP32 #8792

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use GPTModel from mcore #7093

Use GPTModel from mcore #7093

ericharper commented Jul 21, 2023 •

edited

Loading

github-advanced-security bot left a comment

aklife97 Aug 14, 2023

aklife97 left a comment

Use GPTModel from mcore #7093

Use GPTModel from mcore #7093

Conversation

ericharper commented Jul 21, 2023 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

github-advanced-security bot left a comment

Choose a reason for hiding this comment

aklife97 Aug 14, 2023

Choose a reason for hiding this comment

aklife97 left a comment

Choose a reason for hiding this comment

ericharper commented Jul 21, 2023 •

edited

Loading