Adding RETRO tests to Action Tests (cicd-main.yml) (#8942) · NVIDIA/NeMo@0351363

Commit

Adding RETRO tests to Action Tests (cicd-main.yml) (#8942)

* update branch

Signed-off-by: eharper <[email protected]>

* Add dist ckpt support for regular optimizers (#7749)

* Add dist ckpt support for regular optimizers

Signed-off-by: Mikołaj Błaż <[email protected]>

* [tutorial] fixed missing RIR scripts file. (#8257)

Signed-off-by: Xuesong Yang <[email protected]>

* fix imports

Signed-off-by: dimapihtar <[email protected]>

* imports fix

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci imports fix

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert asr notebook

Signed-off-by: dimapihtar <[email protected]>

* revert asr notebook

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Pin lhotse=1.19.2 in r1.23.0 (#8303)

Signed-off-by: Piotr Żelasko <[email protected]>

* Cache Aware Streaming tutorial notebook (#8296)

* add notebook

Signed-off-by: Elena Rastorgueva <[email protected]>

* rename old notebook to Buffered_Streaming

Signed-off-by: Elena Rastorgueva <[email protected]>

* call setup_streaming_params in set_default_att_context_size method

Signed-off-by: Elena Rastorgueva <[email protected]>

* update links in docs

Signed-off-by: Elena Rastorgueva <[email protected]>

* update links to tutorials in docs

Signed-off-by: Elena Rastorgueva <[email protected]>

* remove hard-coding

Signed-off-by: Elena Rastorgueva <[email protected]>

* rename var

Signed-off-by: Elena Rastorgueva <[email protected]>

---------

Signed-off-by: Elena Rastorgueva <[email protected]>

* fix path location and branch (#8304)

* fix path location and branch

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* change to a floating point number

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Somshubra Majumdar <[email protected]>

* add deallocate pipeline output optimization (#8279)

* add deallocate pipeline output optimization

Signed-off-by: Jimmy Zhang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix memory leak caused by context parallelism hanging references by omegaconf (#8299)

* save cp_size to self

Signed-off-by: Jimmy Zhang <[email protected]>

* use parallel_state instead of self

Signed-off-by: Jimmy Zhang <[email protected]>

---------

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* remove assertion (#8302)

Signed-off-by: dimapihtar <[email protected]>

* Update PEFT Doc (#8262)

* update peft doc

Signed-off-by: Chen Cui <[email protected]>

* remove old prompt learning doc and notebook

Signed-off-by: Chen Cui <[email protected]>

* fix table

Signed-off-by: Chen Cui <[email protected]>

* fix table

Signed-off-by: Chen Cui <[email protected]>

* fix table

Signed-off-by: Chen Cui <[email protected]>

* Merge branch 'r1.23.0' into chcui/update_peft_doc

Signed-off-by: Chen Cui <[email protected]>

* revert accidental changes

Signed-off-by: Chen Cui <[email protected]>

* revert accidental changes

Signed-off-by: Chen Cui <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>

* Attention encoder-decoder models for multiple speech-to-text tasks  (#8242) (#8324)

* Rebasing canary changes at current main

Signed-off-by: Piotr Żelasko <[email protected]>

* Move the changes from asr transformer to nlp transformer as originally intended

Signed-off-by: Piotr Żelasko <[email protected]>

* update eval to strip spaces before punctuations

Signed-off-by: stevehuang52 <[email protected]>

* update pc strip

Signed-off-by: stevehuang52 <[email protected]>

* [canary] Refactor: `PromptedAudioToTextLhotseDataset` and `EncDecMultiTaskModel` (#8247)

* Create a separate CanaryDataset and use it inside `transformer_bpe_models.py`. Ditches `token_sequence_format`.

Signed-off-by: Piotr Żelasko <[email protected]>

* [canary] Refactor: move changes in transformer_bpe_models.py to Canar… (#8252)

* [canary] Refactor: move changes in transformer_bpe_models.py to CanaryModel

Signed-off-by: Piotr Żelasko <[email protected]>

* Rename `CanaryModel` to `EncDecMultiTaskModel` and remove inheritance from `EncDecTransfModelBPE`; add a separate config for this model

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* Rename `CanaryDataset` to `PromptedAudioToTextLhotseDataset`; add `prompt_format_fn` argument; clean-up the `_canary_prompt_format` function a bit

Signed-off-by: Piotr Żelasko <[email protected]>

* Move tokenization into `prompt_format_fn`, fix usage, add docs

Signed-off-by: Piotr Żelasko <[email protected]>

* Backward-compatible utterance validation

Signed-off-by: Piotr Żelasko <[email protected]>

* Improve type annotations

Signed-off-by: Piotr Żelasko <[email protected]>

* config and prompt_fn registration changes from review

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* fix transcribe config

Signed-off-by: stevehuang52 <[email protected]>

* Refactor Canary to follow schema of remaining ASR models (#8260)

* Initial draft of multi task beam decoding strategy

Signed-off-by: smajumdar <[email protected]>

* Stabilize inference

Signed-off-by: smajumdar <[email protected]>

* Update AED Multi Task model to mostly conform to Archetype-Type format. Update config

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add change decoding strategy

Signed-off-by: smajumdar <[email protected]>

* Remove redundant imports

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Cleanup

Signed-off-by: smajumdar <[email protected]>

* Cleanup

Signed-off-by: smajumdar <[email protected]>

* remove asr transformer dependency on nlp

Signed-off-by: stevehuang52 <[email protected]>

* clean up

Signed-off-by: stevehuang52 <[email protected]>

* copy token_classifier from nlp to asr

Signed-off-by: stevehuang52 <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

* Add typing to beam decoding

Signed-off-by: smajumdar <[email protected]>

* Make prompt format configurable

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* drop asr dependency on nlp

Signed-off-by: stevehuang52 <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: stevehuang52 <[email protected]>

* fix transcribe, update asr evaluator

Signed-off-by: stevehuang52 <[email protected]>

* Extend the docs for the canary prompt_fn

Signed-off-by: Piotr Żelasko <[email protected]>

* Incorporate changes from Nithin's code review

Signed-off-by: Piotr Żelasko <[email protected]>

* training bug fix and adding launch script for speech_multitask (#8270)

* bug fix and adding launch script for speech_multitask

Signed-off-by: Krishna Puvvada <[email protected]>

* update launch script example in speech_to_text_aed.py

Signed-off-by: Krishna Puvvada <[email protected]>

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>

* Fix: drop_last must be true in validation/test otherwise the training will hang

Signed-off-by: Piotr Żelasko <[email protected]>

* revert to current transcribe API

Signed-off-by: stevehuang52 <[email protected]>

* revert changes to NLP, update docs

Signed-off-by: stevehuang52 <[email protected]>

* update eval utils

Signed-off-by: stevehuang52 <[email protected]>

* update docs

Signed-off-by: stevehuang52 <[email protected]>

* Remove DALI; rename compute_audio_loss to compute_loss

Signed-off-by: Piotr Żelasko <[email protected]>

* set default use_model_transcribe=False

Signed-off-by: stevehuang52 <[email protected]>

* change os.path.dirname to pathlib

Signed-off-by: stevehuang52 <[email protected]>

* [canary] Test for CanaryTokenizer + refactoring (#8285)

* Test for CanaryTokenizer

Signed-off-by: Piotr Żelasko <[email protected]>

* Attempt at refactor...

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* Update config for AED models (#8294)

Signed-off-by: smajumdar <[email protected]>

* set default calculate_wer=False in transcribe_speech.py

Signed-off-by: stevehuang52 <[email protected]>

* Attention encoder-decoder models for multiple speech-to-text tasks

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply suggestions from code review, part 1

Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>

* Apply suggestions from code review, part 2

Signed-off-by: Piotr Żelasko <[email protected]>

* Document compute_loss

Signed-off-by: Piotr Żelasko <[email protected]>

* update transcribe_speech.py

Signed-off-by: stevehuang52 <[email protected]>

* add docstring

Signed-off-by: stevehuang52 <[email protected]>

* Attention encoder-decoder models for multiple speech-to-text tasks

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Krishna Puvvada <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Co-authored-by: stevehuang52 <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: He Huang (Steve) <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
(cherry picked from commit d10726d)

Co-authored-by: Piotr Żelasko <[email protected]>

* add code for calling mcore_retro in NeMo

* add code for calling mcore_retro in NeMo

* runnable, training curve match retro mcore and nemo

* working on retro inference

* working on megatron_retro_eval.py and megatron_retro_inference.yaml

* refactoring text_generation_utils code and retro inference relevant files

* clean PR

* resolving quick hacks (reading number of train/valid samples from workdir, discrepancy in total samples and samples with neighbors retrieved, tokenizers)

* clean repository

* revert changes to inference/eval code to original in main

* clean code

* runable training code, with already implemented eval code

* [tutorial] fixed missing RIR scripts file. (#8257)

Signed-off-by: Xuesong Yang <[email protected]>

* add values to en tts dict (#7879)

Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>

* Add Bert HF checkpoint converter (#8088)

* Add Bert HF checkpoint converter

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Reformat

Signed-off-by: yaoyu-33 <[email protected]>

* Add BERT ONNX export

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add NeMo BERT to HF BERT script

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Clean code

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update argument names

Signed-off-by: yaoyu-33 <[email protected]>

* Update build_transformer_config in Bert

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Bobby Chen <[email protected]>

* revert to original eval code files

* revert to original eval code files 2

* revert to original eval code files 3

* revert to original eval code files 4

* clean code

* clean code

* update my code to support changes from lastest main

* commit before rebase r1.23.0

* Multimodal r1.23.0 bug fix  (#8315)

* Rename quick-gelu

Signed-off-by: yaoyu-33 <[email protected]>

* ddpm config guard

Signed-off-by: yaoyu-33 <[email protected]>

* Fix ddpm edit api

Signed-off-by: yaoyu-33 <[email protected]>

* Fix insert_image_token cfg issue

Signed-off-by: yaoyu-33 <[email protected]>

* neva updates

Signed-off-by: yaoyu-33 <[email protected]>

* reformat

Signed-off-by: yaoyu-33 <[email protected]>

* Add back jenkins

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix jenkins

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bugs

Signed-off-by: yaoyu-33 <[email protected]>

* Update default neva template

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* copy paste files from r1.23.0

* clean PR

* Fixes for MoE parameter passing & use of AutoTokenizer/Model for mistral. (#8272)

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Keep max_seqlen and cu_seqlens_argmin for later micro-batches when PP>1 (#8334)

Signed-off-by: Sangkug Lym <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Remove asr webapp (#8347)

Signed-off-by: smajumdar <[email protected]>

* remove _target_ at model level in aed config (#8351)

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>

* revert changes for tts and asr

* Add change_vocabulary and save_tokenizers() support to Multitask ASR models (#8357)

* Add change_vocabulary and save_tokenizers() support

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update nemo/collections/asr/models/aed_multitask_models.py

Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <[email protected]>

* Change default (#8371)

Signed-off-by: smajumdar <[email protected]>

* implement retro's own fwd_bwd_step() and validation_step() to not have argument first_val_step, which the MLM commit doesn't support

* adding megatron compile_helpers(), in future can be fixed with correct MLM commit

* bug fix in fast-conformer-aed.yaml and adding jenkins test for speech_to_text_aed model (#8368)

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* Enable megatron core loggers for GPT pretraining (#8354)

* Logging changes tested for gpt_pretraining

Signed-off-by: Aishwarya Bhandare <[email protected]>

* Additional args

Signed-off-by: Aishwarya Bhandare <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* mcore ds fix (#8283)

* [tutorial] fixed missing RIR scripts file. (#8257)

Signed-off-by: Xuesong Yang <[email protected]>

* add values to en tts dict (#7879)

Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>

* mcore ds fix

Signed-off-by: Dmytro Pykhtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update mcore

Signed-off-by: dimapihtar <[email protected]>

* revert asr files

Signed-off-by: dimapihtar <[email protected]>

* add comments

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add support for mcore mock dataset

Signed-off-by: dimapihtar <[email protected]>

* update mcore version

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update gpt cfg

Signed-off-by: dimapihtar <[email protected]>

* update mcore commit

Signed-off-by: dimapihtar <[email protected]>

* fix Bert unit tests

Signed-off-by: dimapihtar <[email protected]>

* update bert tests

Signed-off-by: dimapihtar <[email protected]>

* fix bert mcore test

Signed-off-by: dimapihtar <[email protected]>

* fix gpt jenkins tests

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update apex & TE commits

Signed-off-by: dimapihtar <[email protected]>

* revert apex installation

Signed-off-by: dimapihtar <[email protected]>

* turn off the fusion for jenkins

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pablo Garay <[email protected]>

* addressing Eric's reviews

* adding existing implementation RETRO files

* adding existing implementation RETRO files

* Add Finetuning tutorial with HF Datasets (#8356)

* Add Finetuning tutorial with HF Datasets

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* update on Som comments

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>

* release updates (#8378)

* [tutorial] fixed missing RIR scripts file. (#8257)

Signed-off-by: Xuesong Yang <[email protected]>

* add values to en tts dict (#7879)

Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>

* mcore ds fix

Signed-off-by: Dmytro Pykhtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update mcore

Signed-off-by: dimapihtar <[email protected]>

* revert asr files

Signed-off-by: dimapihtar <[email protected]>

* add comments

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add support for mcore mock dataset

Signed-off-by: dimapihtar <[email protected]>

* update mcore version

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update gpt cfg

Signed-off-by: dimapihtar <[email protected]>

* update mcore commit

Signed-off-by: dimapihtar <[email protected]>

* fix Bert unit tests

Signed-off-by: dimapihtar <[email protected]>

* update bert tests

Signed-off-by: dimapihtar <[email protected]>

* fix bert mcore test

Signed-off-by: dimapihtar <[email protected]>

* fix gpt jenkins tests

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add support for dict data input type

Signed-off-by: dimapihtar <[email protected]>

* add mock ds test

Signed-off-by: dimapihtar <[email protected]>

* add test for dict data input type

Signed-off-by: dimapihtar <[email protected]>

* mcore ds fix

Signed-off-by: dimapihtar <[email protected]>

* data input fix

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pablo Garay <[email protected]>

* MCore dataset compatibility for tokenizers (#8390)

* Add unique_identifiers for all tokenizers and eod for SentencePieceTokenizer

Signed-off-by: Valerie Sarge <[email protected]>

* Add generalized token aliases to TokenizerSpec to conform with MegatronTokenizer's interface. Remove now-redundant individual fixes from AutoTokenizer and SentencePieceTokenizer.

Signed-off-by: Valerie Sarge <[email protected]>

---------

Signed-off-by: Valerie Sarge <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>

* Mcore customization doc (#8298)

* [tutorial] fixed missing RIR scripts file. (#8257)

Signed-off-by: Xuesong Yang <[email protected]>

* add values to en tts dict (#7879)

Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>

* Add Bert HF checkpoint converter (#8088)

* Add Bert HF checkpoint converter

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Reformat

Signed-off-by: yaoyu-33 <[email protected]>

* Add BERT ONNX export

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add NeMo BERT to HF BERT script

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Clean code

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update argument names

Signed-off-by: yaoyu-33 <[email protected]>

* Update build_transformer_config in Bert

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Bobby Chen <[email protected]>

* initial placeholder

Signed-off-by: Huiying Li <[email protected]>

* add to intro/index.rst

Signed-off-by: Huiying Li <[email protected]>

* initial content update

Signed-off-by: Huiying Li <[email protected]>

* add diff images

Signed-off-by: Huiying Li <[email protected]>

size

Signed-off-by: Huiying Li <[email protected]>

* minor fixes

* minor language change

Signed-off-by: Chen Cui <[email protected]>

* clean changes

---------

Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Bobby Chen <[email protected]>
Co-authored-by: Huiying Li <[email protected]>
Co-authored-by: Chen Cui <[email protected]>

* wer fix (#8404)

Signed-off-by: Travis Bartley <[email protected]>

* updated link to pubmed (#8402)

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>

* Update NFA video download link (#8406)

* update nfa nasa video link

Signed-off-by: Elena Rastorgueva <[email protected]>

* update link in markdown

Signed-off-by: Elena Rastorgueva <[email protected]>

---------

Signed-off-by: Elena Rastorgueva <[email protected]>

* revert changes (#8410)

Signed-off-by: Chen Cui <[email protected]>

* Fix dreambooth data sampler issue (#8400)

* Turn on drop last

Signed-off-by: yaoyu-33 <[email protected]>

* Some neva fixes

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fixed errors in the CTM gen functions (#8416)

Signed-off-by: Taejin Park <[email protected]>

* add ensemble decoding fix (#8427)

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>

* SDE bugfix log (#8430)

Signed-off-by: George <[email protected]>

* mcore customization doc minor fix (#8421)

Signed-off-by: Huiying Li <[email protected]>

* NeMo-Mistral to HF converter bugfix. (#8353)

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Fixing mcore bert for TP, PP and SP (#8336)

* Fixing mcore bert for TP, PP and SP

* Fixing mcore bert for TP, PP and SP

* Fixing mcore version

* Fixing mcore version

* Update Jenkinsfile

Signed-off-by: Shanmugam Ramasamy <[email protected]>

* Update Jenkinsfile

Signed-off-by: Shanmugam Ramasamy <[email protected]>

* Update Jenkinsfile

Signed-off-by: Shanmugam Ramasamy <[email protected]>

---------

Signed-off-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Add settings to suppress bf16 compile errors in CI on V100 (#8481)

* Add settings to suppress bf16 compile errors in CI on V100

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* MoE parameter passing (#8255)

* MoE parameter passing

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Pass EP/MoE params in consumer scripts.

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* PR fixes

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Use latest commit of mcore-0.5

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* CI fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update k2 version (#8478) (#8492)

Signed-off-by: Vladimir Bataev <[email protected]>

* Add fp8 support for SD/Update notebook paths (#8489)

* Add fp8 support for SD/Update notebook paths

Signed-off-by: Mingyuan Ma <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Mingyuan Ma <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* pin to 0.5.0 (#8465)

Signed-off-by: eharper <[email protected]>

* Update NeMo Multimodal Requirements (#8515)

* Update requirements_multimodal.txt

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update github raw content link (#8517)

Signed-off-by: Chen Cui <[email protected]>

* Add dep notice for notebooks (#8522)

* add dep notice

Signed-off-by: eharper <[email protected]>

* revert

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>

* Revert FP8 integration (#8520)

* Revert FP8 integration

Signed-off-by: Mingyuan Ma <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Mingyuan Ma <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update data prep notebook (#8532)

Signed-off-by: Mingyuan Ma <[email protected]>

* before update branch with latest r1.23.0

* update to run with MLM ae2817b3dde4efb1515061a5311d01d8f85bd99c (runnable training and saving checkpoint)

* remove compile_helpers

* reverse changes from main branch to r1.23.0

* adding *_legacy files

* update MLM commit in Jenkinsfile to latest

* debugging Jenkinstest: test different mcore import in retro_dataset

* update Jenkinsfile edit megatron_retro_mutransfer_pretrain_legacy.py

* removing all mcore RETRO to pass the Jenkinstest

* fixing import legacy problem for tests/collections/nlp/test_indexed_retrieval_dataset.py

* update Jenkinsfile file to use TE v0.7

* update NeMo to work with latest mcore RETRO (solving TE problems)

* update TE commit Jenkinsfile to be the same with r1.23.0's Jenkinsfile

* update commit for MLM

* jenkinstest debugging

* temporary fix RETRO's __init__ for jenkinstest

* edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster

* edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster

* edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster

* edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster

* add model.data.dataloader_type=cyclic to jenkinsfile

* update code to work with latest megatron-lm main 81dab6067

* update M-LM commit in Jenkinsfile to latest main M-LM 81dab6067

* fix to by pass CI test bf16 problem (following this PR https://github.com/NVIDIA/NeMo/pull/8481/files)

* isort and black

* adjusting model.micro_batch_size to 1

* fix BRANCH = 'r1.23.0'

* replace tutorials dir from main branch to huvu/mcore_retro

* fix minor merges conflict

* update Jenkinsfile

* runnable with a temporary fix from Jacek (unfound -unfinished problem)

* runnable with a temporary fix from Jacek (unfound -unfinished problem)

* modified nlp_overrides.py back to original

* fix checkpoint from Jacek Bieniusiewicz

* config Jenkinsfile test

* set RETRO Jenkins MBS to 1

* black fix

* isort fix

* update TE commit

* update to latest Jenkinsfile with latest container and commits

* remove new RETRO jenkinstest

* merge latest main

* put RETRO Jenkinstest to the right place

* update code for megatron_retro_pretraining_legacy.py

* untrack ipa_cmudict-0.7b_nv23.01.txt

* untrack ipa_cmudict-0.7b_nv23.01.txt

* set config in megatron_retro_pretraining_legacy.py to megatron_retro_config_legacy

* update new RETRO jenkinstest to run faster

* merging latest main, and edit Jenkinstest

* update Jenkinstest for new RETRO to run faster

* fix isort

* adding RETRO tests to cicd-main.yml action tests

* update ipa_cmudict-0.7b_nv23.01.txt

* remove quotes for model.data for legacy RETRO action tests

---------

Signed-off-by: eharper <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Elena Rastorgueva <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Sangkug Lym <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Krishna Puvvada <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Aishwarya Bhandare <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: Valerie Sarge <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Travis Bartley <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: George <[email protected]>
Signed-off-by: Shanmugam Ramasamy <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Mingyuan Ma <[email protected]>
Co-authored-by: eharper <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <[email protected]>
Co-authored-by: Elena Rastorgueva <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: JimmyZhang12 <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: Huy Vu2 <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Bobby Chen <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: ashbhandare <[email protected]>
Co-authored-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Valerie Sarge <[email protected]>
Co-authored-by: Huiying <[email protected]>
Co-authored-by: Huiying Li <[email protected]>
Co-authored-by: tbartley94 <[email protected]>
Co-authored-by: Taejin Park <[email protected]>
Co-authored-by: George <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Vladimir Bataev <[email protected]>
Co-authored-by: Ming <[email protected]>
Co-authored-by: Huy Vu2 <[email protected]>

Loading branch information

40 people authored Apr 17, 2024

1 parent 468d5b6 commit 0351363

.github/workflows/cicd-main.yml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -3690,6 +3690,75 @@ jobs:
  
              uses: actions/checkout@v2

            - run: |

                python examples/nlp/language_modeling/megatron_retro_pretraining.py \

                trainer.num_nodes=1 \

                trainer.devices=2 \

                trainer.precision=bf16 \

                trainer.accelerator=gpu \

                model.data.data_prefix=['none'] \

                exp_manager.exp_dir=examples/nlp/language_modeling/mcore_retro_results \

                model.mcore_gpt=True \

                model.tensor_model_parallel_size=1 \

                model.pipeline_model_parallel_size=1 \

                model.optim.name=distributed_fused_adam \

                model.retro.retro_project_dir=/home/TestData/nlp/megatron_retro/mcore_retro/micro-wiki-core \

                model.data.num_workers=4 \

                model.micro_batch_size=1 \

                model.data.shuffle_documents=False \

                trainer.val_check_interval=30 \

                +trainer.num_sanity_val_steps=0 \

                model.init_method_std=0.023 \

                model.optim.lr=6.0e-4 \

                model.megatron_amp_O2=True \

                model.data.splits_string=\'\"98,2,0\"\' \

                model.data.dataloader_type=cyclic \

                trainer.max_steps=10

                python examples/nlp/language_modeling/megatron_retro_pretraining.py \

                trainer.num_nodes=1 \

                trainer.devices=2 \

                trainer.precision=bf16 \

                trainer.accelerator=gpu \

                model.data.data_prefix=['none'] \

                exp_manager.exp_dir=examples/nlp/language_modeling/mcore_retro_results \

                model.mcore_gpt=True \

                model.tensor_model_parallel_size=1 \

                model.pipeline_model_parallel_size=1 \

                model.optim.name=distributed_fused_adam \

                model.retro.retro_project_dir=/home/TestData/nlp/megatron_retro/mcore_retro/micro-wiki-core \

                model.data.num_workers=4 \

                model.micro_batch_size=1 \

                model.data.shuffle_documents=False \

                trainer.val_check_interval=30 \

                +trainer.num_sanity_val_steps=0 \

                model.init_method_std=0.023 \

                model.optim.lr=6.0e-4 \

                model.megatron_amp_O2=True \

                model.data.splits_string=\'\"98,2,0\"\' \

                model.data.dataloader_type=cyclic \

                trainer.max_steps=20

                rm -rf examples/nlp/language_modeling/mcore_retro_results

            - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"

              if: "failure()"

      L2_Legacy_Megatron_RETRO_Pretraining_and_Resume_Training:

        needs: [cicd-test-container-setup]

        runs-on: self-hosted-azure

        container:

          image: nemoci.azurecr.io/nemo_container_${{ github.run_id }}

          options: 

            # --user 0:128

            --device=/dev/nvidia0

            --gpus all

            --shm-size=8g

            --env TRANSFORMERS_OFFLINE=0 

            --env HYDRA_FULL_ERROR=1

            --volume /mnt/datadrive/TestData:/home/TestData

        steps:

            - name: Checkout repository

              uses: actions/checkout@v2

            - run: |

                python examples/nlp/language_modeling/megatron_retro_pretraining_legacy.py \

                trainer.devices=2 \

                trainer.num_nodes=1 \

                trainer.accelerator=gpu \

    @@ -3700,7 +3769,7 @@ jobs:
  
                trainer.precision=16 \

                trainer.gradient_clip_val=1.0 \

                trainer.val_check_interval=10 \

                exp_manager.exp_dir=examples/nlp/language_modeling/retro_results \

                exp_manager.exp_dir=examples/nlp/language_modeling/retro_legacy_results \

                model.data.data_prefix= \

                model.data.knn_index= \

                model.data.retrieval_prefix= \

    @@ -3720,7 +3789,7 @@ jobs:
  
                model.dec_cross_attention=[1] \

                +model.data.mock=True

                python examples/nlp/language_modeling/megatron_retro_pretraining.py \

                python examples/nlp/language_modeling/megatron_retro_pretraining_legacy.py \

                trainer.devices=2 \

                trainer.num_nodes=1 \

                trainer.accelerator=gpu \

    @@ -3731,7 +3800,7 @@ jobs:
  
                trainer.precision=16 \

                trainer.gradient_clip_val=1.0 \

                trainer.val_check_interval=10 \

                exp_manager.exp_dir=examples/nlp/language_modeling/retro_results \

                exp_manager.exp_dir=examples/nlp/language_modeling/retro_legacy_results \

                model.data.data_prefix= \

                model.data.knn_index= \

                model.data.retrieval_prefix= \

    @@ -3751,7 +3820,7 @@ jobs:
  
                model.dec_cross_attention=[1] \

                +model.data.mock=True

                rm -rf examples/nlp/language_modeling/retro_results

                rm -rf examples/nlp/language_modeling/retro_legacy_results

            - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"

              if: "failure()"

0 comments on commit `0351363`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `0351363`

Commit

There are no files selected for viewing

0 comments on commit 0351363

0 comments on commit `0351363`