Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Tiktoken tokenizer for Nemotron-Mistral 12B #9797

Merged
merged 153 commits into from
Jul 22, 2024
Merged

Adds Tiktoken tokenizer for Nemotron-Mistral 12B #9797

merged 153 commits into from
Jul 22, 2024

Conversation

ertkonuk
Copy link
Collaborator

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

akoumpa
akoumpa previously approved these changes Jul 18, 2024
Copy link
Member

@akoumpa akoumpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

import json
import os
from pathlib import Path
from typing import Dict, List, Optional, Union

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'Union' is not used.
from pathlib import Path
from typing import Dict, List, Optional, Union

import numpy as np

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'np' is not used.
import numpy as np
import tiktoken

from nemo.collections.common.parts.utils import if_exist

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'if_exist' is not used.

from nemo.collections.common.parts.utils import if_exist
from nemo.collections.common.tokenizers.tokenizer_spec import TokenizerSpec
from nemo.utils import logging

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'logging' is not used.
marcromeyn and others added 21 commits July 19, 2024 11:22
* Add CICD test for Stable Diffusion

Signed-off-by: Michal Futrega <[email protected]>

* Update cicd-main.yml

Signed-off-by: Michal Futrega <[email protected]>

* Use single gpu runner

Signed-off-by: Michal Futrega <[email protected]>

---------

Signed-off-by: Michal Futrega <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
* use default collate if dataset does not have one

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mixtral config

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add convert_state

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix StateDictTransform for 2D layers, e.g. MoE

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pass num_moe_experts to specs

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* udpate MixtralModel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mini docstring

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
* update mcoreddp call

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update mcore commits

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama3

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fix typo

Signed-off-by: Chen Cui <[email protected]>

* enable importers with multiple models

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add gemma

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* checks

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
* minor exp_manager bug fixes

* remove print statement

* fix docstring

* fix AppState defaults

---------

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
Signed-off-by: ertkonuk <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
* Update NeMo script

Signed-off-by: yaoyu-33 <[email protected]>

* Fix example scripts

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update convert_llava_nemo_to_hf.py

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
* Export implementation for vLLM 0.4.3.

Supports LLAMA2, Mistral, Mixtral (unverified), Gemma and StarCoder2 models.

The nemo.export.tensorrt_llm alias was removed to avoid initializing TRT-LLM when importing anything from nemo.export.

Signed-off-by: Alexey Panteleev <[email protected]>

* Fixed some CodeQL warnings.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Removed empty files.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Updated the integration for vLLM 0.5.0.

Signed-off-by: Alexey Panteleev <[email protected]>

* Updated the vLLM deployment interface to use max_output_len instead of max_output_token.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the Exporter class to nemo/export and renamed its file to vllm_exporter.py, to be more similar to TRT-LLM.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Implemented vLLM support in the export tests, added functional testing, implemented forward evaluation on vLLM without Triton.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the vLLM deployment functionality to the common deploy_triton.py script.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed the CodeQL discovered issues.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed one more return of a wrong dimensionality...

Signed-off-by: Alexey Panteleev <[email protected]>

* More wrong dimensionality returns.

Signed-off-by: Alexey Panteleev <[email protected]>

---------

Signed-off-by: Alexey Panteleev <[email protected]>
Signed-off-by: apanteleev <[email protected]>
Co-authored-by: apanteleev <[email protected]>
Co-authored-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
…uilder (#9535)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
* Improve IOMixin.io_transform_args to handle dataclasses better

* Dump task json + img inside NeMoLogger

* Adding store_io to train task

* Update opt.connect to also propagate to __io__

* Rename opt to optim for consistency

* Moving to using safe serialization using fiddle, only use cloudpickle when needed

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Using Config from fiddle instead of sdk for now

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move enable_nemo_ckpt_io from MegatronStrategy to ModelCheckpoint

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move nemo-ckpt to _get_finalize_save_checkpoint_callback

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Update TrainerContext & io.load_ckpt

* Use renamed TrainerContext inside ModelCheckpoint

* Remove double io saving

* Rename lightning.pytorch.opt -> optim

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove store_io from train-task

* Adding fiddle-extension for torch

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move fdl_torch import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding dtype to serialization

* Some fixes

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add support for BuiltinFunctionType

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add missing import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix dataclass fields

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
…9544)

* Add test_cpp_runtime flag

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Co-authored-by: janekl <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
* Fix lhotse tests for v1.24.0

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix RIR test

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
@akoumpa akoumpa added Run CICD and removed Run CICD labels Jul 19, 2024
@akoumpa akoumpa merged commit 425d5dd into main Jul 22, 2024
210 checks passed
@akoumpa akoumpa deleted the tkonuk/tiktoken branch July 22, 2024 16:16
tonyjie pushed a commit to tonyjie/NeMo that referenced this pull request Jul 24, 2024
* Adding context- & expert-parallism to MegatronStrategy (#9525)

Signed-off-by: Tugrul Konuk <[email protected]>

* Add CICD test for Stable Diffusion (#9464)

* Add CICD test for Stable Diffusion

Signed-off-by: Michal Futrega <[email protected]>

* Update cicd-main.yml

Signed-off-by: Michal Futrega <[email protected]>

* Use single gpu runner

Signed-off-by: Michal Futrega <[email protected]>

---------

Signed-off-by: Michal Futrega <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Akoumparouli/nemo ux mixtral (#9446)

* use default collate if dataset does not have one

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mixtral config

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add convert_state

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix StateDictTransform for 2D layers, e.g. MoE

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pass num_moe_experts to specs

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* udpate MixtralModel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mini docstring

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* update mcoreddp call (#9345)

* update mcoreddp call

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update mcore commits

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Llama and Gemma (#9528)

* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama3

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fix typo

Signed-off-by: Chen Cui <[email protected]>

* enable importers with multiple models

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add gemma

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* checks

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] minor logging bug fixes (#9529)

* minor exp_manager bug fixes

* remove print statement

* fix docstring

* fix AppState defaults

---------

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* mcore distOpt restore fix (#9421)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Custom Tiktoken tokenizer.

Signed-off-by: Tugrul Konuk <[email protected]>

* Fixed the tokenizer decoding on special tokens.

Signed-off-by: Tugrul Konuk <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ertkonuk <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Added token_to_id() method.

Signed-off-by: Tugrul Konuk <[email protected]>

* Update neva conversion script from and to HF (#9296)

* Update NeMo script

Signed-off-by: yaoyu-33 <[email protected]>

* Fix example scripts

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update convert_llava_nemo_to_hf.py

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* vLLM Export Support (#9381)

* Export implementation for vLLM 0.4.3.

Supports LLAMA2, Mistral, Mixtral (unverified), Gemma and StarCoder2 models.

The nemo.export.tensorrt_llm alias was removed to avoid initializing TRT-LLM when importing anything from nemo.export.

Signed-off-by: Alexey Panteleev <[email protected]>

* Fixed some CodeQL warnings.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Removed empty files.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Updated the integration for vLLM 0.5.0.

Signed-off-by: Alexey Panteleev <[email protected]>

* Updated the vLLM deployment interface to use max_output_len instead of max_output_token.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the Exporter class to nemo/export and renamed its file to vllm_exporter.py, to be more similar to TRT-LLM.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Implemented vLLM support in the export tests, added functional testing, implemented forward evaluation on vLLM without Triton.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the vLLM deployment functionality to the common deploy_triton.py script.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed the CodeQL discovered issues.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed one more return of a wrong dimensionality...

Signed-off-by: Alexey Panteleev <[email protected]>

* More wrong dimensionality returns.

Signed-off-by: Alexey Panteleev <[email protected]>

---------

Signed-off-by: Alexey Panteleev <[email protected]>
Signed-off-by: apanteleev <[email protected]>
Co-authored-by: apanteleev <[email protected]>
Co-authored-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* PL: Delete precision if using plugin. TODO switch to MegatronTrainerBuilder (#9535)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add page context fmha (#9526)

Signed-off-by: Tugrul Konuk <[email protected]>

* extend get_gpt_layer_modelopt_spec to support MoE (#9532)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix mock data generation for legacy dataset (#9530)

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] IO fixes (#9512)

* Improve IOMixin.io_transform_args to handle dataclasses better

* Dump task json + img inside NeMoLogger

* Adding store_io to train task

* Update opt.connect to also propagate to __io__

* Rename opt to optim for consistency

* Moving to using safe serialization using fiddle, only use cloudpickle when needed

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Using Config from fiddle instead of sdk for now

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move enable_nemo_ckpt_io from MegatronStrategy to ModelCheckpoint

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move nemo-ckpt to _get_finalize_save_checkpoint_callback

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Update TrainerContext & io.load_ckpt

* Use renamed TrainerContext inside ModelCheckpoint

* Remove double io saving

* Rename lightning.pytorch.opt -> optim

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove store_io from train-task

* Adding fiddle-extension for torch

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move fdl_torch import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding dtype to serialization

* Some fixes

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add support for BuiltinFunctionType

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add missing import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix dataclass fields

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Test C++ runtime on demand in nemo_export.py to avoid possible OOMs (#9544)

* Add test_cpp_runtime flag

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Co-authored-by: janekl <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix lhotse tests for v1.24.2 (#9546)

* Fix lhotse tests for v1.24.0

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix RIR test

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* gpu_unitTests_notOptional (#9551)

Signed-off-by: Tugrul Konuk <[email protected]>

* add reset learning rate functionality (#9372)

* add reset_lr functionality

Signed-off-by: dimapihtar <[email protected]>

* fix reset_lr logic

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* move reset_lr from optim section

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr value to config

Signed-off-by: dimapihtar <[email protected]>

* set reset_lr False by default

Signed-off-by: dimapihtar <[email protected]>

* remove extra line

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr test

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr test

Signed-off-by: dimapihtar <[email protected]>

* remove extra quote

Signed-off-by: dimapihtar <[email protected]>

* add ability to reset schedule's max_steps and decay_steps

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* change scheduler's first step logic when using reset_lr

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* fix reset_lr logic

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* update reset_lr comments

Signed-off-by: dimapihtar <[email protected]>

* add use cases for reset_lr feature

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add Python AIStore SDK to container and bump min Lhotse version (#9537)

* Add Python AIStore SDK to requirements and bump min Lhotse version

Signed-off-by: Piotr Żelasko <[email protected]>

* Move AIStore Python SDK to Dockerfile, remove matplotlib/ipywidgets deps

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Adding 'use_dynamo' option for export to use onnx.dynamo_export() instead of onnx.export() (#9147)

* Ininial WARs to implement dynamo option for export

Signed-off-by: Boris Fomitchev <[email protected]>

* including weights in .onnx

Signed-off-by: Boris Fomitchev <[email protected]>

* dynamo_export works for many small models

Signed-off-by: Boris Fomitchev <[email protected]>

* External weights behaviour fixed

Signed-off-by: Boris Fomitchev <[email protected]>

* Cleanup

Signed-off-by: Boris Fomitchev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: borisfom <[email protected]>

* print cleaned up

Signed-off-by: Boris Fomitchev <[email protected]>

* Added overloadable dynamic_shapes_for_export

Signed-off-by: Boris Fomitchev <[email protected]>

* Addressing code review

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing CI issues

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing CI test failure

Signed-off-by: Boris Fomitchev <[email protected]>

* Eliminated test cross-contamination

Signed-off-by: Boris Fomitchev <[email protected]>

---------

Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: borisfom <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Fix tokenizer IO (#9555)

* Adding tokenizer to io-test + making it pass

* Handling tokenizer correctly inside dump_io

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Removing not used import

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo UX] Move mistral_7b.py to mistral.py (#9545)

* Move mistral_7b.py to mistral.py

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rename MixtralConfig to MixtralConfig8x7B

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mistral rename: mistralconfig7b & mistralmodel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Use closed-formula to round by multiple (#9307)

* Use closed-formula to round by multiple

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* ci: Do not attempt to send slack on fork (#9556)

* ci: Do not attempt to send slack on fork

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix nemo export test (#9547)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* fix export test

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix SDXL incorrect name in docs (#9534)

Signed-off-by: Tugrul Konuk <[email protected]>

* GPU unit tests: Mark flaky tests to be fixed (#9559)

Signed-off-by: Tugrul Konuk <[email protected]>

* Bump PTL version (#9557)

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Resiliency] Straggler detection (#9473)

* Initial straggler det impl

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed CI code checks

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Removed unused import

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* remove submodule

Signed-off-by: Maanu Grover <[email protected]>

* Updated documentation; Updated callback params; Cosmetic changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed straggler det config; Added basic test

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixes in test_straggler_det.py

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Updated straggler callback API

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* stop_if_detected=False by default

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* switch to torch_dist as default dist checkpointing backend (#9541)

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Checkpointing bug fixes (#9562)

* fix checkpoint loading

* fix

* fixes

* another fix

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add tps and pps params to the export script (#9558)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* fix export test

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* remove n_gpus param

Signed-off-by: Onur Yilmaz <[email protected]>

* add and fix parameters

Signed-off-by: Onur Yilmaz <[email protected]>

* fix deploy script

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* rename tps and pps params

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Consolidate gpt continue training script into pretraining script (#9413)

* Consolidate gpt continue training with pretraining

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix default config

Signed-off-by: yaoyu-33 <[email protected]>

* Add github action cicd

Signed-off-by: yaoyu-33 <[email protected]>

* extract _integrate_original_checkpoint_data as a method

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix getattr

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "Add github action cicd"

This reverts commit a453f16ba2be6413db932623009da893208acdd5.

* Update comments in nlp_overrides.py

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add support to change Multi task model prompt (#9542)

* Add support to change Multi task model prompt

Signed-off-by: smajumdar <[email protected]>

* Add support to change Multi task model prompt

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Update nemo/collections/common/prompts/formatter.py

Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add Multimodal Exporter (#9256)

* Add video-neva TRT export

* Add TRT inference

* Change config

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change export params

* Remove unused import

* Add neva export

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change unpack nemo

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Add trt infer config

* Fix neva trt inference

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Add exporter

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix infer

* Add PyTriton

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix deploy wrong dim

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change to pass PIL Image

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix video neva deploy

* Change query

* Change deploy

* Remove unused import

* Change ptuning

* Change to mm exporter

* Add script

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix script

---------

Signed-off-by: meatybobby <[email protected]>
Co-authored-by: meatybobby <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Enable encoder adapters for Canary and MultiTaskAED models (#9409)

* Fix assertions for adapter types

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Cleanup

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Finalize support for decoder adapters

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* fix the freeze/unfreeze problem by replacing as_frozen with torch.inference_mode

* Apply isort and black reformatting

Signed-off-by: weiqingw4ng <[email protected]>

* Update tests to new generic way of module update

Signed-off-by: smajumdar <[email protected]>

* Finalize code for update module

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Fix variable name

Signed-off-by: smajumdar <[email protected]>

* Finalize projection support for transformer mha adapters

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Correct implementation of freeze restore

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Corrects the implementation of replace_adapter_modules to limit to just the top level modules

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Remove registration of Transformer MHA

Signed-off-by: smajumdar <[email protected]>

* Remove registration of Transformer MHA

Signed-off-by: smajumdar <[email protected]>

* Address reviewer comments

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: weiqingw4ng <[email protected]>
Co-authored-by: Weiqing Wang <[email protected]>
Co-authored-by: weiqingw4ng <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* pass option through (#9570)

Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* PTQ refinements (#9574)

* Rename megatron_gpt_quantization -> megatron_gpt_ptq

Signed-off-by: Jan Lasek <[email protected]>

* Configure export.save_path as dir or tarball

Signed-off-by: Jan Lasek <[email protected]>

* PTQ docs update

Signed-off-by: Jan Lasek <[email protected]>

* Make model_type optional in case of quantized checkpoints

Signed-off-by: Jan Lasek <[email protected]>

* Drop unused save_nemo_model_config argument

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Audio model collection (#9263)

* Audio model collection

Signed-off-by: Ante Jukić <[email protected]>

* Apply isort and black reformatting

Signed-off-by: anteju <[email protected]>

* Fix imports

Signed-off-by: Ante Jukić <[email protected]>

* Addressed PR comments

Signed-off-by: Ante Jukić <[email protected]>

* Apply isort and black reformatting

Signed-off-by: anteju <[email protected]>

---------

Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: anteju <[email protected]>
Co-authored-by: anteju <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Fix Trainer serialization (#9571)

* Fix Trainer serialization

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Update click version requirement (#9580)

Signed-off-by: Dong Hyuk Chang <[email protected]>
Co-authored-by: Dong Hyuk Chang <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Fault tolerance] Heartbeat detection (#9352)

* Fault tolerance related changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Cosmetic changes in documentation

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Doc update round2

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Co-authored-by: Jacek Bieniusiewicz <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add ModelOpt QAT example for Llama2 SFT model (#9326)

* add INT4 QAT example for Llama2 SFT model

Signed-off-by: Keval Morabia <[email protected]>

* Add config parameter to control kv cache quantization

Signed-off-by: Keval Morabia <[email protected]>

* Fix typo in cicd-main.yml for QAT test

Signed-off-by: Keval Morabia <[email protected]>

* fix nlp_overrides.py

Signed-off-by: Keval Morabia <[email protected]>

* address reviewer feedback

Signed-off-by: Keval Morabia <[email protected]>

* quantize unwrapped model

Signed-off-by: Keval Morabia <[email protected]>

* add compress export argument for qat config

Signed-off-by: Keval Morabia <[email protected]>

---------

Signed-off-by: Keval Morabia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Set TE flag in legacy -> mcore conversion script (#9585)

* set TE flag

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Add fabric-API for manual forward-pass (#9577)

* First pass over fabric-API

* Adding Trainer -> Fabric conversion

* Some small fixes to get a forward-pass in Fabric working

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding doc-string to Fabric.import_model

* Adding track_io to io_init of Fabric

* Fix Fabric.load_model + add doc-string

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove unused import

* Some small fixes

* Fix failing test

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Add SDK-factories to llm-collection (#9589)

* Adding sdk-factories to llm-collection

* Removing _model from mistral + mixtral

* Expose lr_scheduler inside lightning

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Multimodal projection layer adapter fix for PP>1 (#9445)

* enabling multimodal adapters to load in PP>1

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* parameterizing validate_access_integrity, set to false when PP>1

Signed-off-by: paul-gibbons <[email protected]>

formatting fix

Signed-off-by: paul-gibbons <[email protected]>

Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* update nlp_model.py

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* update modelPT with validate_access_integrity

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* updating save_restore_connector w/ validate_access_integrity

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* addressing comment

Signed-off-by: paul-gibbons <[email protected]>

* adding validate_access_integrity to super().load_config_and_state_dict()

Signed-off-by: paul-gibbons <[email protected]>

* testing reorder of validate_access_integrity for CI failures

Signed-off-by: paul-gibbons <[email protected]>

---------

Signed-off-by: paul-gibbons <[email protected]>
Signed-off-by: paul-gibbons <[email protected]>
Co-authored-by: paul-gibbons <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add offline quantization script for QLoRA deployment (#9455)

* add qlora offline quantization script

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* clean

Signed-off-by: Chen Cui <[email protected]>

* docstring

Signed-off-by: Chen Cui <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* qlora support more models (#9488)

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Some improvements to NeMoLogger (#9591)

Signed-off-by: Tugrul Konuk <[email protected]>

* Set n_gpu to None in nemo export (#9593)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* set ngpus to None

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Inflight nemo model export support (#9527)

* online model conversion and refit

Signed-off-by: Jimmy Zhang <[email protected]>

* clean code

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup

Signed-off-by: Jimmy Zhang <[email protected]>

* add refit, cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* combine weight conversion functions

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <[email protected]>

* remove debug print

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* fix single gpu and cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <[email protected]>

---------

Signed-off-by: JimmyZhang12 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* vLLM Export Improvements (#9596)

* Separated the vLLM export functionality from the common deployment script into deploy_vllm_triton.py.

Signed-off-by: Alexey Panteleev <[email protected]>

* Fixed vocab_size for LLAMA3.

Signed-off-by: Alexey Panteleev <[email protected]>

* Export test: fixed deployment testing w/o Megatron, made functional tests optional, added --gpu_memory_utilization.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Addressing review and CodeQL comments.

Signed-off-by: Alexey Panteleev <[email protected]>

---------

Signed-off-by: Alexey Panteleev <[email protected]>
Signed-off-by: apanteleev <[email protected]>
Co-authored-by: apanteleev <[email protected]>
Co-authored-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Set finalize_model_grads_func in on_fit_start instead to make sure it's being called (#9599)

Signed-off-by: Tugrul Konuk <[email protected]>

* Set no_sync_func & grad_sync_fucn (#9601)

* Set no_sync_func & grad_sync_fucn

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* set overlap_param_sync

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* small nemo logger bug fix (#9607)

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix the dict format returned by scheduler method (#9609)

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Dataloading enhancements and bug fixes (#9595)

* fix dataloading + checkpoint restore

* clean up data sampler

* fix typo

* support passing multiple paths to data module

* fix validation dataloader

* fix dataloader len when using gradient accumulation

* fix progress bar

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* fix step count in loggers

* fix blended dataset

* address comments

* address comment

* move step logging into strategy

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix serialization of AutoResume (#9616)

* fix serialization of autoresume

* update undefined variables

Signed-off-by: Tugrul Konuk <[email protected]>

* Chat template support for megatron_gpt_eval.py (#9354)

* Bump PTL version (#9557)

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* [Resiliency] Straggler detection (#9473)

* Initial straggler det impl

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed CI code checks

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Removed unused import

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* remove submodule

Signed-off-by: Maanu Grover <[email protected]>

* Updated documentation; Updated callback params; Cosmetic changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed straggler det config; Added basic test

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixes in test_straggler_det.py

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Updated straggler callback API

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* stop_if_detected=False by default

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move model loading to separate function; call toContainer once; pad using closed formula

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* read prompts from file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* If input prompt contains dict, apply model.tokenizer.chat_template

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* apply @Gal Leibovich's patch

Taken from: https://github.com/NVIDIA/NeMo/commit/17572905344db4692583e72799d55801a8860f35
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rename prompts_file to prompts_jsonl

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat_template param

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add ChatTemplateMixin to SentencePieceTokenizer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat-template to text-gen-strat

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move load prompts to separate file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove chat-template from text-gen-utils

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* make chat-template more generic

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add assert message

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* small refactor for chat_template_mixin

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo ckpt conv changes

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move rounding to function

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Jsonl support (#9611)

* Adding support to preprocess .jsonl and .jsonl.gz files in input directory

Signed-off-by: adityavavre <[email protected]>

* Adding support to preprocess .jsonl and .jsonl.gz files in input directory

Signed-off-by: adityavavre <[email protected]>

* Apply isort and black reformatting

Signed-off-by: adityavavre <[email protected]>

---------

Signed-off-by: adityavavre <[email protected]>
Signed-off-by: adityavavre <[email protected]>
Co-authored-by: adityavavre <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Add PEFT (#9490)

* initial commit for PEFT in nemo2

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* address comments

Signed-off-by: Chen Cui <[email protected]>

* make import easier

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* address comments

Signed-off-by: Chen Cui <[email protected]>

* Update nemo/collections/llm/peft/lora.py

Signed-off-by: Marc Romeyn <[email protected]>

* Some small fixes + adding more doc-strings

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding ModelTransform callback

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fixing type-hint for model_transform

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* fix import

Signed-off-by: Chen Cui <[email protected]>

* model transform for gemma llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fix model transform

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* change lora target default to all linear modules

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* Small fix in mixtral

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Integrating PEFT to the public-API + some fixes

* Big refactor to allow to load adapter-states

* Some fixes to support adapter_path

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Disabling ckpt reloading when adapter_path is passed

* Fix CLI

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove commented-out code

* Remove commented-out code

* Remove un-used import

* Fix callback imports

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fixing llm.pretrain

* Some small fixes

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix missing import + type-hint in finetune

* Adding PreemptionCallback + some more tests

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Clean up imports & clean up llm.api

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Trying to fix failing tests

* Remove __init__.py 2

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix failing test

* Trying to fix last failing test

---------

Signed-off-by: cuichenx <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Akoumparouli/mistral import instruct chat template fix (#9567)

* use bf16 by defualt mistral conv

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat template

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use capitalized role names

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Remove .cuda calls, use device isntead (#9602)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix converter defautl args (#9565)

* fix converter defautl args

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* mixtral export (#9603)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix: remove non_blocking from PTL's .cuda call (#9618)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Alit/mamba tmp (#9612)

* adding mamba support

* fix import mixins

* rm convert jamba

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* more cleanups

* use GPT text gen

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* fixing gbs in TP convetor

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add reqs

* add tutorial

* minor fix to tutorial

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add mamba_tmp

* remove mamba import

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

---------

Signed-off-by: JRD971000 <[email protected]>
Signed-off-by: arendu <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: JRD971000 <[email protected]>
Co-authored-by: arendu <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* TitaNet Batch Verify Speaker (#9337)

* add batch_inference for verify_speakers method

Signed-off-by: [email protected] <[email protected]>

* remove not used package

Signed-off-by: [email protected] <[email protected]>

* change batch inference logic

Signed-off-by: [email protected] <[email protected]>

* fixup

Signed-off-by: [email protected] <[email protected]>

* requested changes

Signed-off-by: [email protected] <[email protected]>

* add verify_speakers_batch to docs

Signed-off-by: [email protected] <[email protected]>

* handle None durations in manifest

Signed-off-by: [email protected] <[email protected]>

* change logging text

Signed-off-by: [email protected] <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* check duration presence

Signed-off-by: [email protected] <[email protected]>

* add channel_selector to dataset configs

Signed-off-by: [email protected] <[email protected]>

---------

Signed-off-by: [email protected] <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Enable MCore checkpointing optimizations (#9505)

* Expose num processes in PyT Dist

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add parallel save/load optimizations from MCore

Signed-off-by: Mikołaj Błaż <[email protected]>

* Remove async utils from MCore

Signed-off-by: Mikołaj Błaż <[email protected]>

* Enable DistOpt paralell R/W

Signed-off-by: Mikołaj Błaż <[email protected]>

* Enable PyT Dist caching

Signed-off-by: Mikołaj Błaż <[email protected]>

* Small fixes

Signed-off-by: Mikołaj Błaż <[email protected]>

* Make sure DistCkptIO is instantiated from config

Signed-off-by: Mikołaj Błaż <[email protected]>

* Bump MCore version to v0.7

Signed-off-by: Mikołaj Błaż <[email protected]>

* Print load strategy

Signed-off-by: Mikołaj Błaż <[email protected]>

* Forward MCore to model space DistOpt

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add separate flag to control DistOpt paralell R/W

Signed-off-by: Mikołaj Błaż <[email protected]>

* Turn off parallel save by default

Signed-off-by: Mikołaj Błaż <[email protected]>

---------

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Change mixtral moe key name for trt-llm (#9620)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* change moe key values

Signed-off-by: Onur Yilmaz <[email protected]>

* add weight to the key

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix ckpt load bug (#9621)

* fix ckpt load bug

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* NeVA Minor Fixes (#9608)

* fix neva resume with empty param loaded for some pp stage

Signed-off-by: yaoyu-33 <[email protected]>

* fix crop size check

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix pretrianing data sizes and weights (#9627)

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Alit/mamba (#9575)

* adding mamba support

* fix import mixins

* rm convert jamba

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* more cleanups

* use GPT text gen

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* fixing gbs in TP convetor

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add reqs

* add tutorial

* minor fix to tutorial

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* add mamba dependancies

* add mcore tag

* modify dockerfile ci

* modify dockerfile ci

---------

Signed-off-by: JRD971000 <[email protected]>
Signed-off-by: arendu <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: JRD971000 <[email protected]>
Co-authored-by: arendu <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] async checkpointing support (#9466)

* add async checkpointing support

* fixes

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* add parallel read/write support and other optimizations

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* address comments, make dist checkpointing args configurable

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* fix small typo

Signed-off-by: ashors1 <[email protected]>

* Update default sharding type

Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Anna Shors <[email protected]>

* Update default sharding type

Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Anna Shors <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Anna Shors <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix the arguments  of forward_for_export function in msdd_models (#9624)

* Fix the arguments  of forward_for_export function

Signed-off-by: Taejin Park <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

---------

Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: tango4j <[email protected]>
Co-authored-by: tango4j <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Change default parallel_save to False (#9632)

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Unwrap ckpt_io for model opt (async save) (#9622)

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* MCore T5 support for NeMo - Training (#9432)

* huvu/mcore_t5 first commit from local

* removing DEBUGGING prints

* cleaning megatron_lm_encoder_decoder_model.py code

* cleaning code

* adding Github action test

* only run mcore T5 test

* only run mcore T5 test

* only run mcore T5 test

* only run mcore T5 test

* reset .github/workflows/cicd-main.yml

* reset .github/workflows/cicd-main.yml

* adding condition self.mcore_t5 when running self.build_transformer_config()

* refractor megatron_lm_encoder_decoder_model.py to not use self.model

* only run T5-related tests

* remove all self.model

* reset cicd file

* reset cicd file

* updating codes remove duplicate if/else; adding mcore/transformer_engine to config file

* adjust +model.mcore_t5=True

* Apply isort and black reformatting

Signed-off-by: huvunvidia <[email protected]>

---------

Signed-off-by: huvunvidia <[email protected]>
Co-authored-by: Huy Vu2 <[email protected]>
Co-authored-by: huvunvidia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Expose transformer_layer_spec inside GPTConfig (#9592)

* Expose transformer_layer_spec inside GPTConfig

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Expose layer-specs

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Update NeMo Clip to Use MCore Modules (#9594)

* update clip model and config file

Signed-off-by: yaoyu-33 <[email protected]>

* update clip for mcore

Signed-off-by: yaoyu-33 <[email protected]>

* MCore CLIP Fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix no mask

Signed-off-by: yaoyu-33 <[email protected]>

* few neva fixes

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip module

Signed-off-by: yaoyu-33 <[email protected]>

* add siglip loss

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix collate fn

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip conversion script

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip convert

Signed-off-by: yaoyu-33 <[email protected]>

* clip fixes

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* clean up script

Signed-off-by: yaoyu-33 <[email protected]>

* clip fixe…
akoumpa added a commit that referenced this pull request Jul 25, 2024
* Adding context- & expert-parallism to MegatronStrategy (#9525)

Signed-off-by: Tugrul Konuk <[email protected]>

* Add CICD test for Stable Diffusion (#9464)

* Add CICD test for Stable Diffusion

Signed-off-by: Michal Futrega <[email protected]>

* Update cicd-main.yml

Signed-off-by: Michal Futrega <[email protected]>

* Use single gpu runner

Signed-off-by: Michal Futrega <[email protected]>

---------

Signed-off-by: Michal Futrega <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Akoumparouli/nemo ux mixtral (#9446)

* use default collate if dataset does not have one

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mixtral config

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add convert_state

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix StateDictTransform for 2D layers, e.g. MoE

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pass num_moe_experts to specs

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* udpate MixtralModel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mini docstring

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* update mcoreddp call (#9345)

* update mcoreddp call

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update mcore commits

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Llama and Gemma (#9528)

* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama3

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fix typo

Signed-off-by: Chen Cui <[email protected]>

* enable importers with multiple models

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add gemma

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* checks

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] minor logging bug fixes (#9529)

* minor exp_manager bug fixes

* remove print statement

* fix docstring

* fix AppState defaults

---------

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* mcore distOpt restore fix (#9421)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Custom Tiktoken tokenizer.

Signed-off-by: Tugrul Konuk <[email protected]>

* Fixed the tokenizer decoding on special tokens.

Signed-off-by: Tugrul Konuk <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ertkonuk <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Added token_to_id() method.

Signed-off-by: Tugrul Konuk <[email protected]>

* Update neva conversion script from and to HF (#9296)

* Update NeMo script

Signed-off-by: yaoyu-33 <[email protected]>

* Fix example scripts

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update convert_llava_nemo_to_hf.py

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* vLLM Export Support (#9381)

* Export implementation for vLLM 0.4.3.

Supports LLAMA2, Mistral, Mixtral (unverified), Gemma and StarCoder2 models.

The nemo.export.tensorrt_llm alias was removed to avoid initializing TRT-LLM when importing anything from nemo.export.

Signed-off-by: Alexey Panteleev <[email protected]>

* Fixed some CodeQL warnings.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Removed empty files.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Updated the integration for vLLM 0.5.0.

Signed-off-by: Alexey Panteleev <[email protected]>

* Updated the vLLM deployment interface to use max_output_len instead of max_output_token.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the Exporter class to nemo/export and renamed its file to vllm_exporter.py, to be more similar to TRT-LLM.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Implemented vLLM support in the export tests, added functional testing, implemented forward evaluation on vLLM without Triton.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the vLLM deployment functionality to the common deploy_triton.py script.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed the CodeQL discovered issues.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed one more return of a wrong dimensionality...

Signed-off-by: Alexey Panteleev <[email protected]>

* More wrong dimensionality returns.

Signed-off-by: Alexey Panteleev <[email protected]>

---------

Signed-off-by: Alexey Panteleev <[email protected]>
Signed-off-by: apanteleev <[email protected]>
Co-authored-by: apanteleev <[email protected]>
Co-authored-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* PL: Delete precision if using plugin. TODO switch to MegatronTrainerBuilder (#9535)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add page context fmha (#9526)

Signed-off-by: Tugrul Konuk <[email protected]>

* extend get_gpt_layer_modelopt_spec to support MoE (#9532)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix mock data generation for legacy dataset (#9530)

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] IO fixes (#9512)

* Improve IOMixin.io_transform_args to handle dataclasses better

* Dump task json + img inside NeMoLogger

* Adding store_io to train task

* Update opt.connect to also propagate to __io__

* Rename opt to optim for consistency

* Moving to using safe serialization using fiddle, only use cloudpickle when needed

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Using Config from fiddle instead of sdk for now

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move enable_nemo_ckpt_io from MegatronStrategy to ModelCheckpoint

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move nemo-ckpt to _get_finalize_save_checkpoint_callback

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Update TrainerContext & io.load_ckpt

* Use renamed TrainerContext inside ModelCheckpoint

* Remove double io saving

* Rename lightning.pytorch.opt -> optim

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove store_io from train-task

* Adding fiddle-extension for torch

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move fdl_torch import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding dtype to serialization

* Some fixes

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add support for BuiltinFunctionType

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add missing import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix dataclass fields

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Test C++ runtime on demand in nemo_export.py to avoid possible OOMs (#9544)

* Add test_cpp_runtime flag

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Co-authored-by: janekl <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix lhotse tests for v1.24.2 (#9546)

* Fix lhotse tests for v1.24.0

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix RIR test

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* gpu_unitTests_notOptional (#9551)

Signed-off-by: Tugrul Konuk <[email protected]>

* add reset learning rate functionality (#9372)

* add reset_lr functionality

Signed-off-by: dimapihtar <[email protected]>

* fix reset_lr logic

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* move reset_lr from optim section

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr value to config

Signed-off-by: dimapihtar <[email protected]>

* set reset_lr False by default

Signed-off-by: dimapihtar <[email protected]>

* remove extra line

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr test

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr test

Signed-off-by: dimapihtar <[email protected]>

* remove extra quote

Signed-off-by: dimapihtar <[email protected]>

* add ability to reset schedule's max_steps and decay_steps

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* change scheduler's first step logic when using reset_lr

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* fix reset_lr logic

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* update reset_lr comments

Signed-off-by: dimapihtar <[email protected]>

* add use cases for reset_lr feature

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add Python AIStore SDK to container and bump min Lhotse version (#9537)

* Add Python AIStore SDK to requirements and bump min Lhotse version

Signed-off-by: Piotr Żelasko <[email protected]>

* Move AIStore Python SDK to Dockerfile, remove matplotlib/ipywidgets deps

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Adding 'use_dynamo' option for export to use onnx.dynamo_export() instead of onnx.export() (#9147)

* Ininial WARs to implement dynamo option for export

Signed-off-by: Boris Fomitchev <[email protected]>

* including weights in .onnx

Signed-off-by: Boris Fomitchev <[email protected]>

* dynamo_export works for many small models

Signed-off-by: Boris Fomitchev <[email protected]>

* External weights behaviour fixed

Signed-off-by: Boris Fomitchev <[email protected]>

* Cleanup

Signed-off-by: Boris Fomitchev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: borisfom <[email protected]>

* print cleaned up

Signed-off-by: Boris Fomitchev <[email protected]>

* Added overloadable dynamic_shapes_for_export

Signed-off-by: Boris Fomitchev <[email protected]>

* Addressing code review

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing CI issues

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing CI test failure

Signed-off-by: Boris Fomitchev <[email protected]>

* Eliminated test cross-contamination

Signed-off-by: Boris Fomitchev <[email protected]>

---------

Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: borisfom <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Fix tokenizer IO (#9555)

* Adding tokenizer to io-test + making it pass

* Handling tokenizer correctly inside dump_io

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Removing not used import

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo UX] Move mistral_7b.py to mistral.py (#9545)

* Move mistral_7b.py to mistral.py

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rename MixtralConfig to MixtralConfig8x7B

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mistral rename: mistralconfig7b & mistralmodel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Use closed-formula to round by multiple (#9307)

* Use closed-formula to round by multiple

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* ci: Do not attempt to send slack on fork (#9556)

* ci: Do not attempt to send slack on fork

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix nemo export test (#9547)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* fix export test

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix SDXL incorrect name in docs (#9534)

Signed-off-by: Tugrul Konuk <[email protected]>

* GPU unit tests: Mark flaky tests to be fixed (#9559)

Signed-off-by: Tugrul Konuk <[email protected]>

* Bump PTL version (#9557)

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Resiliency] Straggler detection (#9473)

* Initial straggler det impl

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed CI code checks

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Removed unused import

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* remove submodule

Signed-off-by: Maanu Grover <[email protected]>

* Updated documentation; Updated callback params; Cosmetic changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed straggler det config; Added basic test

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixes in test_straggler_det.py

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Updated straggler callback API

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* stop_if_detected=False by default

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* switch to torch_dist as default dist checkpointing backend (#9541)

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Checkpointing bug fixes (#9562)

* fix checkpoint loading

* fix

* fixes

* another fix

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add tps and pps params to the export script (#9558)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* fix export test

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* remove n_gpus param

Signed-off-by: Onur Yilmaz <[email protected]>

* add and fix parameters

Signed-off-by: Onur Yilmaz <[email protected]>

* fix deploy script

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* rename tps and pps params

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Consolidate gpt continue training script into pretraining script (#9413)

* Consolidate gpt continue training with pretraining

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix default config

Signed-off-by: yaoyu-33 <[email protected]>

* Add github action cicd

Signed-off-by: yaoyu-33 <[email protected]>

* extract _integrate_original_checkpoint_data as a method

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix getattr

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "Add github action cicd"

This reverts commit a453f16ba2be6413db932623009da893208acdd5.

* Update comments in nlp_overrides.py

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add support to change Multi task model prompt (#9542)

* Add support to change Multi task model prompt

Signed-off-by: smajumdar <[email protected]>

* Add support to change Multi task model prompt

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Update nemo/collections/common/prompts/formatter.py

Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add Multimodal Exporter (#9256)

* Add video-neva TRT export

* Add TRT inference

* Change config

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change export params

* Remove unused import

* Add neva export

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change unpack nemo

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Add trt infer config

* Fix neva trt inference

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Add exporter

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix infer

* Add PyTriton

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix deploy wrong dim

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change to pass PIL Image

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix video neva deploy

* Change query

* Change deploy

* Remove unused import

* Change ptuning

* Change to mm exporter

* Add script

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix script

---------

Signed-off-by: meatybobby <[email protected]>
Co-authored-by: meatybobby <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Enable encoder adapters for Canary and MultiTaskAED models (#9409)

* Fix assertions for adapter types

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Cleanup

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Finalize support for decoder adapters

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* fix the freeze/unfreeze problem by replacing as_frozen with torch.inference_mode

* Apply isort and black reformatting

Signed-off-by: weiqingw4ng <[email protected]>

* Update tests to new generic way of module update

Signed-off-by: smajumdar <[email protected]>

* Finalize code for update module

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Fix variable name

Signed-off-by: smajumdar <[email protected]>

* Finalize projection support for transformer mha adapters

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Correct implementation of freeze restore

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Corrects the implementation of replace_adapter_modules to limit to just the top level modules

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Remove registration of Transformer MHA

Signed-off-by: smajumdar <[email protected]>

* Remove registration of Transformer MHA

Signed-off-by: smajumdar <[email protected]>

* Address reviewer comments

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: weiqingw4ng <[email protected]>
Co-authored-by: Weiqing Wang <[email protected]>
Co-authored-by: weiqingw4ng <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* pass option through (#9570)

Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* PTQ refinements (#9574)

* Rename megatron_gpt_quantization -> megatron_gpt_ptq

Signed-off-by: Jan Lasek <[email protected]>

* Configure export.save_path as dir or tarball

Signed-off-by: Jan Lasek <[email protected]>

* PTQ docs update

Signed-off-by: Jan Lasek <[email protected]>

* Make model_type optional in case of quantized checkpoints

Signed-off-by: Jan Lasek <[email protected]>

* Drop unused save_nemo_model_config argument

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Audio model collection (#9263)

* Audio model collection

Signed-off-by: Ante Jukić <[email protected]>

* Apply isort and black reformatting

Signed-off-by: anteju <[email protected]>

* Fix imports

Signed-off-by: Ante Jukić <[email protected]>

* Addressed PR comments

Signed-off-by: Ante Jukić <[email protected]>

* Apply isort and black reformatting

Signed-off-by: anteju <[email protected]>

---------

Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: anteju <[email protected]>
Co-authored-by: anteju <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Fix Trainer serialization (#9571)

* Fix Trainer serialization

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Update click version requirement (#9580)

Signed-off-by: Dong Hyuk Chang <[email protected]>
Co-authored-by: Dong Hyuk Chang <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Fault tolerance] Heartbeat detection (#9352)

* Fault tolerance related changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Cosmetic changes in documentation

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Doc update round2

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Co-authored-by: Jacek Bieniusiewicz <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add ModelOpt QAT example for Llama2 SFT model (#9326)

* add INT4 QAT example for Llama2 SFT model

Signed-off-by: Keval Morabia <[email protected]>

* Add config parameter to control kv cache quantization

Signed-off-by: Keval Morabia <[email protected]>

* Fix typo in cicd-main.yml for QAT test

Signed-off-by: Keval Morabia <[email protected]>

* fix nlp_overrides.py

Signed-off-by: Keval Morabia <[email protected]>

* address reviewer feedback

Signed-off-by: Keval Morabia <[email protected]>

* quantize unwrapped model

Signed-off-by: Keval Morabia <[email protected]>

* add compress export argument for qat config

Signed-off-by: Keval Morabia <[email protected]>

---------

Signed-off-by: Keval Morabia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Set TE flag in legacy -> mcore conversion script (#9585)

* set TE flag

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Add fabric-API for manual forward-pass (#9577)

* First pass over fabric-API

* Adding Trainer -> Fabric conversion

* Some small fixes to get a forward-pass in Fabric working

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding doc-string to Fabric.import_model

* Adding track_io to io_init of Fabric

* Fix Fabric.load_model + add doc-string

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove unused import

* Some small fixes

* Fix failing test

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Add SDK-factories to llm-collection (#9589)

* Adding sdk-factories to llm-collection

* Removing _model from mistral + mixtral

* Expose lr_scheduler inside lightning

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Multimodal projection layer adapter fix for PP>1 (#9445)

* enabling multimodal adapters to load in PP>1

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* parameterizing validate_access_integrity, set to false when PP>1

Signed-off-by: paul-gibbons <[email protected]>

formatting fix

Signed-off-by: paul-gibbons <[email protected]>

Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* update nlp_model.py

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* update modelPT with validate_access_integrity

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* updating save_restore_connector w/ validate_access_integrity

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* addressing comment

Signed-off-by: paul-gibbons <[email protected]>

* adding validate_access_integrity to super().load_config_and_state_dict()

Signed-off-by: paul-gibbons <[email protected]>

* testing reorder of validate_access_integrity for CI failures

Signed-off-by: paul-gibbons <[email protected]>

---------

Signed-off-by: paul-gibbons <[email protected]>
Signed-off-by: paul-gibbons <[email protected]>
Co-authored-by: paul-gibbons <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add offline quantization script for QLoRA deployment (#9455)

* add qlora offline quantization script

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* clean

Signed-off-by: Chen Cui <[email protected]>

* docstring

Signed-off-by: Chen Cui <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* qlora support more models (#9488)

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Some improvements to NeMoLogger (#9591)

Signed-off-by: Tugrul Konuk <[email protected]>

* Set n_gpu to None in nemo export (#9593)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* set ngpus to None

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Inflight nemo model export support (#9527)

* online model conversion and refit

Signed-off-by: Jimmy Zhang <[email protected]>

* clean code

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup

Signed-off-by: Jimmy Zhang <[email protected]>

* add refit, cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* combine weight conversion functions

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <[email protected]>

* remove debug print

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* fix single gpu and cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <[email protected]>

---------

Signed-off-by: JimmyZhang12 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* vLLM Export Improvements (#9596)

* Separated the vLLM export functionality from the common deployment script into deploy_vllm_triton.py.

Signed-off-by: Alexey Panteleev <[email protected]>

* Fixed vocab_size for LLAMA3.

Signed-off-by: Alexey Panteleev <[email protected]>

* Export test: fixed deployment testing w/o Megatron, made functional tests optional, added --gpu_memory_utilization.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Addressing review and CodeQL comments.

Signed-off-by: Alexey Panteleev <[email protected]>

---------

Signed-off-by: Alexey Panteleev <[email protected]>
Signed-off-by: apanteleev <[email protected]>
Co-authored-by: apanteleev <[email protected]>
Co-authored-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Set finalize_model_grads_func in on_fit_start instead to make sure it's being called (#9599)

Signed-off-by: Tugrul Konuk <[email protected]>

* Set no_sync_func & grad_sync_fucn (#9601)

* Set no_sync_func & grad_sync_fucn

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* set overlap_param_sync

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* small nemo logger bug fix (#9607)

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix the dict format returned by scheduler method (#9609)

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Dataloading enhancements and bug fixes (#9595)

* fix dataloading + checkpoint restore

* clean up data sampler

* fix typo

* support passing multiple paths to data module

* fix validation dataloader

* fix dataloader len when using gradient accumulation

* fix progress bar

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* fix step count in loggers

* fix blended dataset

* address comments

* address comment

* move step logging into strategy

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix serialization of AutoResume (#9616)

* fix serialization of autoresume

* update undefined variables

Signed-off-by: Tugrul Konuk <[email protected]>

* Chat template support for megatron_gpt_eval.py (#9354)

* Bump PTL version (#9557)

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* [Resiliency] Straggler detection (#9473)

* Initial straggler det impl

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed CI code checks

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Removed unused import

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* remove submodule

Signed-off-by: Maanu Grover <[email protected]>

* Updated documentation; Updated callback params; Cosmetic changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed straggler det config; Added basic test

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixes in test_straggler_det.py

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Updated straggler callback API

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* stop_if_detected=False by default

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move model loading to separate function; call toContainer once; pad using closed formula

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* read prompts from file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* If input prompt contains dict, apply model.tokenizer.chat_template

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* apply @Gal Leibovich's patch

Taken from: https://github.com/NVIDIA/NeMo/commit/17572905344db4692583e72799d55801a8860f35
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rename prompts_file to prompts_jsonl

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat_template param

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add ChatTemplateMixin to SentencePieceTokenizer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat-template to text-gen-strat

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move load prompts to separate file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove chat-template from text-gen-utils

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* make chat-template more generic

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add assert message

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* small refactor for chat_template_mixin

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo ckpt conv changes

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move rounding to function

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Jsonl support (#9611)

* Adding support to preprocess .jsonl and .jsonl.gz files in input directory

Signed-off-by: adityavavre <[email protected]>

* Adding support to preprocess .jsonl and .jsonl.gz files in input directory

Signed-off-by: adityavavre <[email protected]>

* Apply isort and black reformatting

Signed-off-by: adityavavre <[email protected]>

---------

Signed-off-by: adityavavre <[email protected]>
Signed-off-by: adityavavre <[email protected]>
Co-authored-by: adityavavre <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Add PEFT (#9490)

* initial commit for PEFT in nemo2

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* address comments

Signed-off-by: Chen Cui <[email protected]>

* make import easier

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* address comments

Signed-off-by: Chen Cui <[email protected]>

* Update nemo/collections/llm/peft/lora.py

Signed-off-by: Marc Romeyn <[email protected]>

* Some small fixes + adding more doc-strings

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding ModelTransform callback

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fixing type-hint for model_transform

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* fix import

Signed-off-by: Chen Cui <[email protected]>

* model transform for gemma llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fix model transform

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* change lora target default to all linear modules

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* Small fix in mixtral

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Integrating PEFT to the public-API + some fixes

* Big refactor to allow to load adapter-states

* Some fixes to support adapter_path

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Disabling ckpt reloading when adapter_path is passed

* Fix CLI

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove commented-out code

* Remove commented-out code

* Remove un-used import

* Fix callback imports

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fixing llm.pretrain

* Some small fixes

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix missing import + type-hint in finetune

* Adding PreemptionCallback + some more tests

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Clean up imports & clean up llm.api

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Trying to fix failing tests

* Remove __init__.py 2

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix failing test

* Trying to fix last failing test

---------

Signed-off-by: cuichenx <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Akoumparouli/mistral import instruct chat template fix (#9567)

* use bf16 by defualt mistral conv

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat template

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use capitalized role names

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Remove .cuda calls, use device isntead (#9602)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix converter defautl args (#9565)

* fix converter defautl args

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* mixtral export (#9603)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix: remove non_blocking from PTL's .cuda call (#9618)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Alit/mamba tmp (#9612)

* adding mamba support

* fix import mixins

* rm convert jamba

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* more cleanups

* use GPT text gen

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* fixing gbs in TP convetor

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add reqs

* add tutorial

* minor fix to tutorial

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add mamba_tmp

* remove mamba import

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

---------

Signed-off-by: JRD971000 <[email protected]>
Signed-off-by: arendu <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: JRD971000 <[email protected]>
Co-authored-by: arendu <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* TitaNet Batch Verify Speaker (#9337)

* add batch_inference for verify_speakers method

Signed-off-by: [email protected] <[email protected]>

* remove not used package

Signed-off-by: [email protected] <[email protected]>

* change batch inference logic

Signed-off-by: [email protected] <[email protected]>

* fixup

Signed-off-by: [email protected] <[email protected]>

* requested changes

Signed-off-by: [email protected] <[email protected]>

* add verify_speakers_batch to docs

Signed-off-by: [email protected] <[email protected]>

* handle None durations in manifest

Signed-off-by: [email protected] <[email protected]>

* change logging text

Signed-off-by: [email protected] <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* check duration presence

Signed-off-by: [email protected] <[email protected]>

* add channel_selector to dataset configs

Signed-off-by: [email protected] <[email protected]>

---------

Signed-off-by: [email protected] <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Enable MCore checkpointing optimizations (#9505)

* Expose num processes in PyT Dist

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add parallel save/load optimizations from MCore

Signed-off-by: Mikołaj Błaż <[email protected]>

* Remove async utils from MCore

Signed-off-by: Mikołaj Błaż <[email protected]>

* Enable DistOpt paralell R/W

Signed-off-by: Mikołaj Błaż <[email protected]>

* Enable PyT Dist caching

Signed-off-by: Mikołaj Błaż <[email protected]>

* Small fixes

Signed-off-by: Mikołaj Błaż <[email protected]>

* Make sure DistCkptIO is instantiated from config

Signed-off-by: Mikołaj Błaż <[email protected]>

* Bump MCore version to v0.7

Signed-off-by: Mikołaj Błaż <[email protected]>

* Print load strategy

Signed-off-by: Mikołaj Błaż <[email protected]>

* Forward MCore to model space DistOpt

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add separate flag to control DistOpt paralell R/W

Signed-off-by: Mikołaj Błaż <[email protected]>

* Turn off parallel save by default

Signed-off-by: Mikołaj Błaż <[email protected]>

---------

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Change mixtral moe key name for trt-llm (#9620)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* change moe key values

Signed-off-by: Onur Yilmaz <[email protected]>

* add weight to the key

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix ckpt load bug (#9621)

* fix ckpt load bug

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* NeVA Minor Fixes (#9608)

* fix neva resume with empty param loaded for some pp stage

Signed-off-by: yaoyu-33 <[email protected]>

* fix crop size check

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix pretrianing data sizes and weights (#9627)

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Alit/mamba (#9575)

* adding mamba support

* fix import mixins

* rm convert jamba

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* more cleanups

* use GPT text gen

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* fixing gbs in TP convetor

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add reqs

* add tutorial

* minor fix to tutorial

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* add mamba dependancies

* add mcore tag

* modify dockerfile ci

* modify dockerfile ci

---------

Signed-off-by: JRD971000 <[email protected]>
Signed-off-by: arendu <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: JRD971000 <[email protected]>
Co-authored-by: arendu <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] async checkpointing support (#9466)

* add async checkpointing support

* fixes

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* add parallel read/write support and other optimizations

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* address comments, make dist checkpointing args configurable

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* fix small typo

Signed-off-by: ashors1 <[email protected]>

* Update default sharding type

Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Anna Shors <[email protected]>

* Update default sharding type

Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Anna Shors <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Anna Shors <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix the arguments  of forward_for_export function in msdd_models (#9624)

* Fix the arguments  of forward_for_export function

Signed-off-by: Taejin Park <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

---------

Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: tango4j <[email protected]>
Co-authored-by: tango4j <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Change default parallel_save to False (#9632)

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Unwrap ckpt_io for model opt (async save) (#9622)

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* MCore T5 support for NeMo - Training (#9432)

* huvu/mcore_t5 first commit from local

* removing DEBUGGING prints

* cleaning megatron_lm_encoder_decoder_model.py code

* cleaning code

* adding Github action test

* only run mcore T5 test

* only run mcore T5 test

* only run mcore T5 test

* only run mcore T5 test

* reset .github/workflows/cicd-main.yml

* reset .github/workflows/cicd-main.yml

* adding condition self.mcore_t5 when running self.build_transformer_config()

* refractor megatron_lm_encoder_decoder_model.py to not use self.model

* only run T5-related tests

* remove all self.model

* reset cicd file

* reset cicd file

* updating codes remove duplicate if/else; adding mcore/transformer_engine to config file

* adjust +model.mcore_t5=True

* Apply isort and black reformatting

Signed-off-by: huvunvidia <[email protected]>

---------

Signed-off-by: huvunvidia <[email protected]>
Co-authored-by: Huy Vu2 <[email protected]>
Co-authored-by: huvunvidia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Expose transformer_layer_spec inside GPTConfig (#9592)

* Expose transformer_layer_spec inside GPTConfig

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Expose layer-specs

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Update NeMo Clip to Use MCore Modules (#9594)

* update clip model and config file

Signed-off-by: yaoyu-33 <[email protected]>

* update clip for mcore

Signed-off-by: yaoyu-33 <[email protected]>

* MCore CLIP Fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix no mask

Signed-off-by: yaoyu-33 <[email protected]>

* few neva fixes

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip module

Signed-off-by: yaoyu-33 <[email protected]>

* add siglip loss

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix collate fn

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip conversion script

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip convert

Signed-off-by: yaoyu-33 <[email protected]>

* clip fixes

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* clean up script

Signed-off-by: yaoyu-33 <[email protected]>

* clip fixes

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix code styles

Signed-off-by: yaoyu-33 <[email protected]>

* Update siglip_loss.py

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add REST API to deploy module (#9539)

* Add REST API and FastAPI to deploy module

Signed-off-by: Abhishree <[email protected]>

* Add NemoQuery and requirements

Signed-off-by: Abhishree <[email protected]>

* Edit path for config.json

Signed-off-by: Abhishree <[email protected]>

* Add modifications for REST API for the correct functionality

Move service dir under deploy
Use NeMoQueryLLM instead of NemoQuery

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Apply isort and black reformatting

Signed-off-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com>

* Change default port for REST Service

Change default port for REST service as Triton server also used the same port as default.

Signed-off-by: Abhishree Thittenamane <[email protected]>

* Apply isort and black reformatting

Signed-off-by: athitten <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Signed-off-by: athitten <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: athitten <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Mistral + Mixtral Support for NeVa (#9459)

* mistral template support

Signed-off-by: paul-gibbons <[email protected]>

* get_specs neva fix

Signed-off-by: paul-gibbons <[email protected]>

* mistral update

Signed-off-by: paul-gibbons <[email protected]>

* fixed mistral tokenization

Signed-off-by: paul-gibbons <[email protected]>

* t…
BoxiangW pushed a commit to BoxiangW/NeMo that referenced this pull request Jul 30, 2024
* Adding context- & expert-parallism to MegatronStrategy (#9525)

Signed-off-by: Tugrul Konuk <[email protected]>

* Add CICD test for Stable Diffusion (#9464)

* Add CICD test for Stable Diffusion

Signed-off-by: Michal Futrega <[email protected]>

* Update cicd-main.yml

Signed-off-by: Michal Futrega <[email protected]>

* Use single gpu runner

Signed-off-by: Michal Futrega <[email protected]>

---------

Signed-off-by: Michal Futrega <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Akoumparouli/nemo ux mixtral (#9446)

* use default collate if dataset does not have one

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mixtral config

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add convert_state

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix StateDictTransform for 2D layers, e.g. MoE

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pass num_moe_experts to specs

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* udpate MixtralModel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mini docstring

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* update mcoreddp call (#9345)

* update mcoreddp call

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update mcore commits

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Llama and Gemma (#9528)

* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama3

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fix typo

Signed-off-by: Chen Cui <[email protected]>

* enable importers with multiple models

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add gemma

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* checks

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] minor logging bug fixes (#9529)

* minor exp_manager bug fixes

* remove print statement

* fix docstring

* fix AppState defaults

---------

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* mcore distOpt restore fix (#9421)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Custom Tiktoken tokenizer.

Signed-off-by: Tugrul Konuk <[email protected]>

* Fixed the tokenizer decoding on special tokens.

Signed-off-by: Tugrul Konuk <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ertkonuk <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Added token_to_id() method.

Signed-off-by: Tugrul Konuk <[email protected]>

* Update neva conversion script from and to HF (#9296)

* Update NeMo script

Signed-off-by: yaoyu-33 <[email protected]>

* Fix example scripts

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update convert_llava_nemo_to_hf.py

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* vLLM Export Support (#9381)

* Export implementation for vLLM 0.4.3.

Supports LLAMA2, Mistral, Mixtral (unverified), Gemma and StarCoder2 models.

The nemo.export.tensorrt_llm alias was removed to avoid initializing TRT-LLM when importing anything from nemo.export.

Signed-off-by: Alexey Panteleev <[email protected]>

* Fixed some CodeQL warnings.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Removed empty files.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Updated the integration for vLLM 0.5.0.

Signed-off-by: Alexey Panteleev <[email protected]>

* Updated the vLLM deployment interface to use max_output_len instead of max_output_token.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the Exporter class to nemo/export and renamed its file to vllm_exporter.py, to be more similar to TRT-LLM.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Implemented vLLM support in the export tests, added functional testing, implemented forward evaluation on vLLM without Triton.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the vLLM deployment functionality to the common deploy_triton.py script.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed the CodeQL discovered issues.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed one more return of a wrong dimensionality...

Signed-off-by: Alexey Panteleev <[email protected]>

* More wrong dimensionality returns.

Signed-off-by: Alexey Panteleev <[email protected]>

---------

Signed-off-by: Alexey Panteleev <[email protected]>
Signed-off-by: apanteleev <[email protected]>
Co-authored-by: apanteleev <[email protected]>
Co-authored-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* PL: Delete precision if using plugin. TODO switch to MegatronTrainerBuilder (#9535)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add page context fmha (#9526)

Signed-off-by: Tugrul Konuk <[email protected]>

* extend get_gpt_layer_modelopt_spec to support MoE (#9532)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix mock data generation for legacy dataset (#9530)

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] IO fixes (#9512)

* Improve IOMixin.io_transform_args to handle dataclasses better

* Dump task json + img inside NeMoLogger

* Adding store_io to train task

* Update opt.connect to also propagate to __io__

* Rename opt to optim for consistency

* Moving to using safe serialization using fiddle, only use cloudpickle when needed

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Using Config from fiddle instead of sdk for now

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move enable_nemo_ckpt_io from MegatronStrategy to ModelCheckpoint

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move nemo-ckpt to _get_finalize_save_checkpoint_callback

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Update TrainerContext & io.load_ckpt

* Use renamed TrainerContext inside ModelCheckpoint

* Remove double io saving

* Rename lightning.pytorch.opt -> optim

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove store_io from train-task

* Adding fiddle-extension for torch

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move fdl_torch import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding dtype to serialization

* Some fixes

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add support for BuiltinFunctionType

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add missing import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix dataclass fields

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Test C++ runtime on demand in nemo_export.py to avoid possible OOMs (#9544)

* Add test_cpp_runtime flag

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Co-authored-by: janekl <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix lhotse tests for v1.24.2 (#9546)

* Fix lhotse tests for v1.24.0

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix RIR test

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* gpu_unitTests_notOptional (#9551)

Signed-off-by: Tugrul Konuk <[email protected]>

* add reset learning rate functionality (#9372)

* add reset_lr functionality

Signed-off-by: dimapihtar <[email protected]>

* fix reset_lr logic

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* move reset_lr from optim section

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr value to config

Signed-off-by: dimapihtar <[email protected]>

* set reset_lr False by default

Signed-off-by: dimapihtar <[email protected]>

* remove extra line

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr test

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr test

Signed-off-by: dimapihtar <[email protected]>

* remove extra quote

Signed-off-by: dimapihtar <[email protected]>

* add ability to reset schedule's max_steps and decay_steps

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* change scheduler's first step logic when using reset_lr

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* fix reset_lr logic

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* update reset_lr comments

Signed-off-by: dimapihtar <[email protected]>

* add use cases for reset_lr feature

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add Python AIStore SDK to container and bump min Lhotse version (#9537)

* Add Python AIStore SDK to requirements and bump min Lhotse version

Signed-off-by: Piotr Żelasko <[email protected]>

* Move AIStore Python SDK to Dockerfile, remove matplotlib/ipywidgets deps

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Adding 'use_dynamo' option for export to use onnx.dynamo_export() instead of onnx.export() (#9147)

* Ininial WARs to implement dynamo option for export

Signed-off-by: Boris Fomitchev <[email protected]>

* including weights in .onnx

Signed-off-by: Boris Fomitchev <[email protected]>

* dynamo_export works for many small models

Signed-off-by: Boris Fomitchev <[email protected]>

* External weights behaviour fixed

Signed-off-by: Boris Fomitchev <[email protected]>

* Cleanup

Signed-off-by: Boris Fomitchev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: borisfom <[email protected]>

* print cleaned up

Signed-off-by: Boris Fomitchev <[email protected]>

* Added overloadable dynamic_shapes_for_export

Signed-off-by: Boris Fomitchev <[email protected]>

* Addressing code review

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing CI issues

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing CI test failure

Signed-off-by: Boris Fomitchev <[email protected]>

* Eliminated test cross-contamination

Signed-off-by: Boris Fomitchev <[email protected]>

---------

Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: borisfom <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Fix tokenizer IO (#9555)

* Adding tokenizer to io-test + making it pass

* Handling tokenizer correctly inside dump_io

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Removing not used import

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo UX] Move mistral_7b.py to mistral.py (#9545)

* Move mistral_7b.py to mistral.py

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rename MixtralConfig to MixtralConfig8x7B

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mistral rename: mistralconfig7b & mistralmodel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Use closed-formula to round by multiple (#9307)

* Use closed-formula to round by multiple

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* ci: Do not attempt to send slack on fork (#9556)

* ci: Do not attempt to send slack on fork

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix nemo export test (#9547)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* fix export test

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix SDXL incorrect name in docs (#9534)

Signed-off-by: Tugrul Konuk <[email protected]>

* GPU unit tests: Mark flaky tests to be fixed (#9559)

Signed-off-by: Tugrul Konuk <[email protected]>

* Bump PTL version (#9557)

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Resiliency] Straggler detection (#9473)

* Initial straggler det impl

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed CI code checks

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Removed unused import

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* remove submodule

Signed-off-by: Maanu Grover <[email protected]>

* Updated documentation; Updated callback params; Cosmetic changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed straggler det config; Added basic test

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixes in test_straggler_det.py

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Updated straggler callback API

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* stop_if_detected=False by default

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* switch to torch_dist as default dist checkpointing backend (#9541)

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Checkpointing bug fixes (#9562)

* fix checkpoint loading

* fix

* fixes

* another fix

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add tps and pps params to the export script (#9558)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* fix export test

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* remove n_gpus param

Signed-off-by: Onur Yilmaz <[email protected]>

* add and fix parameters

Signed-off-by: Onur Yilmaz <[email protected]>

* fix deploy script

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* rename tps and pps params

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Consolidate gpt continue training script into pretraining script (#9413)

* Consolidate gpt continue training with pretraining

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix default config

Signed-off-by: yaoyu-33 <[email protected]>

* Add github action cicd

Signed-off-by: yaoyu-33 <[email protected]>

* extract _integrate_original_checkpoint_data as a method

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix getattr

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "Add github action cicd"

This reverts commit a453f16ba2be6413db932623009da893208acdd5.

* Update comments in nlp_overrides.py

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add support to change Multi task model prompt (#9542)

* Add support to change Multi task model prompt

Signed-off-by: smajumdar <[email protected]>

* Add support to change Multi task model prompt

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Update nemo/collections/common/prompts/formatter.py

Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add Multimodal Exporter (#9256)

* Add video-neva TRT export

* Add TRT inference

* Change config

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change export params

* Remove unused import

* Add neva export

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change unpack nemo

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Add trt infer config

* Fix neva trt inference

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Add exporter

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix infer

* Add PyTriton

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix deploy wrong dim

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change to pass PIL Image

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix video neva deploy

* Change query

* Change deploy

* Remove unused import

* Change ptuning

* Change to mm exporter

* Add script

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix script

---------

Signed-off-by: meatybobby <[email protected]>
Co-authored-by: meatybobby <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Enable encoder adapters for Canary and MultiTaskAED models (#9409)

* Fix assertions for adapter types

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Cleanup

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Finalize support for decoder adapters

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* fix the freeze/unfreeze problem by replacing as_frozen with torch.inference_mode

* Apply isort and black reformatting

Signed-off-by: weiqingw4ng <[email protected]>

* Update tests to new generic way of module update

Signed-off-by: smajumdar <[email protected]>

* Finalize code for update module

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Fix variable name

Signed-off-by: smajumdar <[email protected]>

* Finalize projection support for transformer mha adapters

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Correct implementation of freeze restore

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Corrects the implementation of replace_adapter_modules to limit to just the top level modules

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Remove registration of Transformer MHA

Signed-off-by: smajumdar <[email protected]>

* Remove registration of Transformer MHA

Signed-off-by: smajumdar <[email protected]>

* Address reviewer comments

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: weiqingw4ng <[email protected]>
Co-authored-by: Weiqing Wang <[email protected]>
Co-authored-by: weiqingw4ng <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* pass option through (#9570)

Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* PTQ refinements (#9574)

* Rename megatron_gpt_quantization -> megatron_gpt_ptq

Signed-off-by: Jan Lasek <[email protected]>

* Configure export.save_path as dir or tarball

Signed-off-by: Jan Lasek <[email protected]>

* PTQ docs update

Signed-off-by: Jan Lasek <[email protected]>

* Make model_type optional in case of quantized checkpoints

Signed-off-by: Jan Lasek <[email protected]>

* Drop unused save_nemo_model_config argument

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Audio model collection (#9263)

* Audio model collection

Signed-off-by: Ante Jukić <[email protected]>

* Apply isort and black reformatting

Signed-off-by: anteju <[email protected]>

* Fix imports

Signed-off-by: Ante Jukić <[email protected]>

* Addressed PR comments

Signed-off-by: Ante Jukić <[email protected]>

* Apply isort and black reformatting

Signed-off-by: anteju <[email protected]>

---------

Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: anteju <[email protected]>
Co-authored-by: anteju <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Fix Trainer serialization (#9571)

* Fix Trainer serialization

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Update click version requirement (#9580)

Signed-off-by: Dong Hyuk Chang <[email protected]>
Co-authored-by: Dong Hyuk Chang <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Fault tolerance] Heartbeat detection (#9352)

* Fault tolerance related changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Cosmetic changes in documentation

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Doc update round2

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Co-authored-by: Jacek Bieniusiewicz <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add ModelOpt QAT example for Llama2 SFT model (#9326)

* add INT4 QAT example for Llama2 SFT model

Signed-off-by: Keval Morabia <[email protected]>

* Add config parameter to control kv cache quantization

Signed-off-by: Keval Morabia <[email protected]>

* Fix typo in cicd-main.yml for QAT test

Signed-off-by: Keval Morabia <[email protected]>

* fix nlp_overrides.py

Signed-off-by: Keval Morabia <[email protected]>

* address reviewer feedback

Signed-off-by: Keval Morabia <[email protected]>

* quantize unwrapped model

Signed-off-by: Keval Morabia <[email protected]>

* add compress export argument for qat config

Signed-off-by: Keval Morabia <[email protected]>

---------

Signed-off-by: Keval Morabia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Set TE flag in legacy -> mcore conversion script (#9585)

* set TE flag

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Add fabric-API for manual forward-pass (#9577)

* First pass over fabric-API

* Adding Trainer -> Fabric conversion

* Some small fixes to get a forward-pass in Fabric working

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding doc-string to Fabric.import_model

* Adding track_io to io_init of Fabric

* Fix Fabric.load_model + add doc-string

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove unused import

* Some small fixes

* Fix failing test

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Add SDK-factories to llm-collection (#9589)

* Adding sdk-factories to llm-collection

* Removing _model from mistral + mixtral

* Expose lr_scheduler inside lightning

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Multimodal projection layer adapter fix for PP>1 (#9445)

* enabling multimodal adapters to load in PP>1

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* parameterizing validate_access_integrity, set to false when PP>1

Signed-off-by: paul-gibbons <[email protected]>

formatting fix

Signed-off-by: paul-gibbons <[email protected]>

Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* update nlp_model.py

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* update modelPT with validate_access_integrity

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* updating save_restore_connector w/ validate_access_integrity

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* addressing comment

Signed-off-by: paul-gibbons <[email protected]>

* adding validate_access_integrity to super().load_config_and_state_dict()

Signed-off-by: paul-gibbons <[email protected]>

* testing reorder of validate_access_integrity for CI failures

Signed-off-by: paul-gibbons <[email protected]>

---------

Signed-off-by: paul-gibbons <[email protected]>
Signed-off-by: paul-gibbons <[email protected]>
Co-authored-by: paul-gibbons <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add offline quantization script for QLoRA deployment (#9455)

* add qlora offline quantization script

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* clean

Signed-off-by: Chen Cui <[email protected]>

* docstring

Signed-off-by: Chen Cui <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* qlora support more models (#9488)

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Some improvements to NeMoLogger (#9591)

Signed-off-by: Tugrul Konuk <[email protected]>

* Set n_gpu to None in nemo export (#9593)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* set ngpus to None

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Inflight nemo model export support (#9527)

* online model conversion and refit

Signed-off-by: Jimmy Zhang <[email protected]>

* clean code

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup

Signed-off-by: Jimmy Zhang <[email protected]>

* add refit, cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* combine weight conversion functions

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <[email protected]>

* remove debug print

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* fix single gpu and cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <[email protected]>

---------

Signed-off-by: JimmyZhang12 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* vLLM Export Improvements (#9596)

* Separated the vLLM export functionality from the common deployment script into deploy_vllm_triton.py.

Signed-off-by: Alexey Panteleev <[email protected]>

* Fixed vocab_size for LLAMA3.

Signed-off-by: Alexey Panteleev <[email protected]>

* Export test: fixed deployment testing w/o Megatron, made functional tests optional, added --gpu_memory_utilization.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Addressing review and CodeQL comments.

Signed-off-by: Alexey Panteleev <[email protected]>

---------

Signed-off-by: Alexey Panteleev <[email protected]>
Signed-off-by: apanteleev <[email protected]>
Co-authored-by: apanteleev <[email protected]>
Co-authored-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Set finalize_model_grads_func in on_fit_start instead to make sure it's being called (#9599)

Signed-off-by: Tugrul Konuk <[email protected]>

* Set no_sync_func & grad_sync_fucn (#9601)

* Set no_sync_func & grad_sync_fucn

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* set overlap_param_sync

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* small nemo logger bug fix (#9607)

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix the dict format returned by scheduler method (#9609)

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Dataloading enhancements and bug fixes (#9595)

* fix dataloading + checkpoint restore

* clean up data sampler

* fix typo

* support passing multiple paths to data module

* fix validation dataloader

* fix dataloader len when using gradient accumulation

* fix progress bar

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* fix step count in loggers

* fix blended dataset

* address comments

* address comment

* move step logging into strategy

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix serialization of AutoResume (#9616)

* fix serialization of autoresume

* update undefined variables

Signed-off-by: Tugrul Konuk <[email protected]>

* Chat template support for megatron_gpt_eval.py (#9354)

* Bump PTL version (#9557)

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* [Resiliency] Straggler detection (#9473)

* Initial straggler det impl

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed CI code checks

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Removed unused import

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* remove submodule

Signed-off-by: Maanu Grover <[email protected]>

* Updated documentation; Updated callback params; Cosmetic changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed straggler det config; Added basic test

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixes in test_straggler_det.py

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Updated straggler callback API

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* stop_if_detected=False by default

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move model loading to separate function; call toContainer once; pad using closed formula

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* read prompts from file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* If input prompt contains dict, apply model.tokenizer.chat_template

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* apply @Gal Leibovich's patch

Taken from: https://github.com/NVIDIA/NeMo/commit/17572905344db4692583e72799d55801a8860f35
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rename prompts_file to prompts_jsonl

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat_template param

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add ChatTemplateMixin to SentencePieceTokenizer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat-template to text-gen-strat

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move load prompts to separate file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove chat-template from text-gen-utils

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* make chat-template more generic

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add assert message

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* small refactor for chat_template_mixin

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo ckpt conv changes

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move rounding to function

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Jsonl support (#9611)

* Adding support to preprocess .jsonl and .jsonl.gz files in input directory

Signed-off-by: adityavavre <[email protected]>

* Adding support to preprocess .jsonl and .jsonl.gz files in input directory

Signed-off-by: adityavavre <[email protected]>

* Apply isort and black reformatting

Signed-off-by: adityavavre <[email protected]>

---------

Signed-off-by: adityavavre <[email protected]>
Signed-off-by: adityavavre <[email protected]>
Co-authored-by: adityavavre <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Add PEFT (#9490)

* initial commit for PEFT in nemo2

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* address comments

Signed-off-by: Chen Cui <[email protected]>

* make import easier

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* address comments

Signed-off-by: Chen Cui <[email protected]>

* Update nemo/collections/llm/peft/lora.py

Signed-off-by: Marc Romeyn <[email protected]>

* Some small fixes + adding more doc-strings

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding ModelTransform callback

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fixing type-hint for model_transform

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* fix import

Signed-off-by: Chen Cui <[email protected]>

* model transform for gemma llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fix model transform

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* change lora target default to all linear modules

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* Small fix in mixtral

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Integrating PEFT to the public-API + some fixes

* Big refactor to allow to load adapter-states

* Some fixes to support adapter_path

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Disabling ckpt reloading when adapter_path is passed

* Fix CLI

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove commented-out code

* Remove commented-out code

* Remove un-used import

* Fix callback imports

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fixing llm.pretrain

* Some small fixes

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix missing import + type-hint in finetune

* Adding PreemptionCallback + some more tests

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Clean up imports & clean up llm.api

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Trying to fix failing tests

* Remove __init__.py 2

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix failing test

* Trying to fix last failing test

---------

Signed-off-by: cuichenx <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Akoumparouli/mistral import instruct chat template fix (#9567)

* use bf16 by defualt mistral conv

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat template

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use capitalized role names

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Remove .cuda calls, use device isntead (#9602)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix converter defautl args (#9565)

* fix converter defautl args

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* mixtral export (#9603)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix: remove non_blocking from PTL's .cuda call (#9618)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Alit/mamba tmp (#9612)

* adding mamba support

* fix import mixins

* rm convert jamba

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* more cleanups

* use GPT text gen

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* fixing gbs in TP convetor

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add reqs

* add tutorial

* minor fix to tutorial

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add mamba_tmp

* remove mamba import

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

---------

Signed-off-by: JRD971000 <[email protected]>
Signed-off-by: arendu <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: JRD971000 <[email protected]>
Co-authored-by: arendu <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* TitaNet Batch Verify Speaker (#9337)

* add batch_inference for verify_speakers method

Signed-off-by: [email protected] <[email protected]>

* remove not used package

Signed-off-by: [email protected] <[email protected]>

* change batch inference logic

Signed-off-by: [email protected] <[email protected]>

* fixup

Signed-off-by: [email protected] <[email protected]>

* requested changes

Signed-off-by: [email protected] <[email protected]>

* add verify_speakers_batch to docs

Signed-off-by: [email protected] <[email protected]>

* handle None durations in manifest

Signed-off-by: [email protected] <[email protected]>

* change logging text

Signed-off-by: [email protected] <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* check duration presence

Signed-off-by: [email protected] <[email protected]>

* add channel_selector to dataset configs

Signed-off-by: [email protected] <[email protected]>

---------

Signed-off-by: [email protected] <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Enable MCore checkpointing optimizations (#9505)

* Expose num processes in PyT Dist

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add parallel save/load optimizations from MCore

Signed-off-by: Mikołaj Błaż <[email protected]>

* Remove async utils from MCore

Signed-off-by: Mikołaj Błaż <[email protected]>

* Enable DistOpt paralell R/W

Signed-off-by: Mikołaj Błaż <[email protected]>

* Enable PyT Dist caching

Signed-off-by: Mikołaj Błaż <[email protected]>

* Small fixes

Signed-off-by: Mikołaj Błaż <[email protected]>

* Make sure DistCkptIO is instantiated from config

Signed-off-by: Mikołaj Błaż <[email protected]>

* Bump MCore version to v0.7

Signed-off-by: Mikołaj Błaż <[email protected]>

* Print load strategy

Signed-off-by: Mikołaj Błaż <[email protected]>

* Forward MCore to model space DistOpt

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add separate flag to control DistOpt paralell R/W

Signed-off-by: Mikołaj Błaż <[email protected]>

* Turn off parallel save by default

Signed-off-by: Mikołaj Błaż <[email protected]>

---------

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Change mixtral moe key name for trt-llm (#9620)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* change moe key values

Signed-off-by: Onur Yilmaz <[email protected]>

* add weight to the key

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix ckpt load bug (#9621)

* fix ckpt load bug

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* NeVA Minor Fixes (#9608)

* fix neva resume with empty param loaded for some pp stage

Signed-off-by: yaoyu-33 <[email protected]>

* fix crop size check

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix pretrianing data sizes and weights (#9627)

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Alit/mamba (#9575)

* adding mamba support

* fix import mixins

* rm convert jamba

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* more cleanups

* use GPT text gen

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* fixing gbs in TP convetor

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add reqs

* add tutorial

* minor fix to tutorial

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* add mamba dependancies

* add mcore tag

* modify dockerfile ci

* modify dockerfile ci

---------

Signed-off-by: JRD971000 <[email protected]>
Signed-off-by: arendu <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: JRD971000 <[email protected]>
Co-authored-by: arendu <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] async checkpointing support (#9466)

* add async checkpointing support

* fixes

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* add parallel read/write support and other optimizations

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* address comments, make dist checkpointing args configurable

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* fix small typo

Signed-off-by: ashors1 <[email protected]>

* Update default sharding type

Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Anna Shors <[email protected]>

* Update default sharding type

Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Anna Shors <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Anna Shors <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix the arguments  of forward_for_export function in msdd_models (#9624)

* Fix the arguments  of forward_for_export function

Signed-off-by: Taejin Park <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

---------

Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: tango4j <[email protected]>
Co-authored-by: tango4j <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Change default parallel_save to False (#9632)

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Unwrap ckpt_io for model opt (async save) (#9622)

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* MCore T5 support for NeMo - Training (#9432)

* huvu/mcore_t5 first commit from local

* removing DEBUGGING prints

* cleaning megatron_lm_encoder_decoder_model.py code

* cleaning code

* adding Github action test

* only run mcore T5 test

* only run mcore T5 test

* only run mcore T5 test

* only run mcore T5 test

* reset .github/workflows/cicd-main.yml

* reset .github/workflows/cicd-main.yml

* adding condition self.mcore_t5 when running self.build_transformer_config()

* refractor megatron_lm_encoder_decoder_model.py to not use self.model

* only run T5-related tests

* remove all self.model

* reset cicd file

* reset cicd file

* updating codes remove duplicate if/else; adding mcore/transformer_engine to config file

* adjust +model.mcore_t5=True

* Apply isort and black reformatting

Signed-off-by: huvunvidia <[email protected]>

---------

Signed-off-by: huvunvidia <[email protected]>
Co-authored-by: Huy Vu2 <[email protected]>
Co-authored-by: huvunvidia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Expose transformer_layer_spec inside GPTConfig (#9592)

* Expose transformer_layer_spec inside GPTConfig

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Expose layer-specs

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Update NeMo Clip to Use MCore Modules (#9594)

* update clip model and config file

Signed-off-by: yaoyu-33 <[email protected]>

* update clip for mcore

Signed-off-by: yaoyu-33 <[email protected]>

* MCore CLIP Fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix no mask

Signed-off-by: yaoyu-33 <[email protected]>

* few neva fixes

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip module

Signed-off-by: yaoyu-33 <[email protected]>

* add siglip loss

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix collate fn

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip conversion script

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip convert

Signed-off-by: yaoyu-33 <[email protected]>

* clip fixes

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* clean up script

Signed-off-by: yaoyu-33 <[email protected]>

* clip fixes

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix code styles

Signed-off-by: yaoyu-33 <[email protected]>

* Update siglip_loss.py

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add REST API to deploy module (#9539)

* Add REST API and FastAPI to deploy module

Signed-off-by: Abhishree <[email protected]>

* Add NemoQuery and requirements

Signed-off-by: Abhishree <[email protected]>

* Edit path for config.json

Signed-off-by: Abhishree <[email protected]>

* Add modifications for REST API for the correct functionality

Move service dir under deploy
Use NeMoQueryLLM instead of NemoQuery

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Apply isort and black reformatting

Signed-off-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com>

* Change default port for REST Service

Change default port for REST service as Triton server also used the same port as default.

Signed-off-by: Abhishree Thittenamane <[email protected]>

* Apply isort and black reformatting

Signed-off-by: athitten <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Signed-off-by: athitten <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: athitten <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Mistral + Mixtral Support for NeVa (#9459)

* mistral template support

Signed-off-by: paul-gibbons <[email protected]>

* get_specs neva fix

Signed-off-by: paul-gibbons <[email protected]>

* mistral update

Signed-off-by: paul-gibbons <[email protected]>

* fixed mistral tokenization

Signed-off-by: paul-gibbons <[email protected]>

* t…
gshennvm pushed a commit that referenced this pull request Oct 2, 2024
* Adding context- & expert-parallism to MegatronStrategy (#9525)

Signed-off-by: Tugrul Konuk <[email protected]>

* Add CICD test for Stable Diffusion (#9464)

* Add CICD test for Stable Diffusion

Signed-off-by: Michal Futrega <[email protected]>

* Update cicd-main.yml

Signed-off-by: Michal Futrega <[email protected]>

* Use single gpu runner

Signed-off-by: Michal Futrega <[email protected]>

---------

Signed-off-by: Michal Futrega <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Akoumparouli/nemo ux mixtral (#9446)

* use default collate if dataset does not have one

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mixtral config

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add convert_state

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix StateDictTransform for 2D layers, e.g. MoE

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pass num_moe_experts to specs

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* udpate MixtralModel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mini docstring

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* update mcoreddp call (#9345)

* update mcoreddp call

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update mcore commits

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Llama and Gemma (#9528)

* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama3

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fix typo

Signed-off-by: Chen Cui <[email protected]>

* enable importers with multiple models

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add gemma

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* checks

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] minor logging bug fixes (#9529)

* minor exp_manager bug fixes

* remove print statement

* fix docstring

* fix AppState defaults

---------

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* mcore distOpt restore fix (#9421)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Custom Tiktoken tokenizer.

Signed-off-by: Tugrul Konuk <[email protected]>

* Fixed the tokenizer decoding on special tokens.

Signed-off-by: Tugrul Konuk <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ertkonuk <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Added token_to_id() method.

Signed-off-by: Tugrul Konuk <[email protected]>

* Update neva conversion script from and to HF (#9296)

* Update NeMo script

Signed-off-by: yaoyu-33 <[email protected]>

* Fix example scripts

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update convert_llava_nemo_to_hf.py

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* vLLM Export Support (#9381)

* Export implementation for vLLM 0.4.3.

Supports LLAMA2, Mistral, Mixtral (unverified), Gemma and StarCoder2 models.

The nemo.export.tensorrt_llm alias was removed to avoid initializing TRT-LLM when importing anything from nemo.export.

Signed-off-by: Alexey Panteleev <[email protected]>

* Fixed some CodeQL warnings.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Removed empty files.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Updated the integration for vLLM 0.5.0.

Signed-off-by: Alexey Panteleev <[email protected]>

* Updated the vLLM deployment interface to use max_output_len instead of max_output_token.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the Exporter class to nemo/export and renamed its file to vllm_exporter.py, to be more similar to TRT-LLM.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Implemented vLLM support in the export tests, added functional testing, implemented forward evaluation on vLLM without Triton.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the vLLM deployment functionality to the common deploy_triton.py script.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed the CodeQL discovered issues.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed one more return of a wrong dimensionality...

Signed-off-by: Alexey Panteleev <[email protected]>

* More wrong dimensionality returns.

Signed-off-by: Alexey Panteleev <[email protected]>

---------

Signed-off-by: Alexey Panteleev <[email protected]>
Signed-off-by: apanteleev <[email protected]>
Co-authored-by: apanteleev <[email protected]>
Co-authored-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* PL: Delete precision if using plugin. TODO switch to MegatronTrainerBuilder (#9535)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add page context fmha (#9526)

Signed-off-by: Tugrul Konuk <[email protected]>

* extend get_gpt_layer_modelopt_spec to support MoE (#9532)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix mock data generation for legacy dataset (#9530)

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] IO fixes (#9512)

* Improve IOMixin.io_transform_args to handle dataclasses better

* Dump task json + img inside NeMoLogger

* Adding store_io to train task

* Update opt.connect to also propagate to __io__

* Rename opt to optim for consistency

* Moving to using safe serialization using fiddle, only use cloudpickle when needed

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Using Config from fiddle instead of sdk for now

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move enable_nemo_ckpt_io from MegatronStrategy to ModelCheckpoint

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move nemo-ckpt to _get_finalize_save_checkpoint_callback

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Update TrainerContext & io.load_ckpt

* Use renamed TrainerContext inside ModelCheckpoint

* Remove double io saving

* Rename lightning.pytorch.opt -> optim

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove store_io from train-task

* Adding fiddle-extension for torch

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move fdl_torch import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding dtype to serialization

* Some fixes

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add support for BuiltinFunctionType

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add missing import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix dataclass fields

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Test C++ runtime on demand in nemo_export.py to avoid possible OOMs (#9544)

* Add test_cpp_runtime flag

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Co-authored-by: janekl <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix lhotse tests for v1.24.2 (#9546)

* Fix lhotse tests for v1.24.0

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix RIR test

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* gpu_unitTests_notOptional (#9551)

Signed-off-by: Tugrul Konuk <[email protected]>

* add reset learning rate functionality (#9372)

* add reset_lr functionality

Signed-off-by: dimapihtar <[email protected]>

* fix reset_lr logic

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* move reset_lr from optim section

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr value to config

Signed-off-by: dimapihtar <[email protected]>

* set reset_lr False by default

Signed-off-by: dimapihtar <[email protected]>

* remove extra line

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr test

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr test

Signed-off-by: dimapihtar <[email protected]>

* remove extra quote

Signed-off-by: dimapihtar <[email protected]>

* add ability to reset schedule's max_steps and decay_steps

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* change scheduler's first step logic when using reset_lr

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* fix reset_lr logic

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* update reset_lr comments

Signed-off-by: dimapihtar <[email protected]>

* add use cases for reset_lr feature

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add Python AIStore SDK to container and bump min Lhotse version (#9537)

* Add Python AIStore SDK to requirements and bump min Lhotse version

Signed-off-by: Piotr Żelasko <[email protected]>

* Move AIStore Python SDK to Dockerfile, remove matplotlib/ipywidgets deps

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Adding 'use_dynamo' option for export to use onnx.dynamo_export() instead of onnx.export() (#9147)

* Ininial WARs to implement dynamo option for export

Signed-off-by: Boris Fomitchev <[email protected]>

* including weights in .onnx

Signed-off-by: Boris Fomitchev <[email protected]>

* dynamo_export works for many small models

Signed-off-by: Boris Fomitchev <[email protected]>

* External weights behaviour fixed

Signed-off-by: Boris Fomitchev <[email protected]>

* Cleanup

Signed-off-by: Boris Fomitchev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: borisfom <[email protected]>

* print cleaned up

Signed-off-by: Boris Fomitchev <[email protected]>

* Added overloadable dynamic_shapes_for_export

Signed-off-by: Boris Fomitchev <[email protected]>

* Addressing code review

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing CI issues

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing CI test failure

Signed-off-by: Boris Fomitchev <[email protected]>

* Eliminated test cross-contamination

Signed-off-by: Boris Fomitchev <[email protected]>

---------

Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: borisfom <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Fix tokenizer IO (#9555)

* Adding tokenizer to io-test + making it pass

* Handling tokenizer correctly inside dump_io

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Removing not used import

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo UX] Move mistral_7b.py to mistral.py (#9545)

* Move mistral_7b.py to mistral.py

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rename MixtralConfig to MixtralConfig8x7B

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mistral rename: mistralconfig7b & mistralmodel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Use closed-formula to round by multiple (#9307)

* Use closed-formula to round by multiple

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* ci: Do not attempt to send slack on fork (#9556)

* ci: Do not attempt to send slack on fork

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix nemo export test (#9547)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* fix export test

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix SDXL incorrect name in docs (#9534)

Signed-off-by: Tugrul Konuk <[email protected]>

* GPU unit tests: Mark flaky tests to be fixed (#9559)

Signed-off-by: Tugrul Konuk <[email protected]>

* Bump PTL version (#9557)

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Resiliency] Straggler detection (#9473)

* Initial straggler det impl

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed CI code checks

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Removed unused import

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* remove submodule

Signed-off-by: Maanu Grover <[email protected]>

* Updated documentation; Updated callback params; Cosmetic changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed straggler det config; Added basic test

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixes in test_straggler_det.py

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Updated straggler callback API

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* stop_if_detected=False by default

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* switch to torch_dist as default dist checkpointing backend (#9541)

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Checkpointing bug fixes (#9562)

* fix checkpoint loading

* fix

* fixes

* another fix

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add tps and pps params to the export script (#9558)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* fix export test

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* remove n_gpus param

Signed-off-by: Onur Yilmaz <[email protected]>

* add and fix parameters

Signed-off-by: Onur Yilmaz <[email protected]>

* fix deploy script

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* rename tps and pps params

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Consolidate gpt continue training script into pretraining script (#9413)

* Consolidate gpt continue training with pretraining

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix default config

Signed-off-by: yaoyu-33 <[email protected]>

* Add github action cicd

Signed-off-by: yaoyu-33 <[email protected]>

* extract _integrate_original_checkpoint_data as a method

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix getattr

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "Add github action cicd"

This reverts commit a453f16ba2be6413db932623009da893208acdd5.

* Update comments in nlp_overrides.py

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add support to change Multi task model prompt (#9542)

* Add support to change Multi task model prompt

Signed-off-by: smajumdar <[email protected]>

* Add support to change Multi task model prompt

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Update nemo/collections/common/prompts/formatter.py

Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add Multimodal Exporter (#9256)

* Add video-neva TRT export

* Add TRT inference

* Change config

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change export params

* Remove unused import

* Add neva export

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change unpack nemo

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Add trt infer config

* Fix neva trt inference

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Add exporter

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix infer

* Add PyTriton

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix deploy wrong dim

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change to pass PIL Image

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix video neva deploy

* Change query

* Change deploy

* Remove unused import

* Change ptuning

* Change to mm exporter

* Add script

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix script

---------

Signed-off-by: meatybobby <[email protected]>
Co-authored-by: meatybobby <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Enable encoder adapters for Canary and MultiTaskAED models (#9409)

* Fix assertions for adapter types

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Cleanup

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Finalize support for decoder adapters

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* fix the freeze/unfreeze problem by replacing as_frozen with torch.inference_mode

* Apply isort and black reformatting

Signed-off-by: weiqingw4ng <[email protected]>

* Update tests to new generic way of module update

Signed-off-by: smajumdar <[email protected]>

* Finalize code for update module

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Fix variable name

Signed-off-by: smajumdar <[email protected]>

* Finalize projection support for transformer mha adapters

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Correct implementation of freeze restore

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Corrects the implementation of replace_adapter_modules to limit to just the top level modules

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Remove registration of Transformer MHA

Signed-off-by: smajumdar <[email protected]>

* Remove registration of Transformer MHA

Signed-off-by: smajumdar <[email protected]>

* Address reviewer comments

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: weiqingw4ng <[email protected]>
Co-authored-by: Weiqing Wang <[email protected]>
Co-authored-by: weiqingw4ng <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* pass option through (#9570)

Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* PTQ refinements (#9574)

* Rename megatron_gpt_quantization -> megatron_gpt_ptq

Signed-off-by: Jan Lasek <[email protected]>

* Configure export.save_path as dir or tarball

Signed-off-by: Jan Lasek <[email protected]>

* PTQ docs update

Signed-off-by: Jan Lasek <[email protected]>

* Make model_type optional in case of quantized checkpoints

Signed-off-by: Jan Lasek <[email protected]>

* Drop unused save_nemo_model_config argument

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Audio model collection (#9263)

* Audio model collection

Signed-off-by: Ante Jukić <[email protected]>

* Apply isort and black reformatting

Signed-off-by: anteju <[email protected]>

* Fix imports

Signed-off-by: Ante Jukić <[email protected]>

* Addressed PR comments

Signed-off-by: Ante Jukić <[email protected]>

* Apply isort and black reformatting

Signed-off-by: anteju <[email protected]>

---------

Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: anteju <[email protected]>
Co-authored-by: anteju <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Fix Trainer serialization (#9571)

* Fix Trainer serialization

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Update click version requirement (#9580)

Signed-off-by: Dong Hyuk Chang <[email protected]>
Co-authored-by: Dong Hyuk Chang <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Fault tolerance] Heartbeat detection (#9352)

* Fault tolerance related changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Cosmetic changes in documentation

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Doc update round2

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Co-authored-by: Jacek Bieniusiewicz <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add ModelOpt QAT example for Llama2 SFT model (#9326)

* add INT4 QAT example for Llama2 SFT model

Signed-off-by: Keval Morabia <[email protected]>

* Add config parameter to control kv cache quantization

Signed-off-by: Keval Morabia <[email protected]>

* Fix typo in cicd-main.yml for QAT test

Signed-off-by: Keval Morabia <[email protected]>

* fix nlp_overrides.py

Signed-off-by: Keval Morabia <[email protected]>

* address reviewer feedback

Signed-off-by: Keval Morabia <[email protected]>

* quantize unwrapped model

Signed-off-by: Keval Morabia <[email protected]>

* add compress export argument for qat config

Signed-off-by: Keval Morabia <[email protected]>

---------

Signed-off-by: Keval Morabia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Set TE flag in legacy -> mcore conversion script (#9585)

* set TE flag

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Add fabric-API for manual forward-pass (#9577)

* First pass over fabric-API

* Adding Trainer -> Fabric conversion

* Some small fixes to get a forward-pass in Fabric working

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding doc-string to Fabric.import_model

* Adding track_io to io_init of Fabric

* Fix Fabric.load_model + add doc-string

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove unused import

* Some small fixes

* Fix failing test

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Add SDK-factories to llm-collection (#9589)

* Adding sdk-factories to llm-collection

* Removing _model from mistral + mixtral

* Expose lr_scheduler inside lightning

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Multimodal projection layer adapter fix for PP>1 (#9445)

* enabling multimodal adapters to load in PP>1

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* parameterizing validate_access_integrity, set to false when PP>1

Signed-off-by: paul-gibbons <[email protected]>

formatting fix

Signed-off-by: paul-gibbons <[email protected]>

Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* update nlp_model.py

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* update modelPT with validate_access_integrity

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* updating save_restore_connector w/ validate_access_integrity

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* addressing comment

Signed-off-by: paul-gibbons <[email protected]>

* adding validate_access_integrity to super().load_config_and_state_dict()

Signed-off-by: paul-gibbons <[email protected]>

* testing reorder of validate_access_integrity for CI failures

Signed-off-by: paul-gibbons <[email protected]>

---------

Signed-off-by: paul-gibbons <[email protected]>
Signed-off-by: paul-gibbons <[email protected]>
Co-authored-by: paul-gibbons <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add offline quantization script for QLoRA deployment (#9455)

* add qlora offline quantization script

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* clean

Signed-off-by: Chen Cui <[email protected]>

* docstring

Signed-off-by: Chen Cui <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* qlora support more models (#9488)

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Some improvements to NeMoLogger (#9591)

Signed-off-by: Tugrul Konuk <[email protected]>

* Set n_gpu to None in nemo export (#9593)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* set ngpus to None

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Inflight nemo model export support (#9527)

* online model conversion and refit

Signed-off-by: Jimmy Zhang <[email protected]>

* clean code

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup

Signed-off-by: Jimmy Zhang <[email protected]>

* add refit, cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* combine weight conversion functions

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <[email protected]>

* remove debug print

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* fix single gpu and cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <[email protected]>

---------

Signed-off-by: JimmyZhang12 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* vLLM Export Improvements (#9596)

* Separated the vLLM export functionality from the common deployment script into deploy_vllm_triton.py.

Signed-off-by: Alexey Panteleev <[email protected]>

* Fixed vocab_size for LLAMA3.

Signed-off-by: Alexey Panteleev <[email protected]>

* Export test: fixed deployment testing w/o Megatron, made functional tests optional, added --gpu_memory_utilization.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Addressing review and CodeQL comments.

Signed-off-by: Alexey Panteleev <[email protected]>

---------

Signed-off-by: Alexey Panteleev <[email protected]>
Signed-off-by: apanteleev <[email protected]>
Co-authored-by: apanteleev <[email protected]>
Co-authored-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Set finalize_model_grads_func in on_fit_start instead to make sure it's being called (#9599)

Signed-off-by: Tugrul Konuk <[email protected]>

* Set no_sync_func & grad_sync_fucn (#9601)

* Set no_sync_func & grad_sync_fucn

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* set overlap_param_sync

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* small nemo logger bug fix (#9607)

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix the dict format returned by scheduler method (#9609)

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Dataloading enhancements and bug fixes (#9595)

* fix dataloading + checkpoint restore

* clean up data sampler

* fix typo

* support passing multiple paths to data module

* fix validation dataloader

* fix dataloader len when using gradient accumulation

* fix progress bar

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* fix step count in loggers

* fix blended dataset

* address comments

* address comment

* move step logging into strategy

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix serialization of AutoResume (#9616)

* fix serialization of autoresume

* update undefined variables

Signed-off-by: Tugrul Konuk <[email protected]>

* Chat template support for megatron_gpt_eval.py (#9354)

* Bump PTL version (#9557)

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* [Resiliency] Straggler detection (#9473)

* Initial straggler det impl

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed CI code checks

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Removed unused import

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* remove submodule

Signed-off-by: Maanu Grover <[email protected]>

* Updated documentation; Updated callback params; Cosmetic changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed straggler det config; Added basic test

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixes in test_straggler_det.py

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Updated straggler callback API

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* stop_if_detected=False by default

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move model loading to separate function; call toContainer once; pad using closed formula

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* read prompts from file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* If input prompt contains dict, apply model.tokenizer.chat_template

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* apply @Gal Leibovich's patch

Taken from: https://github.com/NVIDIA/NeMo/commit/17572905344db4692583e72799d55801a8860f35
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rename prompts_file to prompts_jsonl

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat_template param

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add ChatTemplateMixin to SentencePieceTokenizer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat-template to text-gen-strat

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move load prompts to separate file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove chat-template from text-gen-utils

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* make chat-template more generic

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add assert message

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* small refactor for chat_template_mixin

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo ckpt conv changes

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move rounding to function

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Jsonl support (#9611)

* Adding support to preprocess .jsonl and .jsonl.gz files in input directory

Signed-off-by: adityavavre <[email protected]>

* Adding support to preprocess .jsonl and .jsonl.gz files in input directory

Signed-off-by: adityavavre <[email protected]>

* Apply isort and black reformatting

Signed-off-by: adityavavre <[email protected]>

---------

Signed-off-by: adityavavre <[email protected]>
Signed-off-by: adityavavre <[email protected]>
Co-authored-by: adityavavre <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Add PEFT (#9490)

* initial commit for PEFT in nemo2

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* address comments

Signed-off-by: Chen Cui <[email protected]>

* make import easier

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* address comments

Signed-off-by: Chen Cui <[email protected]>

* Update nemo/collections/llm/peft/lora.py

Signed-off-by: Marc Romeyn <[email protected]>

* Some small fixes + adding more doc-strings

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding ModelTransform callback

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fixing type-hint for model_transform

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* fix import

Signed-off-by: Chen Cui <[email protected]>

* model transform for gemma llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fix model transform

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* change lora target default to all linear modules

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* Small fix in mixtral

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Integrating PEFT to the public-API + some fixes

* Big refactor to allow to load adapter-states

* Some fixes to support adapter_path

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Disabling ckpt reloading when adapter_path is passed

* Fix CLI

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove commented-out code

* Remove commented-out code

* Remove un-used import

* Fix callback imports

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fixing llm.pretrain

* Some small fixes

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix missing import + type-hint in finetune

* Adding PreemptionCallback + some more tests

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Clean up imports & clean up llm.api

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Trying to fix failing tests

* Remove __init__.py 2

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix failing test

* Trying to fix last failing test

---------

Signed-off-by: cuichenx <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Akoumparouli/mistral import instruct chat template fix (#9567)

* use bf16 by defualt mistral conv

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat template

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use capitalized role names

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Remove .cuda calls, use device isntead (#9602)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix converter defautl args (#9565)

* fix converter defautl args

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* mixtral export (#9603)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix: remove non_blocking from PTL's .cuda call (#9618)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Alit/mamba tmp (#9612)

* adding mamba support

* fix import mixins

* rm convert jamba

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* more cleanups

* use GPT text gen

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* fixing gbs in TP convetor

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add reqs

* add tutorial

* minor fix to tutorial

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add mamba_tmp

* remove mamba import

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

---------

Signed-off-by: JRD971000 <[email protected]>
Signed-off-by: arendu <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: JRD971000 <[email protected]>
Co-authored-by: arendu <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* TitaNet Batch Verify Speaker (#9337)

* add batch_inference for verify_speakers method

Signed-off-by: [email protected] <[email protected]>

* remove not used package

Signed-off-by: [email protected] <[email protected]>

* change batch inference logic

Signed-off-by: [email protected] <[email protected]>

* fixup

Signed-off-by: [email protected] <[email protected]>

* requested changes

Signed-off-by: [email protected] <[email protected]>

* add verify_speakers_batch to docs

Signed-off-by: [email protected] <[email protected]>

* handle None durations in manifest

Signed-off-by: [email protected] <[email protected]>

* change logging text

Signed-off-by: [email protected] <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* check duration presence

Signed-off-by: [email protected] <[email protected]>

* add channel_selector to dataset configs

Signed-off-by: [email protected] <[email protected]>

---------

Signed-off-by: [email protected] <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Enable MCore checkpointing optimizations (#9505)

* Expose num processes in PyT Dist

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add parallel save/load optimizations from MCore

Signed-off-by: Mikołaj Błaż <[email protected]>

* Remove async utils from MCore

Signed-off-by: Mikołaj Błaż <[email protected]>

* Enable DistOpt paralell R/W

Signed-off-by: Mikołaj Błaż <[email protected]>

* Enable PyT Dist caching

Signed-off-by: Mikołaj Błaż <[email protected]>

* Small fixes

Signed-off-by: Mikołaj Błaż <[email protected]>

* Make sure DistCkptIO is instantiated from config

Signed-off-by: Mikołaj Błaż <[email protected]>

* Bump MCore version to v0.7

Signed-off-by: Mikołaj Błaż <[email protected]>

* Print load strategy

Signed-off-by: Mikołaj Błaż <[email protected]>

* Forward MCore to model space DistOpt

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add separate flag to control DistOpt paralell R/W

Signed-off-by: Mikołaj Błaż <[email protected]>

* Turn off parallel save by default

Signed-off-by: Mikołaj Błaż <[email protected]>

---------

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Change mixtral moe key name for trt-llm (#9620)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* change moe key values

Signed-off-by: Onur Yilmaz <[email protected]>

* add weight to the key

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix ckpt load bug (#9621)

* fix ckpt load bug

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* NeVA Minor Fixes (#9608)

* fix neva resume with empty param loaded for some pp stage

Signed-off-by: yaoyu-33 <[email protected]>

* fix crop size check

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix pretrianing data sizes and weights (#9627)

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Alit/mamba (#9575)

* adding mamba support

* fix import mixins

* rm convert jamba

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* more cleanups

* use GPT text gen

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* fixing gbs in TP convetor

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add reqs

* add tutorial

* minor fix to tutorial

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* add mamba dependancies

* add mcore tag

* modify dockerfile ci

* modify dockerfile ci

---------

Signed-off-by: JRD971000 <[email protected]>
Signed-off-by: arendu <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: JRD971000 <[email protected]>
Co-authored-by: arendu <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] async checkpointing support (#9466)

* add async checkpointing support

* fixes

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* add parallel read/write support and other optimizations

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* address comments, make dist checkpointing args configurable

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* fix small typo

Signed-off-by: ashors1 <[email protected]>

* Update default sharding type

Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Anna Shors <[email protected]>

* Update default sharding type

Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Anna Shors <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Anna Shors <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix the arguments  of forward_for_export function in msdd_models (#9624)

* Fix the arguments  of forward_for_export function

Signed-off-by: Taejin Park <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

---------

Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: tango4j <[email protected]>
Co-authored-by: tango4j <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Change default parallel_save to False (#9632)

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Unwrap ckpt_io for model opt (async save) (#9622)

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* MCore T5 support for NeMo - Training (#9432)

* huvu/mcore_t5 first commit from local

* removing DEBUGGING prints

* cleaning megatron_lm_encoder_decoder_model.py code

* cleaning code

* adding Github action test

* only run mcore T5 test

* only run mcore T5 test

* only run mcore T5 test

* only run mcore T5 test

* reset .github/workflows/cicd-main.yml

* reset .github/workflows/cicd-main.yml

* adding condition self.mcore_t5 when running self.build_transformer_config()

* refractor megatron_lm_encoder_decoder_model.py to not use self.model

* only run T5-related tests

* remove all self.model

* reset cicd file

* reset cicd file

* updating codes remove duplicate if/else; adding mcore/transformer_engine to config file

* adjust +model.mcore_t5=True

* Apply isort and black reformatting

Signed-off-by: huvunvidia <[email protected]>

---------

Signed-off-by: huvunvidia <[email protected]>
Co-authored-by: Huy Vu2 <[email protected]>
Co-authored-by: huvunvidia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Expose transformer_layer_spec inside GPTConfig (#9592)

* Expose transformer_layer_spec inside GPTConfig

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Expose layer-specs

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Update NeMo Clip to Use MCore Modules (#9594)

* update clip model and config file

Signed-off-by: yaoyu-33 <[email protected]>

* update clip for mcore

Signed-off-by: yaoyu-33 <[email protected]>

* MCore CLIP Fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix no mask

Signed-off-by: yaoyu-33 <[email protected]>

* few neva fixes

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip module

Signed-off-by: yaoyu-33 <[email protected]>

* add siglip loss

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix collate fn

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip conversion script

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip convert

Signed-off-by: yaoyu-33 <[email protected]>

* clip fixes

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* clean up script

Signed-off-by: yaoyu-33 <[email protected]>

* clip fixes

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix code styles

Signed-off-by: yaoyu-33 <[email protected]>

* Update siglip_loss.py

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add REST API to deploy module (#9539)

* Add REST API and FastAPI to deploy module

Signed-off-by: Abhishree <[email protected]>

* Add NemoQuery and requirements

Signed-off-by: Abhishree <[email protected]>

* Edit path for config.json

Signed-off-by: Abhishree <[email protected]>

* Add modifications for REST API for the correct functionality

Move service dir under deploy
Use NeMoQueryLLM instead of NemoQuery

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Apply isort and black reformatting

Signed-off-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com>

* Change default port for REST Service

Change default port for REST service as Triton server also used the same port as default.

Signed-off-by: Abhishree Thittenamane <[email protected]>

* Apply isort and black reformatting

Signed-off-by: athitten <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Signed-off-by: athitten <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: athitten <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Mistral + Mixtral Support for NeVa (#9459)

* mistral template support

Signed-off-by: paul-gibbons <[email protected]>

* get_specs neva fix

Signed-off-by: paul-gibbons <[email protected]>

* mistral update

Signed-off-by: paul-gibbons <[email protected]>

* fixed mistral tokenization

Signed-off-by: paul-gibbons <[email protected]>

* t…
monica-sekoyan added a commit that referenced this pull request Oct 14, 2024
* Adding context- & expert-parallism to MegatronStrategy (#9525)

Signed-off-by: Tugrul Konuk <[email protected]>

* Add CICD test for Stable Diffusion (#9464)

* Add CICD test for Stable Diffusion

Signed-off-by: Michal Futrega <[email protected]>

* Update cicd-main.yml

Signed-off-by: Michal Futrega <[email protected]>

* Use single gpu runner

Signed-off-by: Michal Futrega <[email protected]>

---------

Signed-off-by: Michal Futrega <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Akoumparouli/nemo ux mixtral (#9446)

* use default collate if dataset does not have one

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mixtral config

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add convert_state

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix StateDictTransform for 2D layers, e.g. MoE

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pass num_moe_experts to specs

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* udpate MixtralModel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mini docstring

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* update mcoreddp call (#9345)

* update mcoreddp call

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update mcore commits

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Llama and Gemma (#9528)

* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama3

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fix typo

Signed-off-by: Chen Cui <[email protected]>

* enable importers with multiple models

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add gemma

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* checks

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] minor logging bug fixes (#9529)

* minor exp_manager bug fixes

* remove print statement

* fix docstring

* fix AppState defaults

---------

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* mcore distOpt restore fix (#9421)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Custom Tiktoken tokenizer.

Signed-off-by: Tugrul Konuk <[email protected]>

* Fixed the tokenizer decoding on special tokens.

Signed-off-by: Tugrul Konuk <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ertkonuk <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Added token_to_id() method.

Signed-off-by: Tugrul Konuk <[email protected]>

* Update neva conversion script from and to HF (#9296)

* Update NeMo script

Signed-off-by: yaoyu-33 <[email protected]>

* Fix example scripts

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update convert_llava_nemo_to_hf.py

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* vLLM Export Support (#9381)

* Export implementation for vLLM 0.4.3.

Supports LLAMA2, Mistral, Mixtral (unverified), Gemma and StarCoder2 models.

The nemo.export.tensorrt_llm alias was removed to avoid initializing TRT-LLM when importing anything from nemo.export.

Signed-off-by: Alexey Panteleev <[email protected]>

* Fixed some CodeQL warnings.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Removed empty files.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Updated the integration for vLLM 0.5.0.

Signed-off-by: Alexey Panteleev <[email protected]>

* Updated the vLLM deployment interface to use max_output_len instead of max_output_token.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the Exporter class to nemo/export and renamed its file to vllm_exporter.py, to be more similar to TRT-LLM.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Implemented vLLM support in the export tests, added functional testing, implemented forward evaluation on vLLM without Triton.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the vLLM deployment functionality to the common deploy_triton.py script.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed the CodeQL discovered issues.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed one more return of a wrong dimensionality...

Signed-off-by: Alexey Panteleev <[email protected]>

* More wrong dimensionality returns.

Signed-off-by: Alexey Panteleev <[email protected]>

---------

Signed-off-by: Alexey Panteleev <[email protected]>
Signed-off-by: apanteleev <[email protected]>
Co-authored-by: apanteleev <[email protected]>
Co-authored-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* PL: Delete precision if using plugin. TODO switch to MegatronTrainerBuilder (#9535)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add page context fmha (#9526)

Signed-off-by: Tugrul Konuk <[email protected]>

* extend get_gpt_layer_modelopt_spec to support MoE (#9532)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix mock data generation for legacy dataset (#9530)

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] IO fixes (#9512)

* Improve IOMixin.io_transform_args to handle dataclasses better

* Dump task json + img inside NeMoLogger

* Adding store_io to train task

* Update opt.connect to also propagate to __io__

* Rename opt to optim for consistency

* Moving to using safe serialization using fiddle, only use cloudpickle when needed

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Using Config from fiddle instead of sdk for now

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move enable_nemo_ckpt_io from MegatronStrategy to ModelCheckpoint

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move nemo-ckpt to _get_finalize_save_checkpoint_callback

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Update TrainerContext & io.load_ckpt

* Use renamed TrainerContext inside ModelCheckpoint

* Remove double io saving

* Rename lightning.pytorch.opt -> optim

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove store_io from train-task

* Adding fiddle-extension for torch

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move fdl_torch import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding dtype to serialization

* Some fixes

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add support for BuiltinFunctionType

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add missing import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix dataclass fields

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Test C++ runtime on demand in nemo_export.py to avoid possible OOMs (#9544)

* Add test_cpp_runtime flag

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Co-authored-by: janekl <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix lhotse tests for v1.24.2 (#9546)

* Fix lhotse tests for v1.24.0

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix RIR test

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* gpu_unitTests_notOptional (#9551)

Signed-off-by: Tugrul Konuk <[email protected]>

* add reset learning rate functionality (#9372)

* add reset_lr functionality

Signed-off-by: dimapihtar <[email protected]>

* fix reset_lr logic

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* move reset_lr from optim section

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr value to config

Signed-off-by: dimapihtar <[email protected]>

* set reset_lr False by default

Signed-off-by: dimapihtar <[email protected]>

* remove extra line

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr test

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr test

Signed-off-by: dimapihtar <[email protected]>

* remove extra quote

Signed-off-by: dimapihtar <[email protected]>

* add ability to reset schedule's max_steps and decay_steps

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* change scheduler's first step logic when using reset_lr

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* fix reset_lr logic

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* update reset_lr comments

Signed-off-by: dimapihtar <[email protected]>

* add use cases for reset_lr feature

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add Python AIStore SDK to container and bump min Lhotse version (#9537)

* Add Python AIStore SDK to requirements and bump min Lhotse version

Signed-off-by: Piotr Żelasko <[email protected]>

* Move AIStore Python SDK to Dockerfile, remove matplotlib/ipywidgets deps

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Adding 'use_dynamo' option for export to use onnx.dynamo_export() instead of onnx.export() (#9147)

* Ininial WARs to implement dynamo option for export

Signed-off-by: Boris Fomitchev <[email protected]>

* including weights in .onnx

Signed-off-by: Boris Fomitchev <[email protected]>

* dynamo_export works for many small models

Signed-off-by: Boris Fomitchev <[email protected]>

* External weights behaviour fixed

Signed-off-by: Boris Fomitchev <[email protected]>

* Cleanup

Signed-off-by: Boris Fomitchev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: borisfom <[email protected]>

* print cleaned up

Signed-off-by: Boris Fomitchev <[email protected]>

* Added overloadable dynamic_shapes_for_export

Signed-off-by: Boris Fomitchev <[email protected]>

* Addressing code review

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing CI issues

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing CI test failure

Signed-off-by: Boris Fomitchev <[email protected]>

* Eliminated test cross-contamination

Signed-off-by: Boris Fomitchev <[email protected]>

---------

Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: borisfom <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Fix tokenizer IO (#9555)

* Adding tokenizer to io-test + making it pass

* Handling tokenizer correctly inside dump_io

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Removing not used import

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo UX] Move mistral_7b.py to mistral.py (#9545)

* Move mistral_7b.py to mistral.py

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rename MixtralConfig to MixtralConfig8x7B

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mistral rename: mistralconfig7b & mistralmodel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Use closed-formula to round by multiple (#9307)

* Use closed-formula to round by multiple

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* ci: Do not attempt to send slack on fork (#9556)

* ci: Do not attempt to send slack on fork

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix nemo export test (#9547)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* fix export test

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix SDXL incorrect name in docs (#9534)

Signed-off-by: Tugrul Konuk <[email protected]>

* GPU unit tests: Mark flaky tests to be fixed (#9559)

Signed-off-by: Tugrul Konuk <[email protected]>

* Bump PTL version (#9557)

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Resiliency] Straggler detection (#9473)

* Initial straggler det impl

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed CI code checks

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Removed unused import

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* remove submodule

Signed-off-by: Maanu Grover <[email protected]>

* Updated documentation; Updated callback params; Cosmetic changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed straggler det config; Added basic test

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixes in test_straggler_det.py

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Updated straggler callback API

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* stop_if_detected=False by default

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* switch to torch_dist as default dist checkpointing backend (#9541)

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Checkpointing bug fixes (#9562)

* fix checkpoint loading

* fix

* fixes

* another fix

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add tps and pps params to the export script (#9558)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* fix export test

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* remove n_gpus param

Signed-off-by: Onur Yilmaz <[email protected]>

* add and fix parameters

Signed-off-by: Onur Yilmaz <[email protected]>

* fix deploy script

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* rename tps and pps params

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Consolidate gpt continue training script into pretraining script (#9413)

* Consolidate gpt continue training with pretraining

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix default config

Signed-off-by: yaoyu-33 <[email protected]>

* Add github action cicd

Signed-off-by: yaoyu-33 <[email protected]>

* extract _integrate_original_checkpoint_data as a method

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix getattr

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "Add github action cicd"

This reverts commit a453f16ba2be6413db932623009da893208acdd5.

* Update comments in nlp_overrides.py

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add support to change Multi task model prompt (#9542)

* Add support to change Multi task model prompt

Signed-off-by: smajumdar <[email protected]>

* Add support to change Multi task model prompt

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Update nemo/collections/common/prompts/formatter.py

Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add Multimodal Exporter (#9256)

* Add video-neva TRT export

* Add TRT inference

* Change config

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change export params

* Remove unused import

* Add neva export

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change unpack nemo

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Add trt infer config

* Fix neva trt inference

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Add exporter

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix infer

* Add PyTriton

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix deploy wrong dim

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change to pass PIL Image

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix video neva deploy

* Change query

* Change deploy

* Remove unused import

* Change ptuning

* Change to mm exporter

* Add script

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix script

---------

Signed-off-by: meatybobby <[email protected]>
Co-authored-by: meatybobby <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Enable encoder adapters for Canary and MultiTaskAED models (#9409)

* Fix assertions for adapter types

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Cleanup

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Finalize support for decoder adapters

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* fix the freeze/unfreeze problem by replacing as_frozen with torch.inference_mode

* Apply isort and black reformatting

Signed-off-by: weiqingw4ng <[email protected]>

* Update tests to new generic way of module update

Signed-off-by: smajumdar <[email protected]>

* Finalize code for update module

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Fix variable name

Signed-off-by: smajumdar <[email protected]>

* Finalize projection support for transformer mha adapters

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Correct implementation of freeze restore

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Corrects the implementation of replace_adapter_modules to limit to just the top level modules

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Remove registration of Transformer MHA

Signed-off-by: smajumdar <[email protected]>

* Remove registration of Transformer MHA

Signed-off-by: smajumdar <[email protected]>

* Address reviewer comments

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: weiqingw4ng <[email protected]>
Co-authored-by: Weiqing Wang <[email protected]>
Co-authored-by: weiqingw4ng <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* pass option through (#9570)

Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* PTQ refinements (#9574)

* Rename megatron_gpt_quantization -> megatron_gpt_ptq

Signed-off-by: Jan Lasek <[email protected]>

* Configure export.save_path as dir or tarball

Signed-off-by: Jan Lasek <[email protected]>

* PTQ docs update

Signed-off-by: Jan Lasek <[email protected]>

* Make model_type optional in case of quantized checkpoints

Signed-off-by: Jan Lasek <[email protected]>

* Drop unused save_nemo_model_config argument

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Audio model collection (#9263)

* Audio model collection

Signed-off-by: Ante Jukić <[email protected]>

* Apply isort and black reformatting

Signed-off-by: anteju <[email protected]>

* Fix imports

Signed-off-by: Ante Jukić <[email protected]>

* Addressed PR comments

Signed-off-by: Ante Jukić <[email protected]>

* Apply isort and black reformatting

Signed-off-by: anteju <[email protected]>

---------

Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: anteju <[email protected]>
Co-authored-by: anteju <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Fix Trainer serialization (#9571)

* Fix Trainer serialization

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Update click version requirement (#9580)

Signed-off-by: Dong Hyuk Chang <[email protected]>
Co-authored-by: Dong Hyuk Chang <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Fault tolerance] Heartbeat detection (#9352)

* Fault tolerance related changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Cosmetic changes in documentation

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Doc update round2

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Co-authored-by: Jacek Bieniusiewicz <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add ModelOpt QAT example for Llama2 SFT model (#9326)

* add INT4 QAT example for Llama2 SFT model

Signed-off-by: Keval Morabia <[email protected]>

* Add config parameter to control kv cache quantization

Signed-off-by: Keval Morabia <[email protected]>

* Fix typo in cicd-main.yml for QAT test

Signed-off-by: Keval Morabia <[email protected]>

* fix nlp_overrides.py

Signed-off-by: Keval Morabia <[email protected]>

* address reviewer feedback

Signed-off-by: Keval Morabia <[email protected]>

* quantize unwrapped model

Signed-off-by: Keval Morabia <[email protected]>

* add compress export argument for qat config

Signed-off-by: Keval Morabia <[email protected]>

---------

Signed-off-by: Keval Morabia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Set TE flag in legacy -> mcore conversion script (#9585)

* set TE flag

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Add fabric-API for manual forward-pass (#9577)

* First pass over fabric-API

* Adding Trainer -> Fabric conversion

* Some small fixes to get a forward-pass in Fabric working

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding doc-string to Fabric.import_model

* Adding track_io to io_init of Fabric

* Fix Fabric.load_model + add doc-string

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove unused import

* Some small fixes

* Fix failing test

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Add SDK-factories to llm-collection (#9589)

* Adding sdk-factories to llm-collection

* Removing _model from mistral + mixtral

* Expose lr_scheduler inside lightning

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Multimodal projection layer adapter fix for PP>1 (#9445)

* enabling multimodal adapters to load in PP>1

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* parameterizing validate_access_integrity, set to false when PP>1

Signed-off-by: paul-gibbons <[email protected]>

formatting fix

Signed-off-by: paul-gibbons <[email protected]>

Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* update nlp_model.py

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* update modelPT with validate_access_integrity

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* updating save_restore_connector w/ validate_access_integrity

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* addressing comment

Signed-off-by: paul-gibbons <[email protected]>

* adding validate_access_integrity to super().load_config_and_state_dict()

Signed-off-by: paul-gibbons <[email protected]>

* testing reorder of validate_access_integrity for CI failures

Signed-off-by: paul-gibbons <[email protected]>

---------

Signed-off-by: paul-gibbons <[email protected]>
Signed-off-by: paul-gibbons <[email protected]>
Co-authored-by: paul-gibbons <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add offline quantization script for QLoRA deployment (#9455)

* add qlora offline quantization script

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* clean

Signed-off-by: Chen Cui <[email protected]>

* docstring

Signed-off-by: Chen Cui <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* qlora support more models (#9488)

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Some improvements to NeMoLogger (#9591)

Signed-off-by: Tugrul Konuk <[email protected]>

* Set n_gpu to None in nemo export (#9593)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* set ngpus to None

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Inflight nemo model export support (#9527)

* online model conversion and refit

Signed-off-by: Jimmy Zhang <[email protected]>

* clean code

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup

Signed-off-by: Jimmy Zhang <[email protected]>

* add refit, cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* combine weight conversion functions

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <[email protected]>

* remove debug print

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* fix single gpu and cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <[email protected]>

---------

Signed-off-by: JimmyZhang12 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* vLLM Export Improvements (#9596)

* Separated the vLLM export functionality from the common deployment script into deploy_vllm_triton.py.

Signed-off-by: Alexey Panteleev <[email protected]>

* Fixed vocab_size for LLAMA3.

Signed-off-by: Alexey Panteleev <[email protected]>

* Export test: fixed deployment testing w/o Megatron, made functional tests optional, added --gpu_memory_utilization.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Addressing review and CodeQL comments.

Signed-off-by: Alexey Panteleev <[email protected]>

---------

Signed-off-by: Alexey Panteleev <[email protected]>
Signed-off-by: apanteleev <[email protected]>
Co-authored-by: apanteleev <[email protected]>
Co-authored-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Set finalize_model_grads_func in on_fit_start instead to make sure it's being called (#9599)

Signed-off-by: Tugrul Konuk <[email protected]>

* Set no_sync_func & grad_sync_fucn (#9601)

* Set no_sync_func & grad_sync_fucn

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* set overlap_param_sync

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* small nemo logger bug fix (#9607)

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix the dict format returned by scheduler method (#9609)

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Dataloading enhancements and bug fixes (#9595)

* fix dataloading + checkpoint restore

* clean up data sampler

* fix typo

* support passing multiple paths to data module

* fix validation dataloader

* fix dataloader len when using gradient accumulation

* fix progress bar

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* fix step count in loggers

* fix blended dataset

* address comments

* address comment

* move step logging into strategy

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix serialization of AutoResume (#9616)

* fix serialization of autoresume

* update undefined variables

Signed-off-by: Tugrul Konuk <[email protected]>

* Chat template support for megatron_gpt_eval.py (#9354)

* Bump PTL version (#9557)

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* [Resiliency] Straggler detection (#9473)

* Initial straggler det impl

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed CI code checks

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Removed unused import

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* remove submodule

Signed-off-by: Maanu Grover <[email protected]>

* Updated documentation; Updated callback params; Cosmetic changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed straggler det config; Added basic test

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixes in test_straggler_det.py

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Updated straggler callback API

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* stop_if_detected=False by default

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move model loading to separate function; call toContainer once; pad using closed formula

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* read prompts from file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* If input prompt contains dict, apply model.tokenizer.chat_template

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* apply @Gal Leibovich's patch

Taken from: https://github.com/NVIDIA/NeMo/commit/17572905344db4692583e72799d55801a8860f35
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rename prompts_file to prompts_jsonl

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat_template param

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add ChatTemplateMixin to SentencePieceTokenizer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat-template to text-gen-strat

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move load prompts to separate file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove chat-template from text-gen-utils

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* make chat-template more generic

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add assert message

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* small refactor for chat_template_mixin

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo ckpt conv changes

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move rounding to function

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Jsonl support (#9611)

* Adding support to preprocess .jsonl and .jsonl.gz files in input directory

Signed-off-by: adityavavre <[email protected]>

* Adding support to preprocess .jsonl and .jsonl.gz files in input directory

Signed-off-by: adityavavre <[email protected]>

* Apply isort and black reformatting

Signed-off-by: adityavavre <[email protected]>

---------

Signed-off-by: adityavavre <[email protected]>
Signed-off-by: adityavavre <[email protected]>
Co-authored-by: adityavavre <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Add PEFT (#9490)

* initial commit for PEFT in nemo2

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* address comments

Signed-off-by: Chen Cui <[email protected]>

* make import easier

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* address comments

Signed-off-by: Chen Cui <[email protected]>

* Update nemo/collections/llm/peft/lora.py

Signed-off-by: Marc Romeyn <[email protected]>

* Some small fixes + adding more doc-strings

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding ModelTransform callback

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fixing type-hint for model_transform

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* fix import

Signed-off-by: Chen Cui <[email protected]>

* model transform for gemma llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fix model transform

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* change lora target default to all linear modules

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* Small fix in mixtral

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Integrating PEFT to the public-API + some fixes

* Big refactor to allow to load adapter-states

* Some fixes to support adapter_path

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Disabling ckpt reloading when adapter_path is passed

* Fix CLI

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove commented-out code

* Remove commented-out code

* Remove un-used import

* Fix callback imports

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fixing llm.pretrain

* Some small fixes

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix missing import + type-hint in finetune

* Adding PreemptionCallback + some more tests

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Clean up imports & clean up llm.api

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Trying to fix failing tests

* Remove __init__.py 2

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix failing test

* Trying to fix last failing test

---------

Signed-off-by: cuichenx <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Akoumparouli/mistral import instruct chat template fix (#9567)

* use bf16 by defualt mistral conv

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat template

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use capitalized role names

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Remove .cuda calls, use device isntead (#9602)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix converter defautl args (#9565)

* fix converter defautl args

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* mixtral export (#9603)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix: remove non_blocking from PTL's .cuda call (#9618)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Alit/mamba tmp (#9612)

* adding mamba support

* fix import mixins

* rm convert jamba

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* more cleanups

* use GPT text gen

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* fixing gbs in TP convetor

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add reqs

* add tutorial

* minor fix to tutorial

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add mamba_tmp

* remove mamba import

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

---------

Signed-off-by: JRD971000 <[email protected]>
Signed-off-by: arendu <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: JRD971000 <[email protected]>
Co-authored-by: arendu <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* TitaNet Batch Verify Speaker (#9337)

* add batch_inference for verify_speakers method

Signed-off-by: [email protected] <[email protected]>

* remove not used package

Signed-off-by: [email protected] <[email protected]>

* change batch inference logic

Signed-off-by: [email protected] <[email protected]>

* fixup

Signed-off-by: [email protected] <[email protected]>

* requested changes

Signed-off-by: [email protected] <[email protected]>

* add verify_speakers_batch to docs

Signed-off-by: [email protected] <[email protected]>

* handle None durations in manifest

Signed-off-by: [email protected] <[email protected]>

* change logging text

Signed-off-by: [email protected] <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* check duration presence

Signed-off-by: [email protected] <[email protected]>

* add channel_selector to dataset configs

Signed-off-by: [email protected] <[email protected]>

---------

Signed-off-by: [email protected] <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Enable MCore checkpointing optimizations (#9505)

* Expose num processes in PyT Dist

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add parallel save/load optimizations from MCore

Signed-off-by: Mikołaj Błaż <[email protected]>

* Remove async utils from MCore

Signed-off-by: Mikołaj Błaż <[email protected]>

* Enable DistOpt paralell R/W

Signed-off-by: Mikołaj Błaż <[email protected]>

* Enable PyT Dist caching

Signed-off-by: Mikołaj Błaż <[email protected]>

* Small fixes

Signed-off-by: Mikołaj Błaż <[email protected]>

* Make sure DistCkptIO is instantiated from config

Signed-off-by: Mikołaj Błaż <[email protected]>

* Bump MCore version to v0.7

Signed-off-by: Mikołaj Błaż <[email protected]>

* Print load strategy

Signed-off-by: Mikołaj Błaż <[email protected]>

* Forward MCore to model space DistOpt

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add separate flag to control DistOpt paralell R/W

Signed-off-by: Mikołaj Błaż <[email protected]>

* Turn off parallel save by default

Signed-off-by: Mikołaj Błaż <[email protected]>

---------

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Change mixtral moe key name for trt-llm (#9620)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* change moe key values

Signed-off-by: Onur Yilmaz <[email protected]>

* add weight to the key

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix ckpt load bug (#9621)

* fix ckpt load bug

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* NeVA Minor Fixes (#9608)

* fix neva resume with empty param loaded for some pp stage

Signed-off-by: yaoyu-33 <[email protected]>

* fix crop size check

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix pretrianing data sizes and weights (#9627)

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Alit/mamba (#9575)

* adding mamba support

* fix import mixins

* rm convert jamba

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* more cleanups

* use GPT text gen

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* fixing gbs in TP convetor

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add reqs

* add tutorial

* minor fix to tutorial

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* add mamba dependancies

* add mcore tag

* modify dockerfile ci

* modify dockerfile ci

---------

Signed-off-by: JRD971000 <[email protected]>
Signed-off-by: arendu <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: JRD971000 <[email protected]>
Co-authored-by: arendu <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] async checkpointing support (#9466)

* add async checkpointing support

* fixes

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* add parallel read/write support and other optimizations

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* address comments, make dist checkpointing args configurable

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* fix small typo

Signed-off-by: ashors1 <[email protected]>

* Update default sharding type

Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Anna Shors <[email protected]>

* Update default sharding type

Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Anna Shors <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Anna Shors <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix the arguments  of forward_for_export function in msdd_models (#9624)

* Fix the arguments  of forward_for_export function

Signed-off-by: Taejin Park <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

---------

Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: tango4j <[email protected]>
Co-authored-by: tango4j <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Change default parallel_save to False (#9632)

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Unwrap ckpt_io for model opt (async save) (#9622)

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* MCore T5 support for NeMo - Training (#9432)

* huvu/mcore_t5 first commit from local

* removing DEBUGGING prints

* cleaning megatron_lm_encoder_decoder_model.py code

* cleaning code

* adding Github action test

* only run mcore T5 test

* only run mcore T5 test

* only run mcore T5 test

* only run mcore T5 test

* reset .github/workflows/cicd-main.yml

* reset .github/workflows/cicd-main.yml

* adding condition self.mcore_t5 when running self.build_transformer_config()

* refractor megatron_lm_encoder_decoder_model.py to not use self.model

* only run T5-related tests

* remove all self.model

* reset cicd file

* reset cicd file

* updating codes remove duplicate if/else; adding mcore/transformer_engine to config file

* adjust +model.mcore_t5=True

* Apply isort and black reformatting

Signed-off-by: huvunvidia <[email protected]>

---------

Signed-off-by: huvunvidia <[email protected]>
Co-authored-by: Huy Vu2 <[email protected]>
Co-authored-by: huvunvidia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Expose transformer_layer_spec inside GPTConfig (#9592)

* Expose transformer_layer_spec inside GPTConfig

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Expose layer-specs

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Update NeMo Clip to Use MCore Modules (#9594)

* update clip model and config file

Signed-off-by: yaoyu-33 <[email protected]>

* update clip for mcore

Signed-off-by: yaoyu-33 <[email protected]>

* MCore CLIP Fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix no mask

Signed-off-by: yaoyu-33 <[email protected]>

* few neva fixes

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip module

Signed-off-by: yaoyu-33 <[email protected]>

* add siglip loss

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix collate fn

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip conversion script

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip convert

Signed-off-by: yaoyu-33 <[email protected]>

* clip fixes

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* clean up script

Signed-off-by: yaoyu-33 <[email protected]>

* clip fixe…
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
* Adding context- & expert-parallism to MegatronStrategy (#9525)

Signed-off-by: Tugrul Konuk <[email protected]>

* Add CICD test for Stable Diffusion (#9464)

* Add CICD test for Stable Diffusion

Signed-off-by: Michal Futrega <[email protected]>

* Update cicd-main.yml

Signed-off-by: Michal Futrega <[email protected]>

* Use single gpu runner

Signed-off-by: Michal Futrega <[email protected]>

---------

Signed-off-by: Michal Futrega <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Akoumparouli/nemo ux mixtral (#9446)

* use default collate if dataset does not have one

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mixtral config

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add convert_state

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix StateDictTransform for 2D layers, e.g. MoE

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pass num_moe_experts to specs

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* udpate MixtralModel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mini docstring

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* update mcoreddp call (#9345)

* update mcoreddp call

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update mcore commits

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Llama and Gemma (#9528)

* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add llama3

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fix typo

Signed-off-by: Chen Cui <[email protected]>

* enable importers with multiple models

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add gemma

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* checks

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] minor logging bug fixes (#9529)

* minor exp_manager bug fixes

* remove print statement

* fix docstring

* fix AppState defaults

---------

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* mcore distOpt restore fix (#9421)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Custom Tiktoken tokenizer.

Signed-off-by: Tugrul Konuk <[email protected]>

* Fixed the tokenizer decoding on special tokens.

Signed-off-by: Tugrul Konuk <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ertkonuk <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Added token_to_id() method.

Signed-off-by: Tugrul Konuk <[email protected]>

* Update neva conversion script from and to HF (#9296)

* Update NeMo script

Signed-off-by: yaoyu-33 <[email protected]>

* Fix example scripts

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update convert_llava_nemo_to_hf.py

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* vLLM Export Support (#9381)

* Export implementation for vLLM 0.4.3.

Supports LLAMA2, Mistral, Mixtral (unverified), Gemma and StarCoder2 models.

The nemo.export.tensorrt_llm alias was removed to avoid initializing TRT-LLM when importing anything from nemo.export.

Signed-off-by: Alexey Panteleev <[email protected]>

* Fixed some CodeQL warnings.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Removed empty files.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Updated the integration for vLLM 0.5.0.

Signed-off-by: Alexey Panteleev <[email protected]>

* Updated the vLLM deployment interface to use max_output_len instead of max_output_token.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the Exporter class to nemo/export and renamed its file to vllm_exporter.py, to be more similar to TRT-LLM.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Implemented vLLM support in the export tests, added functional testing, implemented forward evaluation on vLLM without Triton.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Moved the vLLM deployment functionality to the common deploy_triton.py script.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed the CodeQL discovered issues.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Fixed one more return of a wrong dimensionality...

Signed-off-by: Alexey Panteleev <[email protected]>

* More wrong dimensionality returns.

Signed-off-by: Alexey Panteleev <[email protected]>

---------

Signed-off-by: Alexey Panteleev <[email protected]>
Signed-off-by: apanteleev <[email protected]>
Co-authored-by: apanteleev <[email protected]>
Co-authored-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* PL: Delete precision if using plugin. TODO switch to MegatronTrainerBuilder (#9535)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add page context fmha (#9526)

Signed-off-by: Tugrul Konuk <[email protected]>

* extend get_gpt_layer_modelopt_spec to support MoE (#9532)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix mock data generation for legacy dataset (#9530)

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] IO fixes (#9512)

* Improve IOMixin.io_transform_args to handle dataclasses better

* Dump task json + img inside NeMoLogger

* Adding store_io to train task

* Update opt.connect to also propagate to __io__

* Rename opt to optim for consistency

* Moving to using safe serialization using fiddle, only use cloudpickle when needed

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Using Config from fiddle instead of sdk for now

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move enable_nemo_ckpt_io from MegatronStrategy to ModelCheckpoint

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move nemo-ckpt to _get_finalize_save_checkpoint_callback

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Update TrainerContext & io.load_ckpt

* Use renamed TrainerContext inside ModelCheckpoint

* Remove double io saving

* Rename lightning.pytorch.opt -> optim

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove store_io from train-task

* Adding fiddle-extension for torch

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Move fdl_torch import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding dtype to serialization

* Some fixes

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Make TransformerConfig inherit from IOMixin to fix serialization error

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add support for BuiltinFunctionType

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Add missing import

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix dataclass fields

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Test C++ runtime on demand in nemo_export.py to avoid possible OOMs (#9544)

* Add test_cpp_runtime flag

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Co-authored-by: janekl <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix lhotse tests for v1.24.2 (#9546)

* Fix lhotse tests for v1.24.0

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix RIR test

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* gpu_unitTests_notOptional (#9551)

Signed-off-by: Tugrul Konuk <[email protected]>

* add reset learning rate functionality (#9372)

* add reset_lr functionality

Signed-off-by: dimapihtar <[email protected]>

* fix reset_lr logic

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* move reset_lr from optim section

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr value to config

Signed-off-by: dimapihtar <[email protected]>

* set reset_lr False by default

Signed-off-by: dimapihtar <[email protected]>

* remove extra line

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr test

Signed-off-by: dimapihtar <[email protected]>

* add reset_lr test

Signed-off-by: dimapihtar <[email protected]>

* remove extra quote

Signed-off-by: dimapihtar <[email protected]>

* add ability to reset schedule's max_steps and decay_steps

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* change scheduler's first step logic when using reset_lr

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* fix reset_lr logic

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* revert config

Signed-off-by: dimapihtar <[email protected]>

* update reset_lr comments

Signed-off-by: dimapihtar <[email protected]>

* add use cases for reset_lr feature

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add Python AIStore SDK to container and bump min Lhotse version (#9537)

* Add Python AIStore SDK to requirements and bump min Lhotse version

Signed-off-by: Piotr Żelasko <[email protected]>

* Move AIStore Python SDK to Dockerfile, remove matplotlib/ipywidgets deps

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Adding 'use_dynamo' option for export to use onnx.dynamo_export() instead of onnx.export() (#9147)

* Ininial WARs to implement dynamo option for export

Signed-off-by: Boris Fomitchev <[email protected]>

* including weights in .onnx

Signed-off-by: Boris Fomitchev <[email protected]>

* dynamo_export works for many small models

Signed-off-by: Boris Fomitchev <[email protected]>

* External weights behaviour fixed

Signed-off-by: Boris Fomitchev <[email protected]>

* Cleanup

Signed-off-by: Boris Fomitchev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: borisfom <[email protected]>

* print cleaned up

Signed-off-by: Boris Fomitchev <[email protected]>

* Added overloadable dynamic_shapes_for_export

Signed-off-by: Boris Fomitchev <[email protected]>

* Addressing code review

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing CI issues

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing CI test failure

Signed-off-by: Boris Fomitchev <[email protected]>

* Eliminated test cross-contamination

Signed-off-by: Boris Fomitchev <[email protected]>

---------

Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: borisfom <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Fix tokenizer IO (#9555)

* Adding tokenizer to io-test + making it pass

* Handling tokenizer correctly inside dump_io

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Removing not used import

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo UX] Move mistral_7b.py to mistral.py (#9545)

* Move mistral_7b.py to mistral.py

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rename MixtralConfig to MixtralConfig8x7B

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mistral rename: mistralconfig7b & mistralmodel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Use closed-formula to round by multiple (#9307)

* Use closed-formula to round by multiple

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* ci: Do not attempt to send slack on fork (#9556)

* ci: Do not attempt to send slack on fork

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix nemo export test (#9547)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* fix export test

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix SDXL incorrect name in docs (#9534)

Signed-off-by: Tugrul Konuk <[email protected]>

* GPU unit tests: Mark flaky tests to be fixed (#9559)

Signed-off-by: Tugrul Konuk <[email protected]>

* Bump PTL version (#9557)

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Resiliency] Straggler detection (#9473)

* Initial straggler det impl

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed CI code checks

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Removed unused import

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* remove submodule

Signed-off-by: Maanu Grover <[email protected]>

* Updated documentation; Updated callback params; Cosmetic changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed straggler det config; Added basic test

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixes in test_straggler_det.py

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Updated straggler callback API

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* stop_if_detected=False by default

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* switch to torch_dist as default dist checkpointing backend (#9541)

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Checkpointing bug fixes (#9562)

* fix checkpoint loading

* fix

* fixes

* another fix

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add tps and pps params to the export script (#9558)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* fix export test

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* remove n_gpus param

Signed-off-by: Onur Yilmaz <[email protected]>

* add and fix parameters

Signed-off-by: Onur Yilmaz <[email protected]>

* fix deploy script

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* rename tps and pps params

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Consolidate gpt continue training script into pretraining script (#9413)

* Consolidate gpt continue training with pretraining

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix default config

Signed-off-by: yaoyu-33 <[email protected]>

* Add github action cicd

Signed-off-by: yaoyu-33 <[email protected]>

* extract _integrate_original_checkpoint_data as a method

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix getattr

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "Add github action cicd"

This reverts commit a453f16ba2be6413db932623009da893208acdd5.

* Update comments in nlp_overrides.py

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add support to change Multi task model prompt (#9542)

* Add support to change Multi task model prompt

Signed-off-by: smajumdar <[email protected]>

* Add support to change Multi task model prompt

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Update nemo/collections/common/prompts/formatter.py

Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add Multimodal Exporter (#9256)

* Add video-neva TRT export

* Add TRT inference

* Change config

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change export params

* Remove unused import

* Add neva export

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change unpack nemo

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Add trt infer config

* Fix neva trt inference

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Add exporter

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix infer

* Add PyTriton

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix deploy wrong dim

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Change to pass PIL Image

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix video neva deploy

* Change query

* Change deploy

* Remove unused import

* Change ptuning

* Change to mm exporter

* Add script

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Fix script

---------

Signed-off-by: meatybobby <[email protected]>
Co-authored-by: meatybobby <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Enable encoder adapters for Canary and MultiTaskAED models (#9409)

* Fix assertions for adapter types

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Cleanup

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Finalize support for decoder adapters

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* fix the freeze/unfreeze problem by replacing as_frozen with torch.inference_mode

* Apply isort and black reformatting

Signed-off-by: weiqingw4ng <[email protected]>

* Update tests to new generic way of module update

Signed-off-by: smajumdar <[email protected]>

* Finalize code for update module

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Fix variable name

Signed-off-by: smajumdar <[email protected]>

* Finalize projection support for transformer mha adapters

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Correct implementation of freeze restore

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Corrects the implementation of replace_adapter_modules to limit to just the top level modules

Signed-off-by: smajumdar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: titu1994 <[email protected]>

* Remove registration of Transformer MHA

Signed-off-by: smajumdar <[email protected]>

* Remove registration of Transformer MHA

Signed-off-by: smajumdar <[email protected]>

* Address reviewer comments

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: weiqingw4ng <[email protected]>
Co-authored-by: Weiqing Wang <[email protected]>
Co-authored-by: weiqingw4ng <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* pass option through (#9570)

Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* PTQ refinements (#9574)

* Rename megatron_gpt_quantization -> megatron_gpt_ptq

Signed-off-by: Jan Lasek <[email protected]>

* Configure export.save_path as dir or tarball

Signed-off-by: Jan Lasek <[email protected]>

* PTQ docs update

Signed-off-by: Jan Lasek <[email protected]>

* Make model_type optional in case of quantized checkpoints

Signed-off-by: Jan Lasek <[email protected]>

* Drop unused save_nemo_model_config argument

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Audio model collection (#9263)

* Audio model collection

Signed-off-by: Ante Jukić <[email protected]>

* Apply isort and black reformatting

Signed-off-by: anteju <[email protected]>

* Fix imports

Signed-off-by: Ante Jukić <[email protected]>

* Addressed PR comments

Signed-off-by: Ante Jukić <[email protected]>

* Apply isort and black reformatting

Signed-off-by: anteju <[email protected]>

---------

Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: anteju <[email protected]>
Co-authored-by: anteju <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Fix Trainer serialization (#9571)

* Fix Trainer serialization

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Update click version requirement (#9580)

Signed-off-by: Dong Hyuk Chang <[email protected]>
Co-authored-by: Dong Hyuk Chang <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Fault tolerance] Heartbeat detection (#9352)

* Fault tolerance related changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Cosmetic changes in documentation

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Doc update round2

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Co-authored-by: Jacek Bieniusiewicz <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add ModelOpt QAT example for Llama2 SFT model (#9326)

* add INT4 QAT example for Llama2 SFT model

Signed-off-by: Keval Morabia <[email protected]>

* Add config parameter to control kv cache quantization

Signed-off-by: Keval Morabia <[email protected]>

* Fix typo in cicd-main.yml for QAT test

Signed-off-by: Keval Morabia <[email protected]>

* fix nlp_overrides.py

Signed-off-by: Keval Morabia <[email protected]>

* address reviewer feedback

Signed-off-by: Keval Morabia <[email protected]>

* quantize unwrapped model

Signed-off-by: Keval Morabia <[email protected]>

* add compress export argument for qat config

Signed-off-by: Keval Morabia <[email protected]>

---------

Signed-off-by: Keval Morabia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Set TE flag in legacy -> mcore conversion script (#9585)

* set TE flag

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Add fabric-API for manual forward-pass (#9577)

* First pass over fabric-API

* Adding Trainer -> Fabric conversion

* Some small fixes to get a forward-pass in Fabric working

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding doc-string to Fabric.import_model

* Adding track_io to io_init of Fabric

* Fix Fabric.load_model + add doc-string

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove unused import

* Some small fixes

* Fix failing test

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Add SDK-factories to llm-collection (#9589)

* Adding sdk-factories to llm-collection

* Removing _model from mistral + mixtral

* Expose lr_scheduler inside lightning

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Multimodal projection layer adapter fix for PP>1 (#9445)

* enabling multimodal adapters to load in PP>1

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* parameterizing validate_access_integrity, set to false when PP>1

Signed-off-by: paul-gibbons <[email protected]>

formatting fix

Signed-off-by: paul-gibbons <[email protected]>

Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* update nlp_model.py

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* update modelPT with validate_access_integrity

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* updating save_restore_connector w/ validate_access_integrity

Signed-off-by: paul-gibbons <[email protected]>

* Apply isort and black reformatting

Signed-off-by: paul-gibbons <[email protected]>

* addressing comment

Signed-off-by: paul-gibbons <[email protected]>

* adding validate_access_integrity to super().load_config_and_state_dict()

Signed-off-by: paul-gibbons <[email protected]>

* testing reorder of validate_access_integrity for CI failures

Signed-off-by: paul-gibbons <[email protected]>

---------

Signed-off-by: paul-gibbons <[email protected]>
Signed-off-by: paul-gibbons <[email protected]>
Co-authored-by: paul-gibbons <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Add offline quantization script for QLoRA deployment (#9455)

* add qlora offline quantization script

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* clean

Signed-off-by: Chen Cui <[email protected]>

* docstring

Signed-off-by: Chen Cui <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* qlora support more models (#9488)

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Some improvements to NeMoLogger (#9591)

Signed-off-by: Tugrul Konuk <[email protected]>

* Set n_gpu to None in nemo export (#9593)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* set ngpus to None

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Inflight nemo model export support (#9527)

* online model conversion and refit

Signed-off-by: Jimmy Zhang <[email protected]>

* clean code

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup

Signed-off-by: Jimmy Zhang <[email protected]>

* add refit, cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* combine weight conversion functions

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <[email protected]>

* remove debug print

Signed-off-by: Jimmy Zhang <[email protected]>

* cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* fix single gpu and cleanup code

Signed-off-by: Jimmy Zhang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <[email protected]>

---------

Signed-off-by: JimmyZhang12 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* vLLM Export Improvements (#9596)

* Separated the vLLM export functionality from the common deployment script into deploy_vllm_triton.py.

Signed-off-by: Alexey Panteleev <[email protected]>

* Fixed vocab_size for LLAMA3.

Signed-off-by: Alexey Panteleev <[email protected]>

* Export test: fixed deployment testing w/o Megatron, made functional tests optional, added --gpu_memory_utilization.

Signed-off-by: Alexey Panteleev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: apanteleev <[email protected]>

* Addressing review and CodeQL comments.

Signed-off-by: Alexey Panteleev <[email protected]>

---------

Signed-off-by: Alexey Panteleev <[email protected]>
Signed-off-by: apanteleev <[email protected]>
Co-authored-by: apanteleev <[email protected]>
Co-authored-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Set finalize_model_grads_func in on_fit_start instead to make sure it's being called (#9599)

Signed-off-by: Tugrul Konuk <[email protected]>

* Set no_sync_func & grad_sync_fucn (#9601)

* Set no_sync_func & grad_sync_fucn

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* set overlap_param_sync

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* small nemo logger bug fix (#9607)

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix the dict format returned by scheduler method (#9609)

Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Dataloading enhancements and bug fixes (#9595)

* fix dataloading + checkpoint restore

* clean up data sampler

* fix typo

* support passing multiple paths to data module

* fix validation dataloader

* fix dataloader len when using gradient accumulation

* fix progress bar

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* fix step count in loggers

* fix blended dataset

* address comments

* address comment

* move step logging into strategy

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix serialization of AutoResume (#9616)

* fix serialization of autoresume

* update undefined variables

Signed-off-by: Tugrul Konuk <[email protected]>

* Chat template support for megatron_gpt_eval.py (#9354)

* Bump PTL version (#9557)

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* [Resiliency] Straggler detection (#9473)

* Initial straggler det impl

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed CI code checks

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Removed unused import

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* remove submodule

Signed-off-by: Maanu Grover <[email protected]>

* Updated documentation; Updated callback params; Cosmetic changes

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixed straggler det config; Added basic test

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* Fixes in test_straggler_det.py

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Updated straggler callback API

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: jbieniusiewi <[email protected]>

* stop_if_detected=False by default

Signed-off-by: Jacek Bieniusiewicz <[email protected]>

---------

Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move model loading to separate function; call toContainer once; pad using closed formula

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* read prompts from file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* If input prompt contains dict, apply model.tokenizer.chat_template

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* apply @Gal Leibovich's patch

Taken from: https://github.com/NVIDIA/NeMo/commit/17572905344db4692583e72799d55801a8860f35
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rename prompts_file to prompts_jsonl

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat_template param

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add ChatTemplateMixin to SentencePieceTokenizer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat-template to text-gen-strat

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move load prompts to separate file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove chat-template from text-gen-utils

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* make chat-template more generic

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add assert message

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* small refactor for chat_template_mixin

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo ckpt conv changes

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move rounding to function

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Jacek Bieniusiewicz <[email protected]>
Signed-off-by: jbieniusiewi <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: jbieniusiewi <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Jsonl support (#9611)

* Adding support to preprocess .jsonl and .jsonl.gz files in input directory

Signed-off-by: adityavavre <[email protected]>

* Adding support to preprocess .jsonl and .jsonl.gz files in input directory

Signed-off-by: adityavavre <[email protected]>

* Apply isort and black reformatting

Signed-off-by: adityavavre <[email protected]>

---------

Signed-off-by: adityavavre <[email protected]>
Signed-off-by: adityavavre <[email protected]>
Co-authored-by: adityavavre <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] Add PEFT (#9490)

* initial commit for PEFT in nemo2

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* address comments

Signed-off-by: Chen Cui <[email protected]>

* make import easier

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* address comments

Signed-off-by: Chen Cui <[email protected]>

* Update nemo/collections/llm/peft/lora.py

Signed-off-by: Marc Romeyn <[email protected]>

* Some small fixes + adding more doc-strings

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Adding ModelTransform callback

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fixing type-hint for model_transform

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* fix import

Signed-off-by: Chen Cui <[email protected]>

* model transform for gemma llama

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fix model transform

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* change lora target default to all linear modules

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* Small fix in mixtral

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Integrating PEFT to the public-API + some fixes

* Big refactor to allow to load adapter-states

* Some fixes to support adapter_path

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Disabling ckpt reloading when adapter_path is passed

* Fix CLI

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Remove commented-out code

* Remove commented-out code

* Remove un-used import

* Fix callback imports

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fixing llm.pretrain

* Some small fixes

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix missing import + type-hint in finetune

* Adding PreemptionCallback + some more tests

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Clean up imports & clean up llm.api

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Trying to fix failing tests

* Remove __init__.py 2

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Fix failing test

* Trying to fix last failing test

---------

Signed-off-by: cuichenx <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Akoumparouli/mistral import instruct chat template fix (#9567)

* use bf16 by defualt mistral conv

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add chat template

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use capitalized role names

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Remove .cuda calls, use device isntead (#9602)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix converter defautl args (#9565)

* fix converter defautl args

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* mixtral export (#9603)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix: remove non_blocking from PTL's .cuda call (#9618)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Alit/mamba tmp (#9612)

* adding mamba support

* fix import mixins

* rm convert jamba

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* more cleanups

* use GPT text gen

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* fixing gbs in TP convetor

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add reqs

* add tutorial

* minor fix to tutorial

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add mamba_tmp

* remove mamba import

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

---------

Signed-off-by: JRD971000 <[email protected]>
Signed-off-by: arendu <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: JRD971000 <[email protected]>
Co-authored-by: arendu <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* TitaNet Batch Verify Speaker (#9337)

* add batch_inference for verify_speakers method

Signed-off-by: [email protected] <[email protected]>

* remove not used package

Signed-off-by: [email protected] <[email protected]>

* change batch inference logic

Signed-off-by: [email protected] <[email protected]>

* fixup

Signed-off-by: [email protected] <[email protected]>

* requested changes

Signed-off-by: [email protected] <[email protected]>

* add verify_speakers_batch to docs

Signed-off-by: [email protected] <[email protected]>

* handle None durations in manifest

Signed-off-by: [email protected] <[email protected]>

* change logging text

Signed-off-by: [email protected] <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* check duration presence

Signed-off-by: [email protected] <[email protected]>

* add channel_selector to dataset configs

Signed-off-by: [email protected] <[email protected]>

---------

Signed-off-by: [email protected] <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Enable MCore checkpointing optimizations (#9505)

* Expose num processes in PyT Dist

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add parallel save/load optimizations from MCore

Signed-off-by: Mikołaj Błaż <[email protected]>

* Remove async utils from MCore

Signed-off-by: Mikołaj Błaż <[email protected]>

* Enable DistOpt paralell R/W

Signed-off-by: Mikołaj Błaż <[email protected]>

* Enable PyT Dist caching

Signed-off-by: Mikołaj Błaż <[email protected]>

* Small fixes

Signed-off-by: Mikołaj Błaż <[email protected]>

* Make sure DistCkptIO is instantiated from config

Signed-off-by: Mikołaj Błaż <[email protected]>

* Bump MCore version to v0.7

Signed-off-by: Mikołaj Błaż <[email protected]>

* Print load strategy

Signed-off-by: Mikołaj Błaż <[email protected]>

* Forward MCore to model space DistOpt

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add separate flag to control DistOpt paralell R/W

Signed-off-by: Mikołaj Błaż <[email protected]>

* Turn off parallel save by default

Signed-off-by: Mikołaj Błaż <[email protected]>

---------

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Change mixtral moe key name for trt-llm (#9620)

* fix minor import bug

Signed-off-by: Onur Yilmaz <[email protected]>

* change moe key values

Signed-off-by: Onur Yilmaz <[email protected]>

* add weight to the key

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix ckpt load bug (#9621)

* fix ckpt load bug

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* NeVA Minor Fixes (#9608)

* fix neva resume with empty param loaded for some pp stage

Signed-off-by: yaoyu-33 <[email protected]>

* fix crop size check

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* fix pretrianing data sizes and weights (#9627)

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Alit/mamba (#9575)

* adding mamba support

* fix import mixins

* rm convert jamba

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* more cleanups

* use GPT text gen

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* fixing gbs in TP convetor

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* add reqs

* add tutorial

* minor fix to tutorial

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* moving finetuning files

Signed-off-by: arendu <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <[email protected]>

* address comments

* add mamba dependancies

* add mcore tag

* modify dockerfile ci

* modify dockerfile ci

---------

Signed-off-by: JRD971000 <[email protected]>
Signed-off-by: arendu <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: JRD971000 <[email protected]>
Co-authored-by: arendu <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [NeMo-UX] async checkpointing support (#9466)

* add async checkpointing support

* fixes

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* add parallel read/write support and other optimizations

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* address comments, make dist checkpointing args configurable

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

* fix small typo

Signed-off-by: ashors1 <[email protected]>

* Update default sharding type

Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Anna Shors <[email protected]>

* Update default sharding type

Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Anna Shors <[email protected]>

* Apply isort and black reformatting

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Anna Shors <[email protected]>
Co-authored-by: ashors1 <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Fix the arguments  of forward_for_export function in msdd_models (#9624)

* Fix the arguments  of forward_for_export function

Signed-off-by: Taejin Park <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

---------

Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: tango4j <[email protected]>
Co-authored-by: tango4j <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Change default parallel_save to False (#9632)

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Unwrap ckpt_io for model opt (async save) (#9622)

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* MCore T5 support for NeMo - Training (#9432)

* huvu/mcore_t5 first commit from local

* removing DEBUGGING prints

* cleaning megatron_lm_encoder_decoder_model.py code

* cleaning code

* adding Github action test

* only run mcore T5 test

* only run mcore T5 test

* only run mcore T5 test

* only run mcore T5 test

* reset .github/workflows/cicd-main.yml

* reset .github/workflows/cicd-main.yml

* adding condition self.mcore_t5 when running self.build_transformer_config()

* refractor megatron_lm_encoder_decoder_model.py to not use self.model

* only run T5-related tests

* remove all self.model

* reset cicd file

* reset cicd file

* updating codes remove duplicate if/else; adding mcore/transformer_engine to config file

* adjust +model.mcore_t5=True

* Apply isort and black reformatting

Signed-off-by: huvunvidia <[email protected]>

---------

Signed-off-by: huvunvidia <[email protected]>
Co-authored-by: Huy Vu2 <[email protected]>
Co-authored-by: huvunvidia <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* [Nemo-UX] Expose transformer_layer_spec inside GPTConfig (#9592)

* Expose transformer_layer_spec inside GPTConfig

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

* Expose layer-specs

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>

* Update NeMo Clip to Use MCore Modules (#9594)

* update clip model and config file

Signed-off-by: yaoyu-33 <[email protected]>

* update clip for mcore

Signed-off-by: yaoyu-33 <[email protected]>

* MCore CLIP Fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix no mask

Signed-off-by: yaoyu-33 <[email protected]>

* few neva fixes

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip module

Signed-off-by: yaoyu-33 <[email protected]>

* add siglip loss

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix collate fn

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip conversion script

Signed-off-by: yaoyu-33 <[email protected]>

* update siglip convert

Signed-off-by: yaoyu-33 <[email protected]>

* clip fixes

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* clean up script

Signed-off-by: yaoyu-33 <[email protected]>

* clip fixe…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.