Use closed-formula to round by multiple #9307

akoumpa · 2024-05-24T08:21:39Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

github-actions · 2024-06-13T01:47:39Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions · 2024-06-21T01:47:22Z

This PR was closed because it has been inactive for 7 days since being marked as stale.

Signed-off-by: Alexandros Koumparoulis <[email protected]>

Signed-off-by: akoumpa <[email protected]>

* Use closed-formula to round by multiple Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: ashors1 <[email protected]>

* Use closed-formula to round by multiple Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Alex Cui <[email protected]>

* Use closed-formula to round by multiple Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]>

* Adding context- & expert-parallism to MegatronStrategy (#9525) Signed-off-by: Tugrul Konuk <[email protected]> * Add CICD test for Stable Diffusion (#9464) * Add CICD test for Stable Diffusion Signed-off-by: Michal Futrega <[email protected]> * Update cicd-main.yml Signed-off-by: Michal Futrega <[email protected]> * Use single gpu runner Signed-off-by: Michal Futrega <[email protected]> --------- Signed-off-by: Michal Futrega <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Akoumparouli/nemo ux mixtral (#9446) * use default collate if dataset does not have one Signed-off-by: Alexandros Koumparoulis <[email protected]> * mixtral config Signed-off-by: Alexandros Koumparoulis <[email protected]> * add convert_state Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix StateDictTransform for 2D layers, e.g. MoE Signed-off-by: Alexandros Koumparoulis <[email protected]> * pass num_moe_experts to specs Signed-off-by: Alexandros Koumparoulis <[email protected]> * udpate MixtralModel Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * mini docstring Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * update mcoreddp call (#9345) * update mcoreddp call Signed-off-by: Alexandros Koumparoulis <[email protected]> * update mcore commits Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Llama and Gemma (#9528) * add llama Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * add llama Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * add llama3 Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * fix typo Signed-off-by: Chen Cui <[email protected]> * enable importers with multiple models Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * add gemma Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * checks Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] minor logging bug fixes (#9529) * minor exp_manager bug fixes * remove print statement * fix docstring * fix AppState defaults --------- Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * mcore distOpt restore fix (#9421) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Custom Tiktoken tokenizer. Signed-off-by: Tugrul Konuk <[email protected]> * Fixed the tokenizer decoding on special tokens. Signed-off-by: Tugrul Konuk <[email protected]> * Apply isort and black reformatting Signed-off-by: ertkonuk <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Added token_to_id() method. Signed-off-by: Tugrul Konuk <[email protected]> * Update neva conversion script from and to HF (#9296) * Update NeMo script Signed-off-by: yaoyu-33 <[email protected]> * Fix example scripts Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Update convert_llava_nemo_to_hf.py Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * vLLM Export Support (#9381) * Export implementation for vLLM 0.4.3. Supports LLAMA2, Mistral, Mixtral (unverified), Gemma and StarCoder2 models. The nemo.export.tensorrt_llm alias was removed to avoid initializing TRT-LLM when importing anything from nemo.export. Signed-off-by: Alexey Panteleev <[email protected]> * Fixed some CodeQL warnings. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Removed empty files. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Updated the integration for vLLM 0.5.0. Signed-off-by: Alexey Panteleev <[email protected]> * Updated the vLLM deployment interface to use max_output_len instead of max_output_token. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Moved the Exporter class to nemo/export and renamed its file to vllm_exporter.py, to be more similar to TRT-LLM. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Implemented vLLM support in the export tests, added functional testing, implemented forward evaluation on vLLM without Triton. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Moved the vLLM deployment functionality to the common deploy_triton.py script. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Fixed the CodeQL discovered issues. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Fixed one more return of a wrong dimensionality... Signed-off-by: Alexey Panteleev <[email protected]> * More wrong dimensionality returns. Signed-off-by: Alexey Panteleev <[email protected]> --------- Signed-off-by: Alexey Panteleev <[email protected]> Signed-off-by: apanteleev <[email protected]> Co-authored-by: apanteleev <[email protected]> Co-authored-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * PL: Delete precision if using plugin. TODO switch to MegatronTrainerBuilder (#9535) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add page context fmha (#9526) Signed-off-by: Tugrul Konuk <[email protected]> * extend get_gpt_layer_modelopt_spec to support MoE (#9532) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix mock data generation for legacy dataset (#9530) Signed-off-by: dimapihtar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] IO fixes (#9512) * Improve IOMixin.io_transform_args to handle dataclasses better * Dump task json + img inside NeMoLogger * Adding store_io to train task * Update opt.connect to also propagate to __io__ * Rename opt to optim for consistency * Moving to using safe serialization using fiddle, only use cloudpickle when needed * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Using Config from fiddle instead of sdk for now * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Move enable_nemo_ckpt_io from MegatronStrategy to ModelCheckpoint * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Move nemo-ckpt to _get_finalize_save_checkpoint_callback * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Update TrainerContext & io.load_ckpt * Use renamed TrainerContext inside ModelCheckpoint * Remove double io saving * Rename lightning.pytorch.opt -> optim * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Remove store_io from train-task * Adding fiddle-extension for torch * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Move fdl_torch import * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Adding dtype to serialization * Some fixes * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Make TransformerConfig inherit from IOMixin to fix serialization error * Make TransformerConfig inherit from IOMixin to fix serialization error * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Add support for BuiltinFunctionType * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Add missing import * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fix dataclass fields --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Test C++ runtime on demand in nemo_export.py to avoid possible OOMs (#9544) * Add test_cpp_runtime flag Signed-off-by: Jan Lasek <[email protected]> * Apply isort and black reformatting Signed-off-by: janekl <[email protected]> --------- Signed-off-by: Jan Lasek <[email protected]> Signed-off-by: janekl <[email protected]> Co-authored-by: janekl <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix lhotse tests for v1.24.2 (#9546) * Fix lhotse tests for v1.24.0 Signed-off-by: Piotr Żelasko <[email protected]> * Fix RIR test Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * gpu_unitTests_notOptional (#9551) Signed-off-by: Tugrul Konuk <[email protected]> * add reset learning rate functionality (#9372) * add reset_lr functionality Signed-off-by: dimapihtar <[email protected]> * fix reset_lr logic Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * move reset_lr from optim section Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * add reset_lr value to config Signed-off-by: dimapihtar <[email protected]> * set reset_lr False by default Signed-off-by: dimapihtar <[email protected]> * remove extra line Signed-off-by: dimapihtar <[email protected]> * add reset_lr test Signed-off-by: dimapihtar <[email protected]> * add reset_lr test Signed-off-by: dimapihtar <[email protected]> * remove extra quote Signed-off-by: dimapihtar <[email protected]> * add ability to reset schedule's max_steps and decay_steps Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * change scheduler's first step logic when using reset_lr Signed-off-by: dimapihtar <[email protected]> * revert config Signed-off-by: dimapihtar <[email protected]> * fix reset_lr logic Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * revert config Signed-off-by: dimapihtar <[email protected]> * revert config Signed-off-by: dimapihtar <[email protected]> * update reset_lr comments Signed-off-by: dimapihtar <[email protected]> * add use cases for reset_lr feature Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add Python AIStore SDK to container and bump min Lhotse version (#9537) * Add Python AIStore SDK to requirements and bump min Lhotse version Signed-off-by: Piotr Żelasko <[email protected]> * Move AIStore Python SDK to Dockerfile, remove matplotlib/ipywidgets deps Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Adding 'use_dynamo' option for export to use onnx.dynamo_export() instead of onnx.export() (#9147) * Ininial WARs to implement dynamo option for export Signed-off-by: Boris Fomitchev <[email protected]> * including weights in .onnx Signed-off-by: Boris Fomitchev <[email protected]> * dynamo_export works for many small models Signed-off-by: Boris Fomitchev <[email protected]> * External weights behaviour fixed Signed-off-by: Boris Fomitchev <[email protected]> * Cleanup Signed-off-by: Boris Fomitchev <[email protected]> * Apply isort and black reformatting Signed-off-by: borisfom <[email protected]> * print cleaned up Signed-off-by: Boris Fomitchev <[email protected]> * Added overloadable dynamic_shapes_for_export Signed-off-by: Boris Fomitchev <[email protected]> * Addressing code review Signed-off-by: Boris Fomitchev <[email protected]> * Fixing CI issues Signed-off-by: Boris Fomitchev <[email protected]> * Fixing CI test failure Signed-off-by: Boris Fomitchev <[email protected]> * Eliminated test cross-contamination Signed-off-by: Boris Fomitchev <[email protected]> --------- Signed-off-by: Boris Fomitchev <[email protected]> Signed-off-by: borisfom <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Fix tokenizer IO (#9555) * Adding tokenizer to io-test + making it pass * Handling tokenizer correctly inside dump_io * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Removing not used import --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo UX] Move mistral_7b.py to mistral.py (#9545) * Move mistral_7b.py to mistral.py Signed-off-by: Alexandros Koumparoulis <[email protected]> * rename MixtralConfig to MixtralConfig8x7B Signed-off-by: Alexandros Koumparoulis <[email protected]> * mistral rename: mistralconfig7b & mistralmodel Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Use closed-formula to round by multiple (#9307) * Use closed-formula to round by multiple Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * ci: Do not attempt to send slack on fork (#9556) * ci: Do not attempt to send slack on fork Signed-off-by: Oliver Koenig <[email protected]> * test Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix nemo export test (#9547) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * fix export test Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: oyilmaz-nvidia <[email protected]> Co-authored-by: oyilmaz-nvidia <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix SDXL incorrect name in docs (#9534) Signed-off-by: Tugrul Konuk <[email protected]> * GPU unit tests: Mark flaky tests to be fixed (#9559) Signed-off-by: Tugrul Konuk <[email protected]> * Bump PTL version (#9557) Signed-off-by: Abhishree <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Resiliency] Straggler detection (#9473) * Initial straggler det impl Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed CI code checks Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Removed unused import Signed-off-by: Jacek Bieniusiewicz <[email protected]> * remove submodule Signed-off-by: Maanu Grover <[email protected]> * Updated documentation; Updated callback params; Cosmetic changes Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed straggler det config; Added basic test Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixes in test_straggler_det.py Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Updated straggler callback API Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * stop_if_detected=False by default Signed-off-by: Jacek Bieniusiewicz <[email protected]> --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * switch to torch_dist as default dist checkpointing backend (#9541) Signed-off-by: ashors1 <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Checkpointing bug fixes (#9562) * fix checkpoint loading * fix * fixes * another fix * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Co-authored-by: ashors1 <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add tps and pps params to the export script (#9558) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * fix export test Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * remove n_gpus param Signed-off-by: Onur Yilmaz <[email protected]> * add and fix parameters Signed-off-by: Onur Yilmaz <[email protected]> * fix deploy script Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * rename tps and pps params Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: oyilmaz-nvidia <[email protected]> Co-authored-by: oyilmaz-nvidia <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Consolidate gpt continue training script into pretraining script (#9413) * Consolidate gpt continue training with pretraining Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix default config Signed-off-by: yaoyu-33 <[email protected]> * Add github action cicd Signed-off-by: yaoyu-33 <[email protected]> * extract _integrate_original_checkpoint_data as a method Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix getattr Signed-off-by: yaoyu-33 <[email protected]> * Revert "Add github action cicd" This reverts commit a453f16ba2be6413db932623009da893208acdd5. * Update comments in nlp_overrides.py Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add support to change Multi task model prompt (#9542) * Add support to change Multi task model prompt Signed-off-by: smajumdar <[email protected]> * Add support to change Multi task model prompt Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Update nemo/collections/common/prompts/formatter.py Co-authored-by: Piotr Żelasko <[email protected]> Signed-off-by: Somshubra Majumdar <[email protected]> * Address comments Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Address comments Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Signed-off-by: titu1994 <[email protected]> Signed-off-by: Somshubra Majumdar <[email protected]> Co-authored-by: Piotr Żelasko <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add Multimodal Exporter (#9256) * Add video-neva TRT export * Add TRT inference * Change config * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Change export params * Remove unused import * Add neva export * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Change unpack nemo * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Add trt infer config * Fix neva trt inference * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Add exporter * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix infer * Add PyTriton * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix deploy wrong dim * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Change to pass PIL Image * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix video neva deploy * Change query * Change deploy * Remove unused import * Change ptuning * Change to mm exporter * Add script * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix script --------- Signed-off-by: meatybobby <[email protected]> Co-authored-by: meatybobby <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Enable encoder adapters for Canary and MultiTaskAED models (#9409) * Fix assertions for adapter types Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Cleanup Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Finalize support for decoder adapters Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * fix the freeze/unfreeze problem by replacing as_frozen with torch.inference_mode * Apply isort and black reformatting Signed-off-by: weiqingw4ng <[email protected]> * Update tests to new generic way of module update Signed-off-by: smajumdar <[email protected]> * Finalize code for update module Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Fix variable name Signed-off-by: smajumdar <[email protected]> * Finalize projection support for transformer mha adapters Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Correct implementation of freeze restore Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Corrects the implementation of replace_adapter_modules to limit to just the top level modules Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Remove registration of Transformer MHA Signed-off-by: smajumdar <[email protected]> * Remove registration of Transformer MHA Signed-off-by: smajumdar <[email protected]> * Address reviewer comments Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Signed-off-by: titu1994 <[email protected]> Signed-off-by: weiqingw4ng <[email protected]> Co-authored-by: Weiqing Wang <[email protected]> Co-authored-by: weiqingw4ng <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * pass option through (#9570) Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * PTQ refinements (#9574) * Rename megatron_gpt_quantization -> megatron_gpt_ptq Signed-off-by: Jan Lasek <[email protected]> * Configure export.save_path as dir or tarball Signed-off-by: Jan Lasek <[email protected]> * PTQ docs update Signed-off-by: Jan Lasek <[email protected]> * Make model_type optional in case of quantized checkpoints Signed-off-by: Jan Lasek <[email protected]> * Drop unused save_nemo_model_config argument Signed-off-by: Jan Lasek <[email protected]> --------- Signed-off-by: Jan Lasek <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Audio model collection (#9263) * Audio model collection Signed-off-by: Ante Jukić <[email protected]> * Apply isort and black reformatting Signed-off-by: anteju <[email protected]> * Fix imports Signed-off-by: Ante Jukić <[email protected]> * Addressed PR comments Signed-off-by: Ante Jukić <[email protected]> * Apply isort and black reformatting Signed-off-by: anteju <[email protected]> --------- Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: anteju <[email protected]> Co-authored-by: anteju <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Fix Trainer serialization (#9571) * Fix Trainer serialization * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Update click version requirement (#9580) Signed-off-by: Dong Hyuk Chang <[email protected]> Co-authored-by: Dong Hyuk Chang <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Fault tolerance] Heartbeat detection (#9352) * Fault tolerance related changes Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Cosmetic changes in documentation Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Doc update round2 Signed-off-by: Jacek Bieniusiewicz <[email protected]> --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Co-authored-by: Jacek Bieniusiewicz <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add ModelOpt QAT example for Llama2 SFT model (#9326) * add INT4 QAT example for Llama2 SFT model Signed-off-by: Keval Morabia <[email protected]> * Add config parameter to control kv cache quantization Signed-off-by: Keval Morabia <[email protected]> * Fix typo in cicd-main.yml for QAT test Signed-off-by: Keval Morabia <[email protected]> * fix nlp_overrides.py Signed-off-by: Keval Morabia <[email protected]> * address reviewer feedback Signed-off-by: Keval Morabia <[email protected]> * quantize unwrapped model Signed-off-by: Keval Morabia <[email protected]> * add compress export argument for qat config Signed-off-by: Keval Morabia <[email protected]> --------- Signed-off-by: Keval Morabia <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Set TE flag in legacy -> mcore conversion script (#9585) * set TE flag Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] Add fabric-API for manual forward-pass (#9577) * First pass over fabric-API * Adding Trainer -> Fabric conversion * Some small fixes to get a forward-pass in Fabric working * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Adding doc-string to Fabric.import_model * Adding track_io to io_init of Fabric * Fix Fabric.load_model + add doc-string * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Remove unused import * Some small fixes * Fix failing test --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] Add SDK-factories to llm-collection (#9589) * Adding sdk-factories to llm-collection * Removing _model from mistral + mixtral * Expose lr_scheduler inside lightning * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Multimodal projection layer adapter fix for PP>1 (#9445) * enabling multimodal adapters to load in PP>1 Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * parameterizing validate_access_integrity, set to false when PP>1 Signed-off-by: paul-gibbons <[email protected]> formatting fix Signed-off-by: paul-gibbons <[email protected]> Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * update nlp_model.py Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * update modelPT with validate_access_integrity Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * updating save_restore_connector w/ validate_access_integrity Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * addressing comment Signed-off-by: paul-gibbons <[email protected]> * adding validate_access_integrity to super().load_config_and_state_dict() Signed-off-by: paul-gibbons <[email protected]> * testing reorder of validate_access_integrity for CI failures Signed-off-by: paul-gibbons <[email protected]> --------- Signed-off-by: paul-gibbons <[email protected]> Signed-off-by: paul-gibbons <[email protected]> Co-authored-by: paul-gibbons <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add offline quantization script for QLoRA deployment (#9455) * add qlora offline quantization script Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * clean Signed-off-by: Chen Cui <[email protected]> * docstring Signed-off-by: Chen Cui <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * qlora support more models (#9488) Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Some improvements to NeMoLogger (#9591) Signed-off-by: Tugrul Konuk <[email protected]> * Set n_gpu to None in nemo export (#9593) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * set ngpus to None Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Inflight nemo model export support (#9527) * online model conversion and refit Signed-off-by: Jimmy Zhang <[email protected]> * clean code Signed-off-by: Jimmy Zhang <[email protected]> * cleanup Signed-off-by: Jimmy Zhang <[email protected]> * add refit, cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * combine weight conversion functions Signed-off-by: Jimmy Zhang <[email protected]> * cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <[email protected]> * remove debug print Signed-off-by: Jimmy Zhang <[email protected]> * cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * fix single gpu and cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <[email protected]> --------- Signed-off-by: JimmyZhang12 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * vLLM Export Improvements (#9596) * Separated the vLLM export functionality from the common deployment script into deploy_vllm_triton.py. Signed-off-by: Alexey Panteleev <[email protected]> * Fixed vocab_size for LLAMA3. Signed-off-by: Alexey Panteleev <[email protected]> * Export test: fixed deployment testing w/o Megatron, made functional tests optional, added --gpu_memory_utilization. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Addressing review and CodeQL comments. Signed-off-by: Alexey Panteleev <[email protected]> --------- Signed-off-by: Alexey Panteleev <[email protected]> Signed-off-by: apanteleev <[email protected]> Co-authored-by: apanteleev <[email protected]> Co-authored-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Set finalize_model_grads_func in on_fit_start instead to make sure it's being called (#9599) Signed-off-by: Tugrul Konuk <[email protected]> * Set no_sync_func & grad_sync_fucn (#9601) * Set no_sync_func & grad_sync_fucn Signed-off-by: Alexandros Koumparoulis <[email protected]> * set overlap_param_sync Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * small nemo logger bug fix (#9607) Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix the dict format returned by scheduler method (#9609) Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Dataloading enhancements and bug fixes (#9595) * fix dataloading + checkpoint restore * clean up data sampler * fix typo * support passing multiple paths to data module * fix validation dataloader * fix dataloader len when using gradient accumulation * fix progress bar * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * fix step count in loggers * fix blended dataset * address comments * address comment * move step logging into strategy * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Co-authored-by: ashors1 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix serialization of AutoResume (#9616) * fix serialization of autoresume * update undefined variables Signed-off-by: Tugrul Konuk <[email protected]> * Chat template support for megatron_gpt_eval.py (#9354) * Bump PTL version (#9557) Signed-off-by: Abhishree <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * [Resiliency] Straggler detection (#9473) * Initial straggler det impl Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed CI code checks Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Removed unused import Signed-off-by: Jacek Bieniusiewicz <[email protected]> * remove submodule Signed-off-by: Maanu Grover <[email protected]> * Updated documentation; Updated callback params; Cosmetic changes Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed straggler det config; Added basic test Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixes in test_straggler_det.py Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Updated straggler callback API Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * stop_if_detected=False by default Signed-off-by: Jacek Bieniusiewicz <[email protected]> --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * move model loading to separate function; call toContainer once; pad using closed formula Signed-off-by: Alexandros Koumparoulis <[email protected]> * read prompts from file Signed-off-by: Alexandros Koumparoulis <[email protected]> * If input prompt contains dict, apply model.tokenizer.chat_template Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * apply @Gal Leibovich's patch Taken from: https://github.com/NVIDIA/NeMo/commit/17572905344db4692583e72799d55801a8860f35 Signed-off-by: Alexandros Koumparoulis <[email protected]> * rename prompts_file to prompts_jsonl Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * add chat_template param Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add ChatTemplateMixin to SentencePieceTokenizer Signed-off-by: Alexandros Koumparoulis <[email protected]> * add chat-template to text-gen-strat Signed-off-by: Alexandros Koumparoulis <[email protected]> * move load prompts to separate file Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove chat-template from text-gen-utils Signed-off-by: Alexandros Koumparoulis <[email protected]> * make chat-template more generic Signed-off-by: Alexandros Koumparoulis <[email protected]> * add assert message Signed-off-by: Alexandros Koumparoulis <[email protected]> * small refactor for chat_template_mixin Signed-off-by: Alexandros Koumparoulis <[email protected]> * undo ckpt conv changes Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * move rounding to function Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Abhishree <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Abhishree Thittenamane <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Jsonl support (#9611) * Adding support to preprocess .jsonl and .jsonl.gz files in input directory Signed-off-by: adityavavre <[email protected]> * Adding support to preprocess .jsonl and .jsonl.gz files in input directory Signed-off-by: adityavavre <[email protected]> * Apply isort and black reformatting Signed-off-by: adityavavre <[email protected]> --------- Signed-off-by: adityavavre <[email protected]> Signed-off-by: adityavavre <[email protected]> Co-authored-by: adityavavre <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Add PEFT (#9490) * initial commit for PEFT in nemo2 * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * address comments Signed-off-by: Chen Cui <[email protected]> * make import easier Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * address comments Signed-off-by: Chen Cui <[email protected]> * Update nemo/collections/llm/peft/lora.py Signed-off-by: Marc Romeyn <[email protected]> * Some small fixes + adding more doc-strings * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Adding ModelTransform callback * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fixing type-hint for model_transform * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * fix import Signed-off-by: Chen Cui <[email protected]> * model transform for gemma llama Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * fix model transform Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * change lora target default to all linear modules Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * Small fix in mixtral * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Integrating PEFT to the public-API + some fixes * Big refactor to allow to load adapter-states * Some fixes to support adapter_path * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Disabling ckpt reloading when adapter_path is passed * Fix CLI * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Remove commented-out code * Remove commented-out code * Remove un-used import * Fix callback imports * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fixing llm.pretrain * Some small fixes * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fix missing import + type-hint in finetune * Adding PreemptionCallback + some more tests * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Clean up imports & clean up llm.api * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Trying to fix failing tests * Remove __init__.py 2 * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fix failing test * Trying to fix last failing test --------- Signed-off-by: cuichenx <[email protected]> Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> Signed-off-by: marcromeyn <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Akoumparouli/mistral import instruct chat template fix (#9567) * use bf16 by defualt mistral conv Signed-off-by: Alexandros Koumparoulis <[email protected]> * add chat template Signed-off-by: Alexandros Koumparoulis <[email protected]> * use capitalized role names Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Remove .cuda calls, use device isntead (#9602) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix converter defautl args (#9565) * fix converter defautl args Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * mixtral export (#9603) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix: remove non_blocking from PTL's .cuda call (#9618) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Alit/mamba tmp (#9612) * adding mamba support * fix import mixins * rm convert jamba * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * more cleanups * use GPT text gen * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * fixing gbs in TP convetor * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * add reqs * add tutorial * minor fix to tutorial * moving finetuning files Signed-off-by: arendu <[email protected]> * moving finetuning files Signed-off-by: arendu <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * add mamba_tmp * remove mamba import * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> --------- Signed-off-by: JRD971000 <[email protected]> Signed-off-by: arendu <[email protected]> Co-authored-by: Ali Taghibakhshi <[email protected]> Co-authored-by: JRD971000 <[email protected]> Co-authored-by: arendu <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * TitaNet Batch Verify Speaker (#9337) * add batch_inference for verify_speakers method Signed-off-by: [email protected] <[email protected]> * remove not used package Signed-off-by: [email protected] <[email protected]> * change batch inference logic Signed-off-by: [email protected] <[email protected]> * fixup Signed-off-by: [email protected] <[email protected]> * requested changes Signed-off-by: [email protected] <[email protected]> * add verify_speakers_batch to docs Signed-off-by: [email protected] <[email protected]> * handle None durations in manifest Signed-off-by: [email protected] <[email protected]> * change logging text Signed-off-by: [email protected] <[email protected]> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <[email protected]> * check duration presence Signed-off-by: [email protected] <[email protected]> * add channel_selector to dataset configs Signed-off-by: [email protected] <[email protected]> --------- Signed-off-by: [email protected] <[email protected]> Signed-off-by: monica-sekoyan <[email protected]> Co-authored-by: monica-sekoyan <[email protected]> Co-authored-by: Nithin Rao <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Enable MCore checkpointing optimizations (#9505) * Expose num processes in PyT Dist Signed-off-by: Mikołaj Błaż <[email protected]> * Add parallel save/load optimizations from MCore Signed-off-by: Mikołaj Błaż <[email protected]> * Remove async utils from MCore Signed-off-by: Mikołaj Błaż <[email protected]> * Enable DistOpt paralell R/W Signed-off-by: Mikołaj Błaż <[email protected]> * Enable PyT Dist caching Signed-off-by: Mikołaj Błaż <[email protected]> * Small fixes Signed-off-by: Mikołaj Błaż <[email protected]> * Make sure DistCkptIO is instantiated from config Signed-off-by: Mikołaj Błaż <[email protected]> * Bump MCore version to v0.7 Signed-off-by: Mikołaj Błaż <[email protected]> * Print load strategy Signed-off-by: Mikołaj Błaż <[email protected]> * Forward MCore to model space DistOpt Signed-off-by: Mikołaj Błaż <[email protected]> * Add separate flag to control DistOpt paralell R/W Signed-off-by: Mikołaj Błaż <[email protected]> * Turn off parallel save by default Signed-off-by: Mikołaj Błaż <[email protected]> --------- Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Change mixtral moe key name for trt-llm (#9620) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * change moe key values Signed-off-by: Onur Yilmaz <[email protected]> * add weight to the key Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix ckpt load bug (#9621) * fix ckpt load bug Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * NeVA Minor Fixes (#9608) * fix neva resume with empty param loaded for some pp stage Signed-off-by: yaoyu-33 <[email protected]> * fix crop size check Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix pretrianing data sizes and weights (#9627) Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Alit/mamba (#9575) * adding mamba support * fix import mixins * rm convert jamba * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * more cleanups * use GPT text gen * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * fixing gbs in TP convetor * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * add reqs * add tutorial * minor fix to tutorial * moving finetuning files Signed-off-by: arendu <[email protected]> * moving finetuning files Signed-off-by: arendu <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * address comments * add mamba dependancies * add mcore tag * modify dockerfile ci * modify dockerfile ci --------- Signed-off-by: JRD971000 <[email protected]> Signed-off-by: arendu <[email protected]> Co-authored-by: Ali Taghibakhshi <[email protected]> Co-authored-by: JRD971000 <[email protected]> Co-authored-by: arendu <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] async checkpointing support (#9466) * add async checkpointing support * fixes * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * add parallel read/write support and other optimizations * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * address comments, make dist checkpointing args configurable * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * fix small typo Signed-off-by: ashors1 <[email protected]> * Update default sharding type Co-authored-by: mikolajblaz <[email protected]> Signed-off-by: Anna Shors <[email protected]> * Update default sharding type Co-authored-by: mikolajblaz <[email protected]> Signed-off-by: Anna Shors <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: ashors1 <[email protected]> Signed-off-by: Anna Shors <[email protected]> Co-authored-by: ashors1 <[email protected]> Co-authored-by: mikolajblaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix the arguments of forward_for_export function in msdd_models (#9624) * Fix the arguments of forward_for_export function Signed-off-by: Taejin Park <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> --------- Signed-off-by: Taejin Park <[email protected]> Signed-off-by: tango4j <[email protected]> Co-authored-by: tango4j <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Change default parallel_save to False (#9632) Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Unwrap ckpt_io for model opt (async save) (#9622) Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * MCore T5 support for NeMo - Training (#9432) * huvu/mcore_t5 first commit from local * removing DEBUGGING prints * cleaning megatron_lm_encoder_decoder_model.py code * cleaning code * adding Github action test * only run mcore T5 test * only run mcore T5 test * only run mcore T5 test * only run mcore T5 test * reset .github/workflows/cicd-main.yml * reset .github/workflows/cicd-main.yml * adding condition self.mcore_t5 when running self.build_transformer_config() * refractor megatron_lm_encoder_decoder_model.py to not use self.model * only run T5-related tests * remove all self.model * reset cicd file * reset cicd file * updating codes remove duplicate if/else; adding mcore/transformer_engine to config file * adjust +model.mcore_t5=True * Apply isort and black reformatting Signed-off-by: huvunvidia <[email protected]> --------- Signed-off-by: huvunvidia <[email protected]> Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: huvunvidia <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] Expose transformer_layer_spec inside GPTConfig (#9592) * Expose transformer_layer_spec inside GPTConfig * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Expose layer-specs * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Update NeMo Clip to Use MCore Modules (#9594) * update clip model and config file Signed-off-by: yaoyu-33 <[email protected]> * update clip for mcore Signed-off-by: yaoyu-33 <[email protected]> * MCore CLIP Fix Signed-off-by: yaoyu-33 <[email protected]> * fix no mask Signed-off-by: yaoyu-33 <[email protected]> * few neva fixes Signed-off-by: yaoyu-33 <[email protected]> * update siglip module Signed-off-by: yaoyu-33 <[email protected]> * add siglip loss Signed-off-by: yaoyu-33 <[email protected]> * fix Signed-off-by: yaoyu-33 <[email protected]> * fix collate fn Signed-off-by: yaoyu-33 <[email protected]> * update siglip conversion script Signed-off-by: yaoyu-33 <[email protected]> * update siglip convert Signed-off-by: yaoyu-33 <[email protected]> * clip fixes Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * clean up script Signed-off-by: yaoyu-33 <[email protected]> * clip fixe…

* Adding context- & expert-parallism to MegatronStrategy (#9525) Signed-off-by: Tugrul Konuk <[email protected]> * Add CICD test for Stable Diffusion (#9464) * Add CICD test for Stable Diffusion Signed-off-by: Michal Futrega <[email protected]> * Update cicd-main.yml Signed-off-by: Michal Futrega <[email protected]> * Use single gpu runner Signed-off-by: Michal Futrega <[email protected]> --------- Signed-off-by: Michal Futrega <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Akoumparouli/nemo ux mixtral (#9446) * use default collate if dataset does not have one Signed-off-by: Alexandros Koumparoulis <[email protected]> * mixtral config Signed-off-by: Alexandros Koumparoulis <[email protected]> * add convert_state Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix StateDictTransform for 2D layers, e.g. MoE Signed-off-by: Alexandros Koumparoulis <[email protected]> * pass num_moe_experts to specs Signed-off-by: Alexandros Koumparoulis <[email protected]> * udpate MixtralModel Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * mini docstring Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * update mcoreddp call (#9345) * update mcoreddp call Signed-off-by: Alexandros Koumparoulis <[email protected]> * update mcore commits Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Llama and Gemma (#9528) * add llama Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * add llama Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * add llama3 Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * fix typo Signed-off-by: Chen Cui <[email protected]> * enable importers with multiple models Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * add gemma Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * checks Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] minor logging bug fixes (#9529) * minor exp_manager bug fixes * remove print statement * fix docstring * fix AppState defaults --------- Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * mcore distOpt restore fix (#9421) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Custom Tiktoken tokenizer. Signed-off-by: Tugrul Konuk <[email protected]> * Fixed the tokenizer decoding on special tokens. Signed-off-by: Tugrul Konuk <[email protected]> * Apply isort and black reformatting Signed-off-by: ertkonuk <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Added token_to_id() method. Signed-off-by: Tugrul Konuk <[email protected]> * Update neva conversion script from and to HF (#9296) * Update NeMo script Signed-off-by: yaoyu-33 <[email protected]> * Fix example scripts Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Update convert_llava_nemo_to_hf.py Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * vLLM Export Support (#9381) * Export implementation for vLLM 0.4.3. Supports LLAMA2, Mistral, Mixtral (unverified), Gemma and StarCoder2 models. The nemo.export.tensorrt_llm alias was removed to avoid initializing TRT-LLM when importing anything from nemo.export. Signed-off-by: Alexey Panteleev <[email protected]> * Fixed some CodeQL warnings. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Removed empty files. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Updated the integration for vLLM 0.5.0. Signed-off-by: Alexey Panteleev <[email protected]> * Updated the vLLM deployment interface to use max_output_len instead of max_output_token. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Moved the Exporter class to nemo/export and renamed its file to vllm_exporter.py, to be more similar to TRT-LLM. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Implemented vLLM support in the export tests, added functional testing, implemented forward evaluation on vLLM without Triton. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Moved the vLLM deployment functionality to the common deploy_triton.py script. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Fixed the CodeQL discovered issues. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Fixed one more return of a wrong dimensionality... Signed-off-by: Alexey Panteleev <[email protected]> * More wrong dimensionality returns. Signed-off-by: Alexey Panteleev <[email protected]> --------- Signed-off-by: Alexey Panteleev <[email protected]> Signed-off-by: apanteleev <[email protected]> Co-authored-by: apanteleev <[email protected]> Co-authored-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * PL: Delete precision if using plugin. TODO switch to MegatronTrainerBuilder (#9535) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add page context fmha (#9526) Signed-off-by: Tugrul Konuk <[email protected]> * extend get_gpt_layer_modelopt_spec to support MoE (#9532) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix mock data generation for legacy dataset (#9530) Signed-off-by: dimapihtar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] IO fixes (#9512) * Improve IOMixin.io_transform_args to handle dataclasses better * Dump task json + img inside NeMoLogger * Adding store_io to train task * Update opt.connect to also propagate to __io__ * Rename opt to optim for consistency * Moving to using safe serialization using fiddle, only use cloudpickle when needed * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Using Config from fiddle instead of sdk for now * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Move enable_nemo_ckpt_io from MegatronStrategy to ModelCheckpoint * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Move nemo-ckpt to _get_finalize_save_checkpoint_callback * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Update TrainerContext & io.load_ckpt * Use renamed TrainerContext inside ModelCheckpoint * Remove double io saving * Rename lightning.pytorch.opt -> optim * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Remove store_io from train-task * Adding fiddle-extension for torch * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Move fdl_torch import * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Adding dtype to serialization * Some fixes * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Make TransformerConfig inherit from IOMixin to fix serialization error * Make TransformerConfig inherit from IOMixin to fix serialization error * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Add support for BuiltinFunctionType * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Add missing import * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fix dataclass fields --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Test C++ runtime on demand in nemo_export.py to avoid possible OOMs (#9544) * Add test_cpp_runtime flag Signed-off-by: Jan Lasek <[email protected]> * Apply isort and black reformatting Signed-off-by: janekl <[email protected]> --------- Signed-off-by: Jan Lasek <[email protected]> Signed-off-by: janekl <[email protected]> Co-authored-by: janekl <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix lhotse tests for v1.24.2 (#9546) * Fix lhotse tests for v1.24.0 Signed-off-by: Piotr Żelasko <[email protected]> * Fix RIR test Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * gpu_unitTests_notOptional (#9551) Signed-off-by: Tugrul Konuk <[email protected]> * add reset learning rate functionality (#9372) * add reset_lr functionality Signed-off-by: dimapihtar <[email protected]> * fix reset_lr logic Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * move reset_lr from optim section Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * add reset_lr value to config Signed-off-by: dimapihtar <[email protected]> * set reset_lr False by default Signed-off-by: dimapihtar <[email protected]> * remove extra line Signed-off-by: dimapihtar <[email protected]> * add reset_lr test Signed-off-by: dimapihtar <[email protected]> * add reset_lr test Signed-off-by: dimapihtar <[email protected]> * remove extra quote Signed-off-by: dimapihtar <[email protected]> * add ability to reset schedule's max_steps and decay_steps Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * change scheduler's first step logic when using reset_lr Signed-off-by: dimapihtar <[email protected]> * revert config Signed-off-by: dimapihtar <[email protected]> * fix reset_lr logic Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * revert config Signed-off-by: dimapihtar <[email protected]> * revert config Signed-off-by: dimapihtar <[email protected]> * update reset_lr comments Signed-off-by: dimapihtar <[email protected]> * add use cases for reset_lr feature Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add Python AIStore SDK to container and bump min Lhotse version (#9537) * Add Python AIStore SDK to requirements and bump min Lhotse version Signed-off-by: Piotr Żelasko <[email protected]> * Move AIStore Python SDK to Dockerfile, remove matplotlib/ipywidgets deps Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Adding 'use_dynamo' option for export to use onnx.dynamo_export() instead of onnx.export() (#9147) * Ininial WARs to implement dynamo option for export Signed-off-by: Boris Fomitchev <[email protected]> * including weights in .onnx Signed-off-by: Boris Fomitchev <[email protected]> * dynamo_export works for many small models Signed-off-by: Boris Fomitchev <[email protected]> * External weights behaviour fixed Signed-off-by: Boris Fomitchev <[email protected]> * Cleanup Signed-off-by: Boris Fomitchev <[email protected]> * Apply isort and black reformatting Signed-off-by: borisfom <[email protected]> * print cleaned up Signed-off-by: Boris Fomitchev <[email protected]> * Added overloadable dynamic_shapes_for_export Signed-off-by: Boris Fomitchev <[email protected]> * Addressing code review Signed-off-by: Boris Fomitchev <[email protected]> * Fixing CI issues Signed-off-by: Boris Fomitchev <[email protected]> * Fixing CI test failure Signed-off-by: Boris Fomitchev <[email protected]> * Eliminated test cross-contamination Signed-off-by: Boris Fomitchev <[email protected]> --------- Signed-off-by: Boris Fomitchev <[email protected]> Signed-off-by: borisfom <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Fix tokenizer IO (#9555) * Adding tokenizer to io-test + making it pass * Handling tokenizer correctly inside dump_io * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Removing not used import --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo UX] Move mistral_7b.py to mistral.py (#9545) * Move mistral_7b.py to mistral.py Signed-off-by: Alexandros Koumparoulis <[email protected]> * rename MixtralConfig to MixtralConfig8x7B Signed-off-by: Alexandros Koumparoulis <[email protected]> * mistral rename: mistralconfig7b & mistralmodel Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Use closed-formula to round by multiple (#9307) * Use closed-formula to round by multiple Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * ci: Do not attempt to send slack on fork (#9556) * ci: Do not attempt to send slack on fork Signed-off-by: Oliver Koenig <[email protected]> * test Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix nemo export test (#9547) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * fix export test Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: oyilmaz-nvidia <[email protected]> Co-authored-by: oyilmaz-nvidia <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix SDXL incorrect name in docs (#9534) Signed-off-by: Tugrul Konuk <[email protected]> * GPU unit tests: Mark flaky tests to be fixed (#9559) Signed-off-by: Tugrul Konuk <[email protected]> * Bump PTL version (#9557) Signed-off-by: Abhishree <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Resiliency] Straggler detection (#9473) * Initial straggler det impl Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed CI code checks Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Removed unused import Signed-off-by: Jacek Bieniusiewicz <[email protected]> * remove submodule Signed-off-by: Maanu Grover <[email protected]> * Updated documentation; Updated callback params; Cosmetic changes Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed straggler det config; Added basic test Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixes in test_straggler_det.py Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Updated straggler callback API Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * stop_if_detected=False by default Signed-off-by: Jacek Bieniusiewicz <[email protected]> --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * switch to torch_dist as default dist checkpointing backend (#9541) Signed-off-by: ashors1 <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Checkpointing bug fixes (#9562) * fix checkpoint loading * fix * fixes * another fix * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Co-authored-by: ashors1 <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add tps and pps params to the export script (#9558) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * fix export test Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * remove n_gpus param Signed-off-by: Onur Yilmaz <[email protected]> * add and fix parameters Signed-off-by: Onur Yilmaz <[email protected]> * fix deploy script Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * rename tps and pps params Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: oyilmaz-nvidia <[email protected]> Co-authored-by: oyilmaz-nvidia <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Consolidate gpt continue training script into pretraining script (#9413) * Consolidate gpt continue training with pretraining Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix default config Signed-off-by: yaoyu-33 <[email protected]> * Add github action cicd Signed-off-by: yaoyu-33 <[email protected]> * extract _integrate_original_checkpoint_data as a method Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix getattr Signed-off-by: yaoyu-33 <[email protected]> * Revert "Add github action cicd" This reverts commit a453f16ba2be6413db932623009da893208acdd5. * Update comments in nlp_overrides.py Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add support to change Multi task model prompt (#9542) * Add support to change Multi task model prompt Signed-off-by: smajumdar <[email protected]> * Add support to change Multi task model prompt Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Update nemo/collections/common/prompts/formatter.py Co-authored-by: Piotr Żelasko <[email protected]> Signed-off-by: Somshubra Majumdar <[email protected]> * Address comments Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Address comments Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Signed-off-by: titu1994 <[email protected]> Signed-off-by: Somshubra Majumdar <[email protected]> Co-authored-by: Piotr Żelasko <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add Multimodal Exporter (#9256) * Add video-neva TRT export * Add TRT inference * Change config * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Change export params * Remove unused import * Add neva export * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Change unpack nemo * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Add trt infer config * Fix neva trt inference * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Add exporter * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix infer * Add PyTriton * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix deploy wrong dim * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Change to pass PIL Image * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix video neva deploy * Change query * Change deploy * Remove unused import * Change ptuning * Change to mm exporter * Add script * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix script --------- Signed-off-by: meatybobby <[email protected]> Co-authored-by: meatybobby <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Enable encoder adapters for Canary and MultiTaskAED models (#9409) * Fix assertions for adapter types Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Cleanup Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Finalize support for decoder adapters Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * fix the freeze/unfreeze problem by replacing as_frozen with torch.inference_mode * Apply isort and black reformatting Signed-off-by: weiqingw4ng <[email protected]> * Update tests to new generic way of module update Signed-off-by: smajumdar <[email protected]> * Finalize code for update module Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Fix variable name Signed-off-by: smajumdar <[email protected]> * Finalize projection support for transformer mha adapters Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Correct implementation of freeze restore Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Corrects the implementation of replace_adapter_modules to limit to just the top level modules Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Remove registration of Transformer MHA Signed-off-by: smajumdar <[email protected]> * Remove registration of Transformer MHA Signed-off-by: smajumdar <[email protected]> * Address reviewer comments Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Signed-off-by: titu1994 <[email protected]> Signed-off-by: weiqingw4ng <[email protected]> Co-authored-by: Weiqing Wang <[email protected]> Co-authored-by: weiqingw4ng <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * pass option through (#9570) Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * PTQ refinements (#9574) * Rename megatron_gpt_quantization -> megatron_gpt_ptq Signed-off-by: Jan Lasek <[email protected]> * Configure export.save_path as dir or tarball Signed-off-by: Jan Lasek <[email protected]> * PTQ docs update Signed-off-by: Jan Lasek <[email protected]> * Make model_type optional in case of quantized checkpoints Signed-off-by: Jan Lasek <[email protected]> * Drop unused save_nemo_model_config argument Signed-off-by: Jan Lasek <[email protected]> --------- Signed-off-by: Jan Lasek <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Audio model collection (#9263) * Audio model collection Signed-off-by: Ante Jukić <[email protected]> * Apply isort and black reformatting Signed-off-by: anteju <[email protected]> * Fix imports Signed-off-by: Ante Jukić <[email protected]> * Addressed PR comments Signed-off-by: Ante Jukić <[email protected]> * Apply isort and black reformatting Signed-off-by: anteju <[email protected]> --------- Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: anteju <[email protected]> Co-authored-by: anteju <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Fix Trainer serialization (#9571) * Fix Trainer serialization * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Update click version requirement (#9580) Signed-off-by: Dong Hyuk Chang <[email protected]> Co-authored-by: Dong Hyuk Chang <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Fault tolerance] Heartbeat detection (#9352) * Fault tolerance related changes Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Cosmetic changes in documentation Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Doc update round2 Signed-off-by: Jacek Bieniusiewicz <[email protected]> --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Co-authored-by: Jacek Bieniusiewicz <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add ModelOpt QAT example for Llama2 SFT model (#9326) * add INT4 QAT example for Llama2 SFT model Signed-off-by: Keval Morabia <[email protected]> * Add config parameter to control kv cache quantization Signed-off-by: Keval Morabia <[email protected]> * Fix typo in cicd-main.yml for QAT test Signed-off-by: Keval Morabia <[email protected]> * fix nlp_overrides.py Signed-off-by: Keval Morabia <[email protected]> * address reviewer feedback Signed-off-by: Keval Morabia <[email protected]> * quantize unwrapped model Signed-off-by: Keval Morabia <[email protected]> * add compress export argument for qat config Signed-off-by: Keval Morabia <[email protected]> --------- Signed-off-by: Keval Morabia <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Set TE flag in legacy -> mcore conversion script (#9585) * set TE flag Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] Add fabric-API for manual forward-pass (#9577) * First pass over fabric-API * Adding Trainer -> Fabric conversion * Some small fixes to get a forward-pass in Fabric working * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Adding doc-string to Fabric.import_model * Adding track_io to io_init of Fabric * Fix Fabric.load_model + add doc-string * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Remove unused import * Some small fixes * Fix failing test --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] Add SDK-factories to llm-collection (#9589) * Adding sdk-factories to llm-collection * Removing _model from mistral + mixtral * Expose lr_scheduler inside lightning * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Multimodal projection layer adapter fix for PP>1 (#9445) * enabling multimodal adapters to load in PP>1 Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * parameterizing validate_access_integrity, set to false when PP>1 Signed-off-by: paul-gibbons <[email protected]> formatting fix Signed-off-by: paul-gibbons <[email protected]> Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * update nlp_model.py Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * update modelPT with validate_access_integrity Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * updating save_restore_connector w/ validate_access_integrity Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * addressing comment Signed-off-by: paul-gibbons <[email protected]> * adding validate_access_integrity to super().load_config_and_state_dict() Signed-off-by: paul-gibbons <[email protected]> * testing reorder of validate_access_integrity for CI failures Signed-off-by: paul-gibbons <[email protected]> --------- Signed-off-by: paul-gibbons <[email protected]> Signed-off-by: paul-gibbons <[email protected]> Co-authored-by: paul-gibbons <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add offline quantization script for QLoRA deployment (#9455) * add qlora offline quantization script Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * clean Signed-off-by: Chen Cui <[email protected]> * docstring Signed-off-by: Chen Cui <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * qlora support more models (#9488) Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Some improvements to NeMoLogger (#9591) Signed-off-by: Tugrul Konuk <[email protected]> * Set n_gpu to None in nemo export (#9593) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * set ngpus to None Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Inflight nemo model export support (#9527) * online model conversion and refit Signed-off-by: Jimmy Zhang <[email protected]> * clean code Signed-off-by: Jimmy Zhang <[email protected]> * cleanup Signed-off-by: Jimmy Zhang <[email protected]> * add refit, cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * combine weight conversion functions Signed-off-by: Jimmy Zhang <[email protected]> * cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <[email protected]> * remove debug print Signed-off-by: Jimmy Zhang <[email protected]> * cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * fix single gpu and cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <[email protected]> --------- Signed-off-by: JimmyZhang12 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * vLLM Export Improvements (#9596) * Separated the vLLM export functionality from the common deployment script into deploy_vllm_triton.py. Signed-off-by: Alexey Panteleev <[email protected]> * Fixed vocab_size for LLAMA3. Signed-off-by: Alexey Panteleev <[email protected]> * Export test: fixed deployment testing w/o Megatron, made functional tests optional, added --gpu_memory_utilization. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Addressing review and CodeQL comments. Signed-off-by: Alexey Panteleev <[email protected]> --------- Signed-off-by: Alexey Panteleev <[email protected]> Signed-off-by: apanteleev <[email protected]> Co-authored-by: apanteleev <[email protected]> Co-authored-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Set finalize_model_grads_func in on_fit_start instead to make sure it's being called (#9599) Signed-off-by: Tugrul Konuk <[email protected]> * Set no_sync_func & grad_sync_fucn (#9601) * Set no_sync_func & grad_sync_fucn Signed-off-by: Alexandros Koumparoulis <[email protected]> * set overlap_param_sync Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * small nemo logger bug fix (#9607) Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix the dict format returned by scheduler method (#9609) Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Dataloading enhancements and bug fixes (#9595) * fix dataloading + checkpoint restore * clean up data sampler * fix typo * support passing multiple paths to data module * fix validation dataloader * fix dataloader len when using gradient accumulation * fix progress bar * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * fix step count in loggers * fix blended dataset * address comments * address comment * move step logging into strategy * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Co-authored-by: ashors1 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix serialization of AutoResume (#9616) * fix serialization of autoresume * update undefined variables Signed-off-by: Tugrul Konuk <[email protected]> * Chat template support for megatron_gpt_eval.py (#9354) * Bump PTL version (#9557) Signed-off-by: Abhishree <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * [Resiliency] Straggler detection (#9473) * Initial straggler det impl Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed CI code checks Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Removed unused import Signed-off-by: Jacek Bieniusiewicz <[email protected]> * remove submodule Signed-off-by: Maanu Grover <[email protected]> * Updated documentation; Updated callback params; Cosmetic changes Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed straggler det config; Added basic test Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixes in test_straggler_det.py Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Updated straggler callback API Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * stop_if_detected=False by default Signed-off-by: Jacek Bieniusiewicz <[email protected]> --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * move model loading to separate function; call toContainer once; pad using closed formula Signed-off-by: Alexandros Koumparoulis <[email protected]> * read prompts from file Signed-off-by: Alexandros Koumparoulis <[email protected]> * If input prompt contains dict, apply model.tokenizer.chat_template Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * apply @Gal Leibovich's patch Taken from: https://github.com/NVIDIA/NeMo/commit/17572905344db4692583e72799d55801a8860f35 Signed-off-by: Alexandros Koumparoulis <[email protected]> * rename prompts_file to prompts_jsonl Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * add chat_template param Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add ChatTemplateMixin to SentencePieceTokenizer Signed-off-by: Alexandros Koumparoulis <[email protected]> * add chat-template to text-gen-strat Signed-off-by: Alexandros Koumparoulis <[email protected]> * move load prompts to separate file Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove chat-template from text-gen-utils Signed-off-by: Alexandros Koumparoulis <[email protected]> * make chat-template more generic Signed-off-by: Alexandros Koumparoulis <[email protected]> * add assert message Signed-off-by: Alexandros Koumparoulis <[email protected]> * small refactor for chat_template_mixin Signed-off-by: Alexandros Koumparoulis <[email protected]> * undo ckpt conv changes Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * move rounding to function Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Abhishree <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Abhishree Thittenamane <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Jsonl support (#9611) * Adding support to preprocess .jsonl and .jsonl.gz files in input directory Signed-off-by: adityavavre <[email protected]> * Adding support to preprocess .jsonl and .jsonl.gz files in input directory Signed-off-by: adityavavre <[email protected]> * Apply isort and black reformatting Signed-off-by: adityavavre <[email protected]> --------- Signed-off-by: adityavavre <[email protected]> Signed-off-by: adityavavre <[email protected]> Co-authored-by: adityavavre <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Add PEFT (#9490) * initial commit for PEFT in nemo2 * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * address comments Signed-off-by: Chen Cui <[email protected]> * make import easier Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * address comments Signed-off-by: Chen Cui <[email protected]> * Update nemo/collections/llm/peft/lora.py Signed-off-by: Marc Romeyn <[email protected]> * Some small fixes + adding more doc-strings * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Adding ModelTransform callback * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fixing type-hint for model_transform * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * fix import Signed-off-by: Chen Cui <[email protected]> * model transform for gemma llama Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * fix model transform Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * change lora target default to all linear modules Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * Small fix in mixtral * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Integrating PEFT to the public-API + some fixes * Big refactor to allow to load adapter-states * Some fixes to support adapter_path * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Disabling ckpt reloading when adapter_path is passed * Fix CLI * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Remove commented-out code * Remove commented-out code * Remove un-used import * Fix callback imports * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fixing llm.pretrain * Some small fixes * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fix missing import + type-hint in finetune * Adding PreemptionCallback + some more tests * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Clean up imports & clean up llm.api * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Trying to fix failing tests * Remove __init__.py 2 * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fix failing test * Trying to fix last failing test --------- Signed-off-by: cuichenx <[email protected]> Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> Signed-off-by: marcromeyn <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Akoumparouli/mistral import instruct chat template fix (#9567) * use bf16 by defualt mistral conv Signed-off-by: Alexandros Koumparoulis <[email protected]> * add chat template Signed-off-by: Alexandros Koumparoulis <[email protected]> * use capitalized role names Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Remove .cuda calls, use device isntead (#9602) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix converter defautl args (#9565) * fix converter defautl args Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * mixtral export (#9603) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix: remove non_blocking from PTL's .cuda call (#9618) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Alit/mamba tmp (#9612) * adding mamba support * fix import mixins * rm convert jamba * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * more cleanups * use GPT text gen * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * fixing gbs in TP convetor * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * add reqs * add tutorial * minor fix to tutorial * moving finetuning files Signed-off-by: arendu <[email protected]> * moving finetuning files Signed-off-by: arendu <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * add mamba_tmp * remove mamba import * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> --------- Signed-off-by: JRD971000 <[email protected]> Signed-off-by: arendu <[email protected]> Co-authored-by: Ali Taghibakhshi <[email protected]> Co-authored-by: JRD971000 <[email protected]> Co-authored-by: arendu <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * TitaNet Batch Verify Speaker (#9337) * add batch_inference for verify_speakers method Signed-off-by: [email protected] <[email protected]> * remove not used package Signed-off-by: [email protected] <[email protected]> * change batch inference logic Signed-off-by: [email protected] <[email protected]> * fixup Signed-off-by: [email protected] <[email protected]> * requested changes Signed-off-by: [email protected] <[email protected]> * add verify_speakers_batch to docs Signed-off-by: [email protected] <[email protected]> * handle None durations in manifest Signed-off-by: [email protected] <[email protected]> * change logging text Signed-off-by: [email protected] <[email protected]> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <[email protected]> * check duration presence Signed-off-by: [email protected] <[email protected]> * add channel_selector to dataset configs Signed-off-by: [email protected] <[email protected]> --------- Signed-off-by: [email protected] <[email protected]> Signed-off-by: monica-sekoyan <[email protected]> Co-authored-by: monica-sekoyan <[email protected]> Co-authored-by: Nithin Rao <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Enable MCore checkpointing optimizations (#9505) * Expose num processes in PyT Dist Signed-off-by: Mikołaj Błaż <[email protected]> * Add parallel save/load optimizations from MCore Signed-off-by: Mikołaj Błaż <[email protected]> * Remove async utils from MCore Signed-off-by: Mikołaj Błaż <[email protected]> * Enable DistOpt paralell R/W Signed-off-by: Mikołaj Błaż <[email protected]> * Enable PyT Dist caching Signed-off-by: Mikołaj Błaż <[email protected]> * Small fixes Signed-off-by: Mikołaj Błaż <[email protected]> * Make sure DistCkptIO is instantiated from config Signed-off-by: Mikołaj Błaż <[email protected]> * Bump MCore version to v0.7 Signed-off-by: Mikołaj Błaż <[email protected]> * Print load strategy Signed-off-by: Mikołaj Błaż <[email protected]> * Forward MCore to model space DistOpt Signed-off-by: Mikołaj Błaż <[email protected]> * Add separate flag to control DistOpt paralell R/W Signed-off-by: Mikołaj Błaż <[email protected]> * Turn off parallel save by default Signed-off-by: Mikołaj Błaż <[email protected]> --------- Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Change mixtral moe key name for trt-llm (#9620) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * change moe key values Signed-off-by: Onur Yilmaz <[email protected]> * add weight to the key Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix ckpt load bug (#9621) * fix ckpt load bug Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * NeVA Minor Fixes (#9608) * fix neva resume with empty param loaded for some pp stage Signed-off-by: yaoyu-33 <[email protected]> * fix crop size check Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix pretrianing data sizes and weights (#9627) Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Alit/mamba (#9575) * adding mamba support * fix import mixins * rm convert jamba * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * more cleanups * use GPT text gen * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * fixing gbs in TP convetor * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * add reqs * add tutorial * minor fix to tutorial * moving finetuning files Signed-off-by: arendu <[email protected]> * moving finetuning files Signed-off-by: arendu <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * address comments * add mamba dependancies * add mcore tag * modify dockerfile ci * modify dockerfile ci --------- Signed-off-by: JRD971000 <[email protected]> Signed-off-by: arendu <[email protected]> Co-authored-by: Ali Taghibakhshi <[email protected]> Co-authored-by: JRD971000 <[email protected]> Co-authored-by: arendu <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] async checkpointing support (#9466) * add async checkpointing support * fixes * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * add parallel read/write support and other optimizations * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * address comments, make dist checkpointing args configurable * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * fix small typo Signed-off-by: ashors1 <[email protected]> * Update default sharding type Co-authored-by: mikolajblaz <[email protected]> Signed-off-by: Anna Shors <[email protected]> * Update default sharding type Co-authored-by: mikolajblaz <[email protected]> Signed-off-by: Anna Shors <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: ashors1 <[email protected]> Signed-off-by: Anna Shors <[email protected]> Co-authored-by: ashors1 <[email protected]> Co-authored-by: mikolajblaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix the arguments of forward_for_export function in msdd_models (#9624) * Fix the arguments of forward_for_export function Signed-off-by: Taejin Park <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> --------- Signed-off-by: Taejin Park <[email protected]> Signed-off-by: tango4j <[email protected]> Co-authored-by: tango4j <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Change default parallel_save to False (#9632) Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Unwrap ckpt_io for model opt (async save) (#9622) Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * MCore T5 support for NeMo - Training (#9432) * huvu/mcore_t5 first commit from local * removing DEBUGGING prints * cleaning megatron_lm_encoder_decoder_model.py code * cleaning code * adding Github action test * only run mcore T5 test * only run mcore T5 test * only run mcore T5 test * only run mcore T5 test * reset .github/workflows/cicd-main.yml * reset .github/workflows/cicd-main.yml * adding condition self.mcore_t5 when running self.build_transformer_config() * refractor megatron_lm_encoder_decoder_model.py to not use self.model * only run T5-related tests * remove all self.model * reset cicd file * reset cicd file * updating codes remove duplicate if/else; adding mcore/transformer_engine to config file * adjust +model.mcore_t5=True * Apply isort and black reformatting Signed-off-by: huvunvidia <[email protected]> --------- Signed-off-by: huvunvidia <[email protected]> Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: huvunvidia <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] Expose transformer_layer_spec inside GPTConfig (#9592) * Expose transformer_layer_spec inside GPTConfig * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Expose layer-specs * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Update NeMo Clip to Use MCore Modules (#9594) * update clip model and config file Signed-off-by: yaoyu-33 <[email protected]> * update clip for mcore Signed-off-by: yaoyu-33 <[email protected]> * MCore CLIP Fix Signed-off-by: yaoyu-33 <[email protected]> * fix no mask Signed-off-by: yaoyu-33 <[email protected]> * few neva fixes Signed-off-by: yaoyu-33 <[email protected]> * update siglip module Signed-off-by: yaoyu-33 <[email protected]> * add siglip loss Signed-off-by: yaoyu-33 <[email protected]> * fix Signed-off-by: yaoyu-33 <[email protected]> * fix collate fn Signed-off-by: yaoyu-33 <[email protected]> * update siglip conversion script Signed-off-by: yaoyu-33 <[email protected]> * update siglip convert Signed-off-by: yaoyu-33 <[email protected]> * clip fixes Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * clean up script Signed-off-by: yaoyu-33 <[email protected]> * clip fixes Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix code styles Signed-off-by: yaoyu-33 <[email protected]> * Update siglip_loss.py Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add REST API to deploy module (#9539) * Add REST API and FastAPI to deploy module Signed-off-by: Abhishree <[email protected]> * Add NemoQuery and requirements Signed-off-by: Abhishree <[email protected]> * Edit path for config.json Signed-off-by: Abhishree <[email protected]> * Add modifications for REST API for the correct functionality Move service dir under deploy Use NeMoQueryLLM instead of NemoQuery Signed-off-by: Abhishree <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply isort and black reformatting Signed-off-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com> * Change default port for REST Service Change default port for REST service as Triton server also used the same port as default. Signed-off-by: Abhishree Thittenamane <[email protected]> * Apply isort and black reformatting Signed-off-by: athitten <[email protected]> --------- Signed-off-by: Abhishree <[email protected]> Signed-off-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Abhishree Thittenamane <[email protected]> Signed-off-by: athitten <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: athitten <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Mistral + Mixtral Support for NeVa (#9459) * mistral template support Signed-off-by: paul-gibbons <[email protected]> * get_specs neva fix Signed-off-by: paul-gibbons <[email protected]> * mistral update Signed-off-by: paul-gibbons <[email protected]> * fixed mistral tokenization Signed-off-by: paul-gibbons <[email protected]> * t…

* Use closed-formula to round by multiple Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: tonyjie <[email protected]>

* Adding context- & expert-parallism to MegatronStrategy (#9525) Signed-off-by: Tugrul Konuk <[email protected]> * Add CICD test for Stable Diffusion (#9464) * Add CICD test for Stable Diffusion Signed-off-by: Michal Futrega <[email protected]> * Update cicd-main.yml Signed-off-by: Michal Futrega <[email protected]> * Use single gpu runner Signed-off-by: Michal Futrega <[email protected]> --------- Signed-off-by: Michal Futrega <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Akoumparouli/nemo ux mixtral (#9446) * use default collate if dataset does not have one Signed-off-by: Alexandros Koumparoulis <[email protected]> * mixtral config Signed-off-by: Alexandros Koumparoulis <[email protected]> * add convert_state Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix StateDictTransform for 2D layers, e.g. MoE Signed-off-by: Alexandros Koumparoulis <[email protected]> * pass num_moe_experts to specs Signed-off-by: Alexandros Koumparoulis <[email protected]> * udpate MixtralModel Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * mini docstring Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * update mcoreddp call (#9345) * update mcoreddp call Signed-off-by: Alexandros Koumparoulis <[email protected]> * update mcore commits Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Llama and Gemma (#9528) * add llama Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * add llama Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * add llama3 Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * fix typo Signed-off-by: Chen Cui <[email protected]> * enable importers with multiple models Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * add gemma Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * checks Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] minor logging bug fixes (#9529) * minor exp_manager bug fixes * remove print statement * fix docstring * fix AppState defaults --------- Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * mcore distOpt restore fix (#9421) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Custom Tiktoken tokenizer. Signed-off-by: Tugrul Konuk <[email protected]> * Fixed the tokenizer decoding on special tokens. Signed-off-by: Tugrul Konuk <[email protected]> * Apply isort and black reformatting Signed-off-by: ertkonuk <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Added token_to_id() method. Signed-off-by: Tugrul Konuk <[email protected]> * Update neva conversion script from and to HF (#9296) * Update NeMo script Signed-off-by: yaoyu-33 <[email protected]> * Fix example scripts Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Update convert_llava_nemo_to_hf.py Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * vLLM Export Support (#9381) * Export implementation for vLLM 0.4.3. Supports LLAMA2, Mistral, Mixtral (unverified), Gemma and StarCoder2 models. The nemo.export.tensorrt_llm alias was removed to avoid initializing TRT-LLM when importing anything from nemo.export. Signed-off-by: Alexey Panteleev <[email protected]> * Fixed some CodeQL warnings. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Removed empty files. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Updated the integration for vLLM 0.5.0. Signed-off-by: Alexey Panteleev <[email protected]> * Updated the vLLM deployment interface to use max_output_len instead of max_output_token. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Moved the Exporter class to nemo/export and renamed its file to vllm_exporter.py, to be more similar to TRT-LLM. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Implemented vLLM support in the export tests, added functional testing, implemented forward evaluation on vLLM without Triton. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Moved the vLLM deployment functionality to the common deploy_triton.py script. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Fixed the CodeQL discovered issues. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Fixed one more return of a wrong dimensionality... Signed-off-by: Alexey Panteleev <[email protected]> * More wrong dimensionality returns. Signed-off-by: Alexey Panteleev <[email protected]> --------- Signed-off-by: Alexey Panteleev <[email protected]> Signed-off-by: apanteleev <[email protected]> Co-authored-by: apanteleev <[email protected]> Co-authored-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * PL: Delete precision if using plugin. TODO switch to MegatronTrainerBuilder (#9535) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add page context fmha (#9526) Signed-off-by: Tugrul Konuk <[email protected]> * extend get_gpt_layer_modelopt_spec to support MoE (#9532) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix mock data generation for legacy dataset (#9530) Signed-off-by: dimapihtar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] IO fixes (#9512) * Improve IOMixin.io_transform_args to handle dataclasses better * Dump task json + img inside NeMoLogger * Adding store_io to train task * Update opt.connect to also propagate to __io__ * Rename opt to optim for consistency * Moving to using safe serialization using fiddle, only use cloudpickle when needed * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Using Config from fiddle instead of sdk for now * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Move enable_nemo_ckpt_io from MegatronStrategy to ModelCheckpoint * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Move nemo-ckpt to _get_finalize_save_checkpoint_callback * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Update TrainerContext & io.load_ckpt * Use renamed TrainerContext inside ModelCheckpoint * Remove double io saving * Rename lightning.pytorch.opt -> optim * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Remove store_io from train-task * Adding fiddle-extension for torch * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Move fdl_torch import * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Adding dtype to serialization * Some fixes * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Make TransformerConfig inherit from IOMixin to fix serialization error * Make TransformerConfig inherit from IOMixin to fix serialization error * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Add support for BuiltinFunctionType * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Add missing import * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fix dataclass fields --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Test C++ runtime on demand in nemo_export.py to avoid possible OOMs (#9544) * Add test_cpp_runtime flag Signed-off-by: Jan Lasek <[email protected]> * Apply isort and black reformatting Signed-off-by: janekl <[email protected]> --------- Signed-off-by: Jan Lasek <[email protected]> Signed-off-by: janekl <[email protected]> Co-authored-by: janekl <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix lhotse tests for v1.24.2 (#9546) * Fix lhotse tests for v1.24.0 Signed-off-by: Piotr Żelasko <[email protected]> * Fix RIR test Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * gpu_unitTests_notOptional (#9551) Signed-off-by: Tugrul Konuk <[email protected]> * add reset learning rate functionality (#9372) * add reset_lr functionality Signed-off-by: dimapihtar <[email protected]> * fix reset_lr logic Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * move reset_lr from optim section Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * add reset_lr value to config Signed-off-by: dimapihtar <[email protected]> * set reset_lr False by default Signed-off-by: dimapihtar <[email protected]> * remove extra line Signed-off-by: dimapihtar <[email protected]> * add reset_lr test Signed-off-by: dimapihtar <[email protected]> * add reset_lr test Signed-off-by: dimapihtar <[email protected]> * remove extra quote Signed-off-by: dimapihtar <[email protected]> * add ability to reset schedule's max_steps and decay_steps Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * change scheduler's first step logic when using reset_lr Signed-off-by: dimapihtar <[email protected]> * revert config Signed-off-by: dimapihtar <[email protected]> * fix reset_lr logic Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * revert config Signed-off-by: dimapihtar <[email protected]> * revert config Signed-off-by: dimapihtar <[email protected]> * update reset_lr comments Signed-off-by: dimapihtar <[email protected]> * add use cases for reset_lr feature Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add Python AIStore SDK to container and bump min Lhotse version (#9537) * Add Python AIStore SDK to requirements and bump min Lhotse version Signed-off-by: Piotr Żelasko <[email protected]> * Move AIStore Python SDK to Dockerfile, remove matplotlib/ipywidgets deps Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Adding 'use_dynamo' option for export to use onnx.dynamo_export() instead of onnx.export() (#9147) * Ininial WARs to implement dynamo option for export Signed-off-by: Boris Fomitchev <[email protected]> * including weights in .onnx Signed-off-by: Boris Fomitchev <[email protected]> * dynamo_export works for many small models Signed-off-by: Boris Fomitchev <[email protected]> * External weights behaviour fixed Signed-off-by: Boris Fomitchev <[email protected]> * Cleanup Signed-off-by: Boris Fomitchev <[email protected]> * Apply isort and black reformatting Signed-off-by: borisfom <[email protected]> * print cleaned up Signed-off-by: Boris Fomitchev <[email protected]> * Added overloadable dynamic_shapes_for_export Signed-off-by: Boris Fomitchev <[email protected]> * Addressing code review Signed-off-by: Boris Fomitchev <[email protected]> * Fixing CI issues Signed-off-by: Boris Fomitchev <[email protected]> * Fixing CI test failure Signed-off-by: Boris Fomitchev <[email protected]> * Eliminated test cross-contamination Signed-off-by: Boris Fomitchev <[email protected]> --------- Signed-off-by: Boris Fomitchev <[email protected]> Signed-off-by: borisfom <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Fix tokenizer IO (#9555) * Adding tokenizer to io-test + making it pass * Handling tokenizer correctly inside dump_io * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Removing not used import --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo UX] Move mistral_7b.py to mistral.py (#9545) * Move mistral_7b.py to mistral.py Signed-off-by: Alexandros Koumparoulis <[email protected]> * rename MixtralConfig to MixtralConfig8x7B Signed-off-by: Alexandros Koumparoulis <[email protected]> * mistral rename: mistralconfig7b & mistralmodel Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Use closed-formula to round by multiple (#9307) * Use closed-formula to round by multiple Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * ci: Do not attempt to send slack on fork (#9556) * ci: Do not attempt to send slack on fork Signed-off-by: Oliver Koenig <[email protected]> * test Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix nemo export test (#9547) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * fix export test Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: oyilmaz-nvidia <[email protected]> Co-authored-by: oyilmaz-nvidia <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix SDXL incorrect name in docs (#9534) Signed-off-by: Tugrul Konuk <[email protected]> * GPU unit tests: Mark flaky tests to be fixed (#9559) Signed-off-by: Tugrul Konuk <[email protected]> * Bump PTL version (#9557) Signed-off-by: Abhishree <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Resiliency] Straggler detection (#9473) * Initial straggler det impl Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed CI code checks Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Removed unused import Signed-off-by: Jacek Bieniusiewicz <[email protected]> * remove submodule Signed-off-by: Maanu Grover <[email protected]> * Updated documentation; Updated callback params; Cosmetic changes Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed straggler det config; Added basic test Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixes in test_straggler_det.py Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Updated straggler callback API Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * stop_if_detected=False by default Signed-off-by: Jacek Bieniusiewicz <[email protected]> --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * switch to torch_dist as default dist checkpointing backend (#9541) Signed-off-by: ashors1 <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Checkpointing bug fixes (#9562) * fix checkpoint loading * fix * fixes * another fix * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Co-authored-by: ashors1 <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add tps and pps params to the export script (#9558) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * fix export test Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * remove n_gpus param Signed-off-by: Onur Yilmaz <[email protected]> * add and fix parameters Signed-off-by: Onur Yilmaz <[email protected]> * fix deploy script Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * rename tps and pps params Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: oyilmaz-nvidia <[email protected]> Co-authored-by: oyilmaz-nvidia <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Consolidate gpt continue training script into pretraining script (#9413) * Consolidate gpt continue training with pretraining Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix default config Signed-off-by: yaoyu-33 <[email protected]> * Add github action cicd Signed-off-by: yaoyu-33 <[email protected]> * extract _integrate_original_checkpoint_data as a method Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix getattr Signed-off-by: yaoyu-33 <[email protected]> * Revert "Add github action cicd" This reverts commit a453f16ba2be6413db932623009da893208acdd5. * Update comments in nlp_overrides.py Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add support to change Multi task model prompt (#9542) * Add support to change Multi task model prompt Signed-off-by: smajumdar <[email protected]> * Add support to change Multi task model prompt Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Update nemo/collections/common/prompts/formatter.py Co-authored-by: Piotr Żelasko <[email protected]> Signed-off-by: Somshubra Majumdar <[email protected]> * Address comments Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Address comments Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Signed-off-by: titu1994 <[email protected]> Signed-off-by: Somshubra Majumdar <[email protected]> Co-authored-by: Piotr Żelasko <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add Multimodal Exporter (#9256) * Add video-neva TRT export * Add TRT inference * Change config * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Change export params * Remove unused import * Add neva export * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Change unpack nemo * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Add trt infer config * Fix neva trt inference * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Add exporter * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix infer * Add PyTriton * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix deploy wrong dim * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Change to pass PIL Image * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix video neva deploy * Change query * Change deploy * Remove unused import * Change ptuning * Change to mm exporter * Add script * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix script --------- Signed-off-by: meatybobby <[email protected]> Co-authored-by: meatybobby <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Enable encoder adapters for Canary and MultiTaskAED models (#9409) * Fix assertions for adapter types Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Cleanup Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Finalize support for decoder adapters Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * fix the freeze/unfreeze problem by replacing as_frozen with torch.inference_mode * Apply isort and black reformatting Signed-off-by: weiqingw4ng <[email protected]> * Update tests to new generic way of module update Signed-off-by: smajumdar <[email protected]> * Finalize code for update module Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Fix variable name Signed-off-by: smajumdar <[email protected]> * Finalize projection support for transformer mha adapters Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Correct implementation of freeze restore Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Corrects the implementation of replace_adapter_modules to limit to just the top level modules Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Remove registration of Transformer MHA Signed-off-by: smajumdar <[email protected]> * Remove registration of Transformer MHA Signed-off-by: smajumdar <[email protected]> * Address reviewer comments Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Signed-off-by: titu1994 <[email protected]> Signed-off-by: weiqingw4ng <[email protected]> Co-authored-by: Weiqing Wang <[email protected]> Co-authored-by: weiqingw4ng <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * pass option through (#9570) Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * PTQ refinements (#9574) * Rename megatron_gpt_quantization -> megatron_gpt_ptq Signed-off-by: Jan Lasek <[email protected]> * Configure export.save_path as dir or tarball Signed-off-by: Jan Lasek <[email protected]> * PTQ docs update Signed-off-by: Jan Lasek <[email protected]> * Make model_type optional in case of quantized checkpoints Signed-off-by: Jan Lasek <[email protected]> * Drop unused save_nemo_model_config argument Signed-off-by: Jan Lasek <[email protected]> --------- Signed-off-by: Jan Lasek <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Audio model collection (#9263) * Audio model collection Signed-off-by: Ante Jukić <[email protected]> * Apply isort and black reformatting Signed-off-by: anteju <[email protected]> * Fix imports Signed-off-by: Ante Jukić <[email protected]> * Addressed PR comments Signed-off-by: Ante Jukić <[email protected]> * Apply isort and black reformatting Signed-off-by: anteju <[email protected]> --------- Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: anteju <[email protected]> Co-authored-by: anteju <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Fix Trainer serialization (#9571) * Fix Trainer serialization * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Update click version requirement (#9580) Signed-off-by: Dong Hyuk Chang <[email protected]> Co-authored-by: Dong Hyuk Chang <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Fault tolerance] Heartbeat detection (#9352) * Fault tolerance related changes Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Cosmetic changes in documentation Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Doc update round2 Signed-off-by: Jacek Bieniusiewicz <[email protected]> --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Co-authored-by: Jacek Bieniusiewicz <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add ModelOpt QAT example for Llama2 SFT model (#9326) * add INT4 QAT example for Llama2 SFT model Signed-off-by: Keval Morabia <[email protected]> * Add config parameter to control kv cache quantization Signed-off-by: Keval Morabia <[email protected]> * Fix typo in cicd-main.yml for QAT test Signed-off-by: Keval Morabia <[email protected]> * fix nlp_overrides.py Signed-off-by: Keval Morabia <[email protected]> * address reviewer feedback Signed-off-by: Keval Morabia <[email protected]> * quantize unwrapped model Signed-off-by: Keval Morabia <[email protected]> * add compress export argument for qat config Signed-off-by: Keval Morabia <[email protected]> --------- Signed-off-by: Keval Morabia <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Set TE flag in legacy -> mcore conversion script (#9585) * set TE flag Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] Add fabric-API for manual forward-pass (#9577) * First pass over fabric-API * Adding Trainer -> Fabric conversion * Some small fixes to get a forward-pass in Fabric working * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Adding doc-string to Fabric.import_model * Adding track_io to io_init of Fabric * Fix Fabric.load_model + add doc-string * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Remove unused import * Some small fixes * Fix failing test --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] Add SDK-factories to llm-collection (#9589) * Adding sdk-factories to llm-collection * Removing _model from mistral + mixtral * Expose lr_scheduler inside lightning * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Multimodal projection layer adapter fix for PP>1 (#9445) * enabling multimodal adapters to load in PP>1 Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * parameterizing validate_access_integrity, set to false when PP>1 Signed-off-by: paul-gibbons <[email protected]> formatting fix Signed-off-by: paul-gibbons <[email protected]> Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * update nlp_model.py Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * update modelPT with validate_access_integrity Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * updating save_restore_connector w/ validate_access_integrity Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * addressing comment Signed-off-by: paul-gibbons <[email protected]> * adding validate_access_integrity to super().load_config_and_state_dict() Signed-off-by: paul-gibbons <[email protected]> * testing reorder of validate_access_integrity for CI failures Signed-off-by: paul-gibbons <[email protected]> --------- Signed-off-by: paul-gibbons <[email protected]> Signed-off-by: paul-gibbons <[email protected]> Co-authored-by: paul-gibbons <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add offline quantization script for QLoRA deployment (#9455) * add qlora offline quantization script Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * clean Signed-off-by: Chen Cui <[email protected]> * docstring Signed-off-by: Chen Cui <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * qlora support more models (#9488) Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Some improvements to NeMoLogger (#9591) Signed-off-by: Tugrul Konuk <[email protected]> * Set n_gpu to None in nemo export (#9593) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * set ngpus to None Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Inflight nemo model export support (#9527) * online model conversion and refit Signed-off-by: Jimmy Zhang <[email protected]> * clean code Signed-off-by: Jimmy Zhang <[email protected]> * cleanup Signed-off-by: Jimmy Zhang <[email protected]> * add refit, cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * combine weight conversion functions Signed-off-by: Jimmy Zhang <[email protected]> * cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <[email protected]> * remove debug print Signed-off-by: Jimmy Zhang <[email protected]> * cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * fix single gpu and cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <[email protected]> --------- Signed-off-by: JimmyZhang12 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * vLLM Export Improvements (#9596) * Separated the vLLM export functionality from the common deployment script into deploy_vllm_triton.py. Signed-off-by: Alexey Panteleev <[email protected]> * Fixed vocab_size for LLAMA3. Signed-off-by: Alexey Panteleev <[email protected]> * Export test: fixed deployment testing w/o Megatron, made functional tests optional, added --gpu_memory_utilization. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Addressing review and CodeQL comments. Signed-off-by: Alexey Panteleev <[email protected]> --------- Signed-off-by: Alexey Panteleev <[email protected]> Signed-off-by: apanteleev <[email protected]> Co-authored-by: apanteleev <[email protected]> Co-authored-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Set finalize_model_grads_func in on_fit_start instead to make sure it's being called (#9599) Signed-off-by: Tugrul Konuk <[email protected]> * Set no_sync_func & grad_sync_fucn (#9601) * Set no_sync_func & grad_sync_fucn Signed-off-by: Alexandros Koumparoulis <[email protected]> * set overlap_param_sync Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * small nemo logger bug fix (#9607) Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix the dict format returned by scheduler method (#9609) Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Dataloading enhancements and bug fixes (#9595) * fix dataloading + checkpoint restore * clean up data sampler * fix typo * support passing multiple paths to data module * fix validation dataloader * fix dataloader len when using gradient accumulation * fix progress bar * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * fix step count in loggers * fix blended dataset * address comments * address comment * move step logging into strategy * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Co-authored-by: ashors1 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix serialization of AutoResume (#9616) * fix serialization of autoresume * update undefined variables Signed-off-by: Tugrul Konuk <[email protected]> * Chat template support for megatron_gpt_eval.py (#9354) * Bump PTL version (#9557) Signed-off-by: Abhishree <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * [Resiliency] Straggler detection (#9473) * Initial straggler det impl Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed CI code checks Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Removed unused import Signed-off-by: Jacek Bieniusiewicz <[email protected]> * remove submodule Signed-off-by: Maanu Grover <[email protected]> * Updated documentation; Updated callback params; Cosmetic changes Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed straggler det config; Added basic test Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixes in test_straggler_det.py Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Updated straggler callback API Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * stop_if_detected=False by default Signed-off-by: Jacek Bieniusiewicz <[email protected]> --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * move model loading to separate function; call toContainer once; pad using closed formula Signed-off-by: Alexandros Koumparoulis <[email protected]> * read prompts from file Signed-off-by: Alexandros Koumparoulis <[email protected]> * If input prompt contains dict, apply model.tokenizer.chat_template Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * apply @Gal Leibovich's patch Taken from: https://github.com/NVIDIA/NeMo/commit/17572905344db4692583e72799d55801a8860f35 Signed-off-by: Alexandros Koumparoulis <[email protected]> * rename prompts_file to prompts_jsonl Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * add chat_template param Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add ChatTemplateMixin to SentencePieceTokenizer Signed-off-by: Alexandros Koumparoulis <[email protected]> * add chat-template to text-gen-strat Signed-off-by: Alexandros Koumparoulis <[email protected]> * move load prompts to separate file Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove chat-template from text-gen-utils Signed-off-by: Alexandros Koumparoulis <[email protected]> * make chat-template more generic Signed-off-by: Alexandros Koumparoulis <[email protected]> * add assert message Signed-off-by: Alexandros Koumparoulis <[email protected]> * small refactor for chat_template_mixin Signed-off-by: Alexandros Koumparoulis <[email protected]> * undo ckpt conv changes Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * move rounding to function Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Abhishree <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Abhishree Thittenamane <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Jsonl support (#9611) * Adding support to preprocess .jsonl and .jsonl.gz files in input directory Signed-off-by: adityavavre <[email protected]> * Adding support to preprocess .jsonl and .jsonl.gz files in input directory Signed-off-by: adityavavre <[email protected]> * Apply isort and black reformatting Signed-off-by: adityavavre <[email protected]> --------- Signed-off-by: adityavavre <[email protected]> Signed-off-by: adityavavre <[email protected]> Co-authored-by: adityavavre <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Add PEFT (#9490) * initial commit for PEFT in nemo2 * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * address comments Signed-off-by: Chen Cui <[email protected]> * make import easier Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * address comments Signed-off-by: Chen Cui <[email protected]> * Update nemo/collections/llm/peft/lora.py Signed-off-by: Marc Romeyn <[email protected]> * Some small fixes + adding more doc-strings * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Adding ModelTransform callback * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fixing type-hint for model_transform * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * fix import Signed-off-by: Chen Cui <[email protected]> * model transform for gemma llama Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * fix model transform Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * change lora target default to all linear modules Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * Small fix in mixtral * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Integrating PEFT to the public-API + some fixes * Big refactor to allow to load adapter-states * Some fixes to support adapter_path * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Disabling ckpt reloading when adapter_path is passed * Fix CLI * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Remove commented-out code * Remove commented-out code * Remove un-used import * Fix callback imports * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fixing llm.pretrain * Some small fixes * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fix missing import + type-hint in finetune * Adding PreemptionCallback + some more tests * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Clean up imports & clean up llm.api * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Trying to fix failing tests * Remove __init__.py 2 * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fix failing test * Trying to fix last failing test --------- Signed-off-by: cuichenx <[email protected]> Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> Signed-off-by: marcromeyn <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Akoumparouli/mistral import instruct chat template fix (#9567) * use bf16 by defualt mistral conv Signed-off-by: Alexandros Koumparoulis <[email protected]> * add chat template Signed-off-by: Alexandros Koumparoulis <[email protected]> * use capitalized role names Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Remove .cuda calls, use device isntead (#9602) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix converter defautl args (#9565) * fix converter defautl args Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * mixtral export (#9603) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix: remove non_blocking from PTL's .cuda call (#9618) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Alit/mamba tmp (#9612) * adding mamba support * fix import mixins * rm convert jamba * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * more cleanups * use GPT text gen * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * fixing gbs in TP convetor * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * add reqs * add tutorial * minor fix to tutorial * moving finetuning files Signed-off-by: arendu <[email protected]> * moving finetuning files Signed-off-by: arendu <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * add mamba_tmp * remove mamba import * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> --------- Signed-off-by: JRD971000 <[email protected]> Signed-off-by: arendu <[email protected]> Co-authored-by: Ali Taghibakhshi <[email protected]> Co-authored-by: JRD971000 <[email protected]> Co-authored-by: arendu <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * TitaNet Batch Verify Speaker (#9337) * add batch_inference for verify_speakers method Signed-off-by: [email protected] <[email protected]> * remove not used package Signed-off-by: [email protected] <[email protected]> * change batch inference logic Signed-off-by: [email protected] <[email protected]> * fixup Signed-off-by: [email protected] <[email protected]> * requested changes Signed-off-by: [email protected] <[email protected]> * add verify_speakers_batch to docs Signed-off-by: [email protected] <[email protected]> * handle None durations in manifest Signed-off-by: [email protected] <[email protected]> * change logging text Signed-off-by: [email protected] <[email protected]> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <[email protected]> * check duration presence Signed-off-by: [email protected] <[email protected]> * add channel_selector to dataset configs Signed-off-by: [email protected] <[email protected]> --------- Signed-off-by: [email protected] <[email protected]> Signed-off-by: monica-sekoyan <[email protected]> Co-authored-by: monica-sekoyan <[email protected]> Co-authored-by: Nithin Rao <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Enable MCore checkpointing optimizations (#9505) * Expose num processes in PyT Dist Signed-off-by: Mikołaj Błaż <[email protected]> * Add parallel save/load optimizations from MCore Signed-off-by: Mikołaj Błaż <[email protected]> * Remove async utils from MCore Signed-off-by: Mikołaj Błaż <[email protected]> * Enable DistOpt paralell R/W Signed-off-by: Mikołaj Błaż <[email protected]> * Enable PyT Dist caching Signed-off-by: Mikołaj Błaż <[email protected]> * Small fixes Signed-off-by: Mikołaj Błaż <[email protected]> * Make sure DistCkptIO is instantiated from config Signed-off-by: Mikołaj Błaż <[email protected]> * Bump MCore version to v0.7 Signed-off-by: Mikołaj Błaż <[email protected]> * Print load strategy Signed-off-by: Mikołaj Błaż <[email protected]> * Forward MCore to model space DistOpt Signed-off-by: Mikołaj Błaż <[email protected]> * Add separate flag to control DistOpt paralell R/W Signed-off-by: Mikołaj Błaż <[email protected]> * Turn off parallel save by default Signed-off-by: Mikołaj Błaż <[email protected]> --------- Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Change mixtral moe key name for trt-llm (#9620) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * change moe key values Signed-off-by: Onur Yilmaz <[email protected]> * add weight to the key Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix ckpt load bug (#9621) * fix ckpt load bug Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * NeVA Minor Fixes (#9608) * fix neva resume with empty param loaded for some pp stage Signed-off-by: yaoyu-33 <[email protected]> * fix crop size check Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix pretrianing data sizes and weights (#9627) Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Alit/mamba (#9575) * adding mamba support * fix import mixins * rm convert jamba * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * more cleanups * use GPT text gen * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * fixing gbs in TP convetor * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * add reqs * add tutorial * minor fix to tutorial * moving finetuning files Signed-off-by: arendu <[email protected]> * moving finetuning files Signed-off-by: arendu <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * address comments * add mamba dependancies * add mcore tag * modify dockerfile ci * modify dockerfile ci --------- Signed-off-by: JRD971000 <[email protected]> Signed-off-by: arendu <[email protected]> Co-authored-by: Ali Taghibakhshi <[email protected]> Co-authored-by: JRD971000 <[email protected]> Co-authored-by: arendu <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] async checkpointing support (#9466) * add async checkpointing support * fixes * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * add parallel read/write support and other optimizations * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * address comments, make dist checkpointing args configurable * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * fix small typo Signed-off-by: ashors1 <[email protected]> * Update default sharding type Co-authored-by: mikolajblaz <[email protected]> Signed-off-by: Anna Shors <[email protected]> * Update default sharding type Co-authored-by: mikolajblaz <[email protected]> Signed-off-by: Anna Shors <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: ashors1 <[email protected]> Signed-off-by: Anna Shors <[email protected]> Co-authored-by: ashors1 <[email protected]> Co-authored-by: mikolajblaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix the arguments of forward_for_export function in msdd_models (#9624) * Fix the arguments of forward_for_export function Signed-off-by: Taejin Park <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> --------- Signed-off-by: Taejin Park <[email protected]> Signed-off-by: tango4j <[email protected]> Co-authored-by: tango4j <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Change default parallel_save to False (#9632) Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Unwrap ckpt_io for model opt (async save) (#9622) Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * MCore T5 support for NeMo - Training (#9432) * huvu/mcore_t5 first commit from local * removing DEBUGGING prints * cleaning megatron_lm_encoder_decoder_model.py code * cleaning code * adding Github action test * only run mcore T5 test * only run mcore T5 test * only run mcore T5 test * only run mcore T5 test * reset .github/workflows/cicd-main.yml * reset .github/workflows/cicd-main.yml * adding condition self.mcore_t5 when running self.build_transformer_config() * refractor megatron_lm_encoder_decoder_model.py to not use self.model * only run T5-related tests * remove all self.model * reset cicd file * reset cicd file * updating codes remove duplicate if/else; adding mcore/transformer_engine to config file * adjust +model.mcore_t5=True * Apply isort and black reformatting Signed-off-by: huvunvidia <[email protected]> --------- Signed-off-by: huvunvidia <[email protected]> Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: huvunvidia <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] Expose transformer_layer_spec inside GPTConfig (#9592) * Expose transformer_layer_spec inside GPTConfig * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Expose layer-specs * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Update NeMo Clip to Use MCore Modules (#9594) * update clip model and config file Signed-off-by: yaoyu-33 <[email protected]> * update clip for mcore Signed-off-by: yaoyu-33 <[email protected]> * MCore CLIP Fix Signed-off-by: yaoyu-33 <[email protected]> * fix no mask Signed-off-by: yaoyu-33 <[email protected]> * few neva fixes Signed-off-by: yaoyu-33 <[email protected]> * update siglip module Signed-off-by: yaoyu-33 <[email protected]> * add siglip loss Signed-off-by: yaoyu-33 <[email protected]> * fix Signed-off-by: yaoyu-33 <[email protected]> * fix collate fn Signed-off-by: yaoyu-33 <[email protected]> * update siglip conversion script Signed-off-by: yaoyu-33 <[email protected]> * update siglip convert Signed-off-by: yaoyu-33 <[email protected]> * clip fixes Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * clean up script Signed-off-by: yaoyu-33 <[email protected]> * clip fixes Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix code styles Signed-off-by: yaoyu-33 <[email protected]> * Update siglip_loss.py Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add REST API to deploy module (#9539) * Add REST API and FastAPI to deploy module Signed-off-by: Abhishree <[email protected]> * Add NemoQuery and requirements Signed-off-by: Abhishree <[email protected]> * Edit path for config.json Signed-off-by: Abhishree <[email protected]> * Add modifications for REST API for the correct functionality Move service dir under deploy Use NeMoQueryLLM instead of NemoQuery Signed-off-by: Abhishree <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply isort and black reformatting Signed-off-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com> * Change default port for REST Service Change default port for REST service as Triton server also used the same port as default. Signed-off-by: Abhishree Thittenamane <[email protected]> * Apply isort and black reformatting Signed-off-by: athitten <[email protected]> --------- Signed-off-by: Abhishree <[email protected]> Signed-off-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Abhishree Thittenamane <[email protected]> Signed-off-by: athitten <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: athitten <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Mistral + Mixtral Support for NeVa (#9459) * mistral template support Signed-off-by: paul-gibbons <[email protected]> * get_specs neva fix Signed-off-by: paul-gibbons <[email protected]> * mistral update Signed-off-by: paul-gibbons <[email protected]> * fixed mistral tokenization Signed-off-by: paul-gibbons <[email protected]> * t…

* Adding context- & expert-parallism to MegatronStrategy (#9525) Signed-off-by: Tugrul Konuk <[email protected]> * Add CICD test for Stable Diffusion (#9464) * Add CICD test for Stable Diffusion Signed-off-by: Michal Futrega <[email protected]> * Update cicd-main.yml Signed-off-by: Michal Futrega <[email protected]> * Use single gpu runner Signed-off-by: Michal Futrega <[email protected]> --------- Signed-off-by: Michal Futrega <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Akoumparouli/nemo ux mixtral (#9446) * use default collate if dataset does not have one Signed-off-by: Alexandros Koumparoulis <[email protected]> * mixtral config Signed-off-by: Alexandros Koumparoulis <[email protected]> * add convert_state Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix StateDictTransform for 2D layers, e.g. MoE Signed-off-by: Alexandros Koumparoulis <[email protected]> * pass num_moe_experts to specs Signed-off-by: Alexandros Koumparoulis <[email protected]> * udpate MixtralModel Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * mini docstring Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * update mcoreddp call (#9345) * update mcoreddp call Signed-off-by: Alexandros Koumparoulis <[email protected]> * update mcore commits Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Llama and Gemma (#9528) * add llama Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * add llama Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * add llama3 Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * fix typo Signed-off-by: Chen Cui <[email protected]> * enable importers with multiple models Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * add gemma Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * checks Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] minor logging bug fixes (#9529) * minor exp_manager bug fixes * remove print statement * fix docstring * fix AppState defaults --------- Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * mcore distOpt restore fix (#9421) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Custom Tiktoken tokenizer. Signed-off-by: Tugrul Konuk <[email protected]> * Fixed the tokenizer decoding on special tokens. Signed-off-by: Tugrul Konuk <[email protected]> * Apply isort and black reformatting Signed-off-by: ertkonuk <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Added token_to_id() method. Signed-off-by: Tugrul Konuk <[email protected]> * Update neva conversion script from and to HF (#9296) * Update NeMo script Signed-off-by: yaoyu-33 <[email protected]> * Fix example scripts Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Update convert_llava_nemo_to_hf.py Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * vLLM Export Support (#9381) * Export implementation for vLLM 0.4.3. Supports LLAMA2, Mistral, Mixtral (unverified), Gemma and StarCoder2 models. The nemo.export.tensorrt_llm alias was removed to avoid initializing TRT-LLM when importing anything from nemo.export. Signed-off-by: Alexey Panteleev <[email protected]> * Fixed some CodeQL warnings. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Removed empty files. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Updated the integration for vLLM 0.5.0. Signed-off-by: Alexey Panteleev <[email protected]> * Updated the vLLM deployment interface to use max_output_len instead of max_output_token. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Moved the Exporter class to nemo/export and renamed its file to vllm_exporter.py, to be more similar to TRT-LLM. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Implemented vLLM support in the export tests, added functional testing, implemented forward evaluation on vLLM without Triton. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Moved the vLLM deployment functionality to the common deploy_triton.py script. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Fixed the CodeQL discovered issues. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Fixed one more return of a wrong dimensionality... Signed-off-by: Alexey Panteleev <[email protected]> * More wrong dimensionality returns. Signed-off-by: Alexey Panteleev <[email protected]> --------- Signed-off-by: Alexey Panteleev <[email protected]> Signed-off-by: apanteleev <[email protected]> Co-authored-by: apanteleev <[email protected]> Co-authored-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * PL: Delete precision if using plugin. TODO switch to MegatronTrainerBuilder (#9535) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add page context fmha (#9526) Signed-off-by: Tugrul Konuk <[email protected]> * extend get_gpt_layer_modelopt_spec to support MoE (#9532) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix mock data generation for legacy dataset (#9530) Signed-off-by: dimapihtar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] IO fixes (#9512) * Improve IOMixin.io_transform_args to handle dataclasses better * Dump task json + img inside NeMoLogger * Adding store_io to train task * Update opt.connect to also propagate to __io__ * Rename opt to optim for consistency * Moving to using safe serialization using fiddle, only use cloudpickle when needed * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Using Config from fiddle instead of sdk for now * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Move enable_nemo_ckpt_io from MegatronStrategy to ModelCheckpoint * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Move nemo-ckpt to _get_finalize_save_checkpoint_callback * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Update TrainerContext & io.load_ckpt * Use renamed TrainerContext inside ModelCheckpoint * Remove double io saving * Rename lightning.pytorch.opt -> optim * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Remove store_io from train-task * Adding fiddle-extension for torch * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Move fdl_torch import * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Adding dtype to serialization * Some fixes * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Make TransformerConfig inherit from IOMixin to fix serialization error * Make TransformerConfig inherit from IOMixin to fix serialization error * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Add support for BuiltinFunctionType * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Add missing import * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fix dataclass fields --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Test C++ runtime on demand in nemo_export.py to avoid possible OOMs (#9544) * Add test_cpp_runtime flag Signed-off-by: Jan Lasek <[email protected]> * Apply isort and black reformatting Signed-off-by: janekl <[email protected]> --------- Signed-off-by: Jan Lasek <[email protected]> Signed-off-by: janekl <[email protected]> Co-authored-by: janekl <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix lhotse tests for v1.24.2 (#9546) * Fix lhotse tests for v1.24.0 Signed-off-by: Piotr Żelasko <[email protected]> * Fix RIR test Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * gpu_unitTests_notOptional (#9551) Signed-off-by: Tugrul Konuk <[email protected]> * add reset learning rate functionality (#9372) * add reset_lr functionality Signed-off-by: dimapihtar <[email protected]> * fix reset_lr logic Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * move reset_lr from optim section Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * add reset_lr value to config Signed-off-by: dimapihtar <[email protected]> * set reset_lr False by default Signed-off-by: dimapihtar <[email protected]> * remove extra line Signed-off-by: dimapihtar <[email protected]> * add reset_lr test Signed-off-by: dimapihtar <[email protected]> * add reset_lr test Signed-off-by: dimapihtar <[email protected]> * remove extra quote Signed-off-by: dimapihtar <[email protected]> * add ability to reset schedule's max_steps and decay_steps Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * change scheduler's first step logic when using reset_lr Signed-off-by: dimapihtar <[email protected]> * revert config Signed-off-by: dimapihtar <[email protected]> * fix reset_lr logic Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * revert config Signed-off-by: dimapihtar <[email protected]> * revert config Signed-off-by: dimapihtar <[email protected]> * update reset_lr comments Signed-off-by: dimapihtar <[email protected]> * add use cases for reset_lr feature Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add Python AIStore SDK to container and bump min Lhotse version (#9537) * Add Python AIStore SDK to requirements and bump min Lhotse version Signed-off-by: Piotr Żelasko <[email protected]> * Move AIStore Python SDK to Dockerfile, remove matplotlib/ipywidgets deps Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Adding 'use_dynamo' option for export to use onnx.dynamo_export() instead of onnx.export() (#9147) * Ininial WARs to implement dynamo option for export Signed-off-by: Boris Fomitchev <[email protected]> * including weights in .onnx Signed-off-by: Boris Fomitchev <[email protected]> * dynamo_export works for many small models Signed-off-by: Boris Fomitchev <[email protected]> * External weights behaviour fixed Signed-off-by: Boris Fomitchev <[email protected]> * Cleanup Signed-off-by: Boris Fomitchev <[email protected]> * Apply isort and black reformatting Signed-off-by: borisfom <[email protected]> * print cleaned up Signed-off-by: Boris Fomitchev <[email protected]> * Added overloadable dynamic_shapes_for_export Signed-off-by: Boris Fomitchev <[email protected]> * Addressing code review Signed-off-by: Boris Fomitchev <[email protected]> * Fixing CI issues Signed-off-by: Boris Fomitchev <[email protected]> * Fixing CI test failure Signed-off-by: Boris Fomitchev <[email protected]> * Eliminated test cross-contamination Signed-off-by: Boris Fomitchev <[email protected]> --------- Signed-off-by: Boris Fomitchev <[email protected]> Signed-off-by: borisfom <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Fix tokenizer IO (#9555) * Adding tokenizer to io-test + making it pass * Handling tokenizer correctly inside dump_io * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Removing not used import --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo UX] Move mistral_7b.py to mistral.py (#9545) * Move mistral_7b.py to mistral.py Signed-off-by: Alexandros Koumparoulis <[email protected]> * rename MixtralConfig to MixtralConfig8x7B Signed-off-by: Alexandros Koumparoulis <[email protected]> * mistral rename: mistralconfig7b & mistralmodel Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Use closed-formula to round by multiple (#9307) * Use closed-formula to round by multiple Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * ci: Do not attempt to send slack on fork (#9556) * ci: Do not attempt to send slack on fork Signed-off-by: Oliver Koenig <[email protected]> * test Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix nemo export test (#9547) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * fix export test Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: oyilmaz-nvidia <[email protected]> Co-authored-by: oyilmaz-nvidia <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix SDXL incorrect name in docs (#9534) Signed-off-by: Tugrul Konuk <[email protected]> * GPU unit tests: Mark flaky tests to be fixed (#9559) Signed-off-by: Tugrul Konuk <[email protected]> * Bump PTL version (#9557) Signed-off-by: Abhishree <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Resiliency] Straggler detection (#9473) * Initial straggler det impl Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed CI code checks Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Removed unused import Signed-off-by: Jacek Bieniusiewicz <[email protected]> * remove submodule Signed-off-by: Maanu Grover <[email protected]> * Updated documentation; Updated callback params; Cosmetic changes Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed straggler det config; Added basic test Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixes in test_straggler_det.py Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Updated straggler callback API Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * stop_if_detected=False by default Signed-off-by: Jacek Bieniusiewicz <[email protected]> --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * switch to torch_dist as default dist checkpointing backend (#9541) Signed-off-by: ashors1 <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Checkpointing bug fixes (#9562) * fix checkpoint loading * fix * fixes * another fix * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Co-authored-by: ashors1 <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add tps and pps params to the export script (#9558) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * fix export test Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * remove n_gpus param Signed-off-by: Onur Yilmaz <[email protected]> * add and fix parameters Signed-off-by: Onur Yilmaz <[email protected]> * fix deploy script Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * rename tps and pps params Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: oyilmaz-nvidia <[email protected]> Co-authored-by: oyilmaz-nvidia <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Consolidate gpt continue training script into pretraining script (#9413) * Consolidate gpt continue training with pretraining Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix default config Signed-off-by: yaoyu-33 <[email protected]> * Add github action cicd Signed-off-by: yaoyu-33 <[email protected]> * extract _integrate_original_checkpoint_data as a method Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix getattr Signed-off-by: yaoyu-33 <[email protected]> * Revert "Add github action cicd" This reverts commit a453f16ba2be6413db932623009da893208acdd5. * Update comments in nlp_overrides.py Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add support to change Multi task model prompt (#9542) * Add support to change Multi task model prompt Signed-off-by: smajumdar <[email protected]> * Add support to change Multi task model prompt Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Update nemo/collections/common/prompts/formatter.py Co-authored-by: Piotr Żelasko <[email protected]> Signed-off-by: Somshubra Majumdar <[email protected]> * Address comments Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Address comments Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Signed-off-by: titu1994 <[email protected]> Signed-off-by: Somshubra Majumdar <[email protected]> Co-authored-by: Piotr Żelasko <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add Multimodal Exporter (#9256) * Add video-neva TRT export * Add TRT inference * Change config * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Change export params * Remove unused import * Add neva export * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Change unpack nemo * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Add trt infer config * Fix neva trt inference * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Add exporter * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix infer * Add PyTriton * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix deploy wrong dim * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Change to pass PIL Image * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix video neva deploy * Change query * Change deploy * Remove unused import * Change ptuning * Change to mm exporter * Add script * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Fix script --------- Signed-off-by: meatybobby <[email protected]> Co-authored-by: meatybobby <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Enable encoder adapters for Canary and MultiTaskAED models (#9409) * Fix assertions for adapter types Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Cleanup Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Finalize support for decoder adapters Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * fix the freeze/unfreeze problem by replacing as_frozen with torch.inference_mode * Apply isort and black reformatting Signed-off-by: weiqingw4ng <[email protected]> * Update tests to new generic way of module update Signed-off-by: smajumdar <[email protected]> * Finalize code for update module Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Fix variable name Signed-off-by: smajumdar <[email protected]> * Finalize projection support for transformer mha adapters Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Correct implementation of freeze restore Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Corrects the implementation of replace_adapter_modules to limit to just the top level modules Signed-off-by: smajumdar <[email protected]> * Apply isort and black reformatting Signed-off-by: titu1994 <[email protected]> * Remove registration of Transformer MHA Signed-off-by: smajumdar <[email protected]> * Remove registration of Transformer MHA Signed-off-by: smajumdar <[email protected]> * Address reviewer comments Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Signed-off-by: titu1994 <[email protected]> Signed-off-by: weiqingw4ng <[email protected]> Co-authored-by: Weiqing Wang <[email protected]> Co-authored-by: weiqingw4ng <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * pass option through (#9570) Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * PTQ refinements (#9574) * Rename megatron_gpt_quantization -> megatron_gpt_ptq Signed-off-by: Jan Lasek <[email protected]> * Configure export.save_path as dir or tarball Signed-off-by: Jan Lasek <[email protected]> * PTQ docs update Signed-off-by: Jan Lasek <[email protected]> * Make model_type optional in case of quantized checkpoints Signed-off-by: Jan Lasek <[email protected]> * Drop unused save_nemo_model_config argument Signed-off-by: Jan Lasek <[email protected]> --------- Signed-off-by: Jan Lasek <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Audio model collection (#9263) * Audio model collection Signed-off-by: Ante Jukić <[email protected]> * Apply isort and black reformatting Signed-off-by: anteju <[email protected]> * Fix imports Signed-off-by: Ante Jukić <[email protected]> * Addressed PR comments Signed-off-by: Ante Jukić <[email protected]> * Apply isort and black reformatting Signed-off-by: anteju <[email protected]> --------- Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: anteju <[email protected]> Co-authored-by: anteju <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Fix Trainer serialization (#9571) * Fix Trainer serialization * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Update click version requirement (#9580) Signed-off-by: Dong Hyuk Chang <[email protected]> Co-authored-by: Dong Hyuk Chang <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Fault tolerance] Heartbeat detection (#9352) * Fault tolerance related changes Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Cosmetic changes in documentation Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Doc update round2 Signed-off-by: Jacek Bieniusiewicz <[email protected]> --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Co-authored-by: Jacek Bieniusiewicz <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add ModelOpt QAT example for Llama2 SFT model (#9326) * add INT4 QAT example for Llama2 SFT model Signed-off-by: Keval Morabia <[email protected]> * Add config parameter to control kv cache quantization Signed-off-by: Keval Morabia <[email protected]> * Fix typo in cicd-main.yml for QAT test Signed-off-by: Keval Morabia <[email protected]> * fix nlp_overrides.py Signed-off-by: Keval Morabia <[email protected]> * address reviewer feedback Signed-off-by: Keval Morabia <[email protected]> * quantize unwrapped model Signed-off-by: Keval Morabia <[email protected]> * add compress export argument for qat config Signed-off-by: Keval Morabia <[email protected]> --------- Signed-off-by: Keval Morabia <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Set TE flag in legacy -> mcore conversion script (#9585) * set TE flag Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] Add fabric-API for manual forward-pass (#9577) * First pass over fabric-API * Adding Trainer -> Fabric conversion * Some small fixes to get a forward-pass in Fabric working * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Adding doc-string to Fabric.import_model * Adding track_io to io_init of Fabric * Fix Fabric.load_model + add doc-string * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Remove unused import * Some small fixes * Fix failing test --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] Add SDK-factories to llm-collection (#9589) * Adding sdk-factories to llm-collection * Removing _model from mistral + mixtral * Expose lr_scheduler inside lightning * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Multimodal projection layer adapter fix for PP>1 (#9445) * enabling multimodal adapters to load in PP>1 Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * parameterizing validate_access_integrity, set to false when PP>1 Signed-off-by: paul-gibbons <[email protected]> formatting fix Signed-off-by: paul-gibbons <[email protected]> Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * update nlp_model.py Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * update modelPT with validate_access_integrity Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * updating save_restore_connector w/ validate_access_integrity Signed-off-by: paul-gibbons <[email protected]> * Apply isort and black reformatting Signed-off-by: paul-gibbons <[email protected]> * addressing comment Signed-off-by: paul-gibbons <[email protected]> * adding validate_access_integrity to super().load_config_and_state_dict() Signed-off-by: paul-gibbons <[email protected]> * testing reorder of validate_access_integrity for CI failures Signed-off-by: paul-gibbons <[email protected]> --------- Signed-off-by: paul-gibbons <[email protected]> Signed-off-by: paul-gibbons <[email protected]> Co-authored-by: paul-gibbons <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Add offline quantization script for QLoRA deployment (#9455) * add qlora offline quantization script Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * clean Signed-off-by: Chen Cui <[email protected]> * docstring Signed-off-by: Chen Cui <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * qlora support more models (#9488) Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Some improvements to NeMoLogger (#9591) Signed-off-by: Tugrul Konuk <[email protected]> * Set n_gpu to None in nemo export (#9593) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * set ngpus to None Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Inflight nemo model export support (#9527) * online model conversion and refit Signed-off-by: Jimmy Zhang <[email protected]> * clean code Signed-off-by: Jimmy Zhang <[email protected]> * cleanup Signed-off-by: Jimmy Zhang <[email protected]> * add refit, cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * combine weight conversion functions Signed-off-by: Jimmy Zhang <[email protected]> * cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <[email protected]> * remove debug print Signed-off-by: Jimmy Zhang <[email protected]> * cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * fix single gpu and cleanup code Signed-off-by: Jimmy Zhang <[email protected]> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <[email protected]> --------- Signed-off-by: JimmyZhang12 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * vLLM Export Improvements (#9596) * Separated the vLLM export functionality from the common deployment script into deploy_vllm_triton.py. Signed-off-by: Alexey Panteleev <[email protected]> * Fixed vocab_size for LLAMA3. Signed-off-by: Alexey Panteleev <[email protected]> * Export test: fixed deployment testing w/o Megatron, made functional tests optional, added --gpu_memory_utilization. Signed-off-by: Alexey Panteleev <[email protected]> * Apply isort and black reformatting Signed-off-by: apanteleev <[email protected]> * Addressing review and CodeQL comments. Signed-off-by: Alexey Panteleev <[email protected]> --------- Signed-off-by: Alexey Panteleev <[email protected]> Signed-off-by: apanteleev <[email protected]> Co-authored-by: apanteleev <[email protected]> Co-authored-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Set finalize_model_grads_func in on_fit_start instead to make sure it's being called (#9599) Signed-off-by: Tugrul Konuk <[email protected]> * Set no_sync_func & grad_sync_fucn (#9601) * Set no_sync_func & grad_sync_fucn Signed-off-by: Alexandros Koumparoulis <[email protected]> * set overlap_param_sync Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * small nemo logger bug fix (#9607) Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix the dict format returned by scheduler method (#9609) Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Dataloading enhancements and bug fixes (#9595) * fix dataloading + checkpoint restore * clean up data sampler * fix typo * support passing multiple paths to data module * fix validation dataloader * fix dataloader len when using gradient accumulation * fix progress bar * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * fix step count in loggers * fix blended dataset * address comments * address comment * move step logging into strategy * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Co-authored-by: ashors1 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix serialization of AutoResume (#9616) * fix serialization of autoresume * update undefined variables Signed-off-by: Tugrul Konuk <[email protected]> * Chat template support for megatron_gpt_eval.py (#9354) * Bump PTL version (#9557) Signed-off-by: Abhishree <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * [Resiliency] Straggler detection (#9473) * Initial straggler det impl Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed CI code checks Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Removed unused import Signed-off-by: Jacek Bieniusiewicz <[email protected]> * remove submodule Signed-off-by: Maanu Grover <[email protected]> * Updated documentation; Updated callback params; Cosmetic changes Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed straggler det config; Added basic test Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixes in test_straggler_det.py Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Updated straggler callback API Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * stop_if_detected=False by default Signed-off-by: Jacek Bieniusiewicz <[email protected]> --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * move model loading to separate function; call toContainer once; pad using closed formula Signed-off-by: Alexandros Koumparoulis <[email protected]> * read prompts from file Signed-off-by: Alexandros Koumparoulis <[email protected]> * If input prompt contains dict, apply model.tokenizer.chat_template Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * apply @Gal Leibovich's patch Taken from: https://github.com/NVIDIA/NeMo/commit/17572905344db4692583e72799d55801a8860f35 Signed-off-by: Alexandros Koumparoulis <[email protected]> * rename prompts_file to prompts_jsonl Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * add chat_template param Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add ChatTemplateMixin to SentencePieceTokenizer Signed-off-by: Alexandros Koumparoulis <[email protected]> * add chat-template to text-gen-strat Signed-off-by: Alexandros Koumparoulis <[email protected]> * move load prompts to separate file Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove chat-template from text-gen-utils Signed-off-by: Alexandros Koumparoulis <[email protected]> * make chat-template more generic Signed-off-by: Alexandros Koumparoulis <[email protected]> * add assert message Signed-off-by: Alexandros Koumparoulis <[email protected]> * small refactor for chat_template_mixin Signed-off-by: Alexandros Koumparoulis <[email protected]> * undo ckpt conv changes Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * move rounding to function Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Abhishree <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Abhishree Thittenamane <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Jsonl support (#9611) * Adding support to preprocess .jsonl and .jsonl.gz files in input directory Signed-off-by: adityavavre <[email protected]> * Adding support to preprocess .jsonl and .jsonl.gz files in input directory Signed-off-by: adityavavre <[email protected]> * Apply isort and black reformatting Signed-off-by: adityavavre <[email protected]> --------- Signed-off-by: adityavavre <[email protected]> Signed-off-by: adityavavre <[email protected]> Co-authored-by: adityavavre <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] Add PEFT (#9490) * initial commit for PEFT in nemo2 * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * address comments Signed-off-by: Chen Cui <[email protected]> * make import easier Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * address comments Signed-off-by: Chen Cui <[email protected]> * Update nemo/collections/llm/peft/lora.py Signed-off-by: Marc Romeyn <[email protected]> * Some small fixes + adding more doc-strings * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Adding ModelTransform callback * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fixing type-hint for model_transform * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * fix import Signed-off-by: Chen Cui <[email protected]> * model transform for gemma llama Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * fix model transform Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * change lora target default to all linear modules Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * Small fix in mixtral * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Integrating PEFT to the public-API + some fixes * Big refactor to allow to load adapter-states * Some fixes to support adapter_path * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Disabling ckpt reloading when adapter_path is passed * Fix CLI * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Remove commented-out code * Remove commented-out code * Remove un-used import * Fix callback imports * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fixing llm.pretrain * Some small fixes * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fix missing import + type-hint in finetune * Adding PreemptionCallback + some more tests * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Clean up imports & clean up llm.api * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Trying to fix failing tests * Remove __init__.py 2 * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Fix failing test * Trying to fix last failing test --------- Signed-off-by: cuichenx <[email protected]> Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> Signed-off-by: marcromeyn <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Akoumparouli/mistral import instruct chat template fix (#9567) * use bf16 by defualt mistral conv Signed-off-by: Alexandros Koumparoulis <[email protected]> * add chat template Signed-off-by: Alexandros Koumparoulis <[email protected]> * use capitalized role names Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Remove .cuda calls, use device isntead (#9602) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix converter defautl args (#9565) * fix converter defautl args Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * mixtral export (#9603) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix: remove non_blocking from PTL's .cuda call (#9618) Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Alit/mamba tmp (#9612) * adding mamba support * fix import mixins * rm convert jamba * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * more cleanups * use GPT text gen * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * fixing gbs in TP convetor * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * add reqs * add tutorial * minor fix to tutorial * moving finetuning files Signed-off-by: arendu <[email protected]> * moving finetuning files Signed-off-by: arendu <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * add mamba_tmp * remove mamba import * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> --------- Signed-off-by: JRD971000 <[email protected]> Signed-off-by: arendu <[email protected]> Co-authored-by: Ali Taghibakhshi <[email protected]> Co-authored-by: JRD971000 <[email protected]> Co-authored-by: arendu <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * TitaNet Batch Verify Speaker (#9337) * add batch_inference for verify_speakers method Signed-off-by: [email protected] <[email protected]> * remove not used package Signed-off-by: [email protected] <[email protected]> * change batch inference logic Signed-off-by: [email protected] <[email protected]> * fixup Signed-off-by: [email protected] <[email protected]> * requested changes Signed-off-by: [email protected] <[email protected]> * add verify_speakers_batch to docs Signed-off-by: [email protected] <[email protected]> * handle None durations in manifest Signed-off-by: [email protected] <[email protected]> * change logging text Signed-off-by: [email protected] <[email protected]> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <[email protected]> * check duration presence Signed-off-by: [email protected] <[email protected]> * add channel_selector to dataset configs Signed-off-by: [email protected] <[email protected]> --------- Signed-off-by: [email protected] <[email protected]> Signed-off-by: monica-sekoyan <[email protected]> Co-authored-by: monica-sekoyan <[email protected]> Co-authored-by: Nithin Rao <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Enable MCore checkpointing optimizations (#9505) * Expose num processes in PyT Dist Signed-off-by: Mikołaj Błaż <[email protected]> * Add parallel save/load optimizations from MCore Signed-off-by: Mikołaj Błaż <[email protected]> * Remove async utils from MCore Signed-off-by: Mikołaj Błaż <[email protected]> * Enable DistOpt paralell R/W Signed-off-by: Mikołaj Błaż <[email protected]> * Enable PyT Dist caching Signed-off-by: Mikołaj Błaż <[email protected]> * Small fixes Signed-off-by: Mikołaj Błaż <[email protected]> * Make sure DistCkptIO is instantiated from config Signed-off-by: Mikołaj Błaż <[email protected]> * Bump MCore version to v0.7 Signed-off-by: Mikołaj Błaż <[email protected]> * Print load strategy Signed-off-by: Mikołaj Błaż <[email protected]> * Forward MCore to model space DistOpt Signed-off-by: Mikołaj Błaż <[email protected]> * Add separate flag to control DistOpt paralell R/W Signed-off-by: Mikołaj Błaż <[email protected]> * Turn off parallel save by default Signed-off-by: Mikołaj Błaż <[email protected]> --------- Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Change mixtral moe key name for trt-llm (#9620) * fix minor import bug Signed-off-by: Onur Yilmaz <[email protected]> * change moe key values Signed-off-by: Onur Yilmaz <[email protected]> * add weight to the key Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix ckpt load bug (#9621) * fix ckpt load bug Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * NeVA Minor Fixes (#9608) * fix neva resume with empty param loaded for some pp stage Signed-off-by: yaoyu-33 <[email protected]> * fix crop size check Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * fix pretrianing data sizes and weights (#9627) Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Alit/mamba (#9575) * adding mamba support * fix import mixins * rm convert jamba * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * more cleanups * use GPT text gen * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * fixing gbs in TP convetor * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * add reqs * add tutorial * minor fix to tutorial * moving finetuning files Signed-off-by: arendu <[email protected]> * moving finetuning files Signed-off-by: arendu <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * address comments * Apply isort and black reformatting Signed-off-by: JRD971000 <[email protected]> * address comments * add mamba dependancies * add mcore tag * modify dockerfile ci * modify dockerfile ci --------- Signed-off-by: JRD971000 <[email protected]> Signed-off-by: arendu <[email protected]> Co-authored-by: Ali Taghibakhshi <[email protected]> Co-authored-by: JRD971000 <[email protected]> Co-authored-by: arendu <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [NeMo-UX] async checkpointing support (#9466) * add async checkpointing support * fixes * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * add parallel read/write support and other optimizations * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * address comments, make dist checkpointing args configurable * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * fix small typo Signed-off-by: ashors1 <[email protected]> * Update default sharding type Co-authored-by: mikolajblaz <[email protected]> Signed-off-by: Anna Shors <[email protected]> * Update default sharding type Co-authored-by: mikolajblaz <[email protected]> Signed-off-by: Anna Shors <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: ashors1 <[email protected]> Signed-off-by: Anna Shors <[email protected]> Co-authored-by: ashors1 <[email protected]> Co-authored-by: mikolajblaz <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Fix the arguments of forward_for_export function in msdd_models (#9624) * Fix the arguments of forward_for_export function Signed-off-by: Taejin Park <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> --------- Signed-off-by: Taejin Park <[email protected]> Signed-off-by: tango4j <[email protected]> Co-authored-by: tango4j <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Change default parallel_save to False (#9632) Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Unwrap ckpt_io for model opt (async save) (#9622) Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * MCore T5 support for NeMo - Training (#9432) * huvu/mcore_t5 first commit from local * removing DEBUGGING prints * cleaning megatron_lm_encoder_decoder_model.py code * cleaning code * adding Github action test * only run mcore T5 test * only run mcore T5 test * only run mcore T5 test * only run mcore T5 test * reset .github/workflows/cicd-main.yml * reset .github/workflows/cicd-main.yml * adding condition self.mcore_t5 when running self.build_transformer_config() * refractor megatron_lm_encoder_decoder_model.py to not use self.model * only run T5-related tests * remove all self.model * reset cicd file * reset cicd file * updating codes remove duplicate if/else; adding mcore/transformer_engine to config file * adjust +model.mcore_t5=True * Apply isort and black reformatting Signed-off-by: huvunvidia <[email protected]> --------- Signed-off-by: huvunvidia <[email protected]> Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: huvunvidia <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * [Nemo-UX] Expose transformer_layer_spec inside GPTConfig (#9592) * Expose transformer_layer_spec inside GPTConfig * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> * Expose layer-specs * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> --------- Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> Signed-off-by: Tugrul Konuk <[email protected]> * Update NeMo Clip to Use MCore Modules (#9594) * update clip model and config file Signed-off-by: yaoyu-33 <[email protected]> * update clip for mcore Signed-off-by: yaoyu-33 <[email protected]> * MCore CLIP Fix Signed-off-by: yaoyu-33 <[email protected]> * fix no mask Signed-off-by: yaoyu-33 <[email protected]> * few neva fixes Signed-off-by: yaoyu-33 <[email protected]> * update siglip module Signed-off-by: yaoyu-33 <[email protected]> * add siglip loss Signed-off-by: yaoyu-33 <[email protected]> * fix Signed-off-by: yaoyu-33 <[email protected]> * fix collate fn Signed-off-by: yaoyu-33 <[email protected]> * update siglip conversion script Signed-off-by: yaoyu-33 <[email protected]> * update siglip convert Signed-off-by: yaoyu-33 <[email protected]> * clip fixes Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * clean up script Signed-off-by: yaoyu-33 <[email protected]> * clip fixe…

akoumpa self-assigned this May 24, 2024

github-actions bot added NLP Multi Modal labels May 24, 2024

akoumpa added cleanup Run CICD and removed NLP Multi Modal labels May 24, 2024

github-actions bot added NLP Multi Modal labels May 24, 2024

akoumpa requested a review from marcromeyn May 29, 2024 07:06

marcromeyn approved these changes May 29, 2024

View reviewed changes

github-actions bot added the stale label Jun 13, 2024

github-actions bot closed this Jun 21, 2024

akoumpa reopened this Jun 25, 2024

akoumpa and others added 2 commits June 26, 2024 03:32

Use closed-formula to round by multiple

72e7775

Signed-off-by: Alexandros Koumparoulis <[email protected]>

Apply isort and black reformatting

f658854

Signed-off-by: akoumpa <[email protected]>

akoumpa force-pushed the akoumparouli/closed_formula_rounding branch from 1f14398 to f658854 Compare June 26, 2024 10:33

akoumpa added Run CICD and removed Run CICD labels Jun 26, 2024

Merge branch 'main' into akoumparouli/closed_formula_rounding

5003313

pablo-garay added Run CICD and removed Run CICD labels Jun 26, 2024

akoumpa added Run CICD and removed Run CICD labels Jun 27, 2024

akoumpa merged commit 265e680 into main Jun 27, 2024
177 of 214 checks passed

akoumpa deleted the akoumparouli/closed_formula_rounding branch June 27, 2024 17:36

ko3n1g mentioned this pull request Jul 18, 2024

Release 2.0.0rc1 #9786

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use closed-formula to round by multiple #9307

Use closed-formula to round by multiple #9307

akoumpa commented May 24, 2024

github-actions bot commented Jun 13, 2024

github-actions bot commented Jun 21, 2024

Use closed-formula to round by multiple #9307

Use closed-formula to round by multiple #9307

Conversation

akoumpa commented May 24, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

github-actions bot commented Jun 13, 2024

github-actions bot commented Jun 21, 2024