diff --git a/README.rst b/README.rst index 44e5df6b74885..3135bdbfabdd1 100644 --- a/README.rst +++ b/README.rst @@ -35,7 +35,7 @@ .. _main-readme: -**NVIDIA NeMo** +**NVIDIA NeMo Framework** =============== Latest News @@ -57,92 +57,66 @@ such as FSDP, Mixture-of-Experts, and RLHF with TensorRT-LLM to provide speedups Introduction ------------ -NVIDIA NeMo is a conversational AI toolkit built for researchers working on automatic speech recognition (ASR), -text-to-speech synthesis (TTS), large language models (LLMs), and -natural language processing (NLP). -The primary objective of NeMo is to help researchers from industry and academia to reuse prior work (code and pretrained models) -and make it easier to create new `conversational AI models `_. +NVIDIA NeMo Framework is a generative AI framework built for researchers and pytorch developers +working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), +and text-to-speech synthesis (TTS). +The primary objective of NeMo is to provide a scalable framework for researchers and developers from industry and academia +to more easily implement and design new generative AI models by being able to leverage existing code and pretrained models. + +For technical documentation, please see the `NeMo Framework User Guide `_. All NeMo models are trained with `Lightning `_ and training is automatically scalable to 1000s of GPUs. -Additionally, NeMo Megatron LLM models can be trained up to 1 trillion parameters using tensor and pipeline model parallelism. -NeMo models can be optimized for inference and deployed for production use-cases with `NVIDIA Riva `_. + +When applicable, NeMo models take advantage of the latest possible distributed training techniques, +including parallelism strategies such as + +* data parallelism +* tensor parallelism +* pipeline model parallelism +* fully sharded data parallelism (FSDP) +* sequence parallelism +* context parallelism +* mixture-of-experts (MoE) + +and mixed precision training recipes with bfloat16 and FP8 training. + +NeMo's Transformer based LLM and Multimodal models leverage `NVIDIA Transformer Engine `_ for FP8 training on NVIDIA Hopper GPUs +and leverages `NVIDIA Megatron Core `_ for scaling transformer model training. + +NeMo LLMs can be aligned with state of the art methods such as SteerLM, DPO and Reinforcement Learning from Human Feedback (RLHF), +see `NVIDIA NeMo Aligner `_ for more details. + +NeMo LLM and Multimodal models can be deployed and optimized with `NVIDIA Inference Microservices (Early Access) `_. + +NeMo ASR and TTS models can be optimized for inference and deployed for production use-cases with `NVIDIA Riva `_. + +For scaling NeMo LLM and Multimodal training on Slurm clusters or public clouds, please see the `NVIDIA Framework Launcher `_. +The NeMo Framework launcher has extensive recipes, scripts, utilities, and documentation for training NeMo LLMs and Multimodal models and also has an `Autoconfigurator `_ +which can be used to find the optimal model parallel configuration for training on a specific cluster. +To get started quickly with the NeMo Framework Launcher, please see the `NeMo Framework Playbooks `_ +The NeMo Framework Launcher does not currently support ASR and TTS training but will soon. Getting started with NeMo is simple. State of the Art pretrained NeMo models are freely available on `HuggingFace Hub `_ and `NVIDIA NGC `_. -These models can be used to transcribe audio, synthesize speech, or translate text in just a few lines of code. +These models can be used to generate text or images, transcribe audio, and synthesize speech in just a few lines of code. We have extensive `tutorials `_ that -can be run on `Google Colab `_. +can be run on `Google Colab `_ or with our `NGC NeMo Framework Container. `_ +and we have `playbooks `_ for users that want to train NeMo models with the NeMo Framework Launcher. For advanced users that want to train NeMo models from scratch or finetune existing NeMo models we have a full suite of `example scripts `_ that support multi-GPU/multi-node training. -For scaling NeMo LLM training on Slurm clusters or public clouds, please see the `NVIDIA NeMo Megatron Launcher `_. -The NM launcher has extensive recipes, scripts, utilities, and documentation for training NeMo LLMs and also has an `Autoconfigurator `_ -which can be used to find the optimal model parallel configuration for training on a specific cluster. - Key Features ------------ -* Speech processing - * `HuggingFace Space for Audio Transcription (File, Microphone and YouTube) `_ - * `Pretrained models `_ available in 14+ languages - * `Automatic Speech Recognition (ASR) `_ - * Supported ASR `models `_: - * Jasper, QuartzNet, CitriNet, ContextNet - * Conformer-CTC, Conformer-Transducer, FastConformer-CTC, FastConformer-Transducer - * Squeezeformer-CTC and Squeezeformer-Transducer - * LSTM-Transducer (RNNT) and LSTM-CTC - * Supports the following decoders/losses: - * CTC - * Transducer/RNNT - * Hybrid Transducer/CTC - * NeMo Original `Multi-blank Transducers `_ and `Token-and-Duration Transducers (TDT) `_ - * Streaming/Buffered ASR (CTC/Transducer) - `Chunked Inference Examples `_ - * `Cache-aware Streaming Conformer `_ with multiple lookaheads (including microphone streaming `tutorial `_). - * Beam Search decoding - * `Language Modelling for ASR (CTC and RNNT) `_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer - * `Support of long audios for Conformer with memory efficient local attention `_ - * `Speech Classification, Speech Command Recognition and Language Identification `_: MatchboxNet (Command Recognition), AmberNet (LangID) - * `Voice activity Detection (VAD) `_: MarbleNet - * ASR with VAD Inference - `Example `_ - * `Speaker Recognition `_: TitaNet, ECAPA_TDNN, SpeakerNet - * `Speaker Diarization `_ - * Clustering Diarizer: TitaNet, ECAPA_TDNN, SpeakerNet - * Neural Diarizer: MSDD (Multi-scale Diarization Decoder) - * `Speech Intent Detection and Slot Filling `_: Conformer-Transformer -* Natural Language Processing - * `NeMo Megatron pre-training of Large Language Models `_ - * `Neural Machine Translation (NMT) `_ - * `Punctuation and Capitalization `_ - * `Token classification (named entity recognition) `_ - * `Text classification `_ - * `Joint Intent and Slot Classification `_ - * `Question answering `_ - * `GLUE benchmark `_ - * `Information retrieval `_ - * `Entity Linking `_ - * `Dialogue State Tracking `_ - * `Prompt Learning `_ - * `NGC collection of pre-trained NLP models. `_ - * `Synthetic Tabular Data Generation `_ -* Text-to-Speech Synthesis (TTS): - * `Documentation `_ - * Mel-Spectrogram generators: FastPitch, SSL FastPitch, Mixer-TTS/Mixer-TTS-X, RAD-TTS, Tacotron2 - * Vocoders: HiFiGAN, UnivNet, WaveGlow - * End-to-End Models: VITS - * `Pre-trained Model Checkpoints in NVIDIA GPU Cloud (NGC) `_ -* `Tools `_ - * `Text Processing (text normalization and inverse text normalization) `_ - * `NeMo Forced Aligner `_ - * `CTC-Segmentation tool `_ - * `Speech Data Explorer `_: a dash-based tool for interactive exploration of ASR/TTS datasets - * `Speech Data Processor `_ - - -Built for speed, NeMo can utilize NVIDIA's Tensor Cores and scale out training to multiple GPUs and multiple nodes. +* `Large Language Models `_ +* `Multimodal `_ +* `Automatic Speech Recognition `_ +* `Text to Speech `_ +* `Computer Vision `_ Requirements ------------ @@ -151,8 +125,8 @@ Requirements 2) Pytorch 1.13.1 or above 3) NVIDIA GPU, if you intend to do model training -Documentation -------------- +Developer Documentation +----------------------- .. |main| image:: https://readthedocs.com/projects/nvidia-nemo/badge/?version=main :alt: Documentation Status @@ -172,18 +146,6 @@ Documentation | Stable | |stable| | `Documentation of the stable (i.e. most recent release) branch. `_ | +---------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+ -Tutorials ---------- -A great way to start with NeMo is by checking `one of our tutorials `_. - -You can also get a high-level overview of NeMo by watching the talk *NVIDIA NeMo: Toolkit for Conversational AI*, presented at PyData Yerevan 2022: - -|pydata| - -.. |pydata| image:: https://img.youtube.com/vi/J-P6Sczmas8/maxres3.jpg - :target: https://www.youtube.com/embed/J-P6Sczmas8?mute=0&start=14&autoplay=0 - :width: 600 - :alt: NeMo presentation at PyData@Yerevan 2022 Getting help with NeMo ---------------------- diff --git a/docs/source/index.rst b/docs/source/index.rst index 7407886eefc88..9d66d693000ed 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,5 +1,5 @@ -NVIDIA NeMo User Guide -====================== +NVIDIA NeMo Framework Developer Docs +==================================== .. toctree:: :maxdepth: 2 @@ -12,18 +12,28 @@ NVIDIA NeMo User Guide starthere/migration-guide .. toctree:: - :maxdepth: 2 - :caption: NeMo Core - :name: core + :maxdepth: 3 + :caption: Multimodal (MM) + :name: Multimodal - core/core - core/exp_manager - core/neural_types - core/export - core/adapters/intro - core/api + multimodal/mllm/intro + multimodal/vlm/intro + multimodal/text2img/intro + multimodal/nerf/intro + multimodal/api +.. toctree:: + :maxdepth: 3 + :caption: Large Language Models (LLMs) + :name: Large Language Models + + nlp/nemo_megatron/intro + nlp/models + nlp/machine_translation/machine_translation + nlp/megatron_onnx_export + nlp/api + .. toctree:: :maxdepth: 2 :caption: Speech Processing @@ -36,19 +46,6 @@ NVIDIA NeMo User Guide asr/ssl/intro asr/speech_intent_slot/intro -.. toctree:: - :maxdepth: 3 - :caption: Natural Language Processing - :name: Natural Language Processing - - nlp/nemo_megatron/intro - nlp/machine_translation/machine_translation - nlp/text_normalization/intro - nlp/api - nlp/megatron_onnx_export - nlp/models - - .. toctree:: :maxdepth: 1 :caption: Text To Speech (TTS) @@ -56,6 +53,26 @@ NVIDIA NeMo User Guide tts/intro +.. toctree:: + :maxdepth: 2 + :caption: Vision + :name: vision + + vision/intro + + +.. toctree:: + :maxdepth: 2 + :caption: NeMo Core + :name: core + + core/core + core/exp_manager + core/neural_types + core/export + core/adapters/intro + core/api + .. toctree:: :maxdepth: 2 :caption: Common @@ -71,27 +88,10 @@ NVIDIA NeMo User Guide text_processing/g2p/g2p common/intro -.. toctree:: - :maxdepth: 3 - :caption: Multimodal (MM) - :name: Multimodal - - multimodal/mllm/intro - multimodal/vlm/intro - multimodal/text2img/intro - multimodal/nerf/intro - multimodal/api - -.. toctree:: - :maxdepth: 2 - :caption: Vision - :name: vision - - vision/intro .. toctree:: :maxdepth: 3 - :caption: Tools - :name: Tools + :caption: Speech Tools + :name: Speech Tools tools/intro diff --git a/docs/source/multimodal/api.rst b/docs/source/multimodal/api.rst index 63ce477273b3e..d6f96e6c6ea44 100644 --- a/docs/source/multimodal/api.rst +++ b/docs/source/multimodal/api.rst @@ -1,4 +1,4 @@ -NeMo Megatron API +Multimodal API ======================= Model Classes diff --git a/docs/source/nlp/api.rst b/docs/source/nlp/api.rst index 33709bd05a193..b9b4d529ba464 100755 --- a/docs/source/nlp/api.rst +++ b/docs/source/nlp/api.rst @@ -1,5 +1,5 @@ -NeMo Megatron API -======================= +Large language Model API +======================== Pretraining Model Classes ------------------------- diff --git a/docs/source/nlp/information_retrieval.rst b/docs/source/nlp/information_retrieval.rst index 5cf87143848c3..b40caeee8a3be 100644 --- a/docs/source/nlp/information_retrieval.rst +++ b/docs/source/nlp/information_retrieval.rst @@ -8,7 +8,7 @@ The model architecture and pre-training process are detailed in the `Sentence-BE Sentence-BERT utilizes a BERT-based architecture, but it is trained using a siamese and triplet network structure to derive fixed-sized sentence embeddings that capture semantic information. Sentence-BERT is commonly used to generate high-quality sentence embeddings for various downstream natural language processing tasks, such as semantic textual similarity, clustering, and information retrieval -Data Input for the Senntence-BERT model +Data Input for the Sentence-BERT model --------------------------------------- The fine-tuning data for the Sentence-BERT (SBERT) model should consist of data instances, diff --git a/docs/source/nlp/nemo_megatron/intro.rst b/docs/source/nlp/nemo_megatron/intro.rst index 80b30a267b182..faf315a40c044 100644 --- a/docs/source/nlp/nemo_megatron/intro.rst +++ b/docs/source/nlp/nemo_megatron/intro.rst @@ -1,8 +1,7 @@ -NeMo Megatron -============= +Large Language Models +===================== -Megatron :cite:`nlp-megatron-shoeybi2019megatron` is a large, powerful transformer developed by the Applied Deep Learning Research -team at NVIDIA. NeMo Megatron supports several types of models: +To learn more about using NeMo to train Large Language Models at scale, please refer to the `NeMo Framework User Guide! `_. * GPT-style models (decoder only) * T5/BART/UL2-style models (encoder-decoder) @@ -10,11 +9,6 @@ team at NVIDIA. NeMo Megatron supports several types of models: * RETRO model (decoder only) - -.. note:: - NeMo Megatron has an Enterprise edition which contains tools for data preprocessing, hyperparameter tuning, container, scripts for various clouds and more. With Enterprise edition you also get deployment tools. Apply for `early access here `_ . - - .. toctree:: :maxdepth: 1 diff --git a/docs/source/starthere/intro.rst b/docs/source/starthere/intro.rst index e6a59b0832ab9..185350bad3ab8 100644 --- a/docs/source/starthere/intro.rst +++ b/docs/source/starthere/intro.rst @@ -8,14 +8,17 @@ Introduction .. _dummy_header: -`NVIDIA NeMo `_, part of the NVIDIA AI platform, is a toolkit for building new state-of-the-art -conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), -Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of +NVIDIA NeMo Framework is an end-to-end, cloud-native framework to build, customize, and deploy generative AI models anywhere. +To learn more about using NeMo in generative AI workflows, please refer to the `NeMo Framework User Guide! `_ + +`NVIDIA NeMo Framework `_ has separate collections for Large Language Models (LLMs), +Multimodal (MM), Computer Vision (CV), Automatic Speech Recognition (ASR), +and Text-to-Speech (TTS) models. Each collection consists of prebuilt modules that include everything needed to train on your data. -Every module can easily be customized, extended, and composed to create new conversational AI +Every module can easily be customized, extended, and composed to create new generative AI model architectures. -Conversational AI architectures are typically large and require a lot of data and compute +Generative AI architectures are typically large and require a lot of data and compute for training. NeMo uses `PyTorch Lightning `_ for easy and performant multi-GPU/multi-node mixed-precision training. @@ -38,7 +41,7 @@ Before you begin using NeMo, it's assumed you meet the following prerequisites. Quick Start Guide ----------------- -You can try out NeMo's ASR, NLP and TTS functionality with the example below, which is based on the `Audio Translation `_ tutorial. +You can try out NeMo's ASR, LLM and TTS functionality with the example below, which is based on the `Audio Translation `_ tutorial. Once you have :ref:`installed NeMo `, then you can run the code below: diff --git a/nemo/collections/asr/README.md b/nemo/collections/asr/README.md new file mode 100644 index 0000000000000..9a1b947f2d184 --- /dev/null +++ b/nemo/collections/asr/README.md @@ -0,0 +1,37 @@ +# Automatic Speech Recognition (ASR) + +## Key Features + +* [HuggingFace Space for Audio Transcription (File, Microphone and YouTube)](https://huggingface.co/spaces/smajumdar/nemo_multilingual_language_id) +* [Pretrained models](https://ngc.nvidia.com/catalog/collections/nvidia:nemo_asr) available in 14+ languages +* [Automatic Speech Recognition (ASR)](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/intro.html) + * Supported ASR [models](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html): + * Jasper, QuartzNet, CitriNet, ContextNet + * Conformer-CTC, Conformer-Transducer, FastConformer-CTC, FastConformer-Transducer + * Squeezeformer-CTC and Squeezeformer-Transducer + * LSTM-Transducer (RNNT) and LSTM-CTC + * Supports the following decoders/losses: + * CTC + * Transducer/RNNT + * Hybrid Transducer/CTC + * NeMo Original [Multi-blank Transducers](https://arxiv.org/abs/2211.03541) and [Token-and-Duration Transducers (TDT)](https://arxiv.org/abs/2304.06795) + * Streaming/Buffered ASR (CTC/Transducer) - [Chunked Inference Examples](https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/asr_chunked_inference) + * [Cache-aware Streaming Conformer](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#cache-aware-streaming-conformer) with multiple lookaheads (including microphone streaming [tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb). + * Beam Search decoding + * [Language Modelling for ASR (CTC and RNNT)](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html): N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer + * [Support of long audios for Conformer with memory efficient local attention](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/results.html#inference-on-long-audio) +* [Speech Classification, Speech Command Recognition and Language Identification](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html): MatchboxNet (Command Recognition), AmberNet (LangID) +* [Voice activity Detection (VAD)](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/models.html#marblenet-vad): MarbleNet + * ASR with VAD Inference - [Example](https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/asr_vad) +* [Speaker Recognition](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html): TitaNet, ECAPA_TDNN, SpeakerNet +* [Speaker Diarization](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_diarization/intro.html) + * Clustering Diarizer: TitaNet, ECAPA_TDNN, SpeakerNet + * Neural Diarizer: MSDD (Multi-scale Diarization Decoder) +* [Speech Intent Detection and Slot Filling](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_intent_slot/intro.html): Conformer-Transformer + +You can also get a high-level overview of NeMo ASR by watching the talk *NVIDIA NeMo: Toolkit for Conversational AI*, presented at PyData Yerevan 2022: + + +[![NVIDIA NeMo: Toolkit for Conversational AI](https://img.youtube.com/vi/J-P6Sczmas8/maxres3.jpg +)](https://www.youtube.com/embed/J-P6Sczmas8?mute=0&start=14&autoplay=0 + "NeMo presentation at PyData@Yerevan 2022") diff --git a/nemo/collections/multimodal/README.md b/nemo/collections/multimodal/README.md new file mode 100644 index 0000000000000..c160ac89569d2 --- /dev/null +++ b/nemo/collections/multimodal/README.md @@ -0,0 +1,27 @@ +NeMo Multimodal Collections +============================ + +The NeMo Multimodal Collection supports a diverse range of multimodal models tailored for various tasks, including text-2-image generation, text-2-NeRF synthesis, multimodal language models (LLM), and foundational vision and language models. Leveraging existing modules from other NeMo collections such as LLM and Vision whenever feasible, our multimodal collections prioritize efficiency by avoiding redundant implementations and maximizing reuse of NeMo's existing modules. Here's a detailed list of the models currently supported within the multimodal collection: + +- **Foundation Vision-Language Models:** + - CLIP + +- **Foundation Text-to-Image Generation:** + - Stable Diffusion + - Imagen + +- **Customizable Text-to-Image Models:** + - SD-LoRA + - SD-ControlNet + - SD-Instruct pix2pix + +- **Multimodal Language Models:** + - NeVA + - LLAVA + +- **Text-to-NeRF Synthesis:** + - DreamFusion++ + +- **NSFW Detection Support** + +Our [documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/index.html) offers comprehensive insights into each supported model, facilitating seamless integration and utilization within your projects. diff --git a/nemo/collections/nlp/README.md b/nemo/collections/nlp/README.md new file mode 100644 index 0000000000000..fc6644d28293f --- /dev/null +++ b/nemo/collections/nlp/README.md @@ -0,0 +1,13 @@ +NeMo NLP/LLM Collection +======================== + +The NeMo NLP/LLM Collection is designed to provide comprehensive support for on-demand large language community models as well as Nvidia's top LLM offerings. By harnessing the cutting-edge Megatron Core, our LLM collection is highly optimized, empowering NeMo users to undertake foundation model training across thousands of GPUs while facilitating fine-tuning of LLMs using techniques such as SFT and PEFT. Leveraging the Transformer Engine library, our collection ensures seamless support for FP8 workloads on Hopper H100 GPUs. Additionally, we prioritize supporting TRTLLM export for the released models, which can accelerate inference by 2-3x depending on the model size. Here's a detailed list of the models currently supported within the LLM collection: + +- **Bert** +- **GPT-style models** +- **Falcon** +- **code-llama 7B** +- **Mistral** +- **Mixtral** + +Our [documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/index.html) offers comprehensive insights into each supported model, facilitating seamless integration and utilization within your projects. diff --git a/nemo/collections/tts/README.md b/nemo/collections/tts/README.md new file mode 100644 index 0000000000000..44b2b1b7a25c0 --- /dev/null +++ b/nemo/collections/tts/README.md @@ -0,0 +1,7 @@ +# Text-to-Speech Synthesis (TTS): + +* [Documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tts/intro.html#) +* Mel-Spectrogram generators: FastPitch, SSL FastPitch, Mixer-TTS/Mixer-TTS-X, RAD-TTS, Tacotron2 +* Vocoders: HiFiGAN, UnivNet, WaveGlow +* End-to-End Models: VITS +* [Pre-trained Model Checkpoints in NVIDIA GPU Cloud (NGC)](https://ngc.nvidia.com/catalog/collections/nvidia:nemo_tts) \ No newline at end of file diff --git a/nemo/collections/vision/README.md b/nemo/collections/vision/README.md new file mode 100644 index 0000000000000..057f5b3a4719a --- /dev/null +++ b/nemo/collections/vision/README.md @@ -0,0 +1,6 @@ +NeMo Vision Collection +======================== + +The NeMo Vision Collection is designed to support the multimodal collection, particularly for models like LLAVA that necessitate a vision encoder implementation. At present, the vision collection features support for ViT, a customized version of the transformer model from Megatron core. + +Our [documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/index.html) offers comprehensive insights into each supported model, facilitating seamless integration and utilization within your projects. \ No newline at end of file