rST Fixes for Developer Docs (#8535)

* Update checkpoint.rst Signed-off-by: Andrew Schilling <[email protected]> * Update configs.rst Signed-off-by: Andrew Schilling <[email protected]> * Update intro.rst Signed-off-by: Andrew Schilling <[email protected]> * Update intro.rst Signed-off-by: Andrew Schilling <[email protected]> * Update neva.rst Signed-off-by: Andrew Schilling <[email protected]> * Update checkpoint.rst Signed-off-by: Andrew Schilling <[email protected]> * Update configs.rst Signed-off-by: Andrew Schilling <[email protected]> * Update controlnet.rst Signed-off-by: Andrew Schilling <[email protected]> * Update datasets.rst Signed-off-by: Andrew Schilling <[email protected]> * Update dreambooth.rst Signed-off-by: Andrew Schilling <[email protected]> * Update imagen.rst Signed-off-by: Andrew Schilling <[email protected]> * Update insp2p.rst Signed-off-by: Andrew Schilling <[email protected]> * Update sd.rst Signed-off-by: Andrew Schilling <[email protected]> * Update checkpoint.rst Signed-off-by: Andrew Schilling <[email protected]> * Update clip.rst Signed-off-by: Andrew Schilling <[email protected]> * Update mcore_customization.rst Signed-off-by: Andrew Schilling <[email protected]> * Update retro_model.rst Signed-off-by: Andrew Schilling <[email protected]> * Update migration-guide.rst Signed-off-by: Andrew Schilling <[email protected]> * Update nemo_forced_aligner.rst Signed-off-by: Andrew Schilling <[email protected]> * Update checkpoints.rst Signed-off-by: Andrew Schilling <[email protected]> * Update datasets.rst Signed-off-by: Andrew Schilling <[email protected]> * Update g2p.rst Signed-off-by: Andrew Schilling <[email protected]> * Update checkpoint.rst Signed-off-by: Andrew Schilling <[email protected]> * Update configs.rst Signed-off-by: Andrew Schilling <[email protected]> * Update datasets.rst Signed-off-by: Andrew Schilling <[email protected]> * Update vit.rst Signed-off-by: Andrew Schilling <[email protected]> * Update core.rst Signed-off-by: Andrew Schilling <[email protected]> * Update export.rst Signed-off-by: Andrew Schilling <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]>
NVIDIA · Feb 28, 2024 · 5f95f50 · 5f95f50
1 parent 0796199
commit 5f95f50
Show file tree

Hide file tree

Showing 27 changed files with 173 additions and 153 deletions.
diff --git a/docs/source/core/core.rst b/docs/source/core/core.rst
@@ -201,8 +201,7 @@ First, instantiate the model and trainer, then call ``.fit``:
     # Or we can run the test loop on test data by calling
     trainer.test(model=model)
 
-All `trainer flags <https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#trainer-flags>`_ can be set from from the 
-NeMo configuration. 
+All `trainer flags <https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#trainer-flags>`_ can be set from from the NeMo configuration. 
 
 
 Configuration

diff --git a/docs/source/core/export.rst b/docs/source/core/export.rst
@@ -197,6 +197,7 @@ To facilitate that, the hooks below are provided. To export, for example, 'encod
 Some nertworks may be exported differently according to user-settable options (like ragged batch support for TTS or cache support for ASR). To facilitate that - `set_export_config()` method is provided by Exportable to set key/value pairs to predefined model.export_config dictionary, to be used during the export:
 
 .. code-block:: Python	
+
     def set_export_config(self, args):
         """
         Sets/updates export_config dictionary
@@ -207,6 +208,7 @@ An example can be found in ``<NeMo_git_root>/nemo/collections/asr/models/rnnt_mo
 Here is example on now `set_export_config()` call is being tied to command line arguments in ``<NeMo_git_root>/scripts/export.py`` :
 
 .. code-block:: Python
+
     python scripts/export.py  hybrid_conformer.nemo hybrid_conformer.onnx --export-config decoder_type=ctc
 
 Exportable Model Code
@@ -217,6 +219,7 @@ Most importantly, the actual Torch code in your model should be ONNX or TorchScr
 #. Create your model ``Exportable`` and add an export unit test, to catch any operation/construct not supported in ONNX/TorchScript, immediately.
 
 For more information, refer to the PyTorch documentation:
+
        - `List of supported operators <https://pytorch.org/docs/stable/onnx.html#supported-operators>`_
        - `Tracing vs. scripting <https://pytorch.org/docs/stable/onnx.html#tracing-vs-scripting>`_ 
        - `AlexNet example <https://pytorch.org/docs/stable/onnx.html#example-end-to-end-alexnet-from-pytorch-to-onnx>`_

diff --git a/docs/source/multimodal/mllm/checkpoint.rst b/docs/source/multimodal/mllm/checkpoint.rst
@@ -92,7 +92,7 @@ For conversion:
 
 
 Model Parallelism Adjustment
--------------------------
+----------------------------
 
 NeVA Checkpoints
 ^^^^^^^^^^^^^^^^

diff --git a/docs/source/multimodal/mllm/configs.rst b/docs/source/multimodal/mllm/configs.rst
@@ -123,19 +123,19 @@ Each configuration file should detail the model architecture used for the experi
 
 The parameters commonly shared across most multimodal language models include:
 
-+---------------------------+--------------+---------------------------------------------------------------------------------------+
-| **Parameter**             | **Datatype** | **Description**                                                                       |
-+===========================+==============+=======================================================================================+
-| :code:`micro_batch_size`  | int          | micro batch size that fits on each GPU                                                |
-+---------------------------+--------------+---------------------------------------------------------------------------------------+
-| :code:`global_batch_size` | int          | global batch size that takes consideration of gradient accumulation, data parallelism |
-+---------------------------+--------------+---------------------------------------------------------------------------------------+
-| :code:`tensor_model_parallel_size`       | int         | intra-layer model parallelism                                                     |
-+---------------------------+--------------+---------------------------------------------------------------------------------------+
-| :code:`pipeline_model_parallel_size`     | int         | inter-layer model parallelism                                                           |
-+---------------------------+--------------+---------------------------------------------------------------------------------------+
-| :code:`seed`              | int          | seed used in training                                                                 |
-+---------------------------+--------------+---------------------------------------------------------------------------------------+
++------------------------------------------+--------------+---------------------------------------------------------------------------------------+
+| **Parameter**                            | **Datatype** | **Description**                                                                       |
++===========================+==============+==============+=======================================================================================+
+| :code:`micro_batch_size`                 | int          | micro batch size that fits on each GPU                                                |
++------------------------------------------+--------------+---------------------------------------------------------------------------------------+
+| :code:`global_batch_size`                | int          | global batch size that takes consideration of gradient accumulation, data parallelism |
++------------------------------------------+--------------+---------------------------------------------------------------------------------------+
+| :code:`tensor_model_parallel_size`       | int          | intra-layer model parallelism                                                         |
++------------------------------------------+--------------+---------------------------------------------------------------------------------------+
+| :code:`pipeline_model_parallel_size`     | int          | inter-layer model parallelism                                                         |
++------------------------------------------+--------------+---------------------------------------------------------------------------------------+
+| :code:`seed`                             | int          | seed used in training                                                                 |
++------------------------------------------+--------------+---------------------------------------------------------------------------------------+
 
 NeVA
 ~~~~~~~~

diff --git a/docs/source/multimodal/mllm/intro.rst b/docs/source/multimodal/mllm/intro.rst
@@ -12,7 +12,7 @@ NeMo Multimodal currently supports the following models:
 +-----------------------------------+----------+-------------+------+-------------------------+------------------+
 | Model                             | Training | Fine-Tuning | PEFT | Evaluation              | Inference        |
 +===================================+==========+=============+======+=========================+==================+
-| `NeVA (LLaVA) <./neva.html>`_     | ✓        | ✓           | -    | -                       | ✓                |
+| `NeVA (LLaVA) <./neva.html>`_     | Yes      | Yes         | -    | -                       | Yes              |
 +-----------------------------------+----------+-------------+------+-------------------------+------------------+
 | Kosmos-2                          | WIP      | WIP         | -    | -                       | WIP              |
 +-----------------------------------+----------+-------------+------+-------------------------+------------------+
@@ -47,7 +47,7 @@ Flamingo :cite:`mm-models-flamingo` addresses inconsistent visual feature map si
     - Dataset: Utilizes data from various datasets like M3W, ALIGN, LTIP, and VTP emphasizing multimodal in-context learning.
 
 Kosmos-1: Language Is Not All You Need: Aligning Perception with Language Models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Kosmos-1 :cite:`mm-models-kosmos1` by Microsoft is a Multimodal Large Language Model (MLLM) aimed at melding language, perception, action, and world modeling.
 
@@ -108,4 +108,4 @@ References
     :style: plain
     :filter: docname in docnames
     :labelprefix: MM-MODELS
-    :keyprefix: mm-models-
+    :keyprefix: mm-models-
diff --git a/docs/source/multimodal/mllm/neva.rst b/docs/source/multimodal/mllm/neva.rst
@@ -15,15 +15,15 @@ Building upon LLaVA's foundational principles, NeVA amplifies its training effic
 
 
 Main Language Model
-^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^
 
 The original LLaVA model incorporates the LLaMA architecture, renowned for its prowess in open-source, language-only instruction-tuning endeavors. LLaMA refines textual input through a process of tokenization and embedding. To these token embeddings, positional embeddings are integrated, and the combined representation is channeled through multiple transformer layers. The output from the concluding transformer layer, associated with the primary token, is designated as the text representation.
 
 In NeMo, the text encoder is anchored in the :class:`~nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel` class. This class is versatile, supporting not only NVGPT models but also LLaMA, LLaMA-2 and other community models, complete with a checkpoint conversion script. Concurrently, the vision model and projection layers enhance the primary language model's word embedding component. For a comprehensive understanding of the implementation, one can refer to the :class:`~nemo.collections.multimodal.models.multimodal_llm.neva.neva_model.MegatronNevaModel` class.
 
 
 Vision Model
-^^^^^^^^^^
+^^^^^^^^^^^
 
 For visual interpretation, NeVA harnesses the power of the pre-trained CLIP visual encoder, ViT-L/14, recognized for its visual comprehension acumen. Images are first partitioned into standardized patches, for instance, 16x16 pixels. These patches are linearly embedded, forming a flattened vector that subsequently feeds into the transformer. The culmination of the transformer's processing is a unified image representation. In the NeMo framework, the NeVA vision model, anchored on the CLIP visual encoder ViT-L/14, can either be instantiated via the :class:`~nemo.collections.multimodal.models.multimodal_llm.clip.megatron_clip_models.CLIPVisionTransformer` class or initiated through the `transformers` package from Hugging Face.
 
@@ -44,7 +44,7 @@ Architecture Table
 +------------------+---------------+------------+--------------------+-----------------+------------+----------------+--------------------------+
 
 Model Configuration
-------------------
+-------------------
 
 Multimodal Configuration
 ^^^^^^^^^^^^^^^^^^^^^^^^
@@ -140,7 +140,8 @@ Optimizations
 | BF16 O2                            | Enables O2-level automatic mixed precision, optimizing Bfloat16 precision for better performance.                                                                                                                                                                                                                                                                                                                                                                                                       | ``model.megatron_amp_O2=True``                                                                                                                                                                                   |
 +------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | Flash Attention V2                 | FlashAttention is a fast and memory-efficient algorithm to compute exact attention. It speeds up model training and reduces memory requirement by being IO-aware. This approach is particularly useful for large-scale models and is detailed further in the repository linked. [Reference](https://github.com/Dao-AILab/flash-attention)                                                                                                                                                               | ``model.use_flash_attention=True``                                                                                                                                                                               |
-+----------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
++------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+
 
 NeVA Training
 --------------
@@ -157,4 +158,4 @@ References
     :style: plain
     :filter: docname in docnames
     :labelprefix: MM-MODELS
-    :keyprefix: mm-models-
+    :keyprefix: mm-models-
diff --git a/docs/source/multimodal/text2img/checkpoint.rst b/docs/source/multimodal/text2img/checkpoint.rst
@@ -12,7 +12,7 @@ Refer to the following sections for instructions and examples for each.
 Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning.
 
 Loading ``.nemo`` Checkpoints
--------------------------
+-----------------------------
 
 NeMo automatically saves checkpoints of a model that is trained in a ``.nemo`` format. Alternatively, to manually save the model at any 
 point, issue :code:`model.save_to(<checkpoint_path>.nemo)`.
@@ -27,7 +27,7 @@ If there is a local ``.nemo`` checkpoint that you'd like to load, use the :code:
 Where the model base class is the MM model class of the original checkpoint.
 
 Converting Intermediate Checkpoints
----------------------------
+-----------------------------------
 To evaluate a partially trained checkpoint, you may need to convert it to ``.nemo`` format.
 `script to convert the checkpoint <ADD convert_ckpt_to_nemo.py PATH>`.
 
@@ -43,7 +43,7 @@ To evaluate a partially trained checkpoint, you may need to convert it to ``.nem
 
 
 Converting HuggingFace Checkpoints
----------------------------------
+----------------------------------
 
 To fully utilize the optimized training pipeline and framework/TRT inference pipeline
 of NeMo, we provide scripts to convert popular checkpoints on HuggingFace into NeMo format.
@@ -77,4 +77,4 @@ Imagen
 
 We will provide conversion script if Imagen research team releases their checkpoint
 in the future. Conversion script for DeepFloyd IF models will be provided in the
-next release.
+next release.
diff --git a/docs/source/multimodal/text2img/configs.rst b/docs/source/multimodal/text2img/configs.rst
@@ -90,7 +90,7 @@ for all possible arguments
 
 
 Experiment Manager Configurations
----------------------------
+---------------------------------
 
 NeMo Experiment Manager provides convenient way to configure logging, saving, resuming options and more.
 
@@ -145,7 +145,7 @@ By default we use ``fused_adam`` as the optimizer, refer to NeMo user guide for
 Learning rate scheduler can be specified in ``optim.sched`` section.
 
 Model Architecture Configurations
-------------------------
+---------------------------------
 
 Each configuration file should describe the model architecture being used for the experiment. 
 

diff --git a/docs/source/multimodal/text2img/controlnet.rst b/docs/source/multimodal/text2img/controlnet.rst
@@ -13,7 +13,7 @@ NeMo Multimodal provides a training pipeline and example implementation for gene
 
 
 ControlNet Dataset
-____________________
+^^^^^^^^^^^^^^^^^^^^
 
 ControlNet employs the WebDataset format for data ingestion. (See :doc:`Datasets<./datasets>`) Beyond the essential image-text pairs saved in tarfiles with matching names but distinct extensions (like 000001.jpg and 000001.txt), ControlNet also requires control input within the tarfiles, identifiable by their specific extension. By default, the control input should be stored as 000001.png for correct loading and identification in NeMo's implementation.
 
@@ -103,4 +103,4 @@ Reference
     :style: plain
     :filter: docname in docnames
     :labelprefix: MM-MODELS
-    :keyprefix: mm-models-
+    :keyprefix: mm-models-
diff --git a/docs/source/multimodal/text2img/datasets.rst b/docs/source/multimodal/text2img/datasets.rst
@@ -2,7 +2,7 @@ Datasets
 ========
 
 Data pipeline overview
------------------
+----------------------
 
 .. note:: It is the responsibility of each user to check the content of the dataset, review the applicable licenses, and determine if it is suitable for their intended use. Users should review any applicable links associated with the dataset before placing the data on their machine.
 
@@ -34,7 +34,7 @@ Instruction for configuring each sub-stage is provided as a comment next to each
 
 
 Examples of Preparing a Dataset for Training Text2Img Model
------------------------
+-----------------------------------------------------------
 
 Refer to the `Dataset Tutorial <http://TODOURL>`_` for details on how to prepare the training dataset for Training Text2Img models.