Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync vfm branch with main branch #11288

Merged
merged 24 commits into from
Nov 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
1cfecc9
Timestamps to transcribe (#10950)
nithinraok Nov 10, 2024
5c5b023
[🤠]: Howdy folks, let's bump `Dockerfile.ci` to 1b8fce7 ! (#11247)
ko3n1g Nov 11, 2024
66766b1
[🤠]: Howdy folks, let's bump `Dockerfile.ci` to 47ff44e ! (#11254)
ko3n1g Nov 12, 2024
d32c664
Handling tokenizer in PTQ for Nemo 2.0 (#11237)
janekl Nov 12, 2024
34c3032
Fix finetuning datamodule resume (#11187)
cuichenx Nov 12, 2024
d363e5d
ci: Move `bump mcore` to templates (#11229)
ko3n1g Nov 12, 2024
77c8e91
fix: Update baseline (#11205)
ko3n1g Nov 12, 2024
b26c220
Remove deprecated builder_opt param from build command (#11259)
janekl Nov 12, 2024
098aa18
chore(beep boop 🤖): Bump `MCORE_TAG=aded519...` (2024-11-12) (#11260)
ko3n1g Nov 12, 2024
5670706
[Doc fixes] update file names, installation instructions, bad links (…
erastorgueva-nv Nov 12, 2024
2d4f495
fix(export): GPT models w/ bias=False convert properly (#11255)
terrykong Nov 12, 2024
24e2871
ci: Run secrets detector on `pull_request_target` (#11263)
ko3n1g Nov 12, 2024
085e957
fix(export): update API for disabling device reassignment in TRTLLM f…
terrykong Nov 12, 2024
6e8e974
new vfm training features (#11246)
zpx01 Nov 13, 2024
f311b2e
Update pruning and distillation tutorial notebooks (#11091)
gvenkatakris Nov 13, 2024
a2572a7
Beam search algorithm implementation for TDT models (#10903)
lilithgrigoryan Nov 13, 2024
a9a959c
update nemo1->2 conversion according to changes in main (#11253)
HuiyingLi Nov 13, 2024
3625d78
Add llama 3.1 recipes (#11273)
cuichenx Nov 13, 2024
071f8bc
Fix Finetune Recipe (#11267)
suiyoubi Nov 13, 2024
02f0932
Configure no restart validation loop in nl.Trainer (#11029)
hemildesai Nov 13, 2024
af91d28
Handle _io_unflatten_object when _thread_local.output_dir is not avai…
hemildesai Nov 14, 2024
8b0c311
change default ckpt name (#11277)
maanug-nv Nov 14, 2024
bf7cc64
Use MegatronDataSampler in HfDatasetDataModule (#11274)
akoumpa Nov 14, 2024
0625327
Remove opencc upperbound (#10909)
thomasdhc Nov 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 12 additions & 51 deletions .github/workflows/mcore-tag-bump-bot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,54 +6,15 @@ on:
- cron: 0 0 * * *

jobs:
main:
runs-on: ubuntu-latest
environment: main
steps:
- name: Checkout NVIDIA/Megatron-LM
uses: actions/checkout@v4
with:
repository: NVIDIA/Megatron-LM
ref: main
path: ${{ github.run_id }}

- name: Get latest mcore commit
id: ref
run: |
cd ${{ github.run_id }}
sha=$(git rev-parse origin/main)
echo "sha=${sha}" >> "$GITHUB_OUTPUT"
echo "short_sha=${sha:0:7}" >> "$GITHUB_OUTPUT"
echo "date=$(date +%F)" >> "$GITHUB_OUTPUT"

- name: Checkout ${{ github.repository }}
uses: actions/checkout@v4
with:
path: ${{ github.run_id }}
token: ${{ secrets.PAT }}

- name: Bump MCORE_TAG
run: |
cd ${{ github.run_id }}
sed -i 's/^ARG MCORE_TAG=.*$/ARG MCORE_TAG=${{ steps.ref.outputs.sha }}/' Dockerfile.ci

- name: Create Bump PR
uses: peter-evans/create-pull-request@v6
id: create-pull-request
with:
path: ${{ github.run_id }}
branch: bump-ci-container-${{ steps.ref.outputs.date }}
base: main
title: 'Bump `Dockerfile.ci` (${{ steps.ref.outputs.date }})'
token: ${{ secrets.PAT }}
body: |
🚀 PR to Bump `Dockerfile.ci`.

📝 Please remember the following to-do's before merge:
- [ ] Verify the presubmit CI

🙏 Please merge this PR only if the CI workflow completed successfully.
commit-message: "[🤠]: Howdy folks, let's bump `Dockerfile.ci` to ${{ steps.ref.outputs.short_sha }} !"
signoff: true
reviewers: 'pablo-garay'
labels: 'Run CICD'
mcore:
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/[email protected]
with:
source-repository: NVIDIA/Megatron-LM
source-ref: main
build-arg: MCORE_TAG
dockerfile: Dockerfile.ci
base-branch: main
cicd-label: Run CICD
pr-reviewers: 'pablo-garay'
secrets:
PAT: ${{ secrets.PAT }}
19 changes: 15 additions & 4 deletions .github/workflows/secrets-detector.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
name: Secrets detector

on:
pull_request:
pull_request_target:
branches:
- 'main'

Expand All @@ -25,13 +25,24 @@ jobs:
- name: Checkout repository
uses: actions/checkout@v4
with:
path: ${{ github.run_id }}
# setup repository and ref for PRs, see
# https://github.com/EndBug/add-and-commit?tab=readme-ov-file#working-with-prs
repository: ${{ github.event.pull_request.head.repo.full_name }}
ref: ${{ github.event.pull_request.head.ref }}
# custom token is required to trigger actions after reformatting + pushing
fetch-depth: 0
token: ${{ secrets.NEMO_REFORMAT_TOKEN }}

- name: Install secrets detector
run: pip install detect-secrets

- name: Run on change-set
run: |
cd ${{ github.run_id }}
git diff --name-only --diff-filter=d --merge-base origin/main -z | xargs -0 detect-secrets-hook --baseline .secrets.baseline
git diff --name-only --diff-filter=d --merge-base origin/main -z | xargs -0 detect-secrets-hook --baseline .secrets.baseline

- uses: EndBug/add-and-commit@v9
# Commit changes. Nothing is committed if no changes.
if: always()
with:
message: Update baseline
commit: --signoff
2 changes: 1 addition & 1 deletion Dockerfile.ci
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ RUN pip install nemo_run@git+https://github.com/NVIDIA/NeMo-Run.git@${NEMO_RUN_T
# Install NeMo requirements
ARG TE_TAG=7d576ed25266a17a7b651f2c12e8498f67e0baea
ARG MODELOPT_VERSION=0.19.0
ARG MCORE_TAG=bc8c4f356240ea4ccadce426251171e6e430c9d3
ARG MCORE_TAG=aded519cfb1de2abf96f36ca059f992294b7876f

ARG APEX_TAG=810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c
RUN \
Expand Down
15 changes: 15 additions & 0 deletions docs/source/asr/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,21 @@ RNNT Decoding
:show-inheritance:
:members:

TDT Decoding
~~~~~~~~~~~~~

.. autoclass:: nemo.collections.asr.parts.submodules.rnnt_greedy_decoding.GreedyTDTInfer
:show-inheritance:
:members:

.. autoclass:: nemo.collections.asr.parts.submodules.rnnt_greedy_decoding.GreedyBatchedTDTInfer
:show-inheritance:
:members:

.. autoclass:: nemo.collections.asr.parts.submodules.tdt_beam_decoding.BeamTDTInfer
:show-inheritance:
:members:

Hypotheses
~~~~~~~~~~

Expand Down
34 changes: 17 additions & 17 deletions docs/source/asr/asr_language_modeling_and_customization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,15 +99,15 @@ Evaluate by Beam Search Decoding and N-gram LM

NeMo's beam search decoders are capable of using the KenLM's N-gram models to find the best candidates.
The script to evaluate an ASR model with beam search decoding and N-gram models can be found at
`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py>`__.
`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py>`__.

This script has a large number of possible argument overrides; therefore, it is recommended that you use ``python eval_beamsearch_ngram.py --help`` to see the full list of arguments.
This script has a large number of possible argument overrides; therefore, it is recommended that you use ``python eval_beamsearch_ngram_ctc.py --help`` to see the full list of arguments.

You can evaluate an ASR model using the following:

.. code-block::

python eval_beamsearch_ngram.py nemo_model_file=<path to the .nemo file of the model> \
python eval_beamsearch_ngram_ctc.py nemo_model_file=<path to the .nemo file of the model> \
input_manifest=<path to the evaluation JSON manifest file \
kenlm_model_file=<path to the binary KenLM model> \
beam_width=[<list of the beam widths, separated with commas>] \
Expand All @@ -118,18 +118,18 @@ You can evaluate an ASR model using the following:
decoding_mode=beamsearch_ngram \
decoding_strategy="<Beam library such as beam, pyctcdecode or flashlight>"

It can evaluate a model in the following three modes by setting the argument `--decoding_mode`:
It can evaluate a model in the following three modes by setting the argument ``--decoding_mode``:

* greedy: Just greedy decoding is done and no beam search decoding is performed.
* beamsearch: The beam search decoding is done, but without using the N-gram language model. Final results are equivalent to setting the weight of LM (beam_beta) to zero.
* beamsearch_ngram: The beam search decoding is done with N-gram LM.

In `beamsearch` mode, the evaluation is performed using beam search decoding without any language model. The performance is reported in terms of Word Error Rate (WER) and Character Error Rate (CER). Moreover, when the best candidate is selected among the candidates, it is also reported as the best WER/CER. This can serve as an indicator of the quality of the predicted candidates.
In ``beamsearch`` mode, the evaluation is performed using beam search decoding without any language model. The performance is reported in terms of Word Error Rate (WER) and Character Error Rate (CER). Moreover, when the best candidate is selected among the candidates, it is also reported as the best WER/CER. This can serve as an indicator of the quality of the predicted candidates.


The script initially loads the ASR model and predicts the outputs of the model's encoder as log probabilities. This part is computed in batches on a device specified by --device, which can be either a CPU (`--device=cpu`) or a single GPU (`--device=cuda:0`).
The batch size for this part is specified by `--acoustic_batch_size`. Using the largest feasible batch size can speed up the calculation of log probabilities. Additionally, you can use `--use_amp` to accelerate the calculation and allow for larger --acoustic_batch_size values.
Currently, multi-GPU support is not available for calculating log probabilities. However, using `--probs_cache_file` can help. This option stores the log probabilities produced by the model’s encoder in a pickle file, allowing you to skip the first step in future runs.
The batch size for this part is specified by ``--acoustic_batch_size``. Using the largest feasible batch size can speed up the calculation of log probabilities. Additionally, you can use `--use_amp` to accelerate the calculation and allow for larger --acoustic_batch_size values.
Currently, multi-GPU support is not available for calculating log probabilities. However, using ``--probs_cache_file`` can help. This option stores the log probabilities produced by the model’s encoder in a pickle file, allowing you to skip the first step in future runs.

The following is the list of the important arguments for the evaluation script:

Expand Down Expand Up @@ -167,7 +167,7 @@ The following is the list of the important arguments for the evaluation script:
| decoding_strategy | str | beam | String argument for type of decoding strategy for the model. |
+--------------------------------------+----------+------------------+-------------------------------------------------------------------------+
| decoding | Dict | BeamCTC | Subdict of beam search configs. Values found via |
| | Config | InferConfig | python eval_beamsearch_ngram.py --help |
| | Config | InferConfig | python eval_beamsearch_ngram_ctc.py --help |
+--------------------------------------+----------+------------------+-------------------------------------------------------------------------+
| text_processing.do_lowercase | bool | ``False`` | Whether to make the training text all lower case. |
+--------------------------------------+----------+------------------+-------------------------------------------------------------------------+
Expand All @@ -178,11 +178,11 @@ The following is the list of the important arguments for the evaluation script:
| text_processing.separate_punctuation | bool | ``True`` | Whether to separate punctuation with the previous word by space. |
+--------------------------------------+----------+------------------+-------------------------------------------------------------------------+

The width of the beam search (`--beam_width`) specifies the number of top candidates or predictions the beam search decoder will consider. Larger beam widths result in more accurate but slower predictions.
The width of the beam search (``--beam_width``) specifies the number of top candidates or predictions the beam search decoder will consider. Larger beam widths result in more accurate but slower predictions.

.. note::

The ``eval_beamsearch_ngram.py`` script contains the entire subconfig used for CTC Beam Decoding.
The ``eval_beamsearch_ngram_ctc.py`` script contains the entire subconfig used for CTC Beam Decoding.
Therefore it is possible to forward arguments for various beam search libraries such as ``flashlight``
and ``pyctcdecode`` via the ``decoding`` subconfig.

Expand Down Expand Up @@ -223,14 +223,14 @@ It supports several advanced features, such as lexicon-based decoding, lexicon-f
.. code-block::

# Lexicon-based decoding
python eval_beamsearch_ngram.py ... \
python eval_beamsearch_ngram_ctc.py ... \
decoding_strategy="flashlight" \
decoding.beam.flashlight_cfg.lexicon_path='/path/to/lexicon.lexicon' \
decoding.beam.flashlight_cfg.beam_size_token = 32 \
decoding.beam.flashlight_cfg.beam_threshold = 25.0

# Lexicon-free decoding
python eval_beamsearch_ngram.py ... \
python eval_beamsearch_ngram_ctc.py ... \
decoding_strategy="flashlight" \
decoding.beam.flashlight_cfg.beam_size_token = 32 \
decoding.beam.flashlight_cfg.beam_threshold = 25.0
Expand All @@ -256,7 +256,7 @@ It has advanced features, such as word boosting, which can be useful for transcr
.. code-block::

# PyCTCDecoding
python eval_beamsearch_ngram.py ... \
python eval_beamsearch_ngram_ctc.py ... \
decoding_strategy="pyctcdecode" \
decoding.beam.pyctcdecode_cfg.beam_prune_logp = -10. \
decoding.beam.pyctcdecode_cfg.token_min_logp = -5. \
Expand All @@ -273,7 +273,7 @@ For example, the following set of parameters would result in 212=4 beam search d

.. code-block::

python eval_beamsearch_ngram.py ... \
python eval_beamsearch_ngram_ctc.py ... \
beam_width=[64,128] \
beam_alpha=[1.0] \
beam_beta=[1.0,0.5]
Expand Down Expand Up @@ -330,7 +330,7 @@ Given a trained TransformerLMModel `.nemo` file or a pretrained HF model, the sc
can be used to re-score beams obtained with ASR model. You need the `.tsv` file containing the candidates produced
by the acoustic model and the beam search decoding to use this script. The candidates can be the result of just the beam
search decoding or the result of fusion with an N-gram LM. You can generate this file by specifying `--preds_output_folder` for
`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py>`__.
`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py>`__.

The neural rescorer would rescore the beams/candidates by using two parameters of `rescorer_alpha` and `rescorer_beta`, as follows:

Expand All @@ -345,7 +345,7 @@ Use the following steps to evaluate a neural LM:
#. Obtain `.tsv` file with beams and their corresponding scores. Scores can be from a regular beam search decoder or
in fusion with an N-gram LM scores. For a given beam size `beam_size` and a number of examples
for evaluation `num_eval_examples`, it should contain (`num_eval_examples` x `beam_size`) lines of
form `beam_candidate_text \t score`. This file can be generated by `scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py>`__
form `beam_candidate_text \t score`. This file can be generated by `scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py>`__

#. Rescore the candidates by `scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py>`__.

Expand Down Expand Up @@ -439,7 +439,7 @@ You can then pass this file to your Flashlight config object during decoding:
.. code-block::

# Lexicon-based decoding
python eval_beamsearch_ngram.py ... \
python eval_beamsearch_ngram_ctc.py ... \
decoding_strategy="flashlight" \
decoding.beam.flashlight_cfg.lexicon_path='/path/to/lexicon.lexicon' \
decoding.beam.flashlight_cfg.boost_path='/path/to/my_boost_file.boost' \
Expand Down
39 changes: 34 additions & 5 deletions docs/source/asr/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,39 @@ After :ref:`installing NeMo<installation>`, you can transcribe an audio file as
asr_model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large")
transcript = asr_model.transcribe(["path/to/audio_file.wav"])

Obtain word/segment timestamps
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Obtain timestamps
^^^^^^^^^^^^^^^^^

You can also obtain timestamps for each word or segment in the transcription as follows:
Obtaining char(token), word or segment timestamps is also possible with NeMo ASR Models.

Currently, timestamps are available for Parakeet Models with all types of decoders (CTC/RNNT/TDT). Support for AED models would be added soon.

There are two ways to obtain timestamps:
1. By using the `timestamps=True` flag in the `transcribe` method.
2. For more control over the timestamps, you can update the decoding config to mention type of timestamps (char, word, segment) and also specify the segment seperators or word seperator for segment and word level timestamps.

With the `timestamps=True` flag, you can obtain timestamps for each character in the transcription as follows:

.. code-block:: python

# import nemo_asr and instantiate asr_model as above
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt_ctc-110m")

# specify flag `timestamps=True`
hypotheses = asr_model.transcribe(["path/to/audio_file.wav"], timestamps=True)

# by default, timestamps are enabled for char, word and segment level
word_timestamps = hypotheses[0][0].timestep['word'] # word level timestamps for first sample
segment_timestamps = hypotheses[0][0].timestep['segment'] # segment level timestamps
char_timestamps = hypotheses[0][0].timestep['char'] # char level timestamps

for stamp in segment_timestamps:
print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")

# segment level timestamps (if model supports Punctuation and Capitalization, segment level timestamps are displayed based on punctuation otherwise complete transcription is considered as a single segment)

For more control over the timestamps, you can update the decoding config to mention type of timestamps (char, word, segment) and also specify the segment seperators or word seperator for segment and word level timestamps as follows:

.. code-block:: python

Expand Down Expand Up @@ -98,8 +127,8 @@ You can get a good improvement in transcription accuracy even using a simple N-g

After :ref:`training <train-ngram-lm>` an N-gram LM, you can use it for transcribing audio as follows:

1. Install the OpenSeq2Seq beam search decoding and KenLM libraries using the `install_beamsearch_decoders script <scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh>`_.
2. Perform transcription using the `eval_beamsearch_ngram script <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py>`_:
1. Install the OpenSeq2Seq beam search decoding and KenLM libraries using the `install_beamsearch_decoders script <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh>`_.
2. Perform transcription using the `eval_beamsearch_ngram script <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py>`_:

.. code-block:: bash

Expand Down
4 changes: 2 additions & 2 deletions docs/source/core/core.rst
Original file line number Diff line number Diff line change
Expand Up @@ -294,8 +294,8 @@ CLI
With NeMo and Hydra, every aspect of model training can be modified from the command-line. This is extremely helpful for running lots
of experiments on compute clusters or for quickly testing parameters during development.

All NeMo `examples <https://github.com/NVIDIA/NeMo/tree/v1.0.2/examples>`_ come with instructions on how to
run the training/inference script from the command-line (see `here <https://github.com/NVIDIA/NeMo/blob/4e9da75f021fe23c9f49404cd2e7da4597cb5879/examples/asr/asr_ctc/speech_to_text_ctc.py#L24>`__
All NeMo `examples <https://github.com/NVIDIA/NeMo/tree/stable/examples>`_ come with instructions on how to
run the training/inference script from the command-line (e.g. see `here <https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_ctc/speech_to_text_ctc.py>`__
for an example).

With Hydra, arguments are set using the ``=`` operator:
Expand Down
Loading
Loading