Timestamps to transcribe #10950

nithinraok · 2024-10-18T21:30:25Z

What does this PR do ?

Adds support for extracting timestamps to .transcribe() method

Collection: ASR

Changelog

Add timestamps=None/True/False to .transcribe() method in mixin
- None: does nothing, restores state set outside with set_decoding_stratergy (default)
- True: Enables timestamping with help of return_hypothesis and compute_timestamps setting to decoding_stratergy
- False: Disables timestamping by disabling return_hypothesis and compute_timestamps to decoding_stratergy
Adds corresponding support in
- ctc_models.py
- rnnt_models.py
- hybrid_rnnt-ctc_models.py
- Raise a notimplemented error for AED Based Models (Canary)
Adds support to transcribe_speech.py
- merges two variables to one: (compute_timestamps, preserve_alignments -> timestamps) as both are mutually dependent
- cleans much of the code
Add optional verbose=True option to change_decoding_strategy method. Default is True
Move some of model loading to conftest.py to improve setup time for each module
Add unit test for timestamps option for ctc and hybrid models

Usage

From command-line

with transcribe_speech.py script

python transcribe_speech.py pretrained_name="nvidia/parakeet-ctc-1.1b.nemo" \
dataset_manifest=<manifest_path> \
output_filename=<output_filename> timestamps=True

From Python Env

For CTC based models

from nemo.collections.asr.models import ASRModel
ctc_model = ASRModel.from_pretrained('nvidia/parakeet-ctc-1.1b')
output=ctc_model.transcribe(['<file_path>'], timestamps=True) # or manifest instead of individual filepaths
# by default you get timestamps for char, word and segment level. segment level differs based on model you use if it support punctuations and capitalizations natively or not. 
# for word-level timestamps
print(output[0].timestep['word'][:10]) #prints first 10 timestamps *_offset corresponds to frame numbers and start and end are provided in seconds 
# for segment-level timestamps
print(output[0].timestep['segment'][:10])

For RNNT/TDT based models

(currently only difference is output type for both models, will be making it consistent in upcoming PR)

from nemo.collections.asr.models import ASRModel
transducer_model = ASRModel.from_pretrained('nvidia/parakeet-rnnt-1.1b')
output=transducer_model.transcribe(['<file_path>'], timestamps=True)
# for word-level timestamps
print(output[0][0].timestep['word'][:10]) 
# for segment-level timestamps
print(output[0][0].timestep['segment'][:10])

For Hybrid RNNT/TDT-CTC models

Same as above by default decoding would be with transducer (RNNT/TDT), if user wants to change decoder then change decoding strategy before running transcribe()
like:

from nemo.collections.asr.models import ASRModel
from nemo.collections.asr.parts.submodules.ctc_decoding import CTCDecodingConfig
hybrid_model = ASRModel.from_pretrained('nvidia/parakeet-tdt_ctc-110m')
ctc_cfg = CTCDecodingConfig()
ctc_cfg.decoding = "greedy_batch"
hybrid_model.change_decoding_strategy(decoding_cfg=ctc_cfg, decoder_type="ctc")
output=hybrid_model.transcribe(['<file_path>'], timestamps=True)
# for word-level timestamps
print(output[0].timestep['word'][:10])
# for segment-level timestamps
print(output[0].timestep['segment'][:10])

For AED Models

For AED models like Canary, support would be added soon.

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

nemo/collections/asr/models/hybrid_rnnt_ctc_models.py

nemo/collections/asr/models/rnnt_models.py

nemo/collections/asr/models/hybrid_rnnt_ctc_models.py

nemo/collections/asr/parts/mixins/transcription.py

nemo/collections/asr/models/hybrid_rnnt_ctc_models.py