25 Feb 05:02

shashikg

180faf3

v1.3.1 Latest

Latest

New Features

Transcript Exporter: Can be used to save predicted transcript to vtt,srt,json,tsv,txt. (Doc: https://github.com/shashikg/WhisperS2T/blob/main/docs.md#write-transcripts-to-a-file)
Prebuilt docker images: Released ready to use prebuilt docker images. (Doc: https://github.com/shashikg/WhisperS2T?tab=readme-ov-file#from-docker-container)
Option to use single lang_code or tasks instead of list -- when all the audio files belongs to same language/task. #27

Bug Fixing

Fix without VAD transcribe function by @shashikg in #15 (Doc: https://github.com/shashikg/WhisperS2T/blob/main/docs.md#run-without-vad-model)
Fix issue with silent file by @shashikg in #12
Fixed missing dependency and tensorrt-llm failures by @shashikg in #32

Full Changelog: v1.3.0...v1.3.1

Contributors

shashikg

Assets 2

28 Jan 16:02

shashikg

v1.3.0

fae33f7

v1.3.0

Release Notes

Support for TensorRT-LLM Backend
Inclusion of Example Notebooks

TensorRT-LLM Backend

WhisperS2T now offers compatibility with NVIDIA's TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM backend, delivering a further twofold improvement in inference time compared to the CTranslate2 backend. The current optimal configuration on an A30 GPU achieves transcription of 1-hour files in approximately 18 seconds. Updated benchmarks are detailed below:

Assets 2

23 Dec 10:20

shashikg

v1.2.0

996f2f0

v1.2.0

Release Notes

Fixed ffmpeg resampling issue by adding an option to use the swr resampler in case soxr is not available.
Added word timestamp feature.

Word Timestamp Benchmarks

Model Name	Acc. Overlapped	Acc. Within Collar (0.1s)	Acc. Within Collar (0.2s)	Acc. Within Collar (0.5s)	Acc. Within Collar (1.0s)	Total Word Hits	Inference Time
WhisperS2T (ASR: whsiper-large-v2 - Aligner: whsiper-tiny)	66.21	38.67	60.8	76.06	85.82	64350	2.6x
WhisperS2T (ASR: whsiper-large-v2 - Aligner: whsiper-large-v2)	66.72	48.95	58.54	73.44	84.0	64350	1.6x
WhisperX (ASR: whsiper-large-v2 - Aligner: wav2vec)	55.65	50.66	55.84	66.18	75.57	64307	1x

We used the Whisper model for alignment. What we observed is that both Whisper as well as phoneme-level alignment (as in WhisperX) yield similar performance. However, using Whisper provides several advantages, including out-of-the-box support for all languages. For phoneme-level alignment, we need an individual model for every new language, which we believe somewhat diminishes the advantages of using the Whisper model at all. Moreover, when using the whisper-tiny model for word alignment, it incurs very little latency overhead without affecting the alignment accuracies. We utilized the AMI-MIX-Headset-Test dataset for benchmarking.

There's no properly defined metric for estimating word alignment accuracy. Hence, we introduce a new metric to accurately estimate the performance of word alignment. Check this function: Word Alignment Metric Function.

The proposed metric performs the following steps:

Initially, it identifies the words detected in the predicted transcript when compared against the reference transcript. This step is crucial because words that are missed or inserted in the predicted transcript should not be considered when evaluating word alignment accuracy.
After identifying the detected words, we calculate two values: overlapped_words and words_within_collar (refer to the figure below). Finally, we divide both values by the total number of detected words.

Assets 2

19 Dec 00:28

shashikg

v1.1.0

d032bfa

v1.1.0

Added support for Whisper Large v3 and Distil-Whisper Large v2. Below are the benchmarks

Assets 2

17 Dec 00:01

shashikg

v1.0.0

86e9834

v1.0.0

Initial Release

WhisperS2T ⚡

WhisperS2T is an optimized lightning-fast speech-to-text pipeline tailored for the whisper model! It's designed to be exceptionally fast, boasting a 1.5X speed improvement over WhisperX and a 2X speed boost compared to HuggingFace Pipeline with FlashAttention 2 (Insanely Fast Whisper). Moreover, it includes several heuristics to enhance transcription accuracy.

Whisper is a general-purpose speech recognition model developed by OpenAI. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

Benchmark and Technical Report

Stay tuned for a technical report comparing WhisperS2T against other whisper pipelines. Meanwhile, check some quick benchmarks on A30 GPU.

Features

🔄 Multi-Backend Support: Support for various Whisper model backends including Original OpenAI Model, HuggingFace Model with FlashAttention2, and CTranslate2 Model.
🎙️ Easy Integration of Custom VAD Models: Seamlessly add custom Voice Activity Detection (VAD) models to enhance control and accuracy in speech recognition.
🎧 Effortless Handling of Small or Large Audio Files: Intelligently batch smaller speech segments from various files, ensuring optimal performance.
⏳ Streamlined Processing for Large Audio Files: Asynchronously loads large audio files in the background while transcribing segmented batches, notably reducing loading times.
🌐 Batching Support with Multiple Language/Task Decoding: Decode multiple languages or perform both transcription and translation in a single batch for improved versatility and transcription time.
🧠 Reduction in Hallucination: Optimized parameters and heuristics to decrease repeated text output or hallucinations.
⏱️ Dynamic Time Length Support (Experimental): Process variable-length inputs in a given input batch instead of fixed 30 seconds, providing flexibility and saving computation time during transcription.

Getting Started

Installation

Install audio packages required for resampling and loading audio files.

apt-get install -y libsndfile1 ffmpeg

To install or update to the latest released version of WhisperS2T use the following command:

pip install -U whisper-s2t

Or to install from latest commit in this repo:

pip install -U git+https://github.com/shashikg/WhisperS2T.git

Usage

import whisper_s2t

model = whisper_s2t.load_model(model_identifier="large-v2", backend='CTranslate2')

files = ['data/KINCAID46/audio/1.wav']
lang_codes = ['en']
tasks = ['transcribe']
initial_prompts = [None]

out = model.transcribe_with_vad(files,
                                lang_codes=lang_codes,
                                tasks=tasks,
                                initial_prompts=initial_prompts,
                                batch_size=32)

print(out[0][0])
"""
[Console Output]

{'text': "Let's bring in Phil Mackie who is there at the palace. We're looking at Teresa and Philip May. Philip, can you see how he's being transferred from the helicopters? It looks like, as you said, the beast. It's got its headlights on because the sun is beginning to set now, certainly sinking behind some clouds. It's about a quarter of a mile away down the Grand Drive",
 'avg_logprob': -0.25426941679184695,
 'no_speech_prob': 8.147954940795898e-05,
 'start_time': 0.0,
 'end_time': 24.8}
"""

Check this Documentation for more details.

License

This project is licensed under MIT License - see the LICENSE file for details.

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Features

Bug Fixing

Contributors

Release Notes

TensorRT-LLM Backend

Release Notes

Word Timestamp Benchmarks

Initial Release

WhisperS2T ⚡

Benchmark and Technical Report

Features

Getting Started

Installation

Usage

License

Releases: shashikg/WhisperS2T

v1.3.1

New Features

Bug Fixing

Contributors

v1.3.0

Release Notes

TensorRT-LLM Backend

v1.2.0

Release Notes

Word Timestamp Benchmarks

v1.1.0

v1.0.0

Initial Release

WhisperS2T ⚡

Benchmark and Technical Report

Features

Getting Started

Installation

Usage

License