nf-whisper
Automatic Speech Recognition (ASR) Nextflow pipeline using OpenAI Whisper
nf-whisper
is a simple Nextflow pipeline that leverages OpenAI's Whisper pre-trained models to generate transcriptions and translations from YouTube videos and audio files. Key features include:
- Automatic transcription and translation of audio content
- YouTube video downloading and audio extraction
- Support for various Whisper pre-trained models
- Flexible input options: YouTube URLs or local audio files
- Optional timestamp generation for transcriptions
This pipeline streamlines the process of converting speech to text, making it easier for researchers, content creators, and developers to work with audio data.
-
Install Nextflow: If you don't have Nextflow installed, visit nextflow.io for installation instructions.
-
Install Docker: This pipeline uses Docker to manage dependencies. Install Docker from docker.com.
-
Build the Docker image (or use Wave) From the root directory of this repository, run:
docker build . -t whisper
Or alternatively, you can use Wave to remotely build the container image on-the-fly. Just run the Nextflow commands below using
-with-wave
instead of-with-docker whisper
.
-
Install Nextflow and Docker (if not already installed).
-
Run the pipeline by providing a YouTube URL using the
--youtube_url
parameter:nextflow run main.nf --youtube_url https://www.youtube.com/watch\?v\=UVzLd304keA --model small.en -with-docker whisper
-
For local audio files use the
--file
parameter:nextflow run main.nf --file audio_sample.wav --model small.en -with-docker whisper
Now that you have the basic setup working, let's explore more advanced features.
- Generating transcriptions with timestamps using the
--timestamp
parameter:
nextflow run main.nf --youtube_url https://www.youtube.com/watch\?v\=UVzLd304keA --model small.en --timestamp -with-docker whisper
- Use different models using the
--model
parameter:
nextflow run main.nf --youtube_url https://www.youtube.com/watch\?v\=UVzLd304keA --model tiny -with-docker whisper
- Provide a local model file using the
--model
parameter:
nextflow run main.nf --youtube_url https://www.youtube.com/watch\?v\=UVzLd304keA --model /path/to/model.pt -with-docker whisper
- Check out help with:
nextflow run main.nf --help
There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. The table below shows the available models and their approximate memory requirements and relative speed.
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|---|---|
tiny | 39 M | tiny.en |
tiny |
~1 GB | ~32x |
base | 74 M | base.en |
base |
~1 GB | ~16x |
small | 244 M | small.en |
small |
~2 GB | ~6x |
medium | 769 M | medium.en |
medium |
~5 GB | ~2x |
large | 1550 M | N/A | large |
~10 GB | 1x |
For English-only applications, the .en
models tend to perform better, especially for the tiny.en
and base.en
models. The performance difference becomes less significant for the small.en
and medium.en
models.
The section above was retrieved from the README of Matthias Zepper amazing work on dockerizing Whisper with support for GPUs! This Nextflow pipeline was heavily influenced by Matthias' work, the official OpenAI Whisper GitHub repository, and some other blog posts I read, mostly this and this.