Use Silero VAD in Batched Mode #936

MahmoudAshraf97 · 2024-07-28T18:36:02Z

This PR tries to close the gap between Batched and Sequential versions
Summary of Changes:

Reimplementation of Silero model and inference is now 3x faster
Batched pipeline now uses silero instead of pyannote vad, this reduces the amount of code in the repo that is needed to handle two vad models
added a script to evaluate WER on Youtube Commons ASR dataset (useful for longform and batched evaluation)
Unified the batched and sequential transcribe function as much as I could

WER Comparisons

Batched (without_timestamps=True, vad_filter=True, chunk_length=25) on Youtube Commons using distil-large-v3:
Before: WER: 13.910
After: WER: 13.712

Vad Parameters are not completely tuned, but I don't have the resources to evaluate on multilingual datasets, contributions are welcome

hoonlight · 2024-07-29T02:54:16Z

When I used the batch version, I got better transcription results compared to the sequential version. I'm not sure if this is due to pyannote VAD or if there is an additional process in the batch version that improves WER. Have you ever compared Silero VAD with pyannote VAD?

By the way, thank you for your contribution to improving faster-whisper. Even though it was a well-discussed and approved PR, anyone is entitled to have their opinion about it, but no one has the right to be rude.

Jiltseb · 2024-07-29T07:59:26Z

When I used the batch version, I got better transcription results compared to the sequential version. I'm not sure if this is due to pyannote VAD or if there is an additional process in the batch version that improves WER. Have you ever compared Silero VAD with pyannote VAD?

By the way, thank you for your contribution to improving faster-whisper. Even though it was a well-discussed and approved PR, anyone is entitled to have their opinion about it, but no one has the right to be rude.

It is indeed possible to have better results for long-form transcription in batched mode. This is because there is no context passing between batches and it prevents ambiguous text from the previous context passing to the next set of frames for computation.

Thanks for your kind words regarding the batched PR.

@MahmoudAshraf97 I would suggest adding the numbers with pyannote VAD and silero VAD (WER and the speed-up) for completeness.

zh-plus · 2024-07-29T15:37:04Z

Is it better to let users choose the VAD model from pyannote VAD or Silero VAD?

I get better VAD segments for Chinese & Japanese audios with pyannote than Silero, even though I try hard to tune the VAD-related parameters for Silero.

Other users have also encountered this kinda issue: #925, #934.

Jiltseb · 2024-07-29T15:54:25Z

Is it better to let users choose the VAD model from pyannote VAD or Silero VAD?

I get better VAD segments for Chinese & Japanese audios with pyannote than Silero, even though I try hard to tune the VAD-related parameters for Silero.

Other users have also encountered this kinda issue: #925, #934.

Pyannote model could be superior VAD, but the extra dependency on pyannote and torch is a concern at the moment.

MahmoudAshraf97 · 2024-07-29T15:55:19Z

@zh-plus it can be an option of course, but keeping pyannote will force us to keep pytorch in the requirements which we are trying to remove based on users feedback, i'm trying to think of a structure to make the whole batching thing optional with optional dependencies for those who want it

MahmoudAshraf97 · 2024-07-30T18:30:57Z

Performance numbers added, tests are passing locally but are failing on CI because torchaudio can't find a backend to use since they are not installed after the removal of pyannote (along with 78 other packages so I guess it's a win)
this PR should not be merged until we do one of the following:

Include soundfile or sox in the requirements as a backend to torchaudio
Revert back to using PyAV and use manual garbage collection to avoid the resampler memory leak if needed, this will make us one step closer to removing torch completely

ozancaglayan · 2024-08-13T16:21:42Z

Thanks for the PR!

Could you add your script that exports the Silero V5 model to encoder and decoder ONNX files? Also, why does it help to separate the model into two ONNX sessions for the performance?

ozancaglayan · 2024-08-13T16:26:15Z

faster_whisper/vad.py

-    min_speech_duration_ms: int = 250
+    onset: float = 0.5
+    offset: float = onset - 0.15
+    min_speech_duration_ms: int = 0


Can you maybe leave these options (threshold, onset, offset) as they were, e.g. not rename them as it would break signature & parameter passing APIs?

Why are you changing min_speech_duration_ms to 0? I think 250ms is a sane default otherwise you may end up with segments that are very small for having speech inside, maybe even empty ones?

it's best to give the users the freedom to tune the parameters as they wish, previously offset was fixed to threshold - 0.15, but now users have the option to tune it as they with without having to play with the code internals, it might not be backwards compatible but it's a very minimal change to adapt

as for min_speech_duration_ms, benchmarks (YT Commons and Librispeech) showed that dropping it from 250 to 0 had minimal positive or no effect on sequential inference, but it had a very positive impact on batched inference as it combines segments differently than the sequential

MahmoudAshraf97 · 2024-08-13T17:15:32Z

Thanks for the PR!

Could you add your script that exports the Silero V5 model to encoder and decoder ONNX files? Also, why does it help to separate the model into two ONNX sessions for the performance?

V4
V5

As for the reason, Silero models in general require the output of the previous sample to give a correct output for the next sample, but the input of the previous sample is only needed in the decoder stage which makes up a small amount of the total computation cost, so by splitting the model to an encoder and decoder and then batching the input to the encoder only, we gain 3X speedup while still generating identical outputs
for more reference check this discussion in the original repo

ozancaglayan · 2024-08-14T10:24:18Z

Thanks. Out of curiosity did you find those reference implementations elsewhere or did you rewrite them based on JIT'ted or is there a way to automatically generate from JIT'ted models?

PS: OK I think you can get the compiled graph from .code variables but that one does not seem to be a pure python implementation.

MahmoudAshraf97 · 2024-08-14T10:42:56Z

Thanks. Out of curiosity did you find those reference implementations elsewhere or did you rewrite them based on JIT'ted or is there a way to automatically generate from JIT'ted models?

PS: OK I think you can get the compiled graph from .code variables but that one does not seem to be a pure python implementation.

I reimplemented it from scratch based on what I could understand from the JITed code and mapped the weights manually using the dictionary, both implementations are within 1e-5 tolerance from the original implementation

MahmoudAshraf97 · 2024-08-14T15:12:59Z

Performance numbers added, tests are passing locally but are failing on CI because torchaudio can't find a backend to use since they are not installed after the removal of pyannote (along with 78 other packages so I guess it's a win) this PR should not be merged until we do one of the following:

Include soundfile or sox in the requirements as a backend to torchaudio

Revert back to using PyAV and use manual garbage collection to avoid the resampler memory leak if needed, this will make us one step closer to removing torch completely

Reverted back to PyAV in #961, once it is merged then this one is merged we can get rid of torch dependency

Jiltseb · 2024-08-14T15:39:37Z

Performance numbers added, tests are passing locally but are failing on CI because torchaudio can't find a backend to use since they are not installed after the removal of pyannote (along with 78 other packages so I guess it's a win) this PR should not be merged until we do one of the following:

Include soundfile or sox in the requirements as a backend to torchaudio

Revert back to using PyAV and use manual garbage collection to avoid the resampler memory leak if needed, this will make us one step closer to removing torch completely

Reverted back to PyAV in #961, once it is merged then this one is merged we can get rid of torch dependency

Nice. I have also re-implemented numpy version to get rid of torch dependency. But will stick to this for removing the torch in two steps. I will test the memory leakage and report in #961.

kenho211 · 2024-08-26T07:06:59Z

Encounter another error for audio without speech. Not the same one as in #973
File "/home/ubuntu/.local/lib/python3.10/site-packages/faster_whisper/transcribe.py", line 362, in transcribe clip_timestamps = merge_segments(active_segments, vad_parameters) File "/home/ubuntu/.local/lib/python3.10/site-packages/faster_whisper/vad.py", line 315, in merge_segments curr_start = segments_list[0]["start"] IndexError: list index out of range

Can we just return an empty list in merge_segments if segment_list is empty?

MahmoudAshraf97 · 2024-08-26T13:00:25Z

should be fixed now

hobodrifterdavid · 2024-09-08T16:43:43Z

Hi. I'm running a lot of audio through the batch transcribe function on this PR, getting a couple of exceptions on some files:

Appreciate the work guys.

MahmoudAshraf97 · 2024-09-08T16:45:29Z

@hobodrifterdavid can you upload audios that reproduce the two exceptions?

hobodrifterdavid · 2024-09-08T18:44:48Z

I don't have the clips on hand. I just added a check to make sure the audio clips I am sending are at least 5s long (it's possible I was requesting transcription of some zero-length files), and I will improve the logging to record what is processing when an error occurs, will let you know if I see the error again.

If the passed audio data has zero length, it might be wise to throw a specific error up-front 'Passed audio is zero samples long' etc., if you don't already.

* add onnx files to manifest * change `merge_segments` to use audio indixes directly

faster_whisper/transcribe.py

Jiltseb

Added some minor comments. I have tested Silero on batched version and got similar WER, but the speed is 60% slower compared to previous VAD. This is on a test set of 9 youtube videos with various audio types and a length from 3-13 minutes. With Silero, it is still at least 2x faster than sequential version. With pyannote VAD it was 3.8x faster.

Have you seen this speed difference?

Jiltseb · 2024-10-23T15:12:39Z

faster_whisper/vad.py

 def merge_segments(segments_list, vad_options: VadOptions):
    curr_end = 0
    seg_idxs = []
    merged_segments = []
-    edge_padding = vad_options.speech_pad_ms / 1000
-    chunk_length = vad_options.max_speech_duration_s
+    sampling_rate = 16000


Use sampling_rate as a function argument which defaults to 16000. Avoid hard coding for sampling rate and such audio related variables.

400ms edge padding can contain multiple syllables if the start and previous end times are closer (let's say 100ms). Any reason for keeping it 400ms instead of 100ms?

if the distance between two segments is less than 2 * edge_padding they are merged together, so it's guaranteed that no audio is included twice, I found that the increasing or decreasing the padding value didn't make much difference so I left it as is to account for higher error margin

As for the speedups, I found that both implementations to be almost identical or within measuring error range, my specs are:
i7 12700k
RTX 3070 Ti
32GB Ram

Although even if silero implementation is slightly slower, it's worth it because of the simpler requirements and the increased code reuse

Makes sense for the edge_padding and agree that Silero makes the codebase lean and easy to maintain. Do you have the audio file you tested?

I test on the yt commons dataset
pyannote vad:
Evaluating...: 94it [25:32, 16.31s/it]
WER: 13.976

Silero Vad:
Evaluating...: 94it [26:22, 16.83s/it]
WER: 13.756

Jiltseb

Add sampling_rate as an argument in merge_segments function as well and remove hard coded sampling rate (L318)

toanhuynhnguyen · 2024-11-04T01:19:19Z

After installing:

pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz"

I run the code:

from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel("medium", device="cuda", compute_type="float16")
batched_model = BatchedInferencePipeline(model=model)
segments, info = batched_model.transcribe("audio.mp3", batch_size=16)

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

I get this error:

Unable to load any of {libcudnn_ops.so.9.1.0, libcudnn_ops.so.9.1, libcudnn_ops.so.9, libcudnn_ops.so}
Invalid handle. Cannot load symbol cudnnCreateTensorDescriptor
Aborted (core dumped)

Can anyone help me with this error, thank you so much.

@MahmoudAshraf97

MahmoudAshraf97 mentioned this pull request Jul 29, 2024

Batching inference commit should be reverted and applied part-by-part for community adaptation !!!! #937

Open

MahmoudAshraf97 marked this pull request as draft July 29, 2024 11:33

MahmoudAshraf97 marked this pull request as ready for review July 30, 2024 18:31

This was referenced Aug 8, 2024

Poor output quality and speed when using batched inference #954

Closed

1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 #934

Open

Purfview mentioned this pull request Aug 13, 2024

Revert the batched process updates for the future of the faster-whisper #940

Closed

ozancaglayan reviewed Aug 13, 2024

View reviewed changes

ben256 mentioned this pull request Aug 14, 2024

added faster whisper pipeline kaiconversations/whisperX#4

Merged

MahmoudAshraf97 mentioned this pull request Aug 14, 2024

revert back to using PyAV instead of torch audio #961

Merged

MahmoudAshraf97 changed the title ~~Use Silero VAD in Batched Mode, Other Vad refactors in Sequential mode~~ Use Silero VAD in Batched Mode Aug 20, 2024

This was referenced Aug 22, 2024

Return empty generator when no active speech in transcribe #973

Closed

OAI Whisper transcribes correctly but whisperx returns No active speech found in audio m-bain/whisperX#844

Open

MahmoudAshraf97 mentioned this pull request Sep 11, 2024

Bug - "No active speech found in audio results" #997

Closed

MahmoudAshraf97 added 18 commits October 23, 2024 15:27

add faster silero model

fb10067

fix pytube client

d8a54b2

cleanup

eb6e60e

remove redundant for loop from get_active_regions

ecbc169

fix the input dimensions of scores

9b523f9

enable zero silence threshold

2290687

switch to async execution to use data parallelism on multiple gpus

d290904

fix faulty silence threshold

a68b7d3

unify the apis of batched and sequential transcribe functions

a3c443e

fix tests

f9edf8d

revert vad defaults

d529e48

update batched vad default arguments

8c9d0e9

fix silero encoder, remove min cut operation and use original functions

1a91f6e

reducing diff

1893664

* reuse collect_chunks for batched inference

c280437

* add onnx files to manifest * change `merge_segments` to use audio indixes directly

update readme

5128e17

additional fixes when no speech is found

02f3c0a

skip transcription if no speech is found

8011470

MahmoudAshraf97 force-pushed the same_vad branch from 64852b5 to 8011470 Compare October 23, 2024 12:27

Jiltseb reviewed Oct 23, 2024

View reviewed changes

faster_whisper/transcribe.py Show resolved Hide resolved

ensure correct dims on short audio

fef15d9

MahmoudAshraf97 mentioned this pull request Oct 23, 2024

Update transcribe.py #1014

Closed

Jiltseb reviewed Oct 23, 2024

View reviewed changes

add sampling rate as an argument

ff6e8b5

Jiltseb reviewed Oct 23, 2024

View reviewed changes

improve documentation for VadOptions

480cb3f

MahmoudAshraf97 merged commit 2dbca5e into SYSTRAN:master Oct 24, 2024
3 checks passed

MahmoudAshraf97 deleted the same_vad branch October 24, 2024 09:07

MahmoudAshraf97 mentioned this pull request Nov 14, 2024

VAD is relatively slow #364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Silero VAD in Batched Mode #936

Use Silero VAD in Batched Mode #936

MahmoudAshraf97 commented Jul 28, 2024 •

edited

Loading

hoonlight commented Jul 29, 2024 •

edited

Loading

Jiltseb commented Jul 29, 2024

zh-plus commented Jul 29, 2024

Jiltseb commented Jul 29, 2024

MahmoudAshraf97 commented Jul 29, 2024

MahmoudAshraf97 commented Jul 30, 2024

ozancaglayan commented Aug 13, 2024 •

edited

Loading

ozancaglayan Aug 13, 2024

MahmoudAshraf97 Aug 13, 2024

MahmoudAshraf97 commented Aug 13, 2024

ozancaglayan commented Aug 14, 2024 •

edited

Loading

MahmoudAshraf97 commented Aug 14, 2024

MahmoudAshraf97 commented Aug 14, 2024

Jiltseb commented Aug 14, 2024 •

edited

Loading

kenho211 commented Aug 26, 2024

MahmoudAshraf97 commented Aug 26, 2024

hobodrifterdavid commented Sep 8, 2024 •

edited

Loading

MahmoudAshraf97 commented Sep 8, 2024

hobodrifterdavid commented Sep 8, 2024 •

edited

Loading

Jiltseb left a comment

Jiltseb Oct 23, 2024 •

edited

Loading

Jiltseb Oct 23, 2024

MahmoudAshraf97 Oct 23, 2024

Jiltseb Oct 23, 2024

MahmoudAshraf97 Oct 24, 2024

Jiltseb left a comment

toanhuynhnguyen commented Nov 4, 2024 •

edited

Loading

Use Silero VAD in Batched Mode #936

Use Silero VAD in Batched Mode #936

Conversation

MahmoudAshraf97 commented Jul 28, 2024 • edited Loading

hoonlight commented Jul 29, 2024 • edited Loading

Jiltseb commented Jul 29, 2024

zh-plus commented Jul 29, 2024

Jiltseb commented Jul 29, 2024

MahmoudAshraf97 commented Jul 29, 2024

MahmoudAshraf97 commented Jul 30, 2024

ozancaglayan commented Aug 13, 2024 • edited Loading

ozancaglayan Aug 13, 2024

Choose a reason for hiding this comment

MahmoudAshraf97 Aug 13, 2024

Choose a reason for hiding this comment

MahmoudAshraf97 commented Aug 13, 2024

ozancaglayan commented Aug 14, 2024 • edited Loading

MahmoudAshraf97 commented Aug 14, 2024

MahmoudAshraf97 commented Aug 14, 2024

Jiltseb commented Aug 14, 2024 • edited Loading

kenho211 commented Aug 26, 2024

MahmoudAshraf97 commented Aug 26, 2024

hobodrifterdavid commented Sep 8, 2024 • edited Loading

MahmoudAshraf97 commented Sep 8, 2024

hobodrifterdavid commented Sep 8, 2024 • edited Loading

Jiltseb left a comment

Choose a reason for hiding this comment

Jiltseb Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Jiltseb Oct 23, 2024

Choose a reason for hiding this comment

MahmoudAshraf97 Oct 23, 2024

Choose a reason for hiding this comment

Jiltseb Oct 23, 2024

Choose a reason for hiding this comment

MahmoudAshraf97 Oct 24, 2024

Choose a reason for hiding this comment

Jiltseb left a comment

Choose a reason for hiding this comment

toanhuynhnguyen commented Nov 4, 2024 • edited Loading

MahmoudAshraf97 commented Jul 28, 2024 •

edited

Loading

hoonlight commented Jul 29, 2024 •

edited

Loading

ozancaglayan commented Aug 13, 2024 •

edited

Loading

ozancaglayan commented Aug 14, 2024 •

edited

Loading

Jiltseb commented Aug 14, 2024 •

edited

Loading

hobodrifterdavid commented Sep 8, 2024 •

edited

Loading

hobodrifterdavid commented Sep 8, 2024 •

edited

Loading

Jiltseb Oct 23, 2024 •

edited

Loading

toanhuynhnguyen commented Nov 4, 2024 •

edited

Loading