1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 #934

zx3777 · 2024-07-26T02:57:24Z

silero-vad

Large portions of the speech are missing.

Some files have subtitles files of 10kb using version 1.0.2, while only less than 1kb using version 1.0.3.

This video file
https://www.youtube.com/watch?v=tVLOBfzbJV8
resulted in 320 lines of subtitles using version 1.0.2, but only 218 lines using version 1.0.3. Many conversations were not recognized in version 1.0.3.

I only compared Korean, other languages have not been tested yet.

zx3777 · 2024-07-26T03:02:18Z

This is the audio file for the video above.
https://mega.nz/file/QacS2LCJ#x_Gq9GgV8aPk2qRVskfzNBuyM9XAI-Pv2SBIwxfomnk

x86Gr · 2024-07-26T06:21:37Z

I agree, I also have worse performance, just not as much, however the overall WER for non english speech is going down. Go back to silero or at least let us choose the VAD model

zx3777 · 2024-07-27T00:50:17Z

I agree, I also have worse performance, just not as much, however the . Go back to silero or at least let us choose the VAD model

Version 1.0.3 release still uses silero, but with an upgraded version.
WER going down maybe because the VAD only identifies sufficiently clear speech.

MahmoudAshraf97 · 2024-07-27T07:35:38Z

@zx3777 that will cause higher WER, a missing word is still an error to count
You should try playing with the vad settings and see how it makes a difference, the model was changed but the parameters are still tuned for the previous one

zx3777 · 2024-07-27T12:08:44Z

@zx3777 that will cause higher WER, a missing word is still an error to count You should try playing with the vad settings and see how it makes a difference, the model was changed but the parameters are still tuned for the previous one

Useless

I tried --vad_threshold 0.4 0.3 0.2 in 1.0.3, and there was a slight improvement, but the recognized subtitles are still much less than in 1.0.2.

hoonlight · 2024-07-28T02:37:25Z

Hi, could you try again with the master branch and let me know the results?

x86Gr · 2024-07-29T10:19:30Z

I will run the tests on our audio corporas, with different parameters, but it won't be quick

zx3777 · 2024-08-12T01:27:46Z

Hi, could you try again with the master branch and let me know the results?

I tested the master branch version before the upgrade to [New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements], and the results were the same.

In my opinion, after the new PR, only the batched version uses a different VAD implementation. The normal version still uses the VAD from 1.03, so the results should be the same.

hoonlight · 2024-08-13T06:08:58Z

Thanks for the test @zx3777 , I suspect this is a issue with the model itself.
There hasn't been enough quantitative evaluation of the silero-vad v5, but at least we can make it possible for users to choose silero-vad v4 instead of silero-vad v5 based on their needs.

I'll open a PR after the issues related to this discussion are well finalized.

MahmoudAshraf97 · 2024-08-13T07:24:20Z

Thanks for the test @zx3777 , I suspect this is a issue with the model itself. There hasn't been enough quantitative evaluation of the silero-vad v5, but at least we can make it possible for users to choose silero-vad v4 instead of silero-vad v5 based on their needs.

I'll open a PR after the issues related to this discussion are well finalized.

I already wrote the code, but waiting for #936 to be merged so we can discuss having both or just reverting to V4

George0828Zhang · 2024-10-26T04:51:49Z

Just chiming in and adding a case where old (not sure if it's v3 or v4) version outperforms v5:
https://drive.google.com/file/d/1NPvEybP0VU1dFmd6neH6JJRW_Qm2MXdk/view?usp=sharing

code:

from pprint import pprint
from faster_whisper.audio import decode_audio
from faster_whisper.vad import VadOptions, get_speech_timestamps

speech_chunks = get_speech_timestamps(decode_audio('ja_example.wav'))
pprint(speech_chunks)

old:

[{'end': 40192, 'start': 12032},
 {'end': 179456, 'start': 76544},
 {'end': 379136, 'start': 273152},
 {'end': 457984, 'start': 422656},
 {'end': 630016, 'start': 576256},
 {'end': 669952, 'start': 653056},
 {'end': 863488, 'start': 695040},
 {'end': 950528, 'start': 896768}]

v5:

[{'end': 30464, 'start': 12032}]

Apparently cartoony voices are ignored in v5.

George0828Zhang · 2024-11-13T15:34:32Z

Thanks for the test @zx3777 , I suspect this is a issue with the model itself. There hasn't been enough quantitative evaluation of the silero-vad v5, but at least we can make it possible for users to choose silero-vad v4 instead of silero-vad v5 based on their needs.
I'll open a PR after the issues related to this discussion are well finalized.

I already wrote the code, but waiting for #936 to be merged so we can discuss having both or just reverting to V4

Hi @MahmoudAshraf97 , since the PR is merged, is it time to have this discussion?

MahmoudAshraf97 · 2024-11-13T15:47:04Z

Since I'm the maintainer now, I guess we should stick to V5 although it might introduce some edge cases, unless there are solid benchmarks on how different silero versions affect WER, I would vote on including V5 only and users have the option to revert to V4 by modifying the code manually

zx3777 changed the title ~~1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 in korean~~ 1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 Jul 26, 2024

This was referenced Jul 29, 2024

Use Silero VAD in Batched Mode #936

Merged

IMPORTANT: 1.0.3 VAD v5 is much worse than 1.0.2 or 1.0.1 VAD v4 for some certain audio data. WHY? #944

Open

George0828Zhang mentioned this issue Oct 26, 2024

[EDGE CASE] Cartoon voice worse performance on v5 than older version snakers4/silero-vad#563

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 #934

1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 #934

zx3777 commented Jul 26, 2024 •

edited

Loading

zx3777 commented Jul 26, 2024

x86Gr commented Jul 26, 2024

zx3777 commented Jul 27, 2024

MahmoudAshraf97 commented Jul 27, 2024

zx3777 commented Jul 27, 2024 •

edited

Loading

hoonlight commented Jul 28, 2024 •

edited

Loading

x86Gr commented Jul 29, 2024

zx3777 commented Aug 12, 2024

hoonlight commented Aug 13, 2024 •

edited

Loading

MahmoudAshraf97 commented Aug 13, 2024

George0828Zhang commented Oct 26, 2024 •

edited

Loading

George0828Zhang commented Nov 13, 2024

MahmoudAshraf97 commented Nov 13, 2024

1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 #934

1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 #934

Comments

zx3777 commented Jul 26, 2024 • edited Loading

zx3777 commented Jul 26, 2024

x86Gr commented Jul 26, 2024

zx3777 commented Jul 27, 2024

MahmoudAshraf97 commented Jul 27, 2024

zx3777 commented Jul 27, 2024 • edited Loading

hoonlight commented Jul 28, 2024 • edited Loading

x86Gr commented Jul 29, 2024

zx3777 commented Aug 12, 2024

hoonlight commented Aug 13, 2024 • edited Loading

MahmoudAshraf97 commented Aug 13, 2024

George0828Zhang commented Oct 26, 2024 • edited Loading

George0828Zhang commented Nov 13, 2024

MahmoudAshraf97 commented Nov 13, 2024

zx3777 commented Jul 26, 2024 •

edited

Loading

zx3777 commented Jul 27, 2024 •

edited

Loading

hoonlight commented Jul 28, 2024 •

edited

Loading

hoonlight commented Aug 13, 2024 •

edited

Loading

George0828Zhang commented Oct 26, 2024 •

edited

Loading