-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 #934
Comments
This is the audio file for the video above. |
I agree, I also have worse performance, just not as much, however the overall WER for non english speech is going down. Go back to silero or at least let us choose the VAD model |
Version 1.0.3 release still uses silero, but with an upgraded version. |
@zx3777 that will cause higher WER, a missing word is still an error to count |
Useless I tried |
Hi, could you try again with the master branch and let me know the results? |
I will run the tests on our audio corporas, with different parameters, but it won't be quick |
I tested the master branch version before the upgrade to [New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements], and the results were the same. In my opinion, after the new PR, only the batched version uses a different VAD implementation. The normal version still uses the VAD from 1.03, so the results should be the same. |
Thanks for the test @zx3777 , I suspect this is a issue with the model itself. I'll open a PR after the issues related to this discussion are well finalized. |
I already wrote the code, but waiting for #936 to be merged so we can discuss having both or just reverting to V4 |
Just chiming in and adding a case where old (not sure if it's v3 or v4) version outperforms v5: code: from pprint import pprint
from faster_whisper.audio import decode_audio
from faster_whisper.vad import VadOptions, get_speech_timestamps
speech_chunks = get_speech_timestamps(decode_audio('ja_example.wav'))
pprint(speech_chunks) old:
v5:
Apparently cartoony voices are ignored in v5. |
Hi @MahmoudAshraf97 , since the PR is merged, is it time to have this discussion? |
Since I'm the maintainer now, I guess we should stick to V5 although it might introduce some edge cases, unless there are solid benchmarks on how different silero versions affect WER, I would vote on including V5 only and users have the option to revert to V4 by modifying the code manually |
silero-vad
Large portions of the speech are missing.
Some files have subtitles files of 10kb using version 1.0.2, while only less than 1kb using version 1.0.3.
This video file
https://www.youtube.com/watch?v=tVLOBfzbJV8
resulted in 320 lines of subtitles using version 1.0.2, but only 218 lines using version 1.0.3. Many conversations were not recognized in version 1.0.3.
I only compared Korean, other languages have not been tested yet.
The text was updated successfully, but these errors were encountered: