turbo
model release
#2363
Replies: 19 comments 45 replies
-
it is amazing but i really want to to see lower WER rather than speed up :( I hope it arrives asap |
Beta Was this translation helpful? Give feedback.
-
Thanks you @jongwook for releasing this. For finetuning, what learning rate would you recommend? We've been using 1e-5 with large-v2 and large-v3 and it's been working well. Should we change it for turbo? |
Beta Was this translation helpful? Give feedback.
-
Does the turbo model still support translate task? I tried it but it seems outputing the raw languages instead of English. Does any one meet this issue? |
Beta Was this translation helpful? Give feedback.
-
Hi all, I'm VB from the Open Source audio team at Hugging Face. Thanks a ton @jongwook and team for For reference, the You can reference them via a simple snippet: import torch
from transformers import pipeline
model_id = "openai/whisper-large-v3-turbo"
pipe = pipeline(
"automatic-speech-recognition",
model=model_id,
torch_dtype=torch.float16,
device="cuda" # replace with `mps` for Mac
)
result = pipe("<file_name.mp3>") # pass `return_timestamps=True` for timestamps
print(result["text"]) You can also fine-tune it for your own use-case as well: https://huggingface.co/blog/fine-tune-whisper Looking forward to the next updates to the series 🤗 |
Beta Was this translation helpful? Give feedback.
-
Amazing work ! Thank you. Is there any ongoing work regarding the hallucination of whisper ? This for us seems to be the biggest downside of using it. Sometimes it also gets stuck in a loop and keeps repeating stuff (happens more often in v3, therefore we use v2 which seems better in French language at least) |
Beta Was this translation helpful? Give feedback.
-
Any chance for availability of this model over the offical OpenAI API anytime soon? |
Beta Was this translation helpful? Give feedback.
-
where can i find WER for urdu language (turbo model)? |
Beta Was this translation helpful? Give feedback.
-
Thanks, the biggest drawback of Whisper right now is that the timeline is inaccurate. Subtitles generated by the large series appear much earlier than they should. Tested in korean. |
Beta Was this translation helpful? Give feedback.
-
Could you please find a way to solve the old issues of hallucinations and incorrect timestamps in Whisper? |
Beta Was this translation helpful? Give feedback.
-
To speed up decoding without hurting performance too much, we should have a fat, shallow decoder, to take advantage of parallel processing. Fewer layers in the decoder, as they did, but more attention heads per layer, this could be computed in parallel, but the layers cannot (i.e. we can only compute a layer after we have its input, which is the output of the previous one). By keeping the encoder deep, you ensure that the input is thoroughly processed and rich in features. Then, using a fat, shallow decoder can speed up the auto-regressive decoding process, which calls the decoder block many times. |
Beta Was this translation helpful? Give feedback.
-
I’ve noticed an issue with the no_speech_prob variable when using the Turbo model. While it works correctly with the large model, it doesn’t seem to function as expected in the Turbo model. Please look into this. Thank you! |
Beta Was this translation helpful? Give feedback.
-
@jongwook any idea to fix this bad broken Turbo model? on my newest video whisper turbo performed really really bad - it was so broken that unfixable manually - i had to use large-v3 and fix it here turbo model transcription : And here fully manually fixed subtitles video (you can download and test) : https://youtu.be/URnOHbmuKWshere manually fixed subtitles : this is why we need a better WER model not a faster one - it is unusable without huge effort compare both and you won't believe difference it is just huge sadly here example 2 case below one turbobelow one manually fixed |
Beta Was this translation helpful? Give feedback.
-
Hey guys, if I wanna use
in the Python Usage,what library should I call? thanks a lot! I try It and found that solution to use python usage, below is the code for Python usage
But current turbo seem doesn't support that translate Function |
Beta Was this translation helpful? Give feedback.
-
I've been using the turbo model and I'm getting moderately better quality, in proper names and acronyms compared to V3. With a moderately noisy audio file i get things like this: Large V3: We're coming back from pruxelles were we held a CDA to fix the mess of the other day. we did it but we're late like jankees. you know what I mean Turbo compared to V3 behaves more like a Large V2 with clean audio. |
Beta Was this translation helpful? Give feedback.
-
When the model is downloaded at the first useage the size of the zip file is roughly 1.5GB, while I expected |
Beta Was this translation helpful? Give feedback.
-
Personally, when transcribing Japanese audio, I still see some random unrelated English words being inserted inbetween Japanese characters. Seems this issue still persists since v3, and v2 doesn't have this problem. Shame, because it's really fast and outside of that issue, it's really good overall. |
Beta Was this translation helpful? Give feedback.
-
When inputting an 11-hour MP3 file, the error rate increases significantly, and several lines were missed in the transcription of the first 20 minutes.
However, when I extracted the first 20 minutes of audio and transcribed it again, everything worked fine.
|
Beta Was this translation helpful? Give feedback.
-
Hmm, what I don't really understand is that there seem to be more problems with v3 regarding hallucinations and perhaps missing sentences than with v2. Even OpenAI uses V2 and not V3 for their api (see here). So they might be aware of this as well. So why not use large-v2 as a basis for turbo instead of large-v3? I mean, robustness is more important than a slightly lower benchmark number? |
Beta Was this translation helpful? Give feedback.
-
We’re releasing a new Whisper model named
large-v3-turbo
, orturbo
for short. It is an optimized version of Whisperlarge-v3
and has only 4 decoder layers—just like thetiny
model—down from the 32 in thelarge
series.This work is inspired by Distil-Whisper[1], where the authors observed that using a smaller decoder can greatly improve transcription speed while causing minimal degradation in accuracy. Unlike Distil-Whisper, which used distillation to train a smaller model, Whisper turbo was fine-tuned for two more epochs over the same amount of multilingual transcription data used for training
large-v3
, i.e. excluding translation data, on which we don’t expect turbo to perform well.Across languages, the
turbo
model performs similarly tolarge-v2
, though it shows larger degradation on some languages like Thai and Cantonese. Whisperturbo
performs better on FLEURS, which consists of cleaner recordings than Common Voice. The figure below shows theturbo
model’s performance on the subset of languages in the Common Voice 15 and FLEURS datasets wherelarge-v3
scored a 20% error rate or lower.Combined with a recent patch (#2359) to use
F.scaled_dot_product_attention
when available, the ASR speed ofturbo
is faster than whattiny
used to be:This puts
turbo
at the "best of both worlds" when comparing speed and accuracy:You can update the Python package to version
20240930
or later to use the new model:With this upgrade, the
whisper
CLI command now defaults to theturbo
model:whisper audio.wav # will use `--model turbo` by default
References
Beta Was this translation helpful? Give feedback.
All reactions