`turbo` model release #2363

jongwook · 2024-10-01T16:00:20Z

jongwook
Oct 1, 2024
Maintainer

We’re releasing a new Whisper model named large-v3-turbo, or turbo for short. It is an optimized version of Whisper large-v3 and has only 4 decoder layers—just like the tiny model—down from the 32 in the large series.

This work is inspired by Distil-Whisper^[1], where the authors observed that using a smaller decoder can greatly improve transcription speed while causing minimal degradation in accuracy. Unlike Distil-Whisper, which used distillation to train a smaller model, Whisper turbo was fine-tuned for two more epochs over the same amount of multilingual transcription data used for training large-v3, i.e. excluding translation data, on which we don’t expect turbo to perform well.

Across languages, the turbo model performs similarly to large-v2, though it shows larger degradation on some languages like Thai and Cantonese. Whisper turbo performs better on FLEURS, which consists of cleaner recordings than Common Voice. The figure below shows the turbo model’s performance on the subset of languages in the Common Voice 15 and FLEURS datasets where large-v3 scored a 20% error rate or lower.

Combined with a recent patch (#2359) to use F.scaled_dot_product_attention when available, the ASR speed of turbo is faster than what tiny used to be:

This puts turbo at the "best of both worlds" when comparing speed and accuracy:

You can update the Python package to version 20240930 or later to use the new model:

pip install -U openai-whisper

With this upgrade, the whisper CLI command now defaults to the turbo model:

whisper audio.wav  # will use `--model turbo` by default

References

[1] Gandhi, S., von Platen, P., & Rush, A. M. (2023). Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling. arXiv preprint arXiv:2311.00430.

FurkanGozukara · 2024-10-01T21:13:39Z

FurkanGozukara
Oct 1, 2024

it is amazing but i really want to to see lower WER rather than speed up :( I hope it arrives asap

0 replies

AmgadHasan · 2024-10-02T04:38:48Z

AmgadHasan
Oct 2, 2024

Thanks you @jongwook for releasing this.

For finetuning, what learning rate would you recommend? We've been using 1e-5 with large-v2 and large-v3 and it's been working well. Should we change it for turbo?

9 replies

FurkanGozukara Oct 2, 2024

@jongwook what do we use how do we train? example code repo?

jongwook Oct 2, 2024
Maintainer Author

HF has an excellent article on this: https://huggingface.co/blog/fine-tune-whisper and also see #64

FurkanGozukara Oct 2, 2024

@jongwook thank you so much. will this notebook work for newest turbo model? https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb

asr-lord Oct 3, 2024

@jongwook thank you so much. will this notebook work for newest turbo model? https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb

Yes, you should change openai/whisper-small to openai/whisper-large-v3-turbo in the notebook.

gslin1224 Oct 5, 2024

Hi everyone,

I have a question about Whisper's translate function and fine-tuning.

If I fine-tune the Whisper large model using a dataset that includes mixed Chinese and English audio, along with bilingual transcripts (Chinese and English), will this fine-tuning process affect the model's translate functionality? Specifically, after fine-tuning, will I still be able to use Whisper's built-in translation feature to translate Chinese audio into English?

Thanks in advance for any insights!

flishwang · 2024-10-02T06:24:53Z

flishwang
Oct 2, 2024

Does the turbo model still support translate task? I tried it but it seems outputing the raw languages instead of English. Does any one meet this issue?

1 reply

Jiltseb Oct 2, 2024

It's not finetuned for translation task, so it won't perform well as @jongwook already written.

Vaibhavs10 · 2024-10-02T09:55:20Z

Vaibhavs10
Oct 2, 2024

Hi all, I'm VB from the Open Source audio team at Hugging Face. Thanks a ton @jongwook and team for Whisper-large-v3-turbo - It's amazing to see the results.

For reference, the transformers compatible weights are here: https://huggingface.co/openai/whisper-large-v3-turbo

You can reference them via a simple snippet:

import torch
from transformers import pipeline

model_id = "openai/whisper-large-v3-turbo"

pipe = pipeline(
    "automatic-speech-recognition",
    model=model_id, 
    torch_dtype=torch.float16,
    device="cuda" # replace with `mps` for Mac
)

result = pipe("<file_name.mp3>") # pass `return_timestamps=True` for timestamps
print(result["text"])

You can also fine-tune it for your own use-case as well: https://huggingface.co/blog/fine-tune-whisper

Looking forward to the next updates to the series 🤗

2 replies

FurkanGozukara Oct 2, 2024

@Vaibhavs10 any easy tool guide tutorial to fine tune? i would like to fine tune on my video which as fully accurate subtitles - written by me : https://www.youtube.com/SECourses

AmgadHasan Oct 2, 2024

@Vaibhavs10
@jongwook
What is the native precision of the model? The HF version is fp16. Shouldn't it be bf16?

Herz3h · 2024-10-02T12:38:52Z

Herz3h
Oct 2, 2024

Amazing work !

Thank you. Is there any ongoing work regarding the hallucination of whisper ? This for us seems to be the biggest downside of using it. Sometimes it also gets stuck in a loop and keeps repeating stuff (happens more often in v3, therefore we use v2 which seems better in French language at least)

5 replies

gongouveia Oct 2, 2024

Hello @Herz3h
Check about enabling VAD, supressing silent instances should improve your hallucinations.
You can activate whisper's VAD or use a more robust pre processing with it (e.g. Silero).

jhj0517 Oct 2, 2024

You can use Silero VAD to detect voice parts, or UVR to remove other noise (such as music) from the audio.

In Whisper-WebUI such pre-processing pipeline is implemented for lower WER.
Of course, since it uses submodels in the pipeline, it slows down the whole process when you use it.

gongouveia Oct 2, 2024

@Herz3h @jhj0517 Beware that using noise removal (such as Demucs in UVR) although enhances hearing quality, it adds impurities and artifacts to the frequency domain, significantly decreasing translation quality (higher WER/CER).
Whisper is more than enough capable of translating with some background noise, no need of noise removal in most cases.

jhj0517 Oct 2, 2024

I used MDX models for noise removal and in my test it gave me better results in most cases.

I haven't tested with Demucs, so I don't know if it could negatively affect the result.

But yes, there's still a possibility that submodels can also cause hallucinations, so you might want to compare results with submodels / without submodels.

I prefer to use them when the audio has such noise, because in my experience it gives me better results in most cases.

xmoiduts Oct 9, 2024

I recommend cutting the beginning of the media file and making the audio appear at the start of the media, this suppresses hallucinations related to the topic but should not exist, and, in word-precise mode, shorten the 'words' so that they are no longer multi-sentences per word.

FlorDonnaSanders · 2024-10-02T14:13:18Z

FlorDonnaSanders
Oct 2, 2024

Any chance for availability of this model over the offical OpenAI API anytime soon?

0 replies

simpthy · 2024-10-02T15:22:43Z

simpthy
Oct 2, 2024

3 replies

ryanheise Oct 2, 2024

@simpthy
According to the v3 release, some languages without spaces to separate words, such as Japanese and Mandarin, were measured in CER rather than WER. I think Korean was also measured in CER (Korean does have spaces, although it doesn't use spaces to separate everything, and it's comparable in syntax to Japanese.)

jongwook Oct 2, 2024
Maintainer Author

Yes to what Ryan said; the languages in italic used CER. I've switched to use CER for Korean after realizing that WER is often too harsh because the use of spaces is often arbitrary in many Korean words. CER also fits better with the agglutinative nature of the language.

I'm not sure what exactly contributed to the better accuracy in the Taiwan subset and haven't done any deeper analysis; the reason could be anything from - the subset had more sentences that are easier to recognize, the recording quality was better in general, or the model actually better understands the way people speak Mandarin in Taiwan.

ryanheise Oct 3, 2024

This may be helpful: https://huggingface.co/learn/audio-course/en/chapter5/evaluation#which-metric-should-i-use

In general, the WER is used far more than the CER for assessing speech systems. This is because the WER requires systems to have greater understanding of the context of the predictions. In our example, “sit” is in the wrong tense. A system that understands the relationship between the verb and tense of the sentence would have predicted the correct verb tense of “sat”. We want to encourage this level of understanding from our speech systems. So although the WER is less forgiving than the CER, it’s also more conducive to the kinds of intelligible systems we want to develop. Therefore, we typically use the WER and would encourage you to as well! However, there are circumstances where it is not possible to use the WER. Certain languages, such as Mandarin and Japanese, have no notion of ‘words’, and so the WER is meaningless. Here, we revert to using the CER.

ushnah · 2024-10-02T17:06:00Z

ushnah
Oct 2, 2024

where can i find WER for urdu language (turbo model)?

2 replies

jongwook Oct 2, 2024
Maintainer Author

Turbo scored 26.5% (CV15) and 21.9% (FLEURS) on Urdu

MedAymenF Oct 8, 2024

@jongwook What about Kabyle (kab)?

zx3777 · 2024-10-03T01:19:13Z

zx3777
Oct 3, 2024

Thanks, the biggest drawback of Whisper right now is that the timeline is inaccurate. Subtitles generated by the large series appear much earlier than they should. Tested in korean.

0 replies

ioudove · 2024-10-03T04:52:11Z

ioudove
Oct 3, 2024

Could you please find a way to solve the old issues of hallucinations and incorrect timestamps in Whisper?

1 reply

LaurinmyReha Oct 3, 2024

maybe give this one a shot. Should be a bit more accurate and deliver better timestamps, atleast for english:

https://github.com/nyrahealth/CrisperWhisper

oswaldoludwig · 2024-10-03T08:32:48Z

oswaldoludwig
Oct 3, 2024

To speed up decoding without hurting performance too much, we should have a fat, shallow decoder, to take advantage of parallel processing. Fewer layers in the decoder, as they did, but more attention heads per layer, this could be computed in parallel, but the layers cannot (i.e. we can only compute a layer after we have its input, which is the output of the previous one). By keeping the encoder deep, you ensure that the input is thoroughly processed and rich in features. Then, using a fat, shallow decoder can speed up the auto-regressive decoding process, which calls the decoder block many times.

0 replies

Jeong-Haneul · 2024-10-03T10:20:15Z

Jeong-Haneul
Oct 3, 2024

I’ve noticed an issue with the no_speech_prob variable when using the Turbo model. While it works correctly with the large model, it doesn’t seem to function as expected in the Turbo model. Please look into this.

Thank you!
'no_speech_prob': 1.903758928278876e-09
in turbo model.

0 replies

FurkanGozukara · 2024-10-03T10:20:26Z

FurkanGozukara
Oct 3, 2024

@jongwook any idea to fix this bad broken Turbo model?

on my newest video whisper turbo performed really really bad - it was so broken that unfixable manually - i had to use large-v3 and fix it

here turbo model transcription :
turbo model.txt

And here fully manually fixed subtitles video (you can download and test) : https://youtu.be/URnOHbmuKWs

here manually fixed subtitles :
manually fixed.txt

this is why we need a better WER model not a faster one - it is unusable without huge effort

compare both and you won't believe difference it is just huge sadly

here example 2 case

below one turbo

below one manually fixed

6 replies

ryanheise Oct 4, 2024

Here is my final solution: whisper audio.mp3 --initial_prompt "Include all punctuation, new paragraph, including (brackets), new paragraph, in the transcription" --word_timestamps True.

It's worth noting that the reason this may help is not because you are instructing Whisper to include all punctuation (it is not like a ChatGPT prompt), but simply because your prompt contains examples of punctuation. All that happens when you supply the initial prompt is that you are telling whisper the transcript of what came just before the current point in the audio. So here, Whisper will think that if we rewound the audio 30 seconds, the previous part of audio would have been transcribed literally as "Include all punctuation, new paragraph, including (brackets), new paragraph, in the transcription". If that prior transcript contains punctuation, it makes it more likely that Whisper will include punctuation in the transcript of the next chunk of audio that follows (Perhaps you can design a better prompt with that understanding). This also works for vocabulary: if you incorporate certain vocabulary into the prompt, Whisper will be more likely to use the spelling you have provided in the next part of the transcript.

FurkanGozukara Oct 5, 2024

thank you for replies @ryanheise and @AeneasZhu . i was already using this :

--initial_prompt ""Welcome to our Youtube channel.""

but it still loses punctuating in many cases which i hate

I haven't tried --word_timestamps True with turbo gonna try

ryanheise Oct 5, 2024

Your prompt doesn't contain much punctuation (only one full stop, and no commas), so you might consider making your prompt longer so that it has enough punctuation, to give it a feel for what you want. It is still possible for the initial prompt to eventually become a distant/forgotten memory in long audio files, so if you're experiencing that behaviour, you might want to test this PR: #2343

FurkanGozukara Oct 5, 2024

@ryanheise do you have good prompt?

ryanheise Oct 5, 2024

There isn't a general purpose prompt I can recommend, so I think you'll just need to experiment to find what works well with your specific audio domain (e.g. Try short vs long prompts, multiple sentences vs one sentence, vary the amount of punctuation, and so on), but also I suggest testing #2343 in case you find that the transcription starts out well and then degrades later in the audio.

gslin1224 · 2024-10-03T13:08:28Z

gslin1224
Oct 3, 2024

Hey guys, if I wanna use

Adding --task translate will translate the speech into English:
whisper japanese.wav --language Japanese --task translate

in the Python Usage,what library should I call?

thanks a lot!

I try It and found that solution to use python usage,

below is the code for Python usage

import whisper

model = whisper.load_model("large", download_root="/home/Lin/whisper")

result = model.transcribe("/home/common_voice_zh-CN_21347786.mp3", task="translate")

print(result["text"])

But current turbo seem doesn't support that translate Function

6 replies

gslin1224 Oct 4, 2024

It's not finetuned for translation task, so it won't perform well as @jongwook already written.

So now only turbo not support the translate Function?

asr-lord Oct 4, 2024

As they said, turbo not perform well in translation: "Whisper turbo was fine-tuned for two more epochs over the same amount of multilingual transcription data used for training large-v3, i.e. excluding translation data, on which we don’t expect turbo to perform well."

gslin1224 Oct 4, 2024

As they said, turbo not perform well in translation: "Whisper turbo was fine-tuned for two more epochs over the same amount of multilingual transcription data used for training large-v3, i.e. excluding translation data, on which we don’t expect turbo to perform well."

ok, thanks a lot

meekmeekmeek Oct 12, 2024

Is the limitation with turbo not performing well with translation something inherent with the fewer-layers approach? For a lot of people the main usecase for whisper is generating subtitles for non-english languages, so losing out on the very significant speed-up from large-v3 to turbo is a shame.

Also since turbo seems to be the default model no, you really should consider displaying some kind of warning when turbo is used with the "translate" option, rather than going through the whole transcription without translating. I know an experienced user will notice early (if they're watching the output) that only transcription is happening, but I'm willing to be more than one user has waited on a very long transcription process and been left wondering why nothing was translated since the CLI reference doesn't mention turbo not supporting the translate option.

ryanheise Oct 15, 2024

Also since turbo seems to be the default model no, you really should consider displaying some kind of warning when turbo is used with the "translate" option, rather than going through the whole transcription without translating.

I had a similar thought. The warning is a good idea, and it may also be helpful to have a different default model for translate so that translation works out of the box.

x86Gr · 2024-10-05T08:34:51Z

x86Gr
Oct 5, 2024

I've been using the turbo model and I'm getting moderately better quality, in proper names and acronyms compared to V3. With a moderately noisy audio file i get things like this:

Large V3: We're coming back from pruxelles were we held a CDA to fix the mess of the other day. we did it but we're late like jankees. you know what I mean
Turbo: We're coming back from Bruxelles were we held a CDA to fix the mess of the other day. We did it but we're late like...like jankees! You know what I mean?

Turbo compared to V3 behaves more like a Large V2 with clean audio.
Settings are no prompt / condition on previous text False / beam size 5 / best of 5 / patience 1.0
(we're still testing it so the sampled audios are not many)

0 replies

dgoryeo · 2024-10-05T18:12:41Z

dgoryeo
Oct 5, 2024

When the model is downloaded at the first useage the size of the zip file is roughly 1.5GB, while I expected large-v3-turbo to be 6GB. Is that purely due to the compression, or might have something gone wrong in my environment?

1 reply

gordon0414 Oct 6, 2024

just as @jongwook mentioned in his first paragraph,

It is an optimized version of https://github.com/openai/whisper/discussions/1762 and has only 4 decoder layers—just like the tiny model—down from the 32 in the large series.

the decoder is 1/8 wise, and when we think of the fact that the decoder part would be originally bigger in size because of the cross-attention KQV, we can understand the decrease of the size

andyhoeung · 2024-10-05T20:23:36Z

andyhoeung
Oct 5, 2024

Personally, when transcribing Japanese audio, I still see some random unrelated English words being inserted inbetween Japanese characters. Seems this issue still persists since v3, and v2 doesn't have this problem. Shame, because it's really fast and outside of that issue, it's really good overall.

4 replies

misutoneko Oct 5, 2024

That's funny, I think I got it the other way around at some point (japanese characters for english audio).
Have you tried with --hallucination_silence_threshold? In my testing that got rid of most(?) of the hallucinations.
I've only used english audio samples though.

andyhoeung Oct 6, 2024

I'll give that a try and report back.

andyhoeung Oct 6, 2024

I started with a value of 2 up until 8, didn't seem to fix much.

misutoneko Oct 6, 2024

Thank you for testing. My samples are very very short (under 30s mostly) so that may also have something to do with it.
I used smaller values such as 0.3 or so...but yeah, gotta keep experimenting I guess :D

zxl777 · 2024-10-15T03:58:09Z

zxl777
Oct 15, 2024

When inputting an 11-hour MP3 file, the error rate increases significantly, and several lines were missed in the transcription of the first 20 minutes.

time whisper '11hours.mp3' --model turbo --output_format json --word_timestamps True

However, when I extracted the first 20 minutes of audio and transcribed it again, everything worked fine.

ffmpeg -i 11hours.mp3 -t 1200 -acodec copy 20_minutes.mp3

time whisper 20_minutes.mp3 --model turbo --output_format json --word_timestamps True

5 replies

xmoiduts Oct 15, 2024

Will it help that clipping the start to the first word? Like -ss [-t some seconds]?

dgoryeo Oct 15, 2024

Have you considered this approach perhaps: WhisperHallu

zxl777 Oct 16, 2024

After continuous attempts, I found that if I want to generate a JSON file with timestamps, adding an initial_prompt brings back the missing sentence.

initial_prompt='They internalize the cultural message of “It’s your fault! You should exercise more, but you aren’t doing it. Shame on you!” I am here to say: It isn’t your fault.'

Additionally, if I only generate plain text without timestamps, the result is also correct.
time whisper '11hours.mp3' --model turbo --output_format txt

DoS007 Oct 28, 2024

@zxl777 Does this happen with the original whisper v3 turbo, or do you use deepdml/faster-whisper-large-v3-turbo-ct2 ? If you use the last one, try mobiuslabsgmbh/faster-whisper-large-v3-turbo instead (see here and post above there).

zxl777 Nov 16, 2024

@DoS007
Yes, I am now using the faster-whisper-large-v3-turbo-ct2.
Thank you for your suggestion; I will update the faster-whisper to use mobiuslabsgmbh/faster-whisper-large-v3-turbo.

DoS007 · 2024-10-28T14:16:12Z

DoS007
Oct 28, 2024

Hmm, what I don't really understand is that there seem to be more problems with v3 regarding hallucinations and perhaps missing sentences than with v2. Even OpenAI uses V2 and not V3 for their api (see here). So they might be aware of this as well.

So why not use large-v2 as a basis for turbo instead of large-v3? I mean, robustness is more important than a slightly lower benchmark number?

0 replies

turbo model release #2363

jongwook Oct 1, 2024 Maintainer

References

Replies: 19 comments · 45 replies

jongwook Oct 2, 2024 Maintainer Author

jongwook Oct 2, 2024 Maintainer Author

jongwook Oct 2, 2024 Maintainer Author

And here fully manually fixed subtitles video (you can download and test) : https://youtu.be/URnOHbmuKWs

below one turbo

below one manually fixed

`turbo` model release #2363

jongwook
Oct 1, 2024
Maintainer

Replies: 19 comments 45 replies

jongwook Oct 2, 2024
Maintainer Author

jongwook Oct 2, 2024
Maintainer Author

jongwook Oct 2, 2024
Maintainer Author