-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
float16 does not appear to work on CPU with fp16 capabilities #65
Comments
Currently we rely on third party libraries to run the matrix multiplications, but none of them support FP16 computation on CPU. (We integrate Intel MKL, oneDNN, OpenBLAS, Ruy, and Apple Accelerate that are selected depending on the platform.) In the whisper.cpp issue you linked there are indeed gains when using the FP16 model and enabling the relevant FP16 compilation flags. Do you know how it compares to running the FP32 model with OpenBLAS on this CPU? In faster-whisper, you could try using 8-bit quantization instead, with |
I don't know how to explicitly select OpenBLAS and am just using the defaults:
Beam size = 1 for all tests.
It does look like the lack of fp16 support hurts on this particular model of CPU. |
Could you enable the verbose mode when running faster-whisper and post the output here?
|
This is how I've ran Faster-whisper and WhisperCpp. EnvironmentSpin up an always-free free Oracle Cloud instance.
Faster-whisper
WhisperCpp
|
|
Thank you for all the information! Everything looks correct to me. So it seems this CPU benefits a lot from FP16 and a native compilation. Can you share what compilation flags are enabled with |
Here are the flags:
Also displaying GCC version if it helps.
And here is the CPU
|
Perhaps we need to use https://github.com/ARM-software/ComputeLibrary ? |
I registered for an Oracle Cloud account and tested on the same instance type that you used. I did not reproduce your results on a 2 min audio file using the large-v2 model:
The time for whisper.cpp is consistent with your results, but not the times for faster-whisper. My guess is that your audio file triggers the "temperature fallback", but the whisper.cpp commit you used (ggerganov/whisper.cpp@8e361d9) just disabled this mode by default. So you should also disable this mode with faster-whisper for the comparison: model.transcribe(..., temperature=0) For reference, here are the reported compilation commands for whisper.cpp which include
|
You are correct. The original file is 2m28s long.
The result for int8, with 0 temperature is fantastic. Looking at the translation for int8 and fp32, int8 is very slightly inferior to fp32, especially in terms of punctuation. fp16 is nice to have because I would expect it to have roughly half the fp32 time which will make it almost real-time too. |
Thanks for the confirmation! Based on whisper.cpp results, there is indeed a possible x2 speedup with FP16 on this CPU.
(using the large-v2 model on a 2 min audio file) |
I don't think there's anything else we can do here. Are you OK if I create an enhancement request in ctranslate2 to support fp16 for Arm CPUs and close this off? |
Yes, please do that. Thanks! |
Closing since enhancement created in OpenNMT/CTranslate2#1153 |
Convert model
Run using sample
This is done on Oracle Cloud's free tier, which has 4x Ampere A1 CPUs and 24G RAM.
The Ampere A1 CPU has native support for FP16.
In WhisperCpp (ggerganov/whisper.cpp#532), I was able to get it to work well with FP16 by adding the necessary compile flags for FP16.
Is there anything similar that we can do here?
FP16 would hopefully significantly improve performance.
The text was updated successfully, but these errors were encountered: