float16 does not appear to work on CPU with fp16 capabilities #65

FlippFuzz · 2023-03-22T13:24:34Z

Convert model

ct2-transformers-converter --model openai/whisper-large-v2 --output_dir whisper-large-v2-ct2-float16 --copy_files tokenizer.json --quantization float16

Run using sample

from faster_whisper import WhisperModel

model_path = "whisper-large-v2-ct2-float16/"

# Run on GPU with FP16
model = WhisperModel(model_path, device="cpu", compute_type="float16")

segments, info = model.transcribe("sample/xxxx.webm", language='ja', task='translate', beam_size=1)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

ValueError: Requested float16 compute type, but the target device or backend do not support efficient float16 computation.

This is done on Oracle Cloud's free tier, which has 4x Ampere A1 CPUs and 24G RAM.
The Ampere A1 CPU has native support for FP16.

In WhisperCpp (ggerganov/whisper.cpp#532), I was able to get it to work well with FP16 by adding the necessary compile flags for FP16.
Is there anything similar that we can do here?
FP16 would hopefully significantly improve performance.

The text was updated successfully, but these errors were encountered:

guillaumekln · 2023-03-22T14:10:31Z

Currently we rely on third party libraries to run the matrix multiplications, but none of them support FP16 computation on CPU. (We integrate Intel MKL, oneDNN, OpenBLAS, Ruy, and Apple Accelerate that are selected depending on the platform.)

In the whisper.cpp issue you linked there are indeed gains when using the FP16 model and enabling the relevant FP16 compilation flags. Do you know how it compares to running the FP32 model with OpenBLAS on this CPU?

In faster-whisper, you could try using 8-bit quantization instead, with compute_type="int8".

FlippFuzz · 2023-03-23T09:07:08Z

I don't know how to explicitly select OpenBLAS and am just using the defaults:

python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt

time python3 main.py (main.py is at the start of this issue)

Beam size = 1 for all tests.
Original File is 2m28s long.

First Header	Commit	Quantization	Time
faster-whisper	`e44a8c7`	fp32	10m36.149s
faster-whisper	`e44a8c7`	int8	07m05.425s
WhisperCpp	8e361d9	fp16	04m58.193s

It does look like the lack of fp16 support hurts on this particular model of CPU.
Anyways, I am happy to just close since we already know that the underlying dependencies don't support fp16. Not much we can do here.
Also happy to run any futher simple tests if you want.

guillaumekln · 2023-03-23T13:20:04Z

Could you enable the verbose mode when running faster-whisper and post the output here?

CT2_VERBOSE=1 time python3 main.py

FlippFuzz · 2023-03-23T13:39:47Z

This is how I've ran Faster-whisper and WhisperCpp.
I will get the CT2_VERBOSE=1 output soon - It's running right now.

Environment

Spin up an always-free free Oracle Cloud instance.

Shape: VM.Standard.A1.Flex
OCPU count: 4
Memory (GB): 24
OS: Ubuntu 22.04

Faster-whisper

git clone https://github.com/guillaumekln/faster-whisper.git
cd faster-whisper

python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
pip3 install -r requirements.conversion.txt

ct2-transformers-converter --model openai/whisper-large-v2 --output_dir whisper-large-v2-ct2-float32 --copy_files tokenizer.json
ct2-transformers-converter --model openai/whisper-large-v2 --output_dir whisper-large-v2-ct2-int8 --copy_files tokenizer.json --quantization int8


tee main.py << 'EOF'
from faster_whisper import WhisperModel

model_path = "whisper-large-v2-ct2-float32/"  # Obviously change the float32 to int8 as needed
model = WhisperModel(model_path, device="cpu", compute_type="float32")  # Obviously change the float32 to int8 as needed

segments, info = model.transcribe("xxxx.webm", language='ja', task='translate', beam_size=1)  # Fill in your file

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
EOF

time python3 main.py

WhisperCpp

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp/

make CC=gcc-12 CXX=g++-12

# Convert file to wav - This took 0m1.111s.
time ffmpeg -i "xxxx.webm" -ar 16000 -ac 1 -c:a pcm_s16le "xxxx.wav"
  
time ./main --model models/ggml-large.bin --language ja --translate -f xxxx.wav \
--threads 4 --beam-size 1 --best-of 5 \
--word-thold 0.01 --entropy-thold 2.40 --logprob-thold -1.00 \
--print-progress --print-colors \
--output-srt --output-csv --output-file xxxx

FlippFuzz · 2023-03-23T13:56:38Z

[2023-03-23 13:47:34.693] [ctranslate2] [thread 256085] [info] CPU: ARM (NEON=true)
[2023-03-23 13:47:34.693] [ctranslate2] [thread 256085] [info]  - Selected ISA: NEON
[2023-03-23 13:47:34.693] [ctranslate2] [thread 256085] [info]  - Use Intel MKL: false
[2023-03-23 13:47:34.693] [ctranslate2] [thread 256085] [info]  - SGEMM backend: OpenBLAS (packed: false)
[2023-03-23 13:47:34.693] [ctranslate2] [thread 256085] [info]  - GEMM_S16 backend: none (packed: false)
[2023-03-23 13:47:34.693] [ctranslate2] [thread 256085] [info]  - GEMM_S8 backend: Ruy (packed: false, u8s8 preferred: false)
[2023-03-23 13:47:35.387] [ctranslate2] [thread 256085] [info] Loaded model whisper-large-v2-ct2-int8/ on device cpu:0
[2023-03-23 13:47:35.387] [ctranslate2] [thread 256085] [info]  - Binary version: 6
[2023-03-23 13:47:35.387] [ctranslate2] [thread 256085] [info]  - Model specification revision: 3
[2023-03-23 13:47:35.387] [ctranslate2] [thread 256085] [info]  - Selected compute type: int8
Detected language 'ja' with probability 1.000000
...
1412.49user 51.14system 7:25.84elapsed 328%CPU (0avgtext+0avgdata 4990180maxresident)k
0inputs+0outputs (2major+25393022minor)pagefaults 0swaps

guillaumekln · 2023-03-25T18:41:56Z

Thank you for all the information! Everything looks correct to me.

So it seems this CPU benefits a lot from FP16 and a native compilation.

Can you share what compilation flags are enabled with -match=native? See for example https://stackoverflow.com/a/5470379.

FlippFuzz · 2023-03-26T04:54:49Z

Here are the flags:

gcc-12 -march=native -Q --help=target
The following options are target specific:
  -mabi=                                lp64
  -march=                               armv8.2-a+crypto+fp16+rcpc+dotprod+ssbs
  -mbig-endian                          [disabled]
  -mbionic                              [disabled]
  -mbranch-protection=
  -mcmodel=                             small
  -mcpu=                                generic
  -mfix-cortex-a53-835769               [enabled]
  -mfix-cortex-a53-843419               [enabled]
  -mgeneral-regs-only                   [disabled]
  -mglibc                               [enabled]
  -mharden-sls=
  -mlittle-endian                       [enabled]
  -mlow-precision-div                   [disabled]
  -mlow-precision-recip-sqrt            [disabled]
  -mlow-precision-sqrt                  [disabled]
  -mmusl                                [disabled]
  -momit-leaf-frame-pointer             [enabled]
  -moutline-atomics                     [enabled]
  -moverride=<string>
  -mpc-relative-literal-loads           [enabled]
  -msign-return-address=                none
  -mstack-protector-guard-offset=
  -mstack-protector-guard-reg=
  -mstack-protector-guard=              global
  -mstrict-align                        [disabled]
  -msve-vector-bits=<number>            scalable
  -mtls-dialect=                        desc
  -mtls-size=                           24
  -mtrack-speculation                   [disabled]
  -mtune=                               generic
  -muclibc                              [disabled]
  -mverbose-cost-dump                   [disabled]

  Known AArch64 ABIs (for use with the -mabi= option):
    ilp32 lp64

  Supported AArch64 return address signing scope (for use with -msign-return-address= option):
    all non-leaf none

  The code model option names for -mcmodel:
    large small tiny

  Valid arguments to -mstack-protector-guard=:
    global sysreg

  The possible SVE vector lengths:
    1024 128 2048 256 512 scalable

  The possible TLS dialects:
    desc trad

Also displaying GCC version if it helps.

gcc-12 -v
Using built-in specs.
COLLECT_GCC=gcc-12
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/12/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 12.1.0-2ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-12/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-12 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.1.0 (Ubuntu 12.1.0-2ubuntu1~22.04)

And here is the CPU

lscpu
Architecture:            aarch64
  CPU op-mode(s):        32-bit, 64-bit
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               ARM
  Model name:            Neoverse-N1
    Model:               1
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          1
    Stepping:            r3p1
    BogoMIPS:            50.00
    Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-3
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; __user pointer sanitization
  Spectre v2:            Mitigation; CSV2, BHB
  Srbds:                 Not affected
  Tsx async abort:       Not affected

FlippFuzz · 2023-03-27T10:15:29Z

Perhaps we need to use https://github.com/ARM-software/ComputeLibrary ?
However, this is just a guess. I'm not familiar enough with the underlying code and machine learning in general to know whether that is what is needed or not.

guillaumekln · 2023-03-27T11:39:34Z

I registered for an Oracle Cloud account and tested on the same instance type that you used.

I did not reproduce your results on a 2 min audio file using the large-v2 model:

Implementation	Precision	Time
faster-whisper	fp32	3m26s
faster-whisper	int8	1m58s
whisper.cpp	fp16	4m01s

The time for whisper.cpp is consistent with your results, but not the times for faster-whisper.

My guess is that your audio file triggers the "temperature fallback", but the whisper.cpp commit you used (ggerganov/whisper.cpp@8e361d9) just disabled this mode by default. So you should also disable this mode with faster-whisper for the comparison:

model.transcribe(..., temperature=0)

For reference, here are the reported compilation commands for whisper.cpp which include -mcpu=native:

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mcpu=native   -c ggml.c -o ggml.o           
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -mcpu=native -c whisper.cpp -o whisper.o     
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -mcpu=native examples/main/main.cpp examples/common.cpp ggml.o whisper.o -o main

FlippFuzz · 2023-03-27T13:02:40Z

You are correct.
The difference in performance was caused by the temperature fallback.

The original file is 2m28s long.

First Header	Commit	Quantization	Temperature	Time
faster-whisper	`e44a8c7`	fp32	default	10m36.149s
faster-whisper	`e44a8c7`	fp32	0	4m39.499s
faster-whisper	`e44a8c7`	int8	default	07m05.425s
faster-whisper	`e44a8c7`	int8	0	2m32.842s
WhisperCpp	8e361d9	fp16	0	04m58.193s

The result for int8, with 0 temperature is fantastic.
It's almost able to use the large model for real-time computation on this free instance.
If only there was a way to squeeze a little more performance. :)

Looking at the translation for int8 and fp32, int8 is very slightly inferior to fp32, especially in terms of punctuation.
However, it's significantly faster.

fp16 is nice to have because I would expect it to have roughly half the fp32 time which will make it almost real-time too.
However, given the fact that int8 is also pretty good, I guess it's not worth your time to implement for only 1 particular CPU.
Thanks again!

guillaumekln · 2023-03-27T13:45:19Z

Thanks for the confirmation!

Based on whisper.cpp results, there is indeed a possible x2 speedup with FP16 on this CPU.

Implementation	Precision	Time
whisper.cpp	fp32	8m47s
whisper.cpp (OpenBLAS)	fp32	7m44s
whisper.cpp	fp16	4m01s

(using the large-v2 model on a 2 min audio file)

FlippFuzz · 2023-03-30T09:13:43Z

I don't think there's anything else we can do here.

Are you OK if I create an enhancement request in ctranslate2 to support fp16 for Arm CPUs and close this off?
It is nice to have (because the free instances would perform really well with fp16), but I totally understand if you don't want to focus on this because it only applies to a small subset of CPUs.

guillaumekln · 2023-03-30T09:21:28Z

Are you OK if I create an enhancement request in ctranslate2 to support fp16 for Arm CPUs and close this off?

Yes, please do that. Thanks!

FlippFuzz · 2023-03-30T12:12:37Z

Closing since enhancement created in OpenNMT/CTranslate2#1153

FlippFuzz mentioned this issue Mar 30, 2023

Support float16 on ARM CPUs with native float16 support OpenNMT/CTranslate2#1153

Open

FlippFuzz closed this as completed Mar 30, 2023

Hasan-Naseer mentioned this issue Sep 10, 2024

ValueError: Requested float16 compute type, but the target device or backend do not support efficient float16 computation. m-bain/whisperX#878

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

float16 does not appear to work on CPU with fp16 capabilities #65

float16 does not appear to work on CPU with fp16 capabilities #65

FlippFuzz commented Mar 22, 2023 •

edited

Loading

guillaumekln commented Mar 22, 2023

FlippFuzz commented Mar 23, 2023

guillaumekln commented Mar 23, 2023

FlippFuzz commented Mar 23, 2023

FlippFuzz commented Mar 23, 2023

guillaumekln commented Mar 25, 2023

FlippFuzz commented Mar 26, 2023 •

edited

Loading

FlippFuzz commented Mar 27, 2023

guillaumekln commented Mar 27, 2023

FlippFuzz commented Mar 27, 2023

guillaumekln commented Mar 27, 2023 •

edited

Loading

FlippFuzz commented Mar 30, 2023

guillaumekln commented Mar 30, 2023

FlippFuzz commented Mar 30, 2023

float16 does not appear to work on CPU with fp16 capabilities #65

float16 does not appear to work on CPU with fp16 capabilities #65

Comments

FlippFuzz commented Mar 22, 2023 • edited Loading

Convert model

Run using sample

guillaumekln commented Mar 22, 2023

FlippFuzz commented Mar 23, 2023

guillaumekln commented Mar 23, 2023

FlippFuzz commented Mar 23, 2023

Environment

Faster-whisper

WhisperCpp

FlippFuzz commented Mar 23, 2023

guillaumekln commented Mar 25, 2023

FlippFuzz commented Mar 26, 2023 • edited Loading

FlippFuzz commented Mar 27, 2023

guillaumekln commented Mar 27, 2023

FlippFuzz commented Mar 27, 2023

guillaumekln commented Mar 27, 2023 • edited Loading

FlippFuzz commented Mar 30, 2023

guillaumekln commented Mar 30, 2023

FlippFuzz commented Mar 30, 2023

FlippFuzz commented Mar 22, 2023 •

edited

Loading

FlippFuzz commented Mar 26, 2023 •

edited

Loading

guillaumekln commented Mar 27, 2023 •

edited

Loading