No cuBLAS performance gain for F16 #1249

ggerganov · 2023-04-30T07:44:52Z

I noticed that using cuBLAS with the F16 model does not give any benefit compared to non-BLAS CPU-only mode:

# with cuBLAS
$ ▶ make clean && LLAMA_CUBLAS=1 make -j && time ./perplexity -m ./models/7B/ggml-model-f16.bin -f ./build/wiki.test.raw --no-mmap -t 12
....
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
14.33 seconds per pass - ETA 2 hours 36 minutes
[1]4.2336,^C^C

# without BLAS
$ ▶ make clean && make -j && time ./perplexity -m ./models/7B/ggml-model-f16.bin -f ./build/wiki.test.raw --no-mmap -t 12
...
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
13.75 seconds per pass - ETA 2 hours 30 minutes
[1]4.2335,^C

System:

GeForce GTX 1660
AMD Ryzen 9 5950X

In contrast, when using a quantized model, the cuBLAS run is significantly faster

Is this expected?
I was hoping to have some performance improvement for F16 as well.
Maybe the data transfer is very slow for F16 and it defeats the purpose of offloading to the GPU?

I noticed this after porting the latest ggml to whisper.cpp where we use F16 precision and was surprised that cuBLAS does not bring any improvement.

For example, sometime ago I tried using NVBLAS in whisper.cpp and it did bring some decent improvements: ggerganov/whisper.cpp#220 (comment)

The NVBLAS code change was very trivial: ggerganov/whisper.cpp#239
What could NVBLAS be doing better in this case?

The text was updated successfully, but these errors were encountered:

slaren · 2023-04-30T08:09:12Z

With a RTX 3080:

F16 used to be the fastest before dequantization on the GPU was implemented: #1044

With the current master, it is still faster than it was originally, so I don't think that there has been a regression:
3.50 seconds per pass - ETA 38 minutes

I don't know why this isn't the case with your GTX 1660. From what I could find, it is a turing chip that can do FP16.

slaren · 2023-04-30T08:14:22Z

I have been experimenting with doing the f16xf32 mat muls in f32 (instead of f16 as it is currently done) in https://github.com/slaren/llama.cpp/commits/cuda-f16f32

For me, this is faster with quantized models, but slower with F16 models, but maybe with your GPU the results are different.

ggerganov · 2023-04-30T08:25:23Z

Thanks - it seems the problem is in the GeForce GTX 1660 somehow.
Ran the same test on GeForce RTX 4080 and there is significant improvement.
Also, whisper.cpp is much faster with cuBLAS

I think the NVBLAS test that I did before was on GeForce RTX 2060

ggerganov added the question Further information is requested label Apr 30, 2023

ggerganov assigned slaren Apr 30, 2023

ggerganov closed this as completed Apr 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No cuBLAS performance gain for F16 #1249

No cuBLAS performance gain for F16 #1249

ggerganov commented Apr 30, 2023

slaren commented Apr 30, 2023

slaren commented Apr 30, 2023

ggerganov commented Apr 30, 2023

No cuBLAS performance gain for F16 #1249

No cuBLAS performance gain for F16 #1249

Comments

ggerganov commented Apr 30, 2023

slaren commented Apr 30, 2023

slaren commented Apr 30, 2023

ggerganov commented Apr 30, 2023