Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No cuBLAS performance gain for F16 #1249

Closed
ggerganov opened this issue Apr 30, 2023 · 3 comments
Closed

No cuBLAS performance gain for F16 #1249

ggerganov opened this issue Apr 30, 2023 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@ggerganov
Copy link
Owner

I noticed that using cuBLAS with the F16 model does not give any benefit compared to non-BLAS CPU-only mode:

# with cuBLAS
$ ▶ make clean && LLAMA_CUBLAS=1 make -j && time ./perplexity -m ./models/7B/ggml-model-f16.bin -f ./build/wiki.test.raw --no-mmap -t 12
....
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
14.33 seconds per pass - ETA 2 hours 36 minutes
[1]4.2336,^C^C

# without BLAS
$ ▶ make clean && make -j && time ./perplexity -m ./models/7B/ggml-model-f16.bin -f ./build/wiki.test.raw --no-mmap -t 12
...
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
13.75 seconds per pass - ETA 2 hours 30 minutes
[1]4.2335,^C

System:

  • GeForce GTX 1660
  • AMD Ryzen 9 5950X

In contrast, when using a quantized model, the cuBLAS run is significantly faster

Is this expected?
I was hoping to have some performance improvement for F16 as well.
Maybe the data transfer is very slow for F16 and it defeats the purpose of offloading to the GPU?

I noticed this after porting the latest ggml to whisper.cpp where we use F16 precision and was surprised that cuBLAS does not bring any improvement.

For example, sometime ago I tried using NVBLAS in whisper.cpp and it did bring some decent improvements: ggerganov/whisper.cpp#220 (comment)

The NVBLAS code change was very trivial: ggerganov/whisper.cpp#239
What could NVBLAS be doing better in this case?

@ggerganov ggerganov added the question Further information is requested label Apr 30, 2023
@slaren
Copy link
Collaborator

slaren commented Apr 30, 2023

With a RTX 3080:

F16 used to be the fastest before dequantization on the GPU was implemented: #1044

With the current master, it is still faster than it was originally, so I don't think that there has been a regression:
3.50 seconds per pass - ETA 38 minutes

I don't know why this isn't the case with your GTX 1660. From what I could find, it is a turing chip that can do FP16.

@slaren
Copy link
Collaborator

slaren commented Apr 30, 2023

I have been experimenting with doing the f16xf32 mat muls in f32 (instead of f16 as it is currently done) in https://github.com/slaren/llama.cpp/commits/cuda-f16f32

For me, this is faster with quantized models, but slower with F16 models, but maybe with your GPU the results are different.

@ggerganov
Copy link
Owner Author

Thanks - it seems the problem is in the GeForce GTX 1660 somehow.
Ran the same test on GeForce RTX 4080 and there is significant improvement.
Also, whisper.cpp is much faster with cuBLAS

I think the NVBLAS test that I did before was on GeForce RTX 2060

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants