Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cuBLAS performance by dequantizing on the GPU #1065

Merged
merged 4 commits into from
Apr 20, 2023

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Apr 19, 2023

For me this makes cuBLAS about twice as fast with quantized models.

Perplexity seconds per pass

Model PR Master
q4_0 5.05 8.62
q4_1 5.37 8.59
q4_2 4.99 10.76

Prompt eval time with 7B q4_0 (bs=512)

cuBLAS (PR):     prompt eval time =  7840.48 ms /   631 tokens (   12.43 ms per token)
cuBLAS (Master): prompt eval time = 15457.33 ms /   631 tokens (   24.50 ms per token)
OpenBLAS:        prompt eval time = 34856.06 ms /   631 tokens (   55.24 ms per token)
No BLAS:         prompt eval time = 43549.67 ms /   631 tokens (   69.02 ms per token)

13B q4_0

cuBLAS (PR):     prompt eval time = 13826.48 ms /   631 tokens (   21.91 ms per token)
cuBLAS (Master): prompt eval time = 27987.82 ms /   631 tokens (   44.35 ms per token)
OpenBLAS:        prompt eval time = 61476.58 ms /   631 tokens (   97.43 ms per token)
No BLAS:         prompt eval time = 81645.43 ms /   631 tokens (  129.39 ms per token)

@@ -150,6 +150,10 @@ if (LLAMA_CUBLAS)
if (CUDAToolkit_FOUND)
message(STATUS "cuBLAS found")

enable_language(CUDA)

set(GGML_CUDA_SOURCES ggml-cuda.cu ggml-cuda.h)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a discussion somewhere recently about splitting out the accel specific code into dedicated .c files. what was the state on that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to keep all the cuda code in ggml-cuda.cu to avoid having to compile ggml with nvcc, but otherwise nothing changed.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff

Just keep in mind that my long-term plan for GPU support is different and it is likely at some point to drop these changes

Can you also provide sample times for prompt ingestion with and without cuBLAS?
Maybe one of the chat examples that we have in the repo

@slaren
Copy link
Collaborator Author

slaren commented Apr 19, 2023

I just added some prompt eval times for 7B q4_0.

@glinscott
Copy link
Collaborator

Wow, this is a game changer! Interestingly, 16 threads and 8 threads seems to be same speed now. Only uses ~600MB of GPU RAM (RTX 3080), and GPU utilization 65% or so. Amazing work!

All tests run with: $ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -c 512 -t <N>

w/ cuBLAS (this PR):

1 thread  - 5.12 seconds per pass - ETA 0.93 hours
2 threads - 3.82 seconds per pass - ETA 0.70 hours
4 threads - 3.22 seconds per pass - ETA 0.59 hours
8 threads - 2.93 seconds per pass - ETA 0.53 hours
16 threads - 2.82 seconds per pass - ETA 0.51 hours

Without CUDA (8 threads):

8 threads  - 16.57 seconds per pass - ETA 3.01 hours
16 threads - 11.46 seconds per pass - ETA 2.08 hours

@glinscott
Copy link
Collaborator

Even more incredible, this allows me to run the full 65B model on a machine with 32GB of RAM quite quickly!

7B - 2.93 seconds per pass - ETA 0.53 hours
13B - 4.86 seconds per pass - ETA 0.88 hours
30B - 10.99 seconds per pass - ETA 2.00 hours
65B - 37.98 seconds per pass - ETA 6.91 hours

Btw, for comparison, from last month 7B was at 24.58 seconds per pass - ETA 4.47 hours!

@slaren
Copy link
Collaborator Author

slaren commented Apr 19, 2023

Nice! Don't use this just yet to run perplexity computations though, I found a synchronization issue that may cause inaccurate results. Should be fixed in the last commit though, I am running a full perplexity test and if it looks good it will be ready to merge.

@Green-Sky
Copy link
Collaborator

Even more incredible, this allows me to run the full 65B model on a machine with 32GB of RAM quite quickly!

wait... how does that work. are you not supposed to need ~60gigs of ram for 65B ?

@glinscott
Copy link
Collaborator

glinscott commented Apr 19, 2023

Even more incredible, this allows me to run the full 65B model on a machine with 32GB of RAM quite quickly!

wait... how does that work. are you not supposed to need ~60gigs of ram for 65B ?

I'm using mmap mode, so it has to go to disk to read parts of the model in as it's going. That was brutally slow previously, but the overlap with running things on GPU seems to make it feasible now.

cuBLAS - 37.98 seconds per pass - ETA 6.91 hours
CPU - 109.73 seconds per pass - ETA 19.96 hours

Actually, even on CPU it's much better than it used to be. Everyone doing amazing work here :).

@slaren
Copy link
Collaborator Author

slaren commented Apr 19, 2023

We could probably get another 10% or so speedup by pre-allocating the cuda memory, but I am not sure how to do that without littering the ggml code with mode cuda specific stuff.

@slaren
Copy link
Collaborator Author

slaren commented Apr 19, 2023

On a side node, should we increase the default batch size when ggml is built with BLAS support? Would make it easier to use.

@avada-z
Copy link

avada-z commented Apr 19, 2023

A problem while building for Windows using Visual Studio:

FAILED: CMakeFiles/ggml.dir/ggml-cuda.cu.obj
nvcc.exe -forward-unknown-to-host-compiler -DGGML_USE_CUBLAS -D_CRT_SECURE_NO_WARNINGS -I..\..\..\. -isystem="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\include" -D_WINDOWS -Xcompiler=" /GR /EHsc" -Xcompiler="-MD -Zi -O2 -Ob1" -DNDEBUG /arch:AVX2 -MD -MT CMakeFiles\ggml.dir\ggml-cuda.cu.obj -MF CMakeFiles\ggml.dir\ggml-cuda.cu.obj.d -x cu -c ..\..\..\ggml-cuda.cu -o CMakeFiles\ggml.dir\ggml-cuda.cu.obj -Xcompiler=-FdCMakeFiles\ggml.dir\,-FS
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified

@ghost
Copy link

ghost commented Apr 19, 2023

On a side node, should we increase the default batch size when ggml is built with BLAS support? Would make it easier to use.

I believe ggml doesn't even use BLAS if the batch size isn't large enough despite system_info reporting BLAS=1. You need a larger batch size to cover the overhead of using the library. Personally I haven't seen any performance difference between BLAS runs with say a batch size of 512 vs 2048.

@slaren
Copy link
Collaborator Author

slaren commented Apr 19, 2023

I believe ggml doesn't even use BLAS if the batch size isn't large enough despite system_info reporting BLAS=1

That's right, the default batch is 8, but the minimum to use BLAS is 32.

Personally I haven't seen any performance difference between BLAS runs with say a batch size of 512 vs 2048.

Currently the maximum batch size is 512, if you try to use a larger one it will be clamped to 512.

@slaren
Copy link
Collaborator Author

slaren commented Apr 19, 2023

@avada-z I think it should be fixed now.

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Apr 20, 2023

I ported this to HIP and hipBLAS.

The dequant on the device is nothing earth-shattering. Still dominated by the memcpy. But I have an older card and only PCIe 3.0.

image

@slaren slaren merged commit 02d6988 into ggerganov:master Apr 20, 2023
@slaren slaren deleted the cuda-dq branch April 20, 2023 01:14
@Dampfinchen
Copy link

Hello there. I'm trying to build it using the make LLAMA_CUBLAS=1 command with Windows and WIN64DevKit. However, even though I have CUDA Toolkit installed and changed the paths for -L and -I in the makefile accordingly, it still misses the following libaries:
-lcublas_static
-lculibos
-lcublasLt_static
and -ldl:.

Where can I get them? I would appreciate some help getting this to work. Thank you!

cudaMemcpyAsync(d_Q, (char *) src0->data + i03*nb03 + i02*nb02,
GGML_TYPE_SIZE[type] * x_ne / GGML_BLCK_SIZE[type], cudaMemcpyHostToDevice, cudaStream));

dequantize_row_q_cuda(d_Q, d_X, ne01 * ne00, cudaStream);
Copy link
Contributor

@jon-chuang jon-chuang Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we might be able to get better perf via cuda graphs by stitching dequantize, sgemm and quantize. Thoughts? See #1192

@jon-chuang
Copy link
Contributor

Still dominated by the memcpy

If the weights are stored in the device HBM/DRAM, I suspect we can get much better perf than copying the weights each time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants