Improve cuBLAS performance by dequantizing on the GPU #1065

slaren · 2023-04-19T16:08:38Z

For me this makes cuBLAS about twice as fast with quantized models.

Perplexity seconds per pass

Model	PR	Master
q4_0	5.05	8.62
q4_1	5.37	8.59
q4_2	4.99	10.76

Prompt eval time with 7B q4_0 (bs=512)

cuBLAS (PR):     prompt eval time =  7840.48 ms /   631 tokens (   12.43 ms per token)
cuBLAS (Master): prompt eval time = 15457.33 ms /   631 tokens (   24.50 ms per token)
OpenBLAS:        prompt eval time = 34856.06 ms /   631 tokens (   55.24 ms per token)
No BLAS:         prompt eval time = 43549.67 ms /   631 tokens (   69.02 ms per token)

13B q4_0

cuBLAS (PR):     prompt eval time = 13826.48 ms /   631 tokens (   21.91 ms per token)
cuBLAS (Master): prompt eval time = 27987.82 ms /   631 tokens (   44.35 ms per token)
OpenBLAS:        prompt eval time = 61476.58 ms /   631 tokens (   97.43 ms per token)
No BLAS:         prompt eval time = 81645.43 ms /   631 tokens (  129.39 ms per token)

…he GPU

Green-Sky · 2023-04-19T16:17:11Z

CMakeLists.txt

@@ -150,6 +150,10 @@ if (LLAMA_CUBLAS)
    if (CUDAToolkit_FOUND)
        message(STATUS "cuBLAS found")

+        enable_language(CUDA)
+
+        set(GGML_CUDA_SOURCES ggml-cuda.cu ggml-cuda.h)


there was a discussion somewhere recently about splitting out the accel specific code into dedicated .c files. what was the state on that?

I tried to keep all the cuda code in ggml-cuda.cu to avoid having to compile ggml with nvcc, but otherwise nothing changed.

ggerganov

Good stuff

Just keep in mind that my long-term plan for GPU support is different and it is likely at some point to drop these changes

Can you also provide sample times for prompt ingestion with and without cuBLAS?
Maybe one of the chat examples that we have in the repo

slaren · 2023-04-19T16:28:26Z

I just added some prompt eval times for 7B q4_0.

glinscott · 2023-04-19T20:56:35Z

Wow, this is a game changer! Interestingly, 16 threads and 8 threads seems to be same speed now. Only uses ~600MB of GPU RAM (RTX 3080), and GPU utilization 65% or so. Amazing work!

All tests run with: $ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -c 512 -t <N>

w/ cuBLAS (this PR):

1 thread  - 5.12 seconds per pass - ETA 0.93 hours
2 threads - 3.82 seconds per pass - ETA 0.70 hours
4 threads - 3.22 seconds per pass - ETA 0.59 hours
8 threads - 2.93 seconds per pass - ETA 0.53 hours
16 threads - 2.82 seconds per pass - ETA 0.51 hours

Without CUDA (8 threads):

8 threads  - 16.57 seconds per pass - ETA 3.01 hours
16 threads - 11.46 seconds per pass - ETA 2.08 hours

glinscott · 2023-04-19T21:04:19Z

Even more incredible, this allows me to run the full 65B model on a machine with 32GB of RAM quite quickly!

7B - 2.93 seconds per pass - ETA 0.53 hours
13B - 4.86 seconds per pass - ETA 0.88 hours
30B - 10.99 seconds per pass - ETA 2.00 hours
65B - 37.98 seconds per pass - ETA 6.91 hours

Btw, for comparison, from last month 7B was at 24.58 seconds per pass - ETA 4.47 hours!

slaren · 2023-04-19T21:08:43Z

Nice! Don't use this just yet to run perplexity computations though, I found a synchronization issue that may cause inaccurate results. Should be fixed in the last commit though, I am running a full perplexity test and if it looks good it will be ready to merge.

Green-Sky · 2023-04-19T21:10:16Z

Even more incredible, this allows me to run the full 65B model on a machine with 32GB of RAM quite quickly!

wait... how does that work. are you not supposed to need ~60gigs of ram for 65B ?

glinscott · 2023-04-19T21:16:24Z

Even more incredible, this allows me to run the full 65B model on a machine with 32GB of RAM quite quickly!

wait... how does that work. are you not supposed to need ~60gigs of ram for 65B ?

I'm using mmap mode, so it has to go to disk to read parts of the model in as it's going. That was brutally slow previously, but the overlap with running things on GPU seems to make it feasible now.

cuBLAS - 37.98 seconds per pass - ETA 6.91 hours
CPU - 109.73 seconds per pass - ETA 19.96 hours

Actually, even on CPU it's much better than it used to be. Everyone doing amazing work here :).

slaren · 2023-04-19T21:28:25Z

We could probably get another 10% or so speedup by pre-allocating the cuda memory, but I am not sure how to do that without littering the ggml code with mode cuda specific stuff.

slaren · 2023-04-19T21:35:00Z

On a side node, should we increase the default batch size when ggml is built with BLAS support? Would make it easier to use.

avada-z · 2023-04-19T22:14:21Z

A problem while building for Windows using Visual Studio:

FAILED: CMakeFiles/ggml.dir/ggml-cuda.cu.obj
nvcc.exe -forward-unknown-to-host-compiler -DGGML_USE_CUBLAS -D_CRT_SECURE_NO_WARNINGS -I..\..\..\. -isystem="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\include" -D_WINDOWS -Xcompiler=" /GR /EHsc" -Xcompiler="-MD -Zi -O2 -Ob1" -DNDEBUG /arch:AVX2 -MD -MT CMakeFiles\ggml.dir\ggml-cuda.cu.obj -MF CMakeFiles\ggml.dir\ggml-cuda.cu.obj.d -x cu -c ..\..\..\ggml-cuda.cu -o CMakeFiles\ggml.dir\ggml-cuda.cu.obj -Xcompiler=-FdCMakeFiles\ggml.dir\,-FS
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified

ghost · 2023-04-19T23:05:36Z

On a side node, should we increase the default batch size when ggml is built with BLAS support? Would make it easier to use.

I believe ggml doesn't even use BLAS if the batch size isn't large enough despite system_info reporting BLAS=1. You need a larger batch size to cover the overhead of using the library. Personally I haven't seen any performance difference between BLAS runs with say a batch size of 512 vs 2048.

slaren · 2023-04-19T23:15:45Z

I believe ggml doesn't even use BLAS if the batch size isn't large enough despite system_info reporting BLAS=1

That's right, the default batch is 8, but the minimum to use BLAS is 32.

Personally I haven't seen any performance difference between BLAS runs with say a batch size of 512 vs 2048.

Currently the maximum batch size is 512, if you try to use a larger one it will be clamped to 512.

slaren · 2023-04-19T23:23:29Z

@avada-z I think it should be fixed now.

SlyEcho · 2023-04-20T00:41:32Z

I ported this to HIP and hipBLAS.

The dequant on the device is nothing earth-shattering. Still dominated by the memcpy. But I have an older card and only PCIe 3.0.

Dampfinchen · 2023-04-20T13:30:48Z

Hello there. I'm trying to build it using the make LLAMA_CUBLAS=1 command with Windows and WIN64DevKit. However, even though I have CUDA Toolkit installed and changed the paths for -L and -I in the makefile accordingly, it still misses the following libaries:
-lcublas_static
-lculibos
-lcublasLt_static
and -ldl:.

Where can I get them? I would appreciate some help getting this to work. Thank you!

jon-chuang · 2023-04-26T17:34:06Z

ggml.c

+                    cudaMemcpyAsync(d_Q, (char *) src0->data + i03*nb03 + i02*nb02,
+                        GGML_TYPE_SIZE[type] * x_ne / GGML_BLCK_SIZE[type], cudaMemcpyHostToDevice, cudaStream));
+
+                dequantize_row_q_cuda(d_Q, d_X, ne01 * ne00, cudaStream);


I believe we might be able to get better perf via cuda graphs by stitching dequantize, sgemm and quantize. Thoughts? See #1192

jon-chuang · 2023-04-26T17:40:36Z

Still dominated by the memcpy

If the weights are stored in the device HBM/DRAM, I suspect we can get much better perf than copying the weights each time.

slaren added 2 commits April 19, 2023 18:06

Improve cuBLAS performance with quantized models by dequantizing on t…

359b056

…he GPU

Remove unused parameters

891af05

Green-Sky reviewed Apr 19, 2023

View reviewed changes

ggerganov approved these changes Apr 19, 2023

View reviewed changes

Fix possible synchronization issue

95cf959

Fix windows build

1833771

slaren merged commit 02d6988 into ggerganov:master Apr 20, 2023

slaren deleted the cuda-dq branch April 20, 2023 01:14

ggerganov mentioned this pull request Apr 20, 2023

Unexpected performance issue with longer prompts? #938

Closed

ghost mentioned this pull request Apr 20, 2023

Have n_batch default to 512 when BLAS is enabled #1091

Merged

Priestru mentioned this pull request Apr 21, 2023

cuBLAS support oobabooga/text-generation-webui#1457

Closed

slaren mentioned this pull request Apr 23, 2023

Performance issues with cuBLAS and a bug #1129

Closed

jon-chuang reviewed Apr 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cuBLAS performance by dequantizing on the GPU #1065

Improve cuBLAS performance by dequantizing on the GPU #1065

slaren commented Apr 19, 2023 •

edited

Loading

Green-Sky Apr 19, 2023

slaren Apr 19, 2023

ggerganov left a comment

slaren commented Apr 19, 2023

glinscott commented Apr 19, 2023

glinscott commented Apr 19, 2023

slaren commented Apr 19, 2023

Green-Sky commented Apr 19, 2023

glinscott commented Apr 19, 2023 •

edited

Loading

slaren commented Apr 19, 2023

slaren commented Apr 19, 2023 •

edited

Loading

avada-z commented Apr 19, 2023

ghost commented Apr 19, 2023

slaren commented Apr 19, 2023

slaren commented Apr 19, 2023

SlyEcho commented Apr 20, 2023

Dampfinchen commented Apr 20, 2023

jon-chuang Apr 26, 2023 •

edited

Loading

jon-chuang commented Apr 26, 2023

Improve cuBLAS performance by dequantizing on the GPU #1065

Improve cuBLAS performance by dequantizing on the GPU #1065

Conversation

slaren commented Apr 19, 2023 • edited Loading

Green-Sky Apr 19, 2023

Choose a reason for hiding this comment

slaren Apr 19, 2023

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

slaren commented Apr 19, 2023

glinscott commented Apr 19, 2023

glinscott commented Apr 19, 2023

slaren commented Apr 19, 2023

Green-Sky commented Apr 19, 2023

glinscott commented Apr 19, 2023 • edited Loading

slaren commented Apr 19, 2023

slaren commented Apr 19, 2023 • edited Loading

avada-z commented Apr 19, 2023

ghost commented Apr 19, 2023

slaren commented Apr 19, 2023

slaren commented Apr 19, 2023

SlyEcho commented Apr 20, 2023

Dampfinchen commented Apr 20, 2023

jon-chuang Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

jon-chuang commented Apr 26, 2023

slaren commented Apr 19, 2023 •

edited

Loading

glinscott commented Apr 19, 2023 •

edited

Loading

slaren commented Apr 19, 2023 •

edited

Loading

jon-chuang Apr 26, 2023 •

edited

Loading