-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cuBLAS performance by using a memory pool #1094
Conversation
Very nice! On, 30B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, would you be interested in an implementation that uses cuda{Malloc,Free}Async()
with a cudaMemPool_t
(see this section of the CUDA programming guide) instead of a custom memory pool? I can try to come up and benchmark an implementation based on that.
Pros:
- harder to get wrong, smaller code diff
- allows to set a memory limit based on the number of bytes, not the number of allocations
- potentially faster
Cons:
- CUDA-specific, harder to port to other GPU-aware BLAS'es
- requires CUDA 11.2
ggml-cuda.cu
Outdated
if (std::atomic_compare_exchange_strong(&b->ptr, &p, (uintptr_t) ptr)) { | ||
b->size = size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this introduces a race condition: another thread can observe a non-nullptr pointer with a wrong size. E.g., consider this execution history of two threads T0
and T1
:
T0
callsggml_cuda_pool_malloc(LARGE_SIZE)
. The pool is empty, soT0
callscudaMalloc(LARGE_SIZE)
and gets the resulting pointerpLarge
.T1
callsggml_cuda_pool_malloc(SMALL_SIZE)
. The pool is again empty, toT1
callscudaMalloc(SMALL_SIZE)
and gets the resulting pointerpSmall
.T0
callsggml_cuda_pool_free(pLarge)
.cuda_buffer_pool[0]->ptr
isNULL
, soT0
makes an update:cuda_buffer_pool[0] = {.ptr = pLarge, .size = LARGE_SIZE}
.T0
callsggml_cuda_pool_malloc(LARGE_SIZE)
.cuda_buffer_pool[0]->size >= LARGE_SIZE && cuda_buffer_pool[0]->ptr != nullptr
, soT0
makes an update (cuda_buffer_pool[0] = {.ptr = nullptr, .size = LARGE_SIZE}
) and getspLarge
.- HERE BE DRAGONS
T1
callsggml_cuda_pool_free(pSmall)
.cuda_buffer_pool[0]->ptr
isnullptr
, soT1
tries to make an update. It makes a successful CAS on line 188 (so thatcuda_buffer_pool[0] = {.ptr = pSmall, .size = LARGE_SIZE}
), but then gets preempted by the OS scheduler. T0
callsggml_cuda_pool_malloc(LARGE_SIZE)
.cuda_buffer_pool[0]->size >= LARGE_SIZE && cuda_buffer_pool[0]->ptr != nullptr
, so it makes an (irrelevant) update and getspSmall
.
So now T0
thinks it has LARGE_SIZE
bytes, while in fact it only has SMALL_SIZE
bytes.
ggml-cuda.cu
Outdated
if (b->size >= size) { | ||
uintptr_t ptr = atomic_load(&b->ptr); | ||
if (ptr) { | ||
if (std::atomic_compare_exchange_strong(&b->ptr, &ptr, 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this has an ABA problem? The scenario I'm thinking of goes like this:
- we get preempted just before the CAS and lose the race to
ptr
to another thread - this other thread eventually frees
ptr
withcudaFree()
- some other thread calls
cudaMalloc()
with a smaller size and get the same pointer asptr
(i.e., it is equal to it as an integer), then frees it into the same pool slot we are trying to use - we wake up, do the CAS (which succeeds because the new pointer is equal to the old one as integer) and start using the pointer with the wrong size
This is highly unlikely to happen in practice, but I think is technically possible, unless CUDART never returns the same pointer twice from cudaMalloc()
.
@dfyz I think it is very unlikely that a general purpose allocator like that will be faster than this, but if you want to give it a try please do. On the other hand, if it isn't compatible with HIPS and #1087 is merged, we would have to write a custom allocator anyway. Good points with the synchronization issues, this is always harder than it looks. In my tests llama doesn't make any concurrent mat muls with BLAS, so my thought here was mostly about making it future proof, but if that is too complicated we may as well replace it with a spin lock with zero performance impact right now. |
HIP supports the same async malloc/free operations and mempool stuff. Technically, it should be checked from the device's feature flags, but that would make our code more difficult. When I was first getting hipBLAS working, I also tried hipMallocAsync and it really didn't make any difference. When I look at the profile, hipMalloc/Free is a tiny part of the matmul operation which is still dominated by hipMemcpy. Maybe if I try the changes in this PR it will reveal something different. I think the only real impact would be to keep the big weight matrixes on the device permanently as much as it can fit. |
Yeah, I agree a general purpose allocator would have no chance here. My hope here is just that the
I don't know if it will work in practice, but it's a fun direction to explore (independently of this PR, of course). Could you please clarify which commands (and which hardware) you used for benchmarking? I guess that the output with
I think a spinlock is a great idea here!
I'd need to take a look at the CUDA trace to confirm this, but I think that
This makes sense, and I think the custom memory pool from this PR can be extended to handle this. Another fun direction to explore. |
If you manage to write an efficient Anyway - great work as always! Regarding the race condition: On that note - would there be any sort of benefit theoretically from parallel GPU BLAS calls? I've tried on CPU and it doesn't help. Can we add some parameter that specifies how much VRAM you want to be used at maximum? |
There could be a lot of benefit from parallel GPU BLAS calls. For example we could use the GemmStridedBatched functions to compute the multiplication over the entire tensor instead of for-looping the outer dimensions (but I think llama.cpp doesn't need it anyway?). But they also mention this:
Currently, the computation (on AMD, at least, could be that Nvidia is better at this) there is just one row of operations for one matmul calculation. @slaren, do you think it would be worth having multiple cudaStreams, so that some operations could happen at the same time as others? (In OpenCL (like clBLAS) every operation gives you event objects that other operations can wait on, so the entire ggml compute graph could be posted to the GPU command stream and let it sort out the dependencies and compute in parallel. But this is my guess if it actually works like that) |
Yes, I will replace the whole synchronization of the memory pool with a spin lock for now, just to make sure that it doesn't cause issues in the future.
I think so, currently the GPU sits idle while waiting for the data, and the PCIe bus is unused while computing the gemm, with more threads it would be possible to continue copying data for the next mat mul while a gemm is being computed in a different thread. Probably will require some changes to the cuda code, we would need to use a different stream per thread, but if it is not very hard to change ggml to allow this, I think it is worth giving it a try.
This PR does increase VRAM usage, I didn't test with 65B but with 30B it is still very low, below what any discrete GPU would have. If we implement multi-threaded BLAS mat muls that could change, though. |
Yes, that's what I was thinking. |
I think, maybe the simplest thing is two have two streams, for |
Here is an interesting article about how that could be done: https://siboehm.com/articles/22/CUDA-MMM |
Yes, the perplexity times is just from running The prompt eval times are just from running Edit: also use |
General code cleanup
The synchronization issues should be resolved, and with a 30B q4_0 model I see a VRAM usage of ~2 GB, so I don't expect that it will be a problem even with 65B models on very low VRAM GPUs. We could do better if we were able to predict the allocations, currently a single pass will make these allocations:
In principle, we could allocate only the memory for the (32000 x 6656) x (512 x 6656) mat mul and reuse these buffers for the smaller mat muls. For now this isn't an issue, but that may change if we implement multi-threaded mat muls. |
@slaren I wonder if tensor cores could provide another huge speedup when optimizing for them. Judging by how GPU architectures with similar raw performance (one with and one without tensor cores) perform in llama's cuBLAS implementation right now, tensor cores are either not in use at all or used inefficiently. I've found the following documents about that matter, maybe you can look into them: https://forums.developer.nvidia.com/t/turing-arch-int4-ops-with-tensor-cores/66656 Thanks for your great work. I'm looking forward to your next innovation for cuBLAS! |
cuBLAS already uses tensors cores, see https://docs.nvidia.com/cuda/cublas/index.html#tensor-core-usage Unfortunately, it seems that NVIDIA intends to deprecate INT4 support in tensor cores, and it has already been removed in sm_90 (Hopper / H100), but we may still benefit from it with RTX 20 to 40 series cards if we write our own mat mul kernel in the future. |
As far as I can see (I'm running your changes from #1207), we only use cuBLAS for SGEMMs. TF32 tensor cores on Ampere and later (which can accelerate single-precision computations) are not used by default, since they result in reduced precision. You can test this by, for example, applying this patch: diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 0c01863..d8f2d6e 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -360,6 +360,8 @@ void ggml_init_cublas() {
CUDA_CHECK(cudaStreamCreateWithFlags(&g_cudaStream2, cudaStreamNonBlocking));
CUDA_CHECK(cudaEventCreateWithFlags(&g_cudaEvent, cudaEventDisableTiming));
+ CUBLAS_CHECK(cublasSetMathMode(g_cublasH, CUBLAS_TF32_TENSOR_OP_MATH));
+
// configure logging to stdout
// CUBLAS_CHECK(cublasLoggerConfigure(1, 1, 0, NULL));
} Here's what the first matmul in the transformer layer looks like without the patch applied (i.e., without tensor cores): And here's what it looks like with the patch applied (i.e., with tensor cores): Without tensor cores, the SGEMM takes much longer than the dequantize kernel. With tensor cores, it's the other way round. I don't know if we should enable TF32 tensor cores, though. I might be measuring it wrong, but we are actually not bottlenecked by the SGEMM speed, so while tensor cores give an impressive speed-up, the overall prompt processing time stays largely the same. |
Nice! I see a 5% overall speedup enabling tensor cores in my current testing branch, it's not much, but it may be worth running a perplexity test to see if the loss of precision is not too bad.
I guess the documentation is wrong here, or maybe the heuristic is failing to choose the fastest algorithm. |
I found a partial solution to the non-contiguous matrixes. If they are contiguous in 2D, they can be computed by Gemm (on all platforms) because we are looping over the higher dimensions. But this creates a lot of tiny multiplications that I wonder could be solved by that StridedBatchedGemm. |
I'm a little confused about what's happening with (I added some NVTX annotations with node labels to make it clear where the parallel matmuls are supposed to appear on the GPU) I'm running this command with the latest changes from #1207 to process a relatively large prompt from the repo on an A100: What am I missing? |
This what two layers look like now (https://github.com/slaren/llama.cpp/tree/cuda-f16f32): I suspect that most of the time between layers is the missing non-contiguous mat mul. So there is probably a lot to gain there. @dfyz in this screenshot it is the noisy lines that look to happen at the start of the layer (actually it is the end of the previous layer). Maybe your prompt is not big enough? I am using the perplexity tool here. |
Ah yes, I can see those now when I run the perplexity tool. I guess I would still try to handle both matmuls (the missing non-contiguous one, and the one you currently throw multiple streams at) with |
My intuition is that this is better than batched because it allows us to keep copying memory while doing the compute, but I may be wrong. It is worth a try for sure. |
I did an experiment where I tried to copy over even non-contiguous tensors using cublasSetVector(Async), looping over the outer dimensions and matrix rows, there is no matrix version of this, which is a shame. You can see that at least rocBLAS is using device kernels to accelerate the non-contiguous path and I assume cuBLAS is similar but it's not open source, so I can't tell. EDIT: oh, the results were inconclusive, maybe I did something wrong. |
We could probably use |
@slaren I made this kind of crap, it seems to speed up from59 minutes to 47: static cudaError_t ggml_cuda_copy_tensor_compatible_2D(void * dst, const struct ggml_tensor * src, uint64_t i3, uint64_t i2, cudaStream_t stream) {
const uint64_t ne0 = src->ne[0];
const uint64_t ne1 = src->ne[1];
const uint64_t nb0 = src->nb[0];
const uint64_t nb1 = src->nb[1];
const uint64_t nb2 = src->nb[2];
const uint64_t nb3 = src->nb[3];
const enum ggml_type type = src->type;
const size_t ts = GGML_TYPE_SIZE[type];
const size_t bs = GGML_BLCK_SIZE[type];
const void * x = (const void *) ((const char *) src->data + i2*nb2 + i3*nb3);
if (ggml_is_contiguous(src)) {
return cudaMemcpyAsync(dst, x, ts*ne0*ne1/bs, cudaMemcpyHostToDevice, stream);
} else {
GGML_ASSERT(nb0 == ts); // don't think about it now yet :(
//fprintf(stderr, "cudaMemcpy2DAsync(dst, %ld*%ld/%ld, x, %ld, ts*ne0/bs, %ld, cudaMemcpyHostToDevice, stream);\n",
// ts, ne0, bs, nb1, ts, ne0, bs, ne1);
return cudaMemcpy2DAsync(dst, ts*ne0/bs, x, nb1, ts*ne0/bs, ne1, cudaMemcpyHostToDevice, stream);
}
} Then just use that everywhere instead of EDIT: perplexity done, Q4_0 --memory_f32:
|
Previously, the cuda memory was allocated and freed as needed for each mat mul operation, which is very inefficient.
By using a memory pool, this is about 30-50% faster in my machine.
PR:
Master: