-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuBLAS: use host pinned memory and dequantize while copying #1207
Conversation
I think these changes looks great. You said elsewhere that this stuff might cause "some friction", but I think it turns out to be very non-intrusive. The CUDA stuff is still relatively self-contained and is separated from the ggml core. Of course, @ggerganov might have a different opinion, but think this should be merged as is. |
@SlyEcho are you sure that this is with this branch and not |
@slaren, you are quite right, this is But it does have the same changes included? Anyway, perplexity on Q4_0 was [655]6.2838 |
Yes, that branch is built on top of this one, with additional changes to the f16 x f32 mat mul. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying memory to the GPU from pageable memory is slow because it forces CUDA to copy the buffer to non-pageable memory before it can DMA it to the GPU. This also means that
cudaMemcpyAsync
is actually synchronous.By storing the ggml context in non-pageable, pinned memory, this additional copy is avoided, and
cudaMemcpyAsync
is done asynchronously. This also makes it possible to dequantize while copying data for the other matrix.To observe most of the benefits, this has to be used with
--no-mmap
, otherwise the weights will be stored in paged, memory mapped memory. With mmap enabled, there is still some benefit from the non-weight matrices. In the future, this will be solved by caching the weights in the GPU memory, avoiding the copy entirely.To avoid adding a CUDA-only function to the ggml interface, llama.cpp has been modified to include
ggml-cuda.h
when cuBLAS is enabled.For me, this represents a ~30% speedup in perplexity times with cuBLAS.
PR:
Master: