-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sched : support async weight copy #7315
base: master
Are you sure you want to change the base?
Conversation
The code changes make sense to me but my understanding of the ggml backend code is not very good. |
sched->copy_streams[cur_backend_id][split->w_copy_stream_id].max_size = MAX( | ||
sched->copy_streams[cur_backend_id][split->w_copy_stream_id].max_size, | ||
ggml_backend_buft_get_alloc_size(sched->bufts[cur_backend_id], src)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor : the ggml_backend_buft_get_alloc_size()
will be evaluated twice due to the MAX
macro
It appears this increases VRAM usage? In that case, I believe it's important to leave this PR as an option should it get merged. The speed decrease in text generation might be higher than the increase in prompt processing. For example, here's a benchmark by someone in the KoboldAI discord server with a 2060 using a custom build of Koboldcpp which includes this PR: As we can see here, that hardware was only able to offload 25 layers instead of 30 layers on a 10.7B model. Thus, while the speedup from prompt processing is impressive, the text generating speed is noticeably slower as the higher VRAM usage means less layers can be offloaded. So we have a situation where the new PR is slower overall (2,42 token/s vs 2,53 token/s). As for my own tests, I was not able to offload 5 layers on Mixtral anymore with this PR, so I have little reason to doubt these findings. But if you so wish, I may conduct my own tests in due time with a more apples to apples comparison between master and this PR. Otherwise, great work on this PR, as always. I think it's a great option for people who prefer prompt processing speed to text generation. But as it has its drawback, I suggest handling it as a commandline option. I'm interested to hear your thoughts! |
It will always use more memory since it requires reserving enough VRAM for multiple weights at the same time, instead of only one. The number can be configured with |
Adds support for copying the weights asynchronously with partial offload, so that the next weight can be uploaded while the current one is being used.
-ngl 0 -fa 1:
While it improves performance significantly, it is still far below what should be possible because the KV cache is still copied synchronously, which results in a stall in every layer which pretty much destroys the performance. Fixing that is going to be more complicated.