Replies: 1 comment 1 reply
-
I was wondering the same. Opinions ? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
While the GPU makers want us to believe that the main crunch point is not enough GPU power, the real issue with self-hosted LLMs is lack of memory. Especially when we're inferencing at large context windows (which is where the magic starts to happen).
At the moment llama.cpp loads all the model's layers and does a very good job of trying to fit everything into GPU and then spilling over to CPU. But CPU is super slow.
A better way to handle large models might be:
Load the entire model into RAM and set aside the KV storage, etc for large contexts in the GPU's VRAM.
Work out how much VRAM is available and then translate that into how many layers could be loaded into VRAM. Let;'s call it n layers.
Load in the first n layers, then when inference need to move to n+1, replace the current n layers in VRAM with the next n layers, and so on until the last layer is reached.
Some people have implemented single layer paging but by paging as many layers as possible, there should, in theory at least, be some efficiency gains. Similarly loading the layers into RAM rather than off disc is for maximum access and transfer speeds.
Where more than one GPU card is in the machine this might make for an even more efficient algorithm as layers can be loaded into the dormant GPU while processing continues with with active GPU.
Is this possible?
Beta Was this translation helpful? Give feedback.
All reactions