Layer-by-layer processing (optimised) for large memory footprint models. #9083

Speedway1 · 2024-08-18T19:42:21Z

Speedway1
Aug 18, 2024

While the GPU makers want us to believe that the main crunch point is not enough GPU power, the real issue with self-hosted LLMs is lack of memory. Especially when we're inferencing at large context windows (which is where the magic starts to happen).

At the moment llama.cpp loads all the model's layers and does a very good job of trying to fit everything into GPU and then spilling over to CPU. But CPU is super slow.

A better way to handle large models might be:

Load the entire model into RAM and set aside the KV storage, etc for large contexts in the GPU's VRAM.
Work out how much VRAM is available and then translate that into how many layers could be loaded into VRAM. Let;'s call it n layers.
Load in the first n layers, then when inference need to move to n+1, replace the current n layers in VRAM with the next n layers, and so on until the last layer is reached.
Some people have implemented single layer paging but by paging as many layers as possible, there should, in theory at least, be some efficiency gains. Similarly loading the layers into RAM rather than off disc is for maximum access and transfer speeds.

Where more than one GPU card is in the machine this might make for an even more efficient algorithm as layers can be loaded into the dormant GPU while processing continues with with active GPU.

Is this possible?

jmarcgit · 2024-10-19T17:15:45Z

jmarcgit
Oct 19, 2024

I was wondering the same.
I even think than transfering layers while GPU is processing other layers could be possible because this is what happens in gaming. CPU could be used to optmize the loading of layers only and could work in parallel. A fast GPU with very limited amount of VRAM could work very well with larger models I guess.

Opinions ?

1 reply

slaren Oct 19, 2024
Collaborator

llama.cpp already offloads computation of large batches to the GPU automatically. It does not copy the weights asynchronously, but there is a partial demo of this in #7315. Crucially, this is only useful with large batches, token generation (batch size = 1) is limited by memory bandwidth. The CPU can already generate tokens as fast as it can read the weights from system memory, which is already much faster than the bandwidth of the PCIe connection to the GPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layer-by-layer processing (optimised) for large memory footprint models. #9083

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Layer-by-layer processing (optimised) for large memory footprint models. #9083

Speedway1 Aug 18, 2024

Replies: 1 comment · 1 reply

jmarcgit Oct 19, 2024

slaren Oct 19, 2024 Collaborator

Speedway1
Aug 18, 2024

Replies: 1 comment 1 reply

jmarcgit
Oct 19, 2024

slaren Oct 19, 2024
Collaborator