Layer-wise inference from ram possible? #9854

AncientMystic · 2024-10-11T23:06:04Z

AncientMystic
Oct 11, 2024

I am using the ollama wrapper for Llama.cpp and i was wondering, Would it be possible to run llama.cpp with layer-wise interference from ram instead of disk and possibly specify the number of layers loaded simultaneously?

Loading from disk seems a little slow but i was thinking, what would the performance gain be if you had enough ram to load the model and it simply offloaded to the gpu for processing to avoid the performance loss of too much being offloaded to the cpu and having to use cpu to process some of the layers

I was also thinking this could increase performance for large models, increase capacity for parallel model loading, etc

I am averaging 15-20 t/s when the model mostly or completely fits in vram (up to 24 usually)
Then once too much is on the cpu (usually about 35-40% or so), performance drops dramatically down to 0.5-6t/s usually sometimes lower than 0.5 if it is a very large model and most of it is all on the cpu.

Hardware:
I am using a i7-7820x with avx-512 (compiled ollama to specifically enable avx-512+cuda)
Using a Nvidia P4 8GB for ollama/llama.cpp
(System also has a intel arc A310 4GB)
96GB DDR4 Ram
OS: proxmox 8.1.4 with windows 10 on top

Looking for ways to improve performance how ever possible and was hoping this could potentially be possible and helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layer-wise inference from ram possible? #9854

{{title}}

Replies: 0 comments

Select a reply

Layer-wise inference from ram possible? #9854

AncientMystic Oct 11, 2024

Replies: 0 comments

AncientMystic
Oct 11, 2024