Layer-wise inference from ram possible? #9854
AncientMystic
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am using the ollama wrapper for Llama.cpp and i was wondering, Would it be possible to run llama.cpp with layer-wise interference from ram instead of disk and possibly specify the number of layers loaded simultaneously?
Loading from disk seems a little slow but i was thinking, what would the performance gain be if you had enough ram to load the model and it simply offloaded to the gpu for processing to avoid the performance loss of too much being offloaded to the cpu and having to use cpu to process some of the layers
I was also thinking this could increase performance for large models, increase capacity for parallel model loading, etc
I am averaging 15-20 t/s when the model mostly or completely fits in vram (up to 24 usually)
Then once too much is on the cpu (usually about 35-40% or so), performance drops dramatically down to 0.5-6t/s usually sometimes lower than 0.5 if it is a very large model and most of it is all on the cpu.
Hardware:
I am using a i7-7820x with avx-512 (compiled ollama to specifically enable avx-512+cuda)
Using a Nvidia P4 8GB for ollama/llama.cpp
(System also has a intel arc A310 4GB)
96GB DDR4 Ram
OS: proxmox 8.1.4 with windows 10 on top
Looking for ways to improve performance how ever possible and was hoping this could potentially be possible and helpful.
Beta Was this translation helpful? Give feedback.
All reactions