Replies: 4 comments 1 reply
-
For me personally my solution works fine and also offers some other features but beside that yes, full ack. At the moment every P40 worldwide running with llama.cpp burns somebodies money. A fix directly in llama.cpp would be great. |
Beta Was this translation helpful? Give feedback.
-
How about dealing with unwanted but maybe needed dependencies via envs at compile time and then later provide external binaries also via env? |
Beta Was this translation helpful? Give feedback.
-
Note that the patch in this form would have impact on the use of multiple llama.cpp instances sharing one or more P40/P100. You would have to implement a shared or synchronized semaphore. In gppm it was easy to implement. Improvements on the logic to come. |
Beta Was this translation helpful? Give feedback.
-
This can manage multiple GPU and multiple llama.cpp instances, regardless of how CUDA_VISIBLE_DEVICES is set for the respective instances. https://github.com/crashr/gppm/blob/multiple-llamacpp/gppmd.py Not ready by now. The llama.cpp instances need to be launched like this atm but this will becoming comfortable soon. https://github.com/crashr/gppm/blob/multiple-llamacpp/run_instance_1.sh |
Beta Was this translation helpful? Give feedback.
-
I'm wondering if it makes sense to have nvidia-pstate directly in
llama.cpp
(enabled only for specific GPUs, e.g. P40/P100)?nvidia-pstate
reduces the idle power consumption (and temperature in result) of server Pascal GPUs. The undocumentedNvAPI
function is called for this purpose. This approach works on both Linux and Windows.Theoretically, this works for other NVIDIA GPUs as well, but the driver does a great job of managing performance states on those GPUs as it is. That said, putting a GPU into performance state 8 (without switching back) is a really useful case for power/temperature constraint setups (since in P8, at least on P40, it will never overheat even completely fanless). So I guess the performance state switcher could also be a new example and provide similar functionality as my cli tool.
It is required to place a function call before and after the inference starts. In my patch it just calls a "function" before the slot starts processing and after all slots are idle on the server, but I assume there is a common function that is executed before/after inference, so that this can be easily extended to all examples + applications that use llama.cpp as a library.
I can make a PR that will integrate it (w/o python or external dependencies, the code can be easily rewritten in C++). But will it be merged?
Related: #8063 (@crashr, you might want to comment on that too.)
Beta Was this translation helpful? Give feedback.
All reactions