nvidia-pstate in llama.cpp (Tesla P40/P100) #8084

sasha0552 · 2024-06-23T19:09:37Z

sasha0552
Jun 23, 2024

I'm wondering if it makes sense to have nvidia-pstate directly in llama.cpp (enabled only for specific GPUs, e.g. P40/P100)?

nvidia-pstate reduces the idle power consumption (and temperature in result) of server Pascal GPUs. The undocumented NvAPI function is called for this purpose. This approach works on both Linux and Windows.

Theoretically, this works for other NVIDIA GPUs as well, but the driver does a great job of managing performance states on those GPUs as it is. That said, putting a GPU into performance state 8 (without switching back) is a really useful case for power/temperature constraint setups (since in P8, at least on P40, it will never overheat even completely fanless). So I guess the performance state switcher could also be a new example and provide similar functionality as my cli tool.

It is required to place a function call before and after the inference starts. In my patch it just calls a "function" before the slot starts processing and after all slots are idle on the server, but I assume there is a common function that is executed before/after inference, so that this can be easily extended to all examples + applications that use llama.cpp as a library.

I can make a PR that will integrate it (w/o python or external dependencies, the code can be easily rewritten in C++). But will it be merged?

Related: #8063 (@crashr, you might want to comment on that too.)

crashr · 2024-06-23T21:09:15Z

crashr
Jun 23, 2024

For me personally my solution works fine and also offers some other features but beside that yes, full ack. At the moment every P40 worldwide running with llama.cpp burns somebodies money. A fix directly in llama.cpp would be great.

0 replies

crashr · 2024-06-24T06:06:54Z

crashr
Jun 24, 2024

How about dealing with unwanted but maybe needed dependencies via envs at compile time and then later provide external binaries also via env?

1 reply

sasha0552 Jun 24, 2024
Author

I don't think this is necessary, since I can write it in such a way that it won't require external dependencies (except for NvAPI library, which is included in the NVIDIA driver. Even if it isn't, the performance state change just won't apply).

crashr · 2024-06-25T20:15:21Z

crashr
Jun 25, 2024

Note that the patch in this form would have impact on the use of multiple llama.cpp instances sharing one or more P40/P100. You would have to implement a shared or synchronized semaphore. In gppm it was easy to implement. Improvements on the logic to come.

0 replies

crashr · 2024-06-27T19:58:20Z

crashr
Jun 27, 2024

This can manage multiple GPU and multiple llama.cpp instances, regardless of how CUDA_VISIBLE_DEVICES is set for the respective instances.

https://github.com/crashr/gppm/blob/multiple-llamacpp/gppmd.py

Not ready by now. The llama.cpp instances need to be launched like this atm but this will becoming comfortable soon.

https://github.com/crashr/gppm/blob/multiple-llamacpp/run_instance_1.sh
https://github.com/crashr/gppm/blob/multiple-llamacpp/run_instance_2.sh

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-pstate in llama.cpp (Tesla P40/P100) #8084

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

nvidia-pstate in llama.cpp (Tesla P40/P100) #8084

sasha0552 Jun 23, 2024

Replies: 4 comments · 1 reply

crashr Jun 23, 2024

crashr Jun 24, 2024

sasha0552 Jun 24, 2024 Author

crashr Jun 25, 2024

crashr Jun 27, 2024

sasha0552
Jun 23, 2024

Replies: 4 comments 1 reply

crashr
Jun 23, 2024

crashr
Jun 24, 2024

sasha0552 Jun 24, 2024
Author

crashr
Jun 25, 2024

crashr
Jun 27, 2024