Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected performance issue with longer prompts? #938

Closed
catid opened this issue Apr 13, 2023 · 3 comments
Closed

Unexpected performance issue with longer prompts? #938

catid opened this issue Apr 13, 2023 · 3 comments

Comments

@catid
Copy link

catid commented Apr 13, 2023

Got pretty far through implementing a llama.cpp-based tool that uses 65B model to do static code analysis, but ran into a wall. The ggml inference engine gets incredibly slow when the past context is long, which is very different from GPU behavior.

The GPU version of my code only gets about 2x slower when there's a long prompt, but the ggml CPU version is like 100x slower. This makes my idea not work on CPU which makes me sad.

I was expecting it to take about 1 second per token so maybe 4 seconds to generate a score between 0...1 for each function in C++ code, which would have been fine.

Maybe this is a performance bug in llama_eval()? The main reason I'm coming to this conclusion is that I'm observing that using the ./main chat app, it takes time per input token as well as per output token, while the HuggingFace LLaMA library practically doesn't care how long the input is - Performance is only 2x worse at most.

Here's my branch: https://github.com/catid/llamanal.cpp/tree/main/examples/analysis

Test code is here: https://github.com/catid/llamanal.cpp/blob/d9f666a39c1a2e82a34e1508ba4c6121cae7a932/examples/analysis/oracle.cpp#L52

@ggerganov
Copy link
Owner

Currently, I am not convinced it is a bug in llama.cpp, but probably there is some room for improvement.
The GPU has a much higher memory throughput and for prompt processing, the computation is highly parallel so I expect it to be orders of magnitude faster compared to the CPU.

Here are a few things to try to improve the performance of llama.cpp for large prompt processing:

I hope at some point we will have better GPU support in ggml, but this will probably take some time.

Otherwise, this is a very cool idea and I will be very happy if you succeed in implementing it and make it run efficiently!
I think the quantization accuracy improvements that are on the way might also be useful to your project.

@catid
Copy link
Author

catid commented Apr 13, 2023

Thanks for taking a look! OpenBLAS helped, but I agree the issue appears to just be a compute bottleneck and this will require GPUs to run.

@catid catid closed this as completed Apr 13, 2023
@ggerganov
Copy link
Owner

@catid

With the recently added cuBLAS support, people are reporting significant speed improvements of running large prompt inference: #1065 (comment) (multiple times faster than before, depending on the GPU that you have).

The current approach allows to take advantage of GPUs with low VRAM even on largest models since the model weights are passed to the GPU "on-demand" instead of loading everything in VRAM.

Maybe you will be interested to give your idea another try with the latest version of this repo and enabling "cuBLAS"

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
jeroen-mostert pushed a commit to jeroen-mostert/llama.cpp that referenced this issue Aug 30, 2024
* Gradient rope formula with offsets

Positive for Solar models
Negative for Llama 1 and 2 models

* Update gpttype_adapter.cpp

Remove L1/L2

* cleanup PR, skip llama models, keep prints behind debug mode

---------

Co-authored-by: Concedo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants