Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA #391

Open
1 task
irthomasthomas opened this issue Jan 18, 2024 · 0 comments
Open
1 task
Labels
llm-benchmarks testing and benchmarking large language models llm-experiments experiments with large language models llm-inference-engines Software to run inference on large language models New-Label Choose this option if the existing labels are insufficient to describe the content accurately

Comments

@irthomasthomas
Copy link
Owner

Speculative Decoding in Exllama v2 and llama.cpp Comparison

Discussion

We discussed speculative decoding (SD) in a previous thread. For those who are not aware of this feature, it allows LLM loaders to use a smaller "draft" model to help predict tokens for a larger model. In that thread, someone asked for tests of speculative decoding for both Exllama v2 and llama.cpp. Although I generally only run models in GPTQ, AWQ, or exl2 formats, I was interested in doing the exl2 vs. llama.cpp comparison.

Test Setup

The tests were run on a 2x 4090, 13900K, DDR5 system. The screen captures of the terminal output of both are available below. If someone has experience with making llama.cpp speculative decoding work better, please share.

Exllama v2 Results

Model: Xwin-LM-70B-V0.1-4.0bpw-h6-exl2

Draft Model: TinyLlama-1.1B-1T-OpenOrca-GPTQ

Performance can be highly variable, but it goes from ~20 t/s without SD to 40-50 t/s with SD.

No SD

Prompt processed in 0.02 seconds, 4 tokens, 200.61 tokens/second
Response generated in 10.80 seconds, 250 tokens, 23.15 tokens/second

With SD

Prompt processed in 0.03 seconds, 4 tokens, 138.80 tokens/second
Response generated in 5.10 seconds, 250 tokens, 49.05 tokens/second

Suggested labels

{ "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }

@irthomasthomas irthomasthomas added ExLlamaV2 llm-benchmarks testing and benchmarking large language models llm-experiments experiments with large language models llm-inference-engines Software to run inference on large language models New-Label Choose this option if the existing labels are insufficient to describe the content accurately labels Jan 18, 2024
@ShellLM ShellLM removed the llama label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llm-benchmarks testing and benchmarking large language models llm-experiments experiments with large language models llm-inference-engines Software to run inference on large language models New-Label Choose this option if the existing labels are insufficient to describe the content accurately
Projects
None yet
Development

No branches or pull requests

2 participants