Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exllamav2 integration #349

Merged
merged 1 commit into from
Sep 26, 2023
Merged

Conversation

SunMarc
Copy link
Contributor

@SunMarc SunMarc commented Sep 25, 2023

What does this PR do ?

This PR adds the exllamav2 kernels into auto-gptq. The integration is similar to the exllama kernel. Here's a quick benchmark with llama2-7B with the integration in optimum/transformers. We get slower speedup compared to the benchmark in exllamav2 repo because here we only replace the Linear layers with the quantized layer with exllamav2 kernel. I have confirmed that the tests were successful and that we have the same perplexity as exllama kernel. For now, we only support GPTQ format and not the new EXL2 format.

TLDR: for bs=4, we have the same speed as the llama model from exllamav2 repo and 40% faster than exllama kernel.

For exllama kernel, we see that we are not compute bound for bs=1 and bs=2 and memory/overhead bound for bs=4.

quantization act_order bits group_size kernel num_batches batch_size prompt_length new_tokens Load time (s) Per-token latency (ms) Throughput (tok/s) Max memory (MB)
gptq True 4 128 exllama 10 1 512 512 3.15 20.75 48.18 6293.99
gptq True 4 128 exllama 10 2 512 512 3.15 21.40 93.46 7368.83
gptq True 4 128 exllama 10 4 512 512 3.15 34.95 114.46 9517.63

For exllamav2 kernel, we see that we are not compute bound for bs=1 and bs=2 and bs=4.

quantization act_order bits group_size kernel num_batches batch_size prompt_length new_tokens Load time (s) Per-token latency (ms) Throughput (tok/s) Max memory (MB)
gptq True 4 128 exllamav2 10 1 512 512 3.12 21.03 47.54 6833.44
gptq True 4 128 exllamav2 10 2 512 512 3.12 21.52 92.93 7908.16
gptq True 4 128 exllamav2 10 4 512 512 3.12 23.78 168.18 10056.95

Benchmark using exllama2 repo with their optimized llama model:

quantization act_order bits group_size kernel num_batches batch_size prompt_length new_tokens Load time (s) Per-token latency (ms) Throughput (tok/s) Max memory (MB)
gptq True 4 128 exllamav2 10 1 512 512 3.01 8.87 112.68 6426.15
gptq True 4 128 exllamav2 10 2 512 512 3.01 15.66 127.70 7538.15
gptq True 4 128 exllamav2 10 4 512 512 3.01 23.93 167.18 9762.21

cc @PanQiWei @fxmarty

Copy link
Collaborator

@PanQiWei PanQiWei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for adding exllamav2 support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants