exllamav2 integration #349

SunMarc · 2023-09-25T17:24:00Z

What does this PR do ?

This PR adds the exllamav2 kernels into auto-gptq. The integration is similar to the exllama kernel. Here's a quick benchmark with llama2-7B with the integration in optimum/transformers. We get slower speedup compared to the benchmark in exllamav2 repo because here we only replace the Linear layers with the quantized layer with exllamav2 kernel. I have confirmed that the tests were successful and that we have the same perplexity as exllama kernel. For now, we only support GPTQ format and not the new EXL2 format.

TLDR: for bs=4, we have the same speed as the llama model from exllamav2 repo and 40% faster than exllama kernel.

For exllama kernel, we see that we are not compute bound for bs=1 and bs=2 and memory/overhead bound for bs=4.

quantization	act_order	bits	group_size	kernel	num_batches	batch_size	prompt_length	new_tokens	Load time (s)	Per-token latency (ms)	Throughput (tok/s)	Max memory (MB)
gptq	True	4	128	exllama	10	1	512	512	3.15	20.75	48.18	6293.99
gptq	True	4	128	exllama	10	2	512	512	3.15	21.40	93.46	7368.83
gptq	True	4	128	exllama	10	4	512	512	3.15	34.95	114.46	9517.63

For exllamav2 kernel, we see that we are not compute bound for bs=1 and bs=2 and bs=4.

quantization	act_order	bits	group_size	kernel	num_batches	batch_size	prompt_length	new_tokens	Load time (s)	Per-token latency (ms)	Throughput (tok/s)	Max memory (MB)
gptq	True	4	128	exllamav2	10	1	512	512	3.12	21.03	47.54	6833.44
gptq	True	4	128	exllamav2	10	2	512	512	3.12	21.52	92.93	7908.16
gptq	True	4	128	exllamav2	10	4	512	512	3.12	23.78	168.18	10056.95

Benchmark using exllama2 repo with their optimized llama model:

quantization	act_order	bits	group_size	kernel	num_batches	batch_size	prompt_length	new_tokens	Load time (s)	Per-token latency (ms)	Throughput (tok/s)	Max memory (MB)
gptq	True	4	128	exllamav2	10	1	512	512	3.01	8.87	112.68	6426.15
gptq	True	4	128	exllamav2	10	2	512	512	3.01	15.66	127.70	7538.15
gptq	True	4	128	exllamav2	10	4	512	512	3.01	23.93	167.18	9762.21

cc @PanQiWei @fxmarty

PanQiWei

Thank you very much for adding exllamav2 support!

exllamav2 integration

c912bf3

PanQiWei approved these changes Sep 26, 2023

View reviewed changes

PanQiWei merged commit 50d2e86 into AutoGPTQ:main Sep 26, 2023

This was referenced Sep 27, 2023

add_exllamav2 huggingface/optimum#1419

Merged

add exllamav2 arg huggingface/transformers#26437

Merged

flozi00 mentioned this pull request Oct 17, 2023

Upgrade to exllama v2 huggingface/text-generation-inference#1016

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exllamav2 integration #349

exllamav2 integration #349

SunMarc commented Sep 25, 2023 •

edited

Loading

PanQiWei left a comment

exllamav2 integration #349

exllamav2 integration #349

Conversation

SunMarc commented Sep 25, 2023 • edited Loading

What does this PR do ?

PanQiWei left a comment

Choose a reason for hiding this comment

SunMarc commented Sep 25, 2023 •

edited

Loading