8bit quantization #3261

rghosh08 · 2024-03-07T21:46:34Z

Does vLLM support 8 bit quantization? We need to use vLLM with large context window (>1K tokens). We tried AWQ but the generation quality is not good. Any pointer will be greatly appreciated.

simon-mo · 2024-03-08T06:24:03Z

Try GPT-Q? We support 2/3/4/8 bits.

andysalerno · 2024-03-08T21:22:54Z

Try GPT-Q? We support 2/3/4/8 bits.

@simon-mo is it possible to support eetq, like huggingface/text-generation-inference?

https://github.com/NetEase-FuXi/EETQ

It's super useful because you don't even need an offline quantization step, you just point it at a normal unquantized model and pass --quantize eetq and then magically you use half the vram and get super fast inference with very little quality impact.

Here's the PR where they added it in TGI:
https://github.com/huggingface/text-generation-inference/pull/1068/files

shiqingzhangCSU · 2024-03-11T06:26:34Z

Try GPT-Q? We support 2/3/4/8 bits.

@simon-mo is it possible to support eetq, like huggingface/text-generation-inference?

https://github.com/NetEase-FuXi/EETQ

It's super useful because you don't even need an offline quantization step, you just point it at a normal unquantized model and pass --quantize eetq and then magically you use half the vram and get super fast inference with very little quality impact.

Here's the PR where they added it in TGI: https://github.com/huggingface/text-generation-inference/pull/1068/files

Good idea. Is it possible to also integrate the W4A16kernel optimization in tensorrtllm?

SidaZh · 2024-03-14T06:55:24Z

That's a good idea. EETQ works out of the box and we'd like to integrate it into vLLM.

dtlzhuangz mentioned this issue Mar 25, 2024

[Misc] feat: add eetq quantization #3614

Closed

hibukipanim mentioned this issue Jul 3, 2024

[Roadmap] vLLM Roadmap Q3 2024 #5805

Open

46 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8bit quantization #3261

8bit quantization #3261

rghosh08 commented Mar 7, 2024 •

edited

Loading

simon-mo commented Mar 8, 2024

andysalerno commented Mar 8, 2024 •

edited

Loading

shiqingzhangCSU commented Mar 11, 2024

SidaZh commented Mar 14, 2024

8bit quantization #3261

8bit quantization #3261

Comments

rghosh08 commented Mar 7, 2024 • edited Loading

simon-mo commented Mar 8, 2024

andysalerno commented Mar 8, 2024 • edited Loading

shiqingzhangCSU commented Mar 11, 2024

SidaZh commented Mar 14, 2024

rghosh08 commented Mar 7, 2024 •

edited

Loading

andysalerno commented Mar 8, 2024 •

edited

Loading