-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8bit quantization #3261
Comments
Try GPT-Q? We support 2/3/4/8 bits. |
@simon-mo is it possible to support eetq, like huggingface/text-generation-inference? https://github.com/NetEase-FuXi/EETQ It's super useful because you don't even need an offline quantization step, you just point it at a normal unquantized model and pass Here's the PR where they added it in TGI: |
Good idea. Is it possible to also integrate the W4A16kernel optimization in tensorrtllm? |
That's a good idea. EETQ works out of the box and we'd like to integrate it into vLLM. |
Does vLLM support 8 bit quantization? We need to use vLLM with large context window (>1K tokens). We tried AWQ but the generation quality is not good. Any pointer will be greatly appreciated.
The text was updated successfully, but these errors were encountered: