-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Weight-Only quantization on CPU device with QBits backend #437
Support Weight-Only quantization on CPU device with QBits backend #437
Conversation
Hi @PenghuiCheng and @zhewang1-intc. This is incredibly exciting work. I will attempt to find time soon to properly review this PR and to see how well it works on CPU. I will also try benchmarking the speed on my |
it should only can work on x86 CPUs, we tested on Linux(Ubuntu) platform |
@casper-hansen Hi, we are not sure if we have done everything appropriately, but we expect your review. Please let us know if there's anything we can do to improve it 😄 |
@casper-hansen We are delighted to see your suggestions for this PR, which can help us better understand your requirements. Looking forward to your review. |
I want to request performance benchmarks by using Sorry for taking so long. I have been moving apartments, so I have been AFK. I will make sure to prioritize this PR. |
Below is the performance benchmark with examples/benchmark.py, this performance is based on the master branch of intel-extension-for-transformers(ITREX) repo, and in the latest code of ITREX, the QBits was updated to the latest version of BestLa kernel. The new version of ITREX will be released soon. We will update AutoAWQ with the new version of ITREX once it is released.: <style> </style>
|
note: we done this benchmark on INTEL(R) XEON(R) PLATINUM 8592+ with 8-channel 4800MT/s memory. |
a19739d
to
1a09c77
Compare
Benchmarks are looking good for CPU! Thanks for providing them. Can you address the comments I have left? |
But I don’t see any new comments 🤔 |
@casper-hansen hi, could you list your comments? i only notice two, 1: whether qbits works on Mac, 2: perf data, |
Signed-off-by: Cheng Penghui <[email protected]>
Signed-off-by: Cheng Penghui <[email protected]>
Signed-off-by: Cheng Penghui <[email protected]>
Signed-off-by: Cheng, Penghui <[email protected]>
Signed-off-by: Cheng, Penghui <[email protected]>
Signed-off-by: Cheng, Penghui <[email protected]>
Signed-off-by: Cheng Penghui <[email protected]>
Signed-off-by: Cheng Penghui <[email protected]>
Signed-off-by: Cheng Penghui <[email protected]>
1e08661
to
d76769e
Compare
Sorry for the long delay. I have been away the past month, taking time off from open-source. I pushed a small refactor to the setup to make it easier to install. I also tested the Llama 3 8B model, however, I think the CPU I selected is not good for LLM inference due to low clock speed/low memory bandwidth.
Here is the CPU I was able to rent:
|
@casper-hansen hi, we are preparing a blog about ITREX accelerating AutoAWQ inference, could you pls tell us which llama 8b-awq huggingface model you tested on AMD platform? We also want to benchmark on llama3 8b. |
This one should work fine. You have to set exllamav2 kernels to true to use it on AMD |
Based on the suggestion of #390, we have implemented the inference of AWQ model on the CPU device. This PR will support Weight-Only quantization on CPU devices and infernce with QBbits backend. QBits backend has a 'bestla' kernel for CPU gemm op. And QBits is a module of intel-extension-for-transformers package.