Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Weight-Only quantization on CPU device with QBits backend #437

Merged
merged 10 commits into from
Jun 8, 2024

Conversation

PenghuiCheng
Copy link
Contributor

@PenghuiCheng PenghuiCheng commented Apr 10, 2024

Based on the suggestion of #390, we have implemented the inference of AWQ model on the CPU device. This PR will support Weight-Only quantization on CPU devices and infernce with QBbits backend. QBits backend has a 'bestla' kernel for CPU gemm op. And QBits is a module of intel-extension-for-transformers package.

@PenghuiCheng
Copy link
Contributor Author

hi, @songhan, This PR is based on RFC, could you help review it or assign it to suitable personnel?

docs/index.md Outdated Show resolved Hide resolved
@casper-hansen
Copy link
Owner

casper-hansen commented Apr 15, 2024

Hi @PenghuiCheng and @zhewang1-intc. This is incredibly exciting work. I will attempt to find time soon to properly review this PR and to see how well it works on CPU. I will also try benchmarking the speed on my MacBook M2 Pro to see how well it can work for local models. EDIT: On another thought, I am not sure it will work on a Mac - will need to test.

@zhewang1-intc
Copy link

zhewang1-intc commented Apr 15, 2024

Hi @PenghuiCheng and @zhewang1-intc. This is incredibly exciting work. I will attempt to find time soon to properly review this PR and to see how well it works on CPU. I will also try benchmarking the speed on my MacBook M2 Pro to see how well it can work for local models. EDIT: On another thought, I am not sure it will work on a Mac - will need to test.

it should only can work on x86 CPUs, we tested on Linux(Ubuntu) platform

@zhewang1-intc
Copy link

zhewang1-intc commented Apr 26, 2024

@casper-hansen Hi, we are not sure if we have done everything appropriately, but we expect your review. Please let us know if there's anything we can do to improve it 😄

@PenghuiCheng
Copy link
Contributor Author

@casper-hansen We are delighted to see your suggestions for this PR, which can help us better understand your requirements. Looking forward to your review.

@casper-hansen
Copy link
Owner

casper-hansen commented Apr 29, 2024

I want to request performance benchmarks by using examples/benchmark.py. Can you please run the benchmark so we can assess the speed of the implementation? We can use this to set expectations with users.

Sorry for taking so long. I have been moving apartments, so I have been AFK. I will make sure to prioritize this PR.

@PenghuiCheng
Copy link
Contributor Author

PenghuiCheng commented May 16, 2024

Below is the performance benchmark with examples/benchmark.py, this performance is based on the master branch of intel-extension-for-transformers(ITREX) repo, and in the latest code of ITREX, the QBits was updated to the latest version of BestLa kernel. The new version of ITREX will be released soon. We will update AutoAWQ with the new version of ITREX once it is released.:

<style> </style>
  Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (RAM)
casperhansen/mistral-7b-instruct-v0.1-awq 1 64 64 389.24 16.01 5.59 GB (0.02%)
  1 2048 2048 1412 17.76 6.29 GB (0.03%
TheBloke/vicuna-7B-v1.5-AWQ 1 64 64 346 18.13 8.18 GB (0.03%)
  1 2048 2048 1023.4 18.18 8.80 GB (0.04%)
TheBloke/LLaMA2-13B-Tiefighter-AWQ 1 64 64 160.24 9.87 14.65 GB (0.06%)
  1 2048 2048 592.35 9.93 16.87 GB (0.07%)
abhinavkulkarni/mosaicml-mpt-7b-chat-w4-g128-awq 1 64 64 433.17 18.79 4.60 GB (0.02%)
  1 2048 2048 404.25 19.91 4.75 GB (0.02%)
casperhansen/falcon-7b-awq 1 64 64 303.16 14.41 5.18 GB (0.02%)
  1 2048 2048 634.57 15.55 5.80 GB (0.02%)
TheBloke/CodeLlama-34B-AWQ 1 64 64 153.73 4.23 29.00 GB (0.12%)
  1 2048 2048 274.25 4.38 35.21 GB (0.15%)
TheBloke/deepseek-coder-33B-instruct-AWQ 1 64 64 83.08 4.07 22.16 GB (0.09%)
  1 2048 2048 296.04 4.33 37.05 GB (0.16%)

@zhewang1-intc
Copy link

note: we done this benchmark on INTEL(R) XEON(R) PLATINUM 8592+ with 8-channel 4800MT/s memory.

README.md Outdated Show resolved Hide resolved
@casper-hansen
Copy link
Owner

Benchmarks are looking good for CPU! Thanks for providing them. Can you address the comments I have left?

@zhewang1-intc
Copy link

zhewang1-intc commented May 17, 2024

Benchmarks are looking good for CPU! Thanks for providing them. Can you address the comments I have left?

But I don’t see any new comments 🤔

@zhewang1-intc
Copy link

@casper-hansen hi, could you list your comments? i only notice two, 1: whether qbits works on Mac, 2: perf data,
for 1, we only support x86 platform and don't support M1 chip based on ARM arch, for 2, we already updated the benchmark results.

@casper-hansen
Copy link
Owner

casper-hansen commented Jun 8, 2024

Sorry for the long delay. I have been away the past month, taking time off from open-source.

I pushed a small refactor to the setup to make it easier to install. I also tested the Llama 3 8B model, however, I think the CPU I selected is not good for LLM inference due to low clock speed/low memory bandwidth.

implementation model prefill tokens/s decode tokens/s
intel extension llama 3 8b 5.17 1.32
native pytorch llama 3 8b never finished x

Here is the CPU I was able to rent:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   52 bits physical, 57 bits virtual
CPU(s):                          17
On-line CPU(s) list:             0-16
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       17
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           160
Model name:                      AMD EPYC 9754 128-Core Processor
Stepping:                        2
CPU MHz:                         2246.622
BogoMIPS:                        4493.24
Virtualization:                  AMD-V
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       1.1 MiB
L1i cache:                       1.1 MiB
L2 cache:                        8.5 MiB
L3 cache:                        272 MiB
NUMA node0 CPU(s):               0-16
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid 
                                 tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy ab
                                 m sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed ad
                                 x smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt nrip_save avx512vbmi umip pk
                                 u avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm arch_capabilities

@casper-hansen casper-hansen merged commit 2627364 into casper-hansen:main Jun 8, 2024
@zhewang1-intc
Copy link

@casper-hansen hi, we are preparing a blog about ITREX accelerating AutoAWQ inference, could you pls tell us which llama 8b-awq huggingface model you tested on AMD platform? We also want to benchmark on llama3 8b.

@casper-hansen
Copy link
Owner

This one should work fine. You have to set exllamav2 kernels to true to use it on AMD

https://huggingface.co/casperhansen/llama-3-8b-instruct-awq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants