Support Weight-Only quantization on CPU device with QBits backend #437

PenghuiCheng · 2024-04-10T12:22:19Z

Based on the suggestion of #390, we have implemented the inference of AWQ model on the CPU device. This PR will support Weight-Only quantization on CPU devices and infernce with QBbits backend. QBits backend has a 'bestla' kernel for CPU gemm op. And QBits is a module of intel-extension-for-transformers package.

PenghuiCheng · 2024-04-15T02:03:49Z

hi, @songhan, This PR is based on RFC, could you help review it or assign it to suitable personnel?

docs/index.md

casper-hansen · 2024-04-15T10:16:10Z

Hi @PenghuiCheng and @zhewang1-intc. This is incredibly exciting work. I will attempt to find time soon to properly review this PR and to see how well it works on CPU. I will also try benchmarking the speed on my ~~MacBook M2 Pro~~ to see how well it can work for local models. EDIT: On another thought, I am not sure it will work on a Mac - will need to test.

zhewang1-intc · 2024-04-15T13:15:17Z

Hi @PenghuiCheng and @zhewang1-intc. This is incredibly exciting work. I will attempt to find time soon to properly review this PR and to see how well it works on CPU. I will also try benchmarking the speed on my ~~MacBook M2 Pro~~ to see how well it can work for local models. EDIT: On another thought, I am not sure it will work on a Mac - will need to test.

it should only can work on x86 CPUs, we tested on Linux(Ubuntu) platform

zhewang1-intc · 2024-04-26T09:16:39Z

@casper-hansen Hi, we are not sure if we have done everything appropriately, but we expect your review. Please let us know if there's anything we can do to improve it 😄

PenghuiCheng · 2024-04-29T01:39:00Z

@casper-hansen We are delighted to see your suggestions for this PR, which can help us better understand your requirements. Looking forward to your review.

casper-hansen · 2024-04-29T10:15:23Z

I want to request performance benchmarks by using examples/benchmark.py. Can you please run the benchmark so we can assess the speed of the implementation? We can use this to set expectations with users.

Sorry for taking so long. I have been moving apartments, so I have been AFK. I will make sure to prioritize this PR.

PenghuiCheng · 2024-05-16T02:46:21Z

Below is the performance benchmark with examples/benchmark.py, this performance is based on the master branch of intel-extension-for-transformers(ITREX) repo, and in the latest code of ITREX, the QBits was updated to the latest version of BestLa kernel. The new version of ITREX will be released soon. We will update AutoAWQ with the new version of ITREX once it is released.:

	Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (RAM)
casperhansen/mistral-7b-instruct-v0.1-awq	1	64	64	389.24	16.01	5.59 GB (0.02%)
	1	2048	2048	1412	17.76	6.29 GB (0.03%
TheBloke/vicuna-7B-v1.5-AWQ	1	64	64	346	18.13	8.18 GB (0.03%)
	1	2048	2048	1023.4	18.18	8.80 GB (0.04%)
TheBloke/LLaMA2-13B-Tiefighter-AWQ	1	64	64	160.24	9.87	14.65 GB (0.06%)
	1	2048	2048	592.35	9.93	16.87 GB (0.07%)
abhinavkulkarni/mosaicml-mpt-7b-chat-w4-g128-awq	1	64	64	433.17	18.79	4.60 GB (0.02%)
	1	2048	2048	404.25	19.91	4.75 GB (0.02%)
casperhansen/falcon-7b-awq	1	64	64	303.16	14.41	5.18 GB (0.02%)
	1	2048	2048	634.57	15.55	5.80 GB (0.02%)
TheBloke/CodeLlama-34B-AWQ	1	64	64	153.73	4.23	29.00 GB (0.12%)
	1	2048	2048	274.25	4.38	35.21 GB (0.15%)
TheBloke/deepseek-coder-33B-instruct-AWQ	1	64	64	83.08	4.07	22.16 GB (0.09%)
	1	2048	2048	296.04	4.33	37.05 GB (0.16%)

zhewang1-intc · 2024-05-16T03:00:54Z

note: we done this benchmark on INTEL(R) XEON(R) PLATINUM 8592+ with 8-channel 4800MT/s memory.

README.md

casper-hansen · 2024-05-17T12:07:15Z

Benchmarks are looking good for CPU! Thanks for providing them. Can you address the comments I have left?

zhewang1-intc · 2024-05-17T12:29:12Z

Benchmarks are looking good for CPU! Thanks for providing them. Can you address the comments I have left?

But I don’t see any new comments 🤔

zhewang1-intc · 2024-05-20T00:30:15Z

@casper-hansen hi, could you list your comments? i only notice two, 1: whether qbits works on Mac, 2: perf data,
for 1, we only support x86 platform and don't support M1 chip based on ARM arch, for 2, we already updated the benchmark results.

Signed-off-by: Cheng Penghui <[email protected]>

Signed-off-by: Cheng, Penghui <[email protected]>

Signed-off-by: Cheng Penghui <[email protected]>

casper-hansen · 2024-06-08T10:45:59Z

Sorry for the long delay. I have been away the past month, taking time off from open-source.

I pushed a small refactor to the setup to make it easier to install. I also tested the Llama 3 8B model, however, I think the CPU I selected is not good for LLM inference due to low clock speed/low memory bandwidth.

implementation	model	prefill tokens/s	decode tokens/s
intel extension	llama 3 8b	5.17	1.32
native pytorch	llama 3 8b	never finished	x

Here is the CPU I was able to rent:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   52 bits physical, 57 bits virtual
CPU(s):                          17
On-line CPU(s) list:             0-16
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       17
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           160
Model name:                      AMD EPYC 9754 128-Core Processor
Stepping:                        2
CPU MHz:                         2246.622
BogoMIPS:                        4493.24
Virtualization:                  AMD-V
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       1.1 MiB
L1i cache:                       1.1 MiB
L2 cache:                        8.5 MiB
L3 cache:                        272 MiB
NUMA node0 CPU(s):               0-16
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid 
                                 tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy ab
                                 m sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed ad
                                 x smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt nrip_save avx512vbmi umip pk
                                 u avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm arch_capabilities

zhewang1-intc · 2024-06-17T03:07:50Z

@casper-hansen hi, we are preparing a blog about ITREX accelerating AutoAWQ inference, could you pls tell us which llama 8b-awq huggingface model you tested on AMD platform? We also want to benchmark on llama3 8b.

casper-hansen · 2024-06-17T05:11:11Z

This one should work fine. You have to set exllamav2 kernels to true to use it on AMD

https://huggingface.co/casperhansen/llama-3-8b-instruct-awq

zhewang1-intc reviewed Apr 15, 2024

View reviewed changes

docs/index.md Outdated Show resolved Hide resolved

PenghuiCheng force-pushed the ITREX_weight-only branch from a19739d to 1a09c77 Compare May 16, 2024 03:30

zhewang1-intc reviewed May 16, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

PenghuiCheng added 9 commits May 30, 2024 21:32

Support Weight-Only quantization on CPU device with QBits backend

80c1e8a

Signed-off-by: Cheng Penghui <[email protected]>

fixed examples error and update documents for CPU device

c2ec592

Signed-off-by: Cheng Penghui <[email protected]>

Update docs for CPU device

df236a5

Signed-off-by: Cheng Penghui <[email protected]>

Improved memory statistics in benchmark

1dcc83b

Signed-off-by: Cheng, Penghui <[email protected]>

Add the performmance benchmark for CPU device

8ae776e

Signed-off-by: Cheng, Penghui <[email protected]>

Fixed typo

8fed063

Signed-off-by: Cheng, Penghui <[email protected]>

Support Weight-Only quantization on CPU device with QBits backend

73d3e8e

Signed-off-by: Cheng Penghui <[email protected]>

Adapt to ITREX1.4.2 version

8159119

Signed-off-by: Cheng Penghui <[email protected]>

Update itrex version

d76769e

Signed-off-by: Cheng Penghui <[email protected]>

PenghuiCheng force-pushed the ITREX_weight-only branch from 1e08661 to d76769e Compare May 31, 2024 04:33

Refactor: Smoother install

4d45301

casper-hansen merged commit 2627364 into casper-hansen:main Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Weight-Only quantization on CPU device with QBits backend #437

Support Weight-Only quantization on CPU device with QBits backend #437

PenghuiCheng commented Apr 10, 2024 •

edited

Loading

PenghuiCheng commented Apr 15, 2024

casper-hansen commented Apr 15, 2024 •

edited

Loading

zhewang1-intc commented Apr 15, 2024 •

edited

Loading

zhewang1-intc commented Apr 26, 2024 •

edited

Loading

PenghuiCheng commented Apr 29, 2024

casper-hansen commented Apr 29, 2024 •

edited

Loading

PenghuiCheng commented May 16, 2024 •

edited

Loading

zhewang1-intc commented May 16, 2024

casper-hansen commented May 17, 2024

zhewang1-intc commented May 17, 2024 •

edited

Loading

zhewang1-intc commented May 20, 2024

casper-hansen commented Jun 8, 2024 •

edited

Loading

zhewang1-intc commented Jun 17, 2024

casper-hansen commented Jun 17, 2024

Support Weight-Only quantization on CPU device with QBits backend #437

Support Weight-Only quantization on CPU device with QBits backend #437

Conversation

PenghuiCheng commented Apr 10, 2024 • edited Loading

PenghuiCheng commented Apr 15, 2024

casper-hansen commented Apr 15, 2024 • edited Loading

zhewang1-intc commented Apr 15, 2024 • edited Loading

zhewang1-intc commented Apr 26, 2024 • edited Loading

PenghuiCheng commented Apr 29, 2024

casper-hansen commented Apr 29, 2024 • edited Loading

PenghuiCheng commented May 16, 2024 • edited Loading

zhewang1-intc commented May 16, 2024

casper-hansen commented May 17, 2024

zhewang1-intc commented May 17, 2024 • edited Loading

zhewang1-intc commented May 20, 2024

casper-hansen commented Jun 8, 2024 • edited Loading

zhewang1-intc commented Jun 17, 2024

casper-hansen commented Jun 17, 2024

PenghuiCheng commented Apr 10, 2024 •

edited

Loading

casper-hansen commented Apr 15, 2024 •

edited

Loading

zhewang1-intc commented Apr 15, 2024 •

edited

Loading

zhewang1-intc commented Apr 26, 2024 •

edited

Loading

casper-hansen commented Apr 29, 2024 •

edited

Loading

PenghuiCheng commented May 16, 2024 •

edited

Loading

zhewang1-intc commented May 17, 2024 •

edited

Loading

casper-hansen commented Jun 8, 2024 •

edited

Loading