[BesTLA] Improve RTN quantization accuracy of int4 and int3 #172

luoyu-intel · 2024-03-13T08:42:34Z

Type of Change

Get higher quantization accuracy of BesTLA's quantization packweight API.

Introduce auto-fullrange for NBits quantization
Add int3 rounding conversion
Optimize int4 decompression on client CPUs
Remove S4_Fullrange, as it's already covered by auto-fullrange
root cause Float4 performance issue on hybrid CPUs, 20%+ speedup

luoyu-intel · 2024-03-13T09:18:14Z

Text generation comparison ( weight_dtype=int3, group_size=128)
prompt: 'Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. '
This PR:

And every day that same little girl would put on her best dress and pear

model_print_timings:        load time =    73.59 ms
model_print_timings:      sample time =     8.07 ms /    16 runs   (    0.50 ms per token)
model_print_timings: prompt eval time =    73.56 ms /    34 tokens (    2.16 ms per token)
model_print_timings:        eval time =   285.11 ms /    15 runs   (   19.01 ms per token)
model_print_timings:       total time =   370.79 ms
========== eval time log of each prediction ==========
prediction   0, time: 73.56ms
prediction   1, time: 19.65ms
prediction   2, time: 19.07ms
prediction   3, time: 18.98ms
prediction   4, time: 19.01ms
prediction   5, time: 19.08ms

Main:

Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија
model_print_timings:        load time =    73.01 ms
model_print_timings:      sample time =     7.62 ms /    16 runs   (    0.48 ms per token)
model_print_timings: prompt eval time =    72.98 ms /    34 tokens (    2.15 ms per token)
model_print_timings:        eval time =   289.00 ms /    15 runs   (   19.27 ms per token)
model_print_timings:       total time =   373.55 ms
========== eval time log of each prediction ==========
prediction   0, time: 72.98ms
prediction   1, time: 19.92ms
prediction   2, time: 19.21ms
prediction   3, time: 19.20ms
prediction   4, time: 19.23ms
prediction   5, time: 19.45ms

bestla/bestla/kernel_ref.h

luoyu-intel · 2024-03-13T10:04:09Z

@kevinintel @hshen14 INT3 RTN quantization can generate reasonable texts now. I added 'int3' to the quantization weight_dtype options in this PR.

hshen14

what's our INT3 GEMM perf vs. llama.cpp INT3 GEMM perf?

bestla/bestla/kernel_ref.h

luoyu-intel · 2024-03-14T01:29:09Z

what's our INT3 GEMM perf vs. llama.cpp INT3 GEMM perf?

@hshen14 llama.cpp Q3_K_S' performance:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. But she was always too scared to leave her house or talk to strangers
llama_print_timings:        load time =     154.97 ms
llama_print_timings:      sample time =       3.68 ms /    16 runs   (    0.23 ms per token,  4345.46 tokens per second)
llama_print_timings: prompt eval time =     382.78 ms /    34 tokens (   11.26 ms per token,    88.82 tokens per second)
llama_print_timings:        eval time =     576.09 ms /    15 runs   (   38.41 ms per token,    26.04 tokens per second)
llama_print_timings:       total time =     968.28 ms /    49 tokens

19ms vs. 38ms

luoyu-intel marked this pull request as draft March 13, 2024 08:47

luoyu-intel marked this pull request as ready for review March 13, 2024 09:18

luoyu-intel requested review from zhewang1-intc and DDEle March 13, 2024 09:18

zhewang1-intc approved these changes Mar 13, 2024

View reviewed changes

bestla/bestla/kernel_ref.h Show resolved Hide resolved

luoyu-intel requested a review from airMeng March 13, 2024 09:56

luoyu-intel changed the title ~~[BesTLA] Improve quantization accuracy of int4 and int3~~ [BesTLA] Improve RTN quantization accuracy of int4 and int3 Mar 13, 2024

hshen14 reviewed Mar 13, 2024

View reviewed changes

airMeng reviewed Mar 14, 2024

View reviewed changes

bestla/bestla/kernel_ref.h Show resolved Hide resolved

DDEle approved these changes Mar 14, 2024

View reviewed changes

luoyu-intel mentioned this pull request Mar 14, 2024

fix nf4 performance in hybrid CPU #120

Closed

VincyZhang added the v1.0a label Mar 15, 2024

luoyu-intel added 15 commits March 15, 2024 16:03

add s4_auto calibration

f4184b2

remove debug code

5ca5070

fix S3 quant error: add rounding and auto quant.

787aefb

clang-format

ffc50b2

add int3 for quant args

3c074c6

fix compile error on GCC8.5

ce73ca6

revert random range

98ae6e3

use AVX512F inst

d9372e5

pass compilation on GCC8.5

b01354a

add avx2 version of s4 decompression

a416ac2

fix compile

4e5821a

remove S4_FULLRANGE

1afa830

fix compile

d81a785

add decompress_kblock_s4_fp to avx2 file

2738a1e

remove warnings

f1904ae

luoyu-intel added 11 commits March 15, 2024 16:03

remove pow usage

dd1ec9f

fix dtype

080020f

fix the e8m0 conversion code

edf6ddd

remove SSE unpack 4bit

64a63cb

remove _mm256_i32gather_ps for poor performance

417b505

for dequant

08d6d56

fix compile

8f9c1a7

fix UT error

01b4679

fix UT err

a2f9814

fix thread dead lock

670b1d3

clang-format

ad6b441

luoyu-intel force-pushed the opt_int4_quant branch from 7f7cb71 to ad6b441 Compare March 15, 2024 08:03

zhewang1-intc and others added 3 commits March 15, 2024 16:22

fix double-quant bug

5899951

fix UT threshold

f5bda82

fix code bug

058a574

luoyu-intel added the ready to merge label Mar 15, 2024

airMeng merged commit a90aea7 into main Mar 18, 2024
12 checks passed

zhewang1-intc deleted the opt_int4_quant branch May 6, 2024 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BesTLA] Improve RTN quantization accuracy of int4 and int3 #172

[BesTLA] Improve RTN quantization accuracy of int4 and int3 #172

luoyu-intel commented Mar 13, 2024 •

edited

Loading

luoyu-intel commented Mar 13, 2024 •

edited

Loading

luoyu-intel commented Mar 13, 2024

hshen14 left a comment

luoyu-intel commented Mar 14, 2024

[BesTLA] Improve RTN quantization accuracy of int4 and int3 #172

[BesTLA] Improve RTN quantization accuracy of int4 and int3 #172

Conversation

luoyu-intel commented Mar 13, 2024 • edited Loading

Type of Change

luoyu-intel commented Mar 13, 2024 • edited Loading

luoyu-intel commented Mar 13, 2024

hshen14 left a comment

Choose a reason for hiding this comment

luoyu-intel commented Mar 14, 2024

luoyu-intel commented Mar 13, 2024 •

edited

Loading

luoyu-intel commented Mar 13, 2024 •

edited

Loading