metal : add quantized FA support #10149

ggerganov · 2024-11-03T13:24:19Z

supersed #9735

Extend the FA kernels to support quantized KV cache.

TODO:

try to extend the non-vec FA kernel as well
separate code movement in different PR to reduce diff

ggml-ci

slaren · 2024-11-04T20:40:59Z

test-backend-ops fails on M3 Max. On master this test was skipped.

FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): ggml/src/ggml-metal.m:3193: GGML_ASSERT(smem <= device.maxThreadgroupMemoryLength) failed

Performance looks good, a bit slower than f16 on prompt processing:

model	size	params	backend	ngl	type_k	type_v	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	99	f16	f16	1	pp512	832.59 ± 1.26
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	99	f16	f16	1	tg128	69.68 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	99	q4_0	q4_0	1	pp512	775.79 ± 0.77
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	99	q4_0	q4_0	1	tg128	69.50 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	99	q8_0	q8_0	1	pp512	776.22 ± 0.75
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	99	q8_0	q8_0	1	tg128	69.56 ± 0.05

ggerganov · 2024-11-05T06:15:59Z

Thanks for checking. The shared memory assert should be good now - let me know if it still fails.

slaren

test-backend-ops passes now, and perplexity also looks good.

* metal : fix minor string leaks (ggml/1004) * cmake : make it possible linking ggml as external lib (ggml/1003) * sync : ggml * CANN: adjust backend registry refactor. (ggerganov#10158) remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR. * metal : move dequantize templates to beginning of MSL source (#0) * metal : simplify f16 and f32 dequant kernels (#0) * cuda : clear error after changing peer access (ggerganov#10153) * fix build break on arm64 linux (ggerganov#10166) This fixes the build break from the recent changes to move the CPU backend to separate files ggerganov#10144 * server : clarify /slots endpoint, add is_processing (ggerganov#10162) * server : clarify /slots endpoint, add is_processing * fix tests * ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (ggerganov#10167) * ggml : fix gelu tables initialization (ggerganov#10172) * Q6_K AVX improvements (ggerganov#10118) * q6_k instruction reordering attempt * better subtract method * should be theoretically faster small improvement with shuffle lut, likely because all loads are already done at that stage * optimize bit fiddling * handle -32 offset separately. bsums exists for a reason! * use shift * Update ggml-quants.c * have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86 * ggml : fix arch check in bf16_to_fp32 (ggerganov#10164) * llama : add <|tool_call|> formatting to Granite template (ggerganov#10177) Branch: GraniteToolCallTemplate Signed-off-by: Gabe Goodhart <[email protected]> * metal : add quantized FA support (ggerganov#10149) * metal : add quantized FA (vec) support ggml-ci * metal : add quantized FA (non-vec) support * metal : fix support check ggml-ci * metal : clean-up * metal : clean-up (cont) * metal : fix shared memory calc + reduce smem + comments * metal : float-correctness * metal : minor [no ci] * ggml : adjust is_first_call init value (ggerganov#10193) ggml-ci * metal : fix from ptr buffer name (ggerganov#10189) * server : remove hack for extra parallel slot (ggerganov#10187) ggml-ci * metal : add BF16 support (ggerganov#8439) * ggml : add initial BF16 support ggml-ci * metal : add mul_mat_id BF16 support ggml-ci * metal : check for bfloat support on the Metal device ggml-ci * metal : better var names [no ci] * metal : do not build bfloat kernels when not supported ggml-ci * metal : try to fix BF16 support check ggml-ci * metal : this should correctly check bfloat support --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]>

* Merge PR (#10) (#11) (#13) Merge --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump requests from 2.31.0 to 2.32.2 in the pip group across 1 directory Bumps the pip group with 1 update in the / directory: [requests](https://github.com/psf/requests). Updates `requests` from 2.31.0 to 2.32.2 - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](psf/requests@v2.31.0...v2.32.2) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production dependency-group: pip ... Signed-off-by: dependabot[bot] <[email protected]> * Temp (#15) * metal : fix minor string leaks (ggml/1004) * cmake : make it possible linking ggml as external lib (ggml/1003) * sync : ggml * CANN: adjust backend registry refactor. (ggerganov#10158) remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR. * metal : move dequantize templates to beginning of MSL source (#0) * metal : simplify f16 and f32 dequant kernels (#0) * cuda : clear error after changing peer access (ggerganov#10153) * fix build break on arm64 linux (ggerganov#10166) This fixes the build break from the recent changes to move the CPU backend to separate files ggerganov#10144 * server : clarify /slots endpoint, add is_processing (ggerganov#10162) * server : clarify /slots endpoint, add is_processing * fix tests * ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (ggerganov#10167) * ggml : fix gelu tables initialization (ggerganov#10172) * Q6_K AVX improvements (ggerganov#10118) * q6_k instruction reordering attempt * better subtract method * should be theoretically faster small improvement with shuffle lut, likely because all loads are already done at that stage * optimize bit fiddling * handle -32 offset separately. bsums exists for a reason! * use shift * Update ggml-quants.c * have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86 * ggml : fix arch check in bf16_to_fp32 (ggerganov#10164) * llama : add <|tool_call|> formatting to Granite template (ggerganov#10177) Branch: GraniteToolCallTemplate Signed-off-by: Gabe Goodhart <[email protected]> * metal : add quantized FA support (ggerganov#10149) * metal : add quantized FA (vec) support ggml-ci * metal : add quantized FA (non-vec) support * metal : fix support check ggml-ci * metal : clean-up * metal : clean-up (cont) * metal : fix shared memory calc + reduce smem + comments * metal : float-correctness * metal : minor [no ci] * ggml : adjust is_first_call init value (ggerganov#10193) ggml-ci * metal : fix from ptr buffer name (ggerganov#10189) * server : remove hack for extra parallel slot (ggerganov#10187) ggml-ci * metal : add BF16 support (ggerganov#8439) * ggml : add initial BF16 support ggml-ci * metal : add mul_mat_id BF16 support ggml-ci * metal : check for bfloat support on the Metal device ggml-ci * metal : better var names [no ci] * metal : do not build bfloat kernels when not supported ggml-ci * metal : try to fix BF16 support check ggml-ci * metal : this should correctly check bfloat support --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]>

ggerganov added 3 commits November 4, 2024 13:50

metal : add quantized FA (vec) support

6c484f3

ggml-ci

metal : add quantized FA (non-vec) support

e9565cc

metal : fix support check

13b87f2

ggml-ci

ggerganov force-pushed the gg/metal-fa-q branch from 82a7012 to 13b87f2 Compare November 4, 2024 11:51

metal : clean-up

dd0d9ed

ggerganov marked this pull request as ready for review November 4, 2024 12:10

metal : clean-up (cont)

1e12961

ggerganov mentioned this pull request Nov 4, 2024

metal : optimize FA kernels #10171

Merged

2 tasks

metal : fix shared memory calc + reduce smem + comments

d805404

ggerganov force-pushed the gg/metal-fa-q branch from 624485f to d805404 Compare November 5, 2024 06:17

ggerganov added 2 commits November 5, 2024 09:24

metal : float-correctness

73f378d

metal : minor [no ci]

9c13f95

slaren approved these changes Nov 5, 2024

View reviewed changes

ggerganov merged commit a1eaf6a into master Nov 6, 2024
1 check passed

ggerganov deleted the gg/metal-fa-q branch November 6, 2024 08:24

stefanb mentioned this pull request Nov 9, 2024

Bug: ggml_metal_init error: zero-length arrays are not permitted in C++ float4x4 lo[D16/NW4]; #10208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : add quantized FA support #10149

metal : add quantized FA support #10149

ggerganov commented Nov 3, 2024 •

edited

Loading

slaren commented Nov 4, 2024

ggerganov commented Nov 5, 2024

slaren left a comment

metal : add quantized FA support #10149

metal : add quantized FA support #10149

Conversation

ggerganov commented Nov 3, 2024 • edited Loading

slaren commented Nov 4, 2024

ggerganov commented Nov 5, 2024

slaren left a comment

Choose a reason for hiding this comment

ggerganov commented Nov 3, 2024 •

edited

Loading