Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal : add quantized FA support #10149

Merged
merged 8 commits into from
Nov 6, 2024
Merged

metal : add quantized FA support #10149

merged 8 commits into from
Nov 6, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Nov 3, 2024

supersed #9735

Extend the FA kernels to support quantized KV cache.

TODO:

  • try to extend the non-vec FA kernel as well
  • separate code movement in different PR to reduce diff

@ggerganov ggerganov marked this pull request as ready for review November 4, 2024 12:10
@ggerganov ggerganov mentioned this pull request Nov 4, 2024
2 tasks
@slaren
Copy link
Collaborator

slaren commented Nov 4, 2024

test-backend-ops fails on M3 Max. On master this test was skipped.

FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): ggml/src/ggml-metal.m:3193: GGML_ASSERT(smem <= device.maxThreadgroupMemoryLength) failed

Performance looks good, a bit slower than f16 on prompt processing:

model size params backend ngl type_k type_v fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 99 f16 f16 1 pp512 832.59 ± 1.26
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 99 f16 f16 1 tg128 69.68 ± 0.05
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 99 q4_0 q4_0 1 pp512 775.79 ± 0.77
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 99 q4_0 q4_0 1 tg128 69.50 ± 0.08
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 99 q8_0 q8_0 1 pp512 776.22 ± 0.75
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 99 q8_0 q8_0 1 tg128 69.56 ± 0.05

@ggerganov
Copy link
Owner Author

Thanks for checking. The shared memory assert should be good now - let me know if it still fails.

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test-backend-ops passes now, and perplexity also looks good.

@ggerganov ggerganov merged commit a1eaf6a into master Nov 6, 2024
1 check passed
@ggerganov ggerganov deleted the gg/metal-fa-q branch November 6, 2024 08:24
apicalshark added a commit to apicalshark/llama.cpp that referenced this pull request Nov 7, 2024
* metal : fix minor string leaks (ggml/1004)

* cmake : make it possible linking ggml as external lib (ggml/1003)

* sync : ggml

* CANN: adjust backend registry refactor. (ggerganov#10158)

remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR.

* metal : move dequantize templates to beginning of MSL source (#0)

* metal : simplify f16 and f32 dequant kernels (#0)

* cuda : clear error after changing peer access (ggerganov#10153)

* fix build break on arm64 linux (ggerganov#10166)

This fixes the build break from the recent changes
to move the CPU backend to separate files
ggerganov#10144

* server : clarify /slots endpoint, add is_processing (ggerganov#10162)

* server : clarify /slots endpoint, add is_processing

* fix tests

* ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (ggerganov#10167)

* ggml : fix gelu tables initialization (ggerganov#10172)

* Q6_K AVX improvements (ggerganov#10118)

* q6_k instruction reordering attempt

* better subtract method

* should be theoretically faster

small improvement with shuffle lut, likely because all loads are already done at that stage

* optimize bit fiddling

* handle -32 offset separately. bsums exists for a reason!

* use shift

* Update ggml-quants.c

* have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86

* ggml : fix arch check in bf16_to_fp32 (ggerganov#10164)

* llama : add <|tool_call|> formatting to Granite template (ggerganov#10177)

Branch: GraniteToolCallTemplate

Signed-off-by: Gabe Goodhart <[email protected]>

* metal : add quantized FA support (ggerganov#10149)

* metal : add quantized FA (vec) support

ggml-ci

* metal : add quantized FA (non-vec) support

* metal : fix support check

ggml-ci

* metal : clean-up

* metal : clean-up (cont)

* metal : fix shared memory calc + reduce smem + comments

* metal : float-correctness

* metal : minor [no ci]

* ggml : adjust is_first_call init value (ggerganov#10193)

ggml-ci

* metal : fix from ptr buffer name (ggerganov#10189)

* server : remove hack for extra parallel slot (ggerganov#10187)

ggml-ci

* metal : add BF16 support (ggerganov#8439)

* ggml : add initial BF16 support

ggml-ci

* metal : add mul_mat_id BF16 support

ggml-ci

* metal : check for bfloat support on the Metal device

ggml-ci

* metal : better var names [no ci]

* metal : do not build bfloat kernels when not supported

ggml-ci

* metal : try to fix BF16 support check

ggml-ci

* metal : this should correctly check bfloat support

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: Plamen Minev <[email protected]>
Co-authored-by: Yuri Khrustalev <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: Diego Devesa <[email protected]>
Co-authored-by: snadampal <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: Eve <[email protected]>
Co-authored-by: Gabe Goodhart <[email protected]>
apicalshark added a commit to apicalshark/llama.cpp that referenced this pull request Nov 8, 2024
* Merge PR (#10) (#11) (#13)

Merge

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dennyxbox890 <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump requests from 2.31.0 to 2.32.2 in the pip group across 1 directory

Bumps the pip group with 1 update in the / directory: [requests](https://github.com/psf/requests).


Updates `requests` from 2.31.0 to 2.32.2
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.31.0...v2.32.2)

---
updated-dependencies:
- dependency-name: requests
  dependency-type: direct:production
  dependency-group: pip
...

Signed-off-by: dependabot[bot] <[email protected]>

* Temp (#15)

* metal : fix minor string leaks (ggml/1004)

* cmake : make it possible linking ggml as external lib (ggml/1003)

* sync : ggml

* CANN: adjust backend registry refactor. (ggerganov#10158)

remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR.

* metal : move dequantize templates to beginning of MSL source (#0)

* metal : simplify f16 and f32 dequant kernels (#0)

* cuda : clear error after changing peer access (ggerganov#10153)

* fix build break on arm64 linux (ggerganov#10166)

This fixes the build break from the recent changes
to move the CPU backend to separate files
ggerganov#10144

* server : clarify /slots endpoint, add is_processing (ggerganov#10162)

* server : clarify /slots endpoint, add is_processing

* fix tests

* ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (ggerganov#10167)

* ggml : fix gelu tables initialization (ggerganov#10172)

* Q6_K AVX improvements (ggerganov#10118)

* q6_k instruction reordering attempt

* better subtract method

* should be theoretically faster

small improvement with shuffle lut, likely because all loads are already done at that stage

* optimize bit fiddling

* handle -32 offset separately. bsums exists for a reason!

* use shift

* Update ggml-quants.c

* have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86

* ggml : fix arch check in bf16_to_fp32 (ggerganov#10164)

* llama : add <|tool_call|> formatting to Granite template (ggerganov#10177)

Branch: GraniteToolCallTemplate

Signed-off-by: Gabe Goodhart <[email protected]>

* metal : add quantized FA support (ggerganov#10149)

* metal : add quantized FA (vec) support

ggml-ci

* metal : add quantized FA (non-vec) support

* metal : fix support check

ggml-ci

* metal : clean-up

* metal : clean-up (cont)

* metal : fix shared memory calc + reduce smem + comments

* metal : float-correctness

* metal : minor [no ci]

* ggml : adjust is_first_call init value (ggerganov#10193)

ggml-ci

* metal : fix from ptr buffer name (ggerganov#10189)

* server : remove hack for extra parallel slot (ggerganov#10187)

ggml-ci

* metal : add BF16 support (ggerganov#8439)

* ggml : add initial BF16 support

ggml-ci

* metal : add mul_mat_id BF16 support

ggml-ci

* metal : check for bfloat support on the Metal device

ggml-ci

* metal : better var names [no ci]

* metal : do not build bfloat kernels when not supported

ggml-ci

* metal : try to fix BF16 support check

ggml-ci

* metal : this should correctly check bfloat support

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: Plamen Minev <[email protected]>
Co-authored-by: Yuri Khrustalev <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: Diego Devesa <[email protected]>
Co-authored-by: snadampal <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: Eve <[email protected]>
Co-authored-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: dennyxbox890 <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Plamen Minev <[email protected]>
Co-authored-by: Yuri Khrustalev <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: Diego Devesa <[email protected]>
Co-authored-by: snadampal <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: Eve <[email protected]>
Co-authored-by: Gabe Goodhart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants