-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : refactor graph build code #3837
Conversation
66a54bf
to
b4ad03b
Compare
ggml-ci
* llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper
Planning to merge this soon. It will be miracle if I didn't break something, but I think the refactoring should make things a bit easier with adding new model arches in the future. |
This change broke Falcon. |
Could you be more specific? Falcon 7B appears to still work for me on both CPU and CUDA. |
The attention norm for Falcon-40B was using wrong input tensor. Should be fixed with 523e49b |
I can confirm Persimmon is broken
I'm uploading Q4_K model to https://huggingface.co/Galunid/persimmon-gguf/tree/main I tested it with 238657d and it was working, 71e3718 does not work Dunno if I should create separate issue for this |
* undoing more semantic renames in ggerganov/llama.cpp#3837
* undoing more semantic renames in ggerganov/llama.cpp#3837
* llama : factor out ggml-alloc from graph graph build functions ggml-ci * metal : disable kernel load log * llama : factor out tensor offloading outside the build call (wip) ggml-ci * llama : offload rest of the models ggml-ci * llama : update offload log messages to print node index * llama : comments * llama : support offloading result_norm + comments * llama : factor graph input into a function * llama : do tensor offload only with CUDA * llama : fix res_norm offloading * llama : try to optimize offloading code * llama : fix non-CUDA build * llama : try to fix build * llama : move refact in correct place + optimize graph input * llama : refactor tensor offloading as callback * llama : add layer index to all tensor names * llama : add functional header * llama : comment ggml-ci * llama : remove obsolete map for layer counting * llama : add llm_build helper functions (ggerganov#3848) * llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (ggerganov#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper
* undoing more semantic renames in ggerganov/llama.cpp#3837
ref #3382
result_norm
when embeddings are not requestedOffload tensor with token positionsinp_pos
as non-repeating (NR)Change offload type ofKQ_mask
,KQ_shift
andK_shifted
from KQ to NR. Any reason for these to be KQ?TODO: