Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chat template support for llama-cli #8068

Merged
merged 11 commits into from
Jun 25, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jun 22, 2024

This PR brings the same logic of chat template from server to main (llama-cli).

Goals

  • Minimal modification possible, by reusing existing llama_chat_apply_template function
  • Support both auto-detect template and custom --chat-template argument
  • Not to introduce new list to be maintained ==> Some PRs in the past tend to add separated list of prefix/postfix, which is duplicated of llama_chat_apply_template and thus requires additional maintenance.
  • To simplify the implementation on both server & main

How it works

  • Newly added CPP wrapper for llama_chat_apply_template that support std::string ==> simplify the code
  • Newly added llama_chat_format_single ==> it evaluates the history twice, once with and once without the added message, then return the diff

Demo

make llama-cli && ./llama-cli -m ../Meta-Llama-3-8B-Instruct-abliterated-v3_q4.gguf -p "You are an AI" -cnv
system

You are an AI


> hi
Hello! I'm a language model AI. It's nice to meet you! Is there something I can help you with or would you like to chat?

> what is your name
I'm an AI, so I don't have a personal name in the classical sense. I'm often referred to as "Assistant" or "AI" by users, but I don't have a specific name like a human would. However, I can be addressed as "AI" or "Assistant" if you'd like!

> who made you
I was created by a team of researcher at Meta AI. They are a group of scientists who specialize in natural language processing and machine learning. They trained me on a massive dataset of text from various sources, including books, articles, and websites, to enable me to understand and generate human-like language.

Fix #8053 #6391

Replace #6810


@ngxson ngxson requested a review from ggerganov June 22, 2024 18:39
@github-actions github-actions bot added testing Everything test related examples labels Jun 22, 2024
@ngxson ngxson added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 22, 2024
common/common.cpp Outdated Show resolved Hide resolved
common/common.cpp Outdated Show resolved Hide resolved
examples/main/main.cpp Outdated Show resolved Hide resolved
std::string user_inp = params.conversation
? chat_add_and_format("user", buffer)
: buffer;
// TODO: one inconvenient of current chat template implementation is that we can't distinguish between user input and special tokens (prefix/postfix)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When params.conversation == false there is an extra string copy that should be avoided here

Regarding the comment - can you illustrate with an example as I'm not sure what is the issue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example would be a prompt like this: Which one is correct HTML tag? <s> or <a>?

Some models having <s> as BOS will see the prompt as Which one is correct HTML tag? BOS or <a>?

Leaving special == false will fix that, but will also break chat template since we're now adding special tokens to user's text. This could be avoided with some more code. But IMO it's not really a big deal though, assuming that special tokens are unlikely to accidentally appear in the text.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a std::move(buffer) since we no longer use buffer after this line. Is it OK to do so?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha got it. Yes, for now let's make have the simple solution

examples/main/main.cpp Outdated Show resolved Hide resolved
Co-authored-by: Georgi Gerganov <[email protected]>
@ngxson ngxson added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jun 25, 2024
@mofosyne mofosyne merged commit 48e6b92 into ggerganov:master Jun 25, 2024
63 checks passed
@fairydreaming
Copy link
Collaborator

It looks like it broke some models, here is the llama-cli output and brief gdb inspection from DeepSeek-V2-Lite:

./llama-cli --numa distribute -s 42 -t 32 --temp 0.01 -m /mnt/md0/models/deepseek-v2-lite-chat-2.gguf -f ../prompt-deepseek.txt
Log start
main: build = 3248 (f675b20a)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 42
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from /mnt/md0/models/deepseek-v2-lite-chat-2.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.name str              = be9443d5eec410d7045ba7dcbe2e0f189f5dda9e
llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944
llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  18:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  19:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  20:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  21:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  22:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  23:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  24: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  25: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  108 tensors
llama_model_loader: - type  f16:  269 tensors
llm_load_vocab: special tokens cache size = 2400
llm_load_vocab: token to piece cache size = 0.6659 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 102400
llm_load_print_meta: n_merges         = 99757
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 27
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 10944
llm_load_print_meta: n_expert         = 64
llm_load_print_meta: n_expert_used    = 6
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 16B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 15.71 B
llm_load_print_meta: model size       = 29.26 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = be9443d5eec410d7045ba7dcbe2e0f189f5dda9e
llm_load_print_meta: BOS token        = 100000 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 100001 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 100001 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 126 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 1
llm_load_print_meta: n_lora_q             = 0
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 1408
llm_load_print_meta: n_expert_shared      = 2
llm_load_print_meta: expert_weights_scale = 1.0
llm_load_print_meta: rope_yarn_log_mul    = 0.0707
llm_load_tensors: ggml ctx size =    0.16 MiB
llm_load_tensors:        CPU buffer size = 29964.48 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 163840
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:        CPU KV buffer size = 43200.00 MiB
llama_new_context_with_model: KV self size  = 43200.00 MiB, K (f16): 25920.00 MiB, V (f16): 17280.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.39 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 5464.01 MiB
llama_new_context_with_model:        CPU compute buffer size =  5464.01 MiB
llama_new_context_with_model: graph nodes  = 1924
llama_new_context_with_model: graph splits = 1
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted (core dumped)
Thread 1 "llama-cli" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737347880896) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737347880896) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737347880896) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737347880896, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7a4f476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7a357f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7e29b9e in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7e3520c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007ffff7e35277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ffff7e354d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ffff7e2c449 in std::__throw_length_error(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00005555556f4867 in std::vector<char, std::allocator<char> >::_M_check_len (this=0x7fffffffb910, __n=18446744073709551522, 
    __s=0x555555899c33 "vector::_M_default_append") at /usr/include/c++/11/bits/stl_vector.h:1759
#11 0x00005555556da3cb in std::vector<char, std::allocator<char> >::_M_default_append (this=0x7fffffffb910, __n=18446744073709551522)
    at /usr/include/c++/11/bits/vector.tcc:634
#12 0x00005555556c588d in std::vector<char, std::allocator<char> >::resize (this=0x7fffffffb910, __new_size=18446744073709551615)
    at /usr/include/c++/11/bits/stl_vector.h:940
#13 0x00005555557c154a in llama_chat_apply_template (model=0x555555a8acf0, tmpl="", msgs=std::vector of length 4, capacity 4 = {...}, 
    add_ass=true) at common/common.cpp:2635
#14 0x00005555557c1b51 in llama_chat_format_example (model=0x555555a8acf0, tmpl="") at common/common.cpp:2664
#15 0x000055555586de70 in main (argc=13, argv=0x7fffffffe0e8) at examples/main/main.cpp:227
(gdb) up
#1  __pthread_kill_internal (signo=6, threadid=140737347880896) at ./nptl/pthread_kill.c:78
78	in ./nptl/pthread_kill.c
(gdb) 
#2  __GI___pthread_kill (threadid=140737347880896, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
89	in ./nptl/pthread_kill.c
(gdb) 
#3  0x00007ffff7a4f476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
26	../sysdeps/posix/raise.c: No such file or directory.
(gdb) 
#4  0x00007ffff7a357f3 in __GI_abort () at ./stdlib/abort.c:79
79	./stdlib/abort.c: No such file or directory.
(gdb) 
#5  0x00007ffff7e29b9e in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) 
#6  0x00007ffff7e3520c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) 
#7  0x00007ffff7e35277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) 
#8  0x00007ffff7e354d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) 
#9  0x00007ffff7e2c449 in std::__throw_length_error(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) 
#10 0x00005555556f4867 in std::vector<char, std::allocator<char> >::_M_check_len (this=0x7fffffffb910, __n=18446744073709551522, 
    __s=0x555555899c33 "vector::_M_default_append") at /usr/include/c++/11/bits/stl_vector.h:1759
1759		  __throw_length_error(__N(__s));
(gdb) 
#11 0x00005555556da3cb in std::vector<char, std::allocator<char> >::_M_default_append (this=0x7fffffffb910, __n=18446744073709551522)
    at /usr/include/c++/11/bits/vector.tcc:634
634			_M_check_len(__n, "vector::_M_default_append");
(gdb) 
#12 0x00005555556c588d in std::vector<char, std::allocator<char> >::resize (this=0x7fffffffb910, __new_size=18446744073709551615)
    at /usr/include/c++/11/bits/stl_vector.h:940
940		  _M_default_append(__new_size - size());
(gdb) 
#13 0x00005555557c154a in llama_chat_apply_template (model=0x555555a8acf0, tmpl="", msgs=std::vector of length 4, capacity 4 = {...}, 
    add_ass=true) at common/common.cpp:2635
2635	        buf.resize(res);
(gdb) 
#14 0x00005555557c1b51 in llama_chat_format_example (model=0x555555a8acf0, tmpl="") at common/common.cpp:2664
2664	    return llama_chat_apply_template(model, tmpl, msgs, true);
(gdb) print tmpl
$1 = ""
(gdb) down
#13 0x00005555557c154a in llama_chat_apply_template (model=0x555555a8acf0, tmpl="", msgs=std::vector of length 4, capacity 4 = {...}, 
    add_ass=true) at common/common.cpp:2635
2635	        buf.resize(res);
(gdb) print res
$2 = -1
(gdb)

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 27, 2024

@fairydreaming The default behavior should be "if built-in template is not supported, we use chatml as fallback value"

Turns out it's not the case here (I missed something). I'll need to push a fix for this.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jun 30, 2024
* add chat template support for llama-cli

* add help message

* server: simplify format_chat

* more consistent naming

* improve

* add llama_chat_format_example

* fix server

* code style

* code style

* Update examples/main/main.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
MagnusS0 pushed a commit to MagnusS0/llama.cpp-normistral-tokenizer that referenced this pull request Jul 1, 2024
* add chat template support for llama-cli

* add help message

* server: simplify format_chat

* more consistent naming

* improve

* add llama_chat_format_example

* fix server

* code style

* code style

* Update examples/main/main.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
ggerganov added a commit that referenced this pull request Jul 25, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
mishig25 pushed a commit to huggingface/huggingface.js that referenced this pull request Aug 7, 2024
In this PR, I propose some changes:
- Update binary name to `llama-cli` (for more details, see this PR:
ggerganov/llama.cpp#7809 and this [homebrew
formula](https://github.com/Homebrew/homebrew-core/blob/03cf5d39d8bf27dfabfc90d62c9a3fe19205dc2a/Formula/l/llama.cpp.rb))
- Add method to download llama.cpp via pre-built release
- Split snippet into 3 sections: `title`, `setup` and `command`
- Use `--conversation` mode to start llama.cpp in chat mode (chat
template is now supported, ref:
ggerganov/llama.cpp#8068)

---

Proposal for the UI:

(Note: Maybe the 3 sections title - setup - command can be more
separated visually)


![image](https://github.com/huggingface/huggingface.js/assets/7702203/2bd302f0-88b1-4057-9cd3-3cf4536aae50)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples merge ready indicates that this may be ready to merge soon and is just holding out in case of objections Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: --chat-template seems to be broken now, no way to truly chat from the llama-cli
4 participants