Implement (properly) different chat templates in main.cpp #6391

ngxson · 2024-03-29T23:07:35Z

Motivation

From the day I added llama_chat_apply_template #5538 , I already started thinking about adding it into main.cpp for replacing the current -cml option. However, it is not as easy as it seems. The main reason is because main.cpp still rely on antiprompt and static prefix/postfix/infix to work with "chat".

The whole reason why antiprompt exist in the first place was because in the early era of LLMs:

We don't have a good way to differentiate different roles (user - assistant)
We don't have a good way to know when to stop the generation

However, a lot of things changed since then: we're now having the notion of "chat template", newer models have special tokens like <|user|> to replace Human:, most models are fine-tuned to stop generation by outputting EOS token...

For that reason, using antiprompt and static prefix/postfix/infix is no longer a viable option to add chat template into main.cpp. That force us to be a bit more creative.

Possible Implementation

The prefix/postfix can be changed dynamically based on message role. For example:
Chatml uses "<|im_start|>" + role + "\n" as prefix (role is dynamic based on current message); <|im_end|>\n is the postfix.
This idea is being implemented in Refactor chat template API #6822
Use llama_token_is_eog() to replace antiprompt. Additionally, for compatibility reason, we can translate EOG token to antiprompt, because some models output the antiprompt as a sequence of multiple tokens (newer models never do this).

Old proposal (outdated)

Possible Implementation

My idea is trying to use llama_chat_apply_template in main.cpp. This will effectively deprecate antiprompt, prompt prefix/postfix and cml options.

Format the chat on-the-go

For now, llama_chat_apply_template produce very "additive" result when a new message is added to the list.

An additive means for example if I have [msg1, msg2], then I get formatted chat msg1_msg2. When I add msg3 to the list, it must add the formatted msg3 to the end of the formatted chat, without touching the existing content, it results in msg1_msg2_msg3 in this example. A wrong result maybe msg1+++msg2_msg3

This is very important. Unlike server.cpp where we clear the KV cache and re-format a new prompt each time, main.cpp add new tokens on top of existing ones, then continues the generation until a condition is met (maybe EOS token or a stop sequence).

So, to use llama_chat_apply_template in main.cpp, a test case must be added to test all chat templates to make sure they are all "additive".

main.cpp can then keep track of a list of messages, re-apply chat template each time and take only the "added" part.

Example:

Messages: [msg1, msg2] ==> Formatted: <user>msg1<assistant>msg2
Messages: [msg1, msg2, msg3] ==> Formatted: <user>msg1<assistant>msg2<user>msg3 ==> Part to evaluate: <user>msg3

Manage stop sequences

llama_chat_apply_template should returns the stop sequence along side with the formatted chat template ==> this is what we need to add.

The text was updated successfully, but these errors were encountered:

ngxson · 2024-03-29T23:09:32Z

Superseded #6378

MoonRide303 · 2024-04-01T10:23:58Z

I am still learning basics of llama.cpp, so maybe I am missing something, but shouldn't there also be separate parameter for main to inject system message in proper place, while using given chat template? Basically something like SYSTEM in Ollama's Modelfile.

teleprint-me · 2024-04-03T17:24:58Z

There are 2 models. The tokenizer and then the language model. The tokenizer manages how the input is managed and split into "tokens", I say "tokens" because they can be anything and represent anything.

The language model utilizes the tokenizer and is "conditioned" on sequencing of the input. So you could simply "inject" a "system prompt", but it wouldn't make a difference unless the language model was "conditioned" with this in mind.

You can do some really interesting things with the "initial prompt" which is typically more evident in a "completion" than a "chat completion". This is really pronounced with the Mistral models because none of them technically have a "system prompt".

hnfong · 2024-04-21T14:47:10Z

It seems to me that the chatml formatting code in main.cpp is mostly good, sans the anti-prompt issue. Instead of trying to reuse llama_chat_apply_template is it possible to just populate the values inp_pfx/inp_sfx/cml_pfx/cml_sfx depending on the template (and you can put the token strings in llama.cpp or even make a template.cpp to hold them all)? The change I proposed above seems simpler than trying to force main.cpp move to a different model where it holds a list of messages, which runs into the problems you mention since its a fundamentally mismatch of what main.cpp is really doing.

I imagine the change is relatively straightforward and I probably can make a patch, but if it's not the direction the project wants to go, I'll probably just hack up something myself so that I can chat with llama3 in peace :D

ngxson · 2024-04-21T15:01:10Z

@hnfong The reason why I propose to reuse llama_chat_apply_template is because I don't want to have 2 different APIs to manage chat templates. If we want to have prefix/postfix system, it's better to refactor the whole chat template API into something easier to maintain. Furthermore, llama_chat_apply_template is designed so that developers who use llama.cpp as a 3rd party library in their project can easily format the chat. So, it's more reasonable to also demonstrate it in main (the same way we're using it on server)

Btw, the main idea here is to rethink about this subject more seriously before actually implement it - this is to prevent introducing patches, which makes it harder for future contributors and maintainers to work on the project.

hnfong · 2024-04-21T16:07:44Z

We might be converging to the same idea, but anyway if you want the code to be reuseable I suppose you can refactor the llama_chat_apply_template to be able to take an existing context/list/thing/whatever and ask it to add one more message? Then the original llama_chat_apply_template can just call this function repeatedly?

FWIW, I'm happy for whatever solution that gets implemented, the existing templating code in main.cpp is really unnecessarily rigid and cumbersome and needs a makeover. (I got my quick patch working for llama3 though, so hopefully that will last [i.e. no conflicts] until this issue is resolved :)

ngxson · 2024-04-21T16:24:26Z

My idea is simply call llama_chat_apply_template twice: with and without the last user message. Then, I can find the diff between 2 output strings and feed it into inference. This approach will require minimal effort to maintain the chat template infrastructure, while using the extract same logic for main and server (remind: server also have the notion of "prompt cache" which works the same way)

hanishkvc · 2024-04-21T16:29:16Z

It seems to me that the chatml formatting code in main.cpp is mostly good, sans the anti-prompt issue. Instead of trying to reuse llama_chat_apply_template is it possible to just populate the values inp_pfx/inp_sfx/cml_pfx/cml_sfx depending on the template (and you can put the token strings in llama.cpp or even make a template.cpp to hold them all)? The change I proposed above seems simpler than trying to force main.cpp move to a different model where it holds a list of messages, which runs into the problems you mention since its a fundamentally mismatch of what main.cpp is really doing.

I imagine the change is relatively straightforward and I probably can make a patch, but if it's not the direction the project wants to go, I'll probably just hack up something myself so that I can chat with llama3 in peace :D

Have a look at #6795, it allows one to chat using the existing chat template mechanism from main. In that PR, if you look at the comments, I have attached a simple patch for adding llama3 template support and with that you can talk to llama3 using main.

ngxson · 2024-04-21T16:35:35Z

#6795 should work with all templates except for templates that does not have support for system prompt (or llama2 with <<SYS>> for system message). That's why in #6810 I propose to evaluate to format the whole chat history instead of one message at a time.

hanishkvc · 2024-04-21T16:39:36Z

My idea is simply call llama_chat_apply_template twice: with and without the last user message. Then, I can find the diff between 2 output strings and feed it into inference. This approach will require minimal effort to maintain the chat template infrastructure, while using the extract same logic for main and server (remind: server also have the notion of "prompt cache" which works the same way)

Not sure why you want to do that, look at that patch I submitted, it uses the existing interactive/chatml flow of main and modifies it so that the existing chat template logic gets called only once per message, as and when a new message is entered by the user. And I have tested that it works properly with llama2 as well as llama3 (with the template patch I have attached as a comment there).

There is one issue, with my patch, which I will look at fixing later today or tomorrow, ie ensuring that the user message doesnt get tokenized wrt special tokens. Given how llama.cpp's chat apply template logic is implemented, both the special tokens (to help identify role to chat-model) and the user message get merged into a single string, which is returned, so that is one thing, I require to handle by extracting the surrounding introduced by chat-apply-template and handling its tokenization seperate from the actual user message

hanishkvc · 2024-04-21T16:43:52Z

#6795 should work with all templates except for templates that does not have support for system prompt (or llama2 with <<SYS>> for system message). That's why in #6810 I propose to evaluate to format the whole chat history instead of one message at a time.

System prompt is a one time message to handle at the begining and for any template standard which doesnt have system role handled in chat-apply-template, it is better to update chat-apply-template to add system role wrapping, rather than trying to look through messages multiple times. Also even in your case, you still will have to handle the missing system prompt wrapping issue. I feel it is better to add the missing system role wrapping in the chat-apply-template directly, so that it can be used by anyone be it server or main or any other logic that may use.

ngxson · 2024-04-21T16:46:00Z

chat template logic gets called only once per message

Take this case for example: We have 2 messages:

system: You are a helpful assistant
user: Hi, who are you?

If you apply chat template one-by-one:

[INST] <<SYS>>\nYou are a helpful assistant\n<</SYS>>\n\n
[INST]Hi, who are you?[/INST]

That produces:

[INST] <<SYS>>\nYou are a helpful assistant\n<</SYS>>\n\n[INST]Hi, who are you?[/INST]

Which is incorrect. The correct formatted chat is:

[INST] <<SYS>>\nYou are a helpful assistant\n<</SYS>>\n\nHi, who are you?[/INST]

Another case that it will fails is llama2 with BOS in-history. The first message in the prompt never has <s>, only subsequent messages have. So, if you apply template for single message, it will never have BOS (that's the case of add_bos_inside_history inside llama_chat_apply_template_internal)

ngxson · 2024-04-21T16:53:56Z

it is better to update chat-apply-template to add system role wrapping, rather than trying to look through messages multiple times

Sorry I don't really get this idea, can you please take an example?

Anyway, I still don't think having to format all messages every time is that bad. If we think in the perspective of server example, it is a normal thing since server is designed to be stateless (it evaluates new chat history for each request).

hanishkvc · 2024-04-21T17:41:29Z

I am more of a casual recent enquirer (not a regular user) of LLMs, so I havent really looked deep into them or for that matter even llama.cpp. But my understanding with minimal experimenation is

In chat mode, one would use the prompt options ie -p and or -f (ie wrt main program) to send a system prompt (which will concatanate -p + -f contents) once and later keep handshaking the individual query/chat messages to the model and the responses got. And in this case the chat-apply-template would get called once for the system-prompt and once for each of the user query/message.

In the above case, I am assuming that all contents of the system-prompt (-p + -f related contents) I mentioned above should be encoded as a single system role message to the model. And let the model differentiate between any system role and sample user/assistant role messages that may have been added towards the end in the system-prompt-file (-f). IE the chat program (main in this case) doesnt bother differentiating between these system and user + assistant role messages in the system prompt.

However if the convention used universally is that any sample user+assistant role messages included in the system prompt need to be identified and tagged seperately from system role message in the system prompt, by the chat program before sending to the model, then

Possiblity-1

one way to handle that will be to include a flag which gets passed to chat-apply-template, which tells it that it is handling a bunch of messages which belongs to system-prompt (ie which will include 1 system role message and optionally user and assistant role messages in it), and in this case the chat-apply-template depending on which handshake-template-standard it is handling decide whether to introduce or not the INST/equivalents in the middle. Again I am assuming that different handshake standards may be handling this differently, so it is for each block within the chat-apply-template to decide how it will handle its case. But either way be it the system prompt or any subsequent user messages, it needs to go through chat-apply-template only once.

Possibility-2

However if all models today have standardized on the handshake high level contours, then the better thing may be to have a bunch of helpers like

chat-handshake-system-prefix, chat-hs-system-suffix
chat-hs-user-prefix, chat-hs-user-suffix,
chat-hs-assistant-prefix, chat-hs-assistant-suffix,
chat-hs-begin-marker, chat-hs-end-marker

which provide the tags/tokens for identifying and demarcating the corresponding role messages, as well as a group of messages (which could even be only a single message).

and inturn have the chat-apply-template built on top of these individual helpers, which inturn depending on whether it is handling

system-prompt with multiple messages in it ie 1 system-role-message and 0 or multiple user+assistant-role-messages or
a single user message

chain the above helpers in a standard way.

In possibility-1, each chat-template-standard block needs to handle things as needed in its block, while in possiblity-2 the individual helpers dont worry about the high level nittygritty, instead chat-apply-template handles high level framing as needed, decoupled from individual-template-standard helpers.

NOTE: However with the limitted testing I have done with llama2 and llama3, and simple system prompts with sample user+assistant role messages included to end of system prompt (the prompts/chat-with-bob.txt or so), the dumb flow which my PR does ie treat the full system-prompt as a single system-role message and inturn the subsequent individual user queries as individual user-role messages, seems to be working as needed.

NOT sure whether I fully understood what you are trying to tell especially given the very limited exploring of chat models that I have done more as a end user till now. But one of the two possibilities which I mentioned is what one needs to take potentially, depending on whether there is a universal standard followed by most models out there or not.

Rather either way, the multitude of helpers based possibility-2 strategy, may be the most flexible and easily reusable for different scenarios one. It can even be adapted for the scenario described for possibility-1.

Sorry for the long response.

hanishkvc · 2024-04-21T17:54:22Z

Now if you are talking about maintaining the context of all the interactions starting from system prompt to all the subsequent user + assistant dialogs/handshakes, wont each of them still be like independent handshake messages, with their own begin-end markers or so inturn concatenated together, or am I missing something fundamental here. IE are the models handshake interface / input-layer so brain dead that they require one to remove inbetween begin-end markers, in this case, as you seem to be indicating or ...

Again as I mentioned I havent really looked much into llms (given that they are still more of a fuzzy+drunk (limited attention inspite of multi-head-attention, position encoding, etal) lossy compressed data pattern logics) and their handshakes, I was trying to use ollama to chat with this new llama3 and given my 8GB machine, the ollama was triggering metal-compute failures, because of some additional memory overhead it seems to be introducing (which if I run llama.cpp directly doesnt seem to create), as well as either way as llama.cpp didnt have llama3 chat template handling the system prompt (which I think ollama uses llama.cpp code to tag wrt handshake template) it was giving a warning telling it doesnt understand the chat template for handing system-prompt message, that is how I end up looking at llama.cpp and inturn added a simple llama3 template in chat-apply-template and inturn modified main to have a simple chat flow using chat-apply-template.

hanishkvc · 2024-04-21T18:47:48Z

Now if you are talking about maintaining the context of all the interactions starting from system prompt to all the subsequent user + assistant dialogs/handshakes, wont each of them still be like independent handshake messages, with their own begin-end markers or so inturn concatenated together, or am I missing something fundamental here. IE are the models handshake interface / input-layer so brain dead that they require one to remove inbetween begin-end markers, in this case, as you seem to be indicating or ...

Again as I mentioned I havent really looked much into llms (given that they are still more of a fuzzy+drunk (limited attention inspite of multi-head-attention, position encoding, etal) lossy compressed data pattern logics) and their handshakes, I was trying to use ollama to chat with this new llama3 and given my 8GB machine, the ollama was triggering metal-compute failures, because of some additional memory overhead it seems to be introducing (which if I run llama.cpp directly doesnt seem to create), as well as either way as llama.cpp didnt have llama3 chat template handling the system prompt (which I think ollama uses llama.cpp code to tag wrt handshake template) it was giving a warning telling it doesnt understand the chat template for handing system-prompt message, that is how I end up looking at llama.cpp and inturn added a simple llama3 template in chat-apply-template and inturn modified main to have a simple chat flow using chat-apply-template.

Also with my PR, the changes which I have done to augument -i, I dont notice it triggering chat-apply-template more than once per user message, and things seem to be fine atleast wrt llama2 and llama3 (with my simple patch for same).

However I havent looked at the server program's flow much. Where If I get correctly, you seem to indicate, that you call the chat-apply-template for all the messages again and again, each time you get a request from the web-browser side client logic (I assume because it sends the raw messages always, as it doesnt get the formatted messages or tokens back (I am assuming))? And here inturn is where I assume you are pondering about the inbetween begin-end markers or some aspect of it.

However as I didnt notice any problem with my simple main-chat flow (unless some existing code in main already takes care of filtering out inbetween begin-end markers or so, I havent really looked too much into main beyond my immidiate need), doesnt, having the begin-end marker wrt each role message seperately, seem to create any problem in a true sense. Or put in a different way, if things are kept simple and if you where to call chat-apply-template seperately for each of the message in the bunch sent from client side, inturn leading to seperate begin-end markers or so wrt each of the role's message immidietly around it, does it create any issue?

hanishkvc · 2024-04-21T19:04:39Z

Also advantage of possibility-2 based kind of flow instead of a all in chat-apply-template, would be that, if there is text in the user's message which mimicks the underlying model's tags for any role or so, ie if it corresponds to a special token; the flow can be structured such that it wont be treated as a special token, when calling tokenizer.

Because one can call chat-hs-user-prefix() and tokenize it with parse_special, call tokenize directly on user message without parse_special flag set, and call chat-hs-user-suffix() and tokenize it with parse_special set. And inturn concatenate the 3 tokenized vectors. Which is what the default interactive mode does, while my PR (chaton) doesnt currently.

ie the way chat-apply-template is setup currently, where tokenization doesnt occur internally to it but instead where one passes the message, gets the tagged message back and then tokenizes it, there could be this issue of user message bringing in special tokens to bear fruit from their text content.

Again I dont know the convention wrt this aspect and current chat programs out there, so mentioning here just in case.

Rather for something like the system-prompt initial file (ie -f arg), treating special token related tags in them has special keeps things simple and also potentially adds flexibility to end user to trigger things the way they want. But for individual user messages later, when chatting, is where, whether special token related tags should be treated has special tokens or not may come into picture

hanishkvc · 2024-04-21T23:58:08Z

Small correction wrt possiblity-2 should be chat-hs-role-prefix and chat-hs-role-suffix, so that new roles can be added in future, using the same calls and infrastructure. Not sure why I suggested seperate chat-hs-user|system|assitant-prefix|suffix before, I was half asleep rather.

Also if chat-apply-template additionaly returns a list of [[startpos1,endpos1], [startpos2,endpos2].....[startposN,endposN]], correpsonding to the user entered messages part, in the returned formatted/tagged string, then chat-apply-template's returned string can be used to ensure that when tokenising, these part corresponding to user messages are tokenized without parse_special

ngxson · 2024-04-22T02:01:15Z

As I understand, your Possibility-2 uses prefix/postfix system which is discussed in #5922 . The idea lead to nowhere so I though that using postfix/prefix was a bad idea.

After a second thought (and also which what you suggested), my idea would be to firstly refactor the chat template system to support the postfix/prefix. This will save a lot of efforts in the future, since chat templates nowadays is just postfix/prefix based (with the exception of llama2 - we can keep using code logic for it)

However I havent looked at the server program's flow much. Where If I get correctly, you seem to indicate, that you call the chat-apply-template for all the messages again and again, each time you get a request from the web-browser side client logic

The server works just like OpenAI API (or any other LLM API out there). A full list of messages will be sent with each request.

Now if you are talking about maintaining the context of all the interactions starting from system prompt to all the subsequent user + assistant dialogs/handshakes, wont each of them still be like independent handshake messages, with their own begin-end markers or so inturn concatenated together,

The jinja template is the main reason why I never consider each message to be independent, since sometimes there're conditions inside the template (i.e. only add something if last message has a specific value,...).

While each message is not independent one another, each pair of message should (the handshake that you mentioned), so a more efficient approach would be to evaluate a combination of (system) and (system+user) each time.

ngxson · 2024-04-22T04:47:42Z

After a second thought (and also which what you suggested), my idea would be to firstly refactor the chat template system to support the postfix/prefix. This will save a lot of efforts in the future, since chat templates nowadays is just postfix/prefix based (with the exception of llama2 - we can keep using code logic for it)

I tried implementing this idea (takes me almost 2 hours but still nothing works). Turns out the idea of breaking chat template API into prefix/postfix is even worse and make the code unreadable. Here is the branch that I'm working on, but I gave up: master...ngxson:llama.cpp:xsn/chat_template_prefix_postfix

See comments below

teleprint-me · 2024-04-22T04:50:03Z

You left a typo on the end of the url, think you meant to share master...ngxson:llama.cpp:xsn/chat_template_prefix_postfix instead of the ?=expand=1.

teleprint-me · 2024-04-22T04:55:57Z

It's actually not that bad.

I'm looking at this one because it's easier to read than looking at the diff.

https://github.com/ngxson/llama.cpp/blob/xsn/chat_template_prefix_postfix/llama.cpp#L17077

I think it's a good start.

Why not use a switch?

https://github.com/ngxson/llama.cpp/blob/xsn/chat_template_prefix_postfix/llama.cpp#L17135

It wouldn't clean it up?

Makes me think of that old joke, "AI is nothing but a bunch of 'if' statements" 😅.

ngxson · 2024-04-22T05:00:45Z

Why not use a switch?

You cannot use switch if inside the if statement you do some logics (for example str_contains). In other words, switch will be compiled into a jump table, but my if statements cannot be compiled into jump table.

Also, switch with case "string" cannot be used in c or cpp, since it will compare the pointer of these strings instead of the string content. It's only possible on higher-level languages like javascript

teleprint-me · 2024-04-22T05:05:09Z

You cannot use switch if inside the if statement you do some logics (for example str_contains). In other words, switch will be compiled into a jump table, but my if statements cannot be compiled into jump table.

Why is this an issue? Strings are pointers to allocated memory addresses. You would be jumping from address to the other. Or does C++ suffer from the same issue as C in a switch where it would need to be a integer or a character? If that's the case, could use a mapping or enum? Implement the logic within each case?

Just some rough ideas and initial thoughts.

ngxson · 2024-04-22T05:08:31Z

Or does C++ suffer from the same issue as C in a switch where it would need to be a integer or a character

Yes, see: https://stackoverflow.com/questions/650162/why-cant-the-switch-statement-be-applied-to-strings

could use a mapping or enum

That's why the function that you're looking on returns an enum. The functions (get postfix/prefix) using that enum have switch statement

teleprint-me · 2024-04-22T05:30:25Z

I would've reversed the implementation. I'd need to actually dig into this a bit more. I've been studying the code base in a small snippets once a week.

I guess to clarify what I originally meant by the mapping is, "Why not map the enums to strings in an array, kind of like what we did for the models?".

e.g.

    enum llama_chat_template {
        LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED  = 0,
        LLAMA_CHAT_TEMPLATE_CHATML         = 1, // Example: teknium/OpenHermes-2.5-Mistral-7B
        LLAMA_CHAT_TEMPLATE_LLAMA2         = 2, // Original llama2 template (no <<SYS>> support)
        // etc...
    }

    static const std::map<llama_chat_template, const char *> llama_chat_template_id {  // I literally came up with id on the fly... grain of salt.
        { LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED,   "unsupported"      },
        { LLAMA_CHAT_TEMPLATE_CHATML,          "chatml"     },
        { LLAMA_CHAT_TEMPLATE_LLAMA2,          "llama2"     },
        // etc...
    }

This could potentially allow you to do something similar to what you did in llama_chat_get_prefix. Ideally allowing you do something similar in the llama_chat_get_template_type function.

I'm not sure why you're using const char * tmpl as a parameter though, so I would need to look at this with fresh eyes. My brain is full of conversion logic at the moment 😅, but I'll do my best.

    // detect template type
    llama_chat_template ttmpl = llama_chat_get_template_type(curr_tmpl.c_str());
    if (ttmpl == LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED) {
        return -1;
    }

I see you're passing the template string as an argument here to set ttmpl within the llama_chat_get_template_type function. This is probably where the map could be used, kind of like a text cipher if you think about it.

Use the enum to get the string which is the id of template. Currently, you have something akin to a reverse/inverse mapping. I'm exhausted right now, but yeah, that's my general line of thought at the moment.

What do you think?

teleprint-me · 2024-04-22T05:49:35Z

Yeah, I would invert it.

    enum llama_chat_template {
        LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED  = 0,
        LLAMA_CHAT_TEMPLATE_CHATML         = 1, // Example: teknium/OpenHermes-2.5-Mistral-7B
        LLAMA_CHAT_TEMPLATE_LLAMA2         = 2, // Original llama2 template (no <<SYS>> support)
        // etc...
    }

    static const std::map<const char *, llama_chat_template> llama_chat_template_id {  // I literally came up with id on the fly... grain of salt.
        { "unsupported", LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED  },
        { "chatml",      LLAMA_CHAT_TEMPLATE_CHATML         },
        { "llama2",      LLAMA_CHAT_TEMPLATE_LLAMA2         },
        // etc...
    }

Path of least resistance.

ngxson · 2024-04-22T05:51:05Z

llama.h is C-style header file, so std::map cannot be placed there. We can place it in llama.cpp if needed.

But anw it is code style and not the real implementation, we will see in later stage. For now, what's more interesting is to make it work.

teleprint-me · 2024-04-22T06:14:01Z

Hm. Could make it an array of 2 element arrays. In other words, an array of pointers to their respective strings.

Each index is mapped accordingly. Same idea. Not as clean or readable as using a map directly, though. Might be better than the current branch implementation.

// Define your enum
enum llama_chat_template {
    LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED  = 0,
    LLAMA_CHAT_TEMPLATE_CHATML         = 1,
    LLAMA_CHAT_TEMPLATE_LLAMA2         = 2,
    // etc...
};

// Define an array of character pointers for the strings
static const char* llama_chat_template_id[NUM_TEMPLATES] = {
    "unsupported",
    "chatml",
    "llama2",
    // etc...
};

Instead of using a data structure to store the mapping between enums and strings, you can directly use an array of character pointers for the strings and map the index of the enum to the corresponding string in the array. This approach is more memory-efficient as it avoids the overhead of storing extra metadata for each mapping.

Here's a brief example:

// Use the enum to map the index to the string
const char* template_name = llama_chat_template_id[LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED]; // "unsupported"

This approach is simpler and more memory-efficient since it directly maps the enum to its corresponding string using the array index. However, it may be less flexible and less readable compared to using a data structure, as you cannot easily modify or extend the mapping without changing the code.

struct llama_chat_template_map {
    char* id;
    enum llama_chat_template template_enum;
}

id could then point to the appropriate string. Here's an example of how to use the struct:

#include <cstring>

// Define your struct
struct llama_chat_template_map {
    char id[32];
    enum llama_chat_template template_enum;
};

// Initialize an array of structures
const size_t num_templates = 3;
static const llama_chat_template_map templates = {
    {"unsupported", LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED},
    {"chatml",      LLAMA_CHAT_TEMPLATE_CHATML},
    {"llama2",      LLAMA_CHAT_TEMPLATE_LLAMA2},
    // etc...
};

// Use the struct in your code
const llama_chat_template_map& curr_tmpl = templates[0];
const char* template_name = curr_tmpl.id;
llama_chat_template ttmpl = curr_tmpl.template_enum;

This approach makes the code more readable as the enum and its corresponding string are directly related within the same structure.

Credit to Mistral-7B-v0.2 for assisting me with this. That's the best I got at the moment. I'm out of gas.

hanishkvc · 2024-04-22T20:58:16Z

I think we should see if we can have a generic chat-apply-template logic, independent of the different handshake-template standards, which inturn is driven by a json file, which provides the required role-prefxes and role-suffixes as well as flags for when system+user message combination is there and whether bos+user-role-tag-prefix should be included inbetween the system and 1st user message or not (a possible flow implemented in the PR) and any others like that...

Below is the PR for a initial go at the same thought

#6834

Idea is to reuse the generic flow as much as possible, as well as allow some amount of controlling of the handshake template insertion using a json without needing to recompile the code, which will also be helpful if a new model's handshake template standard matches the existing generic flow provided.

Sebby37 · 2024-04-23T15:15:21Z

@ngxson

My idea is trying to use llama_chat_apply_template in main.cpp. This will effectively deprecate antiprompt, prompt prefix/postfix and cml options.

I'm a bit confused by your use of deprecate here, would this mean potentially removing support for antiprompt and prefix/postfix in main at some point in the future?

ngxson · 2024-04-23T15:38:17Z

@Sebby37 I wanted to deprecate the usage of antiprompt, prefix, postfix for chat template in particular, but not for general usages, because:

prefix, postfix are fixed value, while it need to be dynamic for chat templates ==> However, since I made a proposal implementation to have dynamic prefix/postfix in Refactor chat template API #6822 , I'll need to re-think that
antiprompt: we no longer need antiprompt, since EOG (end of generaton) token is introduced along with llama 3 model.

@hanishkvc Sorry I was on a flight during last 24h, I will (progressively) continue my work when I recover from jetlag.

kaizau · 2024-04-24T14:28:56Z

@ngxson I tried implementing this idea (takes me almost 2 hours but still nothing works). Turns out the idea of breaking chat template API into prefix/postfix is even worse and make the code unreadable.

I think a variation what you originally proposed in #5922 could handle readability quite well.

Core idea is that if prefix/postfix are constrained to only static strings (no conditional logic, fully context-free), then this doesn't need to be a function so much as just a data structure — mapping template roles to a small handful of strings.

Dynamic cases like llama2 are rare exceptions (is there even another dynamic example?). Instead of incurring the complexity of supporting contextual logic in all templates, I think it'd be simpler to handle such logic upstream, keeping templates themselves as simple as possible.

Ex: llama_chat_get_template_type could distinguish between LLAMA_CHAT_TEMPLATE_LLAMA2 and LLAMA_CHAT_TEMPLATE_LLAMA2_WITH_SYSTEM, which correctly accounts for the differences.

Ultimately, simple string pre/postfixes would also be easier to expose as configuration. So you could potentially even justify NOT handling the edge cases, as long as users can account for them via config flags / server options.

ngxson · 2024-04-24T15:52:25Z

Core idea is that if prefix/postfix are constrained to only static strings (no conditional logic, fully context-free), then this doesn't need to be a function so much as just a data structure — mapping template roles to a small handful of strings.

Yes, eventually we will implement this idea. Because modern chat templates (chatml, gemma, llama3,...) all work that way (i.e. different prefix for different roles), it's already possible to give the user the possibility to control prefix/postfix - just matter of time before we figure out the best way (and future-proof way) to do so.

ngxson · 2024-04-24T16:04:08Z

@kaizau FYI, I updated the description of this issue. The issue with prefix/postfix that I mentioned earlier is no longer correct (after some iterations). I'm now more confident that the idea of having dynamic prefix/postfix will work out.

github-actions · 2024-06-08T01:07:03Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ngxson added the enhancement New feature or request label Mar 29, 2024

Jeximo mentioned this issue Apr 1, 2024

Add OpenChat, Alpaca, Vicuna chat templates #6397

Merged

araleza mentioned this issue Apr 8, 2024

It's not possible to enter <|END_OF_TURN_TOKEN|> when using cmd-r+ #6551

Closed

EZForever mentioned this issue Apr 12, 2024

server: Use llama_chat_apply_template on /completion endpoint #6624

Closed

4 tasks

kaizau mentioned this issue Apr 17, 2024

Proposal: An alternative to chat templates #6726

Closed

4 tasks

ngxson mentioned this issue Apr 21, 2024

Update main's interactive mode to use the chat handshake templates support already available in llama.cpp (and currently only used by server,...) #6795

Closed

ngxson mentioned this issue Apr 21, 2024

Use llama_chat_apply_template in main (WIP) #6810

Draft

3 tasks

ngxson mentioned this issue Apr 22, 2024

Refactor chat template API #6822

Draft

khimaros mentioned this issue Apr 29, 2024

Add the Command R chat format abetlen/llama-cpp-python#1382

Open

github-actions bot added the stale label May 25, 2024

github-actions bot closed this as completed Jun 8, 2024

ngxson mentioned this issue Jun 22, 2024

Add chat template support for llama-cli #8068

Merged

4 tasks

Implement (properly) different chat templates in main.cpp #6391

Implement (properly) different chat templates in main.cpp #6391

Comments

ngxson commented Mar 29, 2024 • edited Loading

Motivation

Possible Implementation

Possible Implementation

Format the chat on-the-go

Manage stop sequences

ngxson commented Mar 29, 2024

MoonRide303 commented Apr 1, 2024

teleprint-me commented Apr 3, 2024 • edited Loading

hnfong commented Apr 21, 2024 • edited Loading

ngxson commented Apr 21, 2024 • edited Loading

hnfong commented Apr 21, 2024 • edited Loading

ngxson commented Apr 21, 2024 • edited Loading

hanishkvc commented Apr 21, 2024

ngxson commented Apr 21, 2024 • edited Loading

hanishkvc commented Apr 21, 2024

hanishkvc commented Apr 21, 2024

ngxson commented Apr 21, 2024

ngxson commented Apr 21, 2024

hanishkvc commented Apr 21, 2024

hanishkvc commented Apr 21, 2024 • edited Loading

hanishkvc commented Apr 21, 2024

hanishkvc commented Apr 21, 2024 • edited Loading

hanishkvc commented Apr 21, 2024

ngxson commented Apr 22, 2024

ngxson commented Apr 22, 2024 • edited Loading

teleprint-me commented Apr 22, 2024

teleprint-me commented Apr 22, 2024 • edited Loading

ngxson commented Apr 22, 2024 • edited Loading

teleprint-me commented Apr 22, 2024

ngxson commented Apr 22, 2024 • edited Loading

teleprint-me commented Apr 22, 2024 • edited Loading

teleprint-me commented Apr 22, 2024 • edited Loading

ngxson commented Apr 22, 2024

teleprint-me commented Apr 22, 2024 • edited Loading

hanishkvc commented Apr 22, 2024 • edited Loading

Sebby37 commented Apr 23, 2024

ngxson commented Apr 23, 2024

kaizau commented Apr 24, 2024 • edited Loading

ngxson commented Apr 24, 2024

ngxson commented Apr 24, 2024

github-actions bot commented Jun 8, 2024

ngxson commented Mar 29, 2024 •

edited

Loading

teleprint-me commented Apr 3, 2024 •

edited

Loading

hnfong commented Apr 21, 2024 •

edited

Loading

ngxson commented Apr 21, 2024 •

edited

Loading

hnfong commented Apr 21, 2024 •

edited

Loading

ngxson commented Apr 21, 2024 •

edited

Loading

ngxson commented Apr 21, 2024 •

edited

Loading

hanishkvc commented Apr 21, 2024 •

edited

Loading

hanishkvc commented Apr 21, 2024 •

edited

Loading

ngxson commented Apr 22, 2024 •

edited

Loading

teleprint-me commented Apr 22, 2024 •

edited

Loading

ngxson commented Apr 22, 2024 •

edited

Loading

ngxson commented Apr 22, 2024 •

edited

Loading

teleprint-me commented Apr 22, 2024 •

edited

Loading

teleprint-me commented Apr 22, 2024 •

edited

Loading

teleprint-me commented Apr 22, 2024 •

edited

Loading

hanishkvc commented Apr 22, 2024 •

edited

Loading

kaizau commented Apr 24, 2024 •

edited

Loading