Use llama_chat_apply_template
in main
(WIP)
#6810
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolve #6391
The core idea is to use
llama_chat_apply_template
to apply the template twice: with and without the last user message. Then, we find the diff between 2 output strings and finally feed it into inference.Example:
This approach will require minimal effort to maintain the chat template infrastructure, while using the extract same logic for
main
andserver
(remind: server also have the notion of "prompt cache" which works the same way)Having to re-format the whole chat history each time seems inefficient at first glance, but it is needed because:
server
(which is designed to be stateless)Then, we find the diff between the 2 strings.
chat_get_added_part
to get the diff part with / without the last user messagemain
must keep track of the list of messagesmain
, deprecate-cml
(but not remove it) while adding-chat-template
argument