Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleChat Completion Mode flexibility and cleanup, Settings gMe, Optional sliding window #7480

Merged
merged 19 commits into from
May 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
8042cb9
SimpleChat: A placeholder system prompt, Use usage msg in code
hanishkvc May 22, 2024
3c11098
SimpleChat:CompletionMode: Allow control of Role: prefix
hanishkvc May 22, 2024
0dba8f8
SimpleChat:Completion: Avoid Role: prefix; Newline only in between
hanishkvc May 22, 2024
7a0a423
SimpleChat:CompletionMode: Update readme/usage, trim textarea newline
hanishkvc May 22, 2024
fe60655
SimpleChat:SC: Ensure proper clearing/reseting
hanishkvc May 22, 2024
01594da
SimpleChat: Update usage note and readme a bit
hanishkvc May 22, 2024
e2164d6
SimpleChat:Completion: clear any prev chat history at begining
hanishkvc May 22, 2024
40fbbeb
SimpleChat:Try read json early, if available
hanishkvc May 23, 2024
5d84a92
SimpleChat: Rename the half asleep mis-spelled global var
hanishkvc May 23, 2024
073eae6
SimpleChat: Common chat request options from a global object
hanishkvc May 23, 2024
59f74c7
SimpleChat: Update title, usage and readme a bit
hanishkvc May 23, 2024
cbd853e
SimpleChat:ChatRequestOptions: max_tokens
hanishkvc May 23, 2024
4b29736
SimpleChat: Reduce max_tokens to be small but still sufficient
hanishkvc May 23, 2024
f0dd91d
SimpleChat: Consolidate global vars into gMe, Display to user
hanishkvc May 23, 2024
b57aad7
SimpleChat:SlidingWindow: iRecentUserMsgCnt to limit context load
hanishkvc May 23, 2024
11d2d31
SimpleChat: placeholder based usage hint for user-in textarea
hanishkvc May 23, 2024
8f172b9
SimpleChat: Try make user experience better, if possible
hanishkvc May 24, 2024
b3afd6c
SimpleChat:Add n_predict (equiv max_tokens) for llamacpp server
hanishkvc May 24, 2024
6d2f3d9
SimpleChat: Note about trying to keep things simple yet flexible
hanishkvc May 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 4 additions & 7 deletions examples/server/public_simplechat/index.html
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
<!DOCTYPE html>
<html lang="en">
<head>
<title>SimpleChat (LlamaCPP, ...) </title>
<title>SimpleChat LlamaCppEtal </title>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="message" content="Save Nature Save Earth" />
Expand Down Expand Up @@ -30,20 +30,17 @@
<hr>
<div class="sameline">
<label for="system-in">System</label>
<input type="text" name="system" id="system-in" class="flex-grow"/>
<input type="text" name="system" id="system-in" placeholder="e.g. you are a helpful ai assistant, who provides concise answers" class="flex-grow"/>
</div>

<hr>
<div id="chat-div">
<p> Enter the system prompt above, before entering/submitting any user query.</p>
<p> Enter your text to the ai assistant below.</p>
<p> Use shift+enter for inserting enter.</p>
<p> Refresh the page to start over fresh.</p>
<p> You need to have javascript enabled.</p>
</div>

<hr>
<div class="sameline">
<textarea id="user-in" class="flex-grow" rows="3"></textarea>
<textarea id="user-in" class="flex-grow" rows="3" placeholder="enter your query to the ai model here" ></textarea>
<button id="user-btn">submit</button>
</div>

Expand Down
126 changes: 123 additions & 3 deletions examples/server/public_simplechat/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,15 @@ own system prompts.
The UI follows a responsive web design so that the layout can adapt to available display space in a usable
enough manner, in general.

Allows developer/end-user to control some of the behaviour by updating gMe members from browser's devel-tool
console.

NOTE: Given that the idea is for basic minimal testing, it doesnt bother with any model context length and
culling of old messages from the chat.
culling of old messages from the chat by default. However by enabling the sliding window chat logic, a crude
form of old messages culling can be achieved.

NOTE: It doesnt set any parameters other than temperature for now. However if someone wants they can update
the js file as needed.
NOTE: It doesnt set any parameters other than temperature and max_tokens for now. However if someone wants
they can update the js file or equivalent member in gMe as needed.


## usage
Expand All @@ -43,39 +47,155 @@ next run this web front end in examples/server/public_simplechat
### using the front end

Open this simple web front end from your local browser

* http://127.0.0.1:PORT/index.html

Once inside

* Select between chat and completion mode. By default it is set to chat mode.

* In completion mode
* logic by default doesnt insert any role specific "ROLE: " prefix wrt each role's message.
If the model requires any prefix wrt user role messages, then the end user has to
explicitly add the needed prefix, when they enter their chat message.
Similarly if the model requires any prefix to trigger assistant/ai-model response,
then the end user needs to enter the same.
This keeps the logic simple, while still giving flexibility to the end user to
manage any templating/tagging requirement wrt their messages to the model.
* the logic doesnt insert newline at the begining and end wrt the prompt message generated.
However if the chat being sent to /completions end point has more than one role's message,
then insert newline when moving from one role's message to the next role's message, so
that it can be clearly identified/distinguished.
* given that /completions endpoint normally doesnt add additional chat-templating of its
own, the above ensures that end user can create a custom single/multi message combo with
any tags/special-tokens related chat templating to test out model handshake. Or enduser
can use it just for normal completion related/based query.

* If you want to provide a system prompt, then ideally enter it first, before entering any user query.
Normally Completion mode doesnt need system prompt, while Chat mode can generate better/interesting
responses with a suitable system prompt.
* if chat.add_system_begin is used
* you cant change the system prompt, after it is has been submitted once along with user query.
* you cant set a system prompt, after you have submitted any user query
* if chat.add_system_anytime is used
* one can change the system prompt any time during chat, by changing the contents of system prompt.
* inturn the updated/changed system prompt will be inserted into the chat session.
* this allows for the subsequent user chatting to be driven by the new system prompt set above.

* Enter your query and either press enter or click on the submit button.
If you want to insert enter (\n) as part of your chat/query to ai model, use shift+enter.

* Wait for the logic to communicate with the server and get the response.
* the user is not allowed to enter any fresh query during this time.
* the user input box will be disabled and a working message will be shown in it.

* just refresh the page, to reset wrt the chat history and or system prompt and start afresh.

* Using NewChat one can start independent chat sessions.
* two independent chat sessions are setup by default.


## Devel note

### Reason behind this

The idea is to be easy enough to use for basic purposes, while also being simple and easily discernable
by developers who may not be from web frontend background (so inturn may not be familiar with template /
end-use-specific-language-extensions driven flows) so that they can use it to explore/experiment things.

And given that the idea is also to help explore/experiment for developers, some flexibility is provided
to change behaviour easily using the devel-tools/console, for now. And skeletal logic has been implemented
to explore some of the end points and ideas/implications around them.


### General

Me/gMe consolidates the settings which control the behaviour into one object.
One can see the current settings, as well as change/update them using browsers devel-tool/console.

bCompletionFreshChatAlways - whether Completion mode collates complete/sliding-window history when
communicating with the server or only sends the latest user query/message.

bCompletionInsertStandardRolePrefix - whether Completion mode inserts role related prefix wrt the
messages that get inserted into prompt field wrt /Completion endpoint.

chatRequestOptions - maintains the list of options/fields to send along with chat request,
irrespective of whether /chat/completions or /completions endpoint.

If you want to add additional options/fields to send to the server/ai-model, and or
modify the existing options value or remove them, for now you can update this global var
using browser's development-tools/console.

iRecentUserMsgCnt - a simple minded SlidingWindow to limit context window load at Ai Model end.
This is disabled by default. However if enabled, then in addition to latest system message, only
the last/latest iRecentUserMsgCnt user messages after the latest system prompt and its responses
from the ai model will be sent to the ai-model, when querying for a new response. IE if enabled,
only user messages after the latest system message/prompt will be considered.

This specified sliding window user message count also includes the latest user query.
<0 : Send entire chat history to server
0 : Send only the system message if any to the server
>0 : Send the latest chat history from the latest system prompt, limited to specified cnt.


By using gMe's iRecentUserMsgCnt and chatRequestOptions.max_tokens one can try to control the
implications of loading of the ai-model's context window by chat history, wrt chat response to
some extent in a simple crude way.


Sometimes the browser may be stuborn with caching of the file, so your updates to html/css/js
may not be visible. Also remember that just refreshing/reloading page in browser or for that
matter clearing site data, dont directly override site caching in all cases. Worst case you may
have to change port. Or in dev tools of browser, you may be able to disable caching fully.


Concept of multiple chat sessions with different servers, as well as saving and restoring of
those across browser usage sessions, can be woven around the SimpleChat/MultiChatUI class and
its instances relatively easily, however given the current goal of keeping this simple, it has
not been added, for now.


By switching between chat.add_system_begin/anytime, one can control whether one can change
the system prompt, anytime during the conversation or only at the beginning.


read_json_early, is to experiment with reading json response data early on, if available,
so that user can be shown generated data, as and when it is being generated, rather than
at the end when full data is available.

the server flow doesnt seem to be sending back data early, atleast for request (inc options)
that is currently sent.

if able to read json data early on in future, as and when ai model is generating data, then
this helper needs to indirectly update the chat div with the recieved data, without waiting
for the overall data to be available.


### Default setup

By default things are setup to try and make the user experience a bit better, if possible.
However a developer when testing the server of ai-model may want to change these value.

Using iRecentUserMsgCnt reduce chat history context sent to the server/ai-model to be
just the system-prompt, prev-user-request-and-ai-response and cur-user-request, instead of
full chat history. This way if there is any response with garbage/repeatation, it doesnt
mess with things beyond the next question/request/query, in some ways.

Set max_tokens to 1024, so that a relatively large previous reponse doesnt eat up the space
available wrt next query-response. However dont forget that the server when started should
also be started with a model context size of 1k or more, to be on safe side.

The /completions endpoint of examples/server doesnt take max_tokens, instead it takes the
internal n_predict, for now add the same here on the client side, maybe later add max_tokens
to /completions endpoint handling code on server side.

Frequency and presence penalty fields are set to 1.2 in the set of fields sent to server
along with the user query. So that the model is partly set to try avoid repeating text in
its response.

A end-user can change these behaviour by editing gMe from browser's devel-tool/console.


## At the end

Also a thank you to all open source and open model developers, who strive for the common good.
7 changes: 7 additions & 0 deletions examples/server/public_simplechat/simplechat.css
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,13 @@ button {
flex-direction: column;
}

.ul1 {
padding-inline-start: 2vw;
}
.ul2 {
padding-inline-start: 2vw;
}

* {
margin: 0.6vmin;
}
Expand Down
Loading
Loading