Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chat format: Recognize specified language and offloaded lexguessing to every newline #81

Merged
merged 8 commits into from
Oct 7, 2023

Conversation

SinanAkkoyun
Copy link
Contributor

@SinanAkkoyun SinanAkkoyun commented Oct 1, 2023

#71 (comment)

It also looks like calling guess_lexer on every token is a little slow, so maybe it makes more sense to call it at the end of every line instead?

#71 (comment)

Sometimes the model specifies the language, use that (and/or incorporate a small prompt to ensure lang specification)

I found some time :) Now it's detecting the specified language (I like to finetune my models to always do that) and if not, it only lexguesses each newline.
The PR is nothing too big, but I thought this might help nonetheless.

Here is the prompt I found most effective for Llama2-7B-chat 4.0bpw to specify the language:
-sp "You are a helpful coding assistant. Always answer as helpfully as possible. Specify the language after starting a codeblock like: ```python\nprint('hello')\n"

#71 (comment)

So that won't really work for code snippets. Also it supports a lot of obscure formats so maybe that could be narrowed down a bit for more accurate results. I'm looking into it.

You said you wanted to improve the lexguesser, but still, if I can somehow help or if you want to spend your time on other problems, please let me know and I'll try to take care of it.

@SinanAkkoyun
Copy link
Contributor Author

I just tested Mistral 7B's coding skills and noticed that the current chat code only treats ``` when it's one chunk/token. Mistral outputs it as `` and ` or so. Fix commit ahead

@SinanAkkoyun
Copy link
Contributor Author

@turboderp The Mistral 7B chunking has been fixed! If you find any bug with that please let me know

@turboderp turboderp merged commit a9f3f17 into turboderp:master Oct 7, 2023
anchortense pushed a commit to anchortense/exllamav2-logit-threshold-samplers that referenced this pull request Oct 21, 2024
Chat format: Recognize specified language and offloaded lexguessing to every newline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants