-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(server): llama v2 GPTQ #648
Conversation
Great! I've had several people report issues with this model, lots of people want to try it in TGI |
@fxmarty what GPUs are you running this on? I'm using your change / command line on a machine with 4xA10G and running into the following error during warmup:
FWIW works with |
I tried https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ on 1 A100 80GB or 2 A100 80 GB. I don't use text-generation-inference docker image nor the default Dockerfile though, so maybe there's something different there? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
groupsize=1 ??? This seems odd, even with the fix I'm not able to get correct output. |
Oh, maybe it should be |
groupsize doesn't actually seem to be used during inference... |
Hum, it's working fine for me with the command at the top.
& import requests
import json
system_message = "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
headers = {
'Content-Type': 'application/json',
}
message = "Hey llama!"
input_prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n "
input_prompt = input_prompt + str(message) + " [/INST] "
data = {
"inputs": input_prompt,
"parameters": {"max_new_tokens":256}
}
response = requests.post("http://127.0.0.1:8080/generate", headers=headers, data=json.dumps(data))
print(response.text) gives Edit: my dockerfile for reference
|
This is extremely ODD. 70b / 4 shards A10G -> Garbage 7b / 2 shard A10G -> Correct (All quantized versions ofc) |
Tentatively merging (Code looks OK, bug is here before this change.) |
guys, have u tested the performance, it is very slow for this gptq model, i tested, about 3s one token |
@munger1985 Please open an issue with a reproduction. |
not an issue, did you feel that is very slow? how much is speed? ?token/s |
As per title & reported #601 (comment) https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5
Test it:
&