-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CodeLlama 34B errors out after 3+ completions #70
Comments
Setting |
Thank you for this! My best guess is that the number of tokens exceed the cache. Will have to investigate this |
I've seen the same with other models. Thanks for the script @abacaj I'm going to run some other models through their paces to see if I can reproduce. AutoAWQ=0.1.0, python=3.10, cuda=11.8, rtx 3090 I can reproduce the error with any model:
|
Fixed this now in #75, at least I cannot produce this error anymore even when running for 1000 iterations: @abacaj and @gestalt73, would appreciate it if you could take the time to test out the pull request to see if something else breaks from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name_or_path = "casperhansen/vicuna-7b-v1.5-awq"
max_new_tokens = 1024
# Load model
model = AutoAWQForCausalLM.from_quantized(
model_name_or_path,
quant_filename="awq_model_w4_g128.pt",
fuse_layers=True,
trust_remote_code=False,
safetensors=True,
max_new_tokens=1024,
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)
tokens = tokenizer(
"# Write a python function to loop to 1000\n\ndef", return_tensors="pt"
).to("cuda")
# Generate output
cumulative_tokens = 0
for i in range(1000):
if cumulative_tokens > max_new_tokens:
cumulative_tokens = 0
generation_output = model.generate(
**tokens,
do_sample=True,
temperature=0.2,
top_p=0.95,
top_k=0,
max_new_tokens=512,
)
num_tokens = len(generation_output[0])
cumulative_tokens += num_tokens
print(i, num_tokens, cumulative_tokens)
# print(tokenizer.decode(generation_output[0], skip_special_tokens=True)) |
Hey @casper-hansen I ran it a bit with TheBloke/Llama-2-7b-Chat-AWQ and things look normal until the first cache clear, then things get weird. It doesn't error out though. Take a look at the output after the first set of cache clear messages around line 192. Output is consistent for the first x generations, then after the resetting cache message it starts ok in generation but gets interesting towards the end of line 209. from there on out it's hit or miss, but I'm also seeing the huge amount of newlines which I would occasionally see in 0.1.0. |
I don't see the expanded tensor error anymore. But model generations using |
Added fused_true and fused_false samples here. I turned sampling off so it should be greedy generation. For https://gist.github.com/abacaj/aefb5e9dd85a6fc8b54b5b655a9a632e |
Thank you all for testing. The fact that the outputs after resetting the cache are getting weird or not working as expected is not good enough for me to merge the PR. I will have to explore:
|
I switched up the approach entirely, and we are rolling over the cache now. This seems to produce correct outputs, and we get as close to HF output with FT modules. They are not meant to be the exact same outputs as slight numerical differences will lead to different outputs in some cases - however, they are very close now. |
I have closed this issue as the main error has been solved. However, it seems there is a problem with the fused modules and the CodeLlama models, although it should already be supported as GQA is implemented. |
Running codellama 34b using latest autoawq (installed from repo):
To reproduce:
The text was updated successfully, but these errors were encountered: