-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: quantized gemma 27b output still wrong after tokenizer fix and soft capping #8183
Comments
aistudio? are you confusing gemma with gemini? |
Gemma 2 is available in AI studio since yesterday. I live in Italy. I don't know if it's available everywhere. |
I'm having the exact same experience, I've been testing Gemma-2 for data extraction, the 9B model almost gets the answers perfect, whereas the 27B model only understands it has to output JSON, literally everything else (including the JSON key names) it gets wrong. Like it's a night and day difference between them. I've tested the full model using Nvidia's NIM service (You get a 1000 requests from signing up) and the 27B model has zero issues with any of the tasks there. I am running a Q8 Quant so the quality loss should be minimal. So I am very confident something is wrong with the quantized 27B model. |
I can also confirm that the 9b is affected less by this. I tried the same prompt with it. It outputs the wrong numeric solution but it was able to repeat the question word by word as requested. Prompt was |
Seems like Google broke something |
Soft capping might be missing, see huggingface/transformers#31698. |
They talk about it in the paper. They say that soft capping was temporarily disabled to make the model compatible with existing implementations of flash attention, and that the performance hit is negligible. Apparently it was not negligible. |
I didn't notice! I will try it. P.S. |
There is now a PR that fixes the soft capping problem: #8197 Another issue that might be relevant is that gemma 2 uses a sliding window attention instead of global attention for every other layer. It could be missing, which means that the context is currently limited to 4096 tokens. See last comment to that issue: #3377 This might also solve the Phi3 issue: #7709
@0wwafa the simplest command you can run is the following: EDITAfter #8197 the output improved a lot. The simplest prompt that completely breaks the local model is: The aistudio model answers: The local model starts rambling about fat pigs and then comments its own answer in spanish.
The model was requantized from the hf repo version after updating both the hf repo and transformers, and after merging the soft capping PR. Quants used: Q8_0 |
I have implemented those two "soft capping" and interleaved SWA/full attention in chatllm.cpp, and Q8_0 quantized Gemma-2 could solve this fruit problem with greedy sampling (while Q4_1 fails):
|
I tested your implementation at Q8_0 with my benchmarks and the output matches exactly the reference implementation by google (To clarify: I mean the gemma2 model on AI studio). |
@matteoserva chek my quantizations: https://huggingface.co/RobertSinclair |
All you need is to go deeper. I would like to report that a self-merged (or self stacked) Gemma-2 9B (Q8_0) can solve this math problem, too. Here layer No. 8/9/16/17/24/25/32/33 are repeated (resulting in a 10.8B model):
An even deeper one ( |
closing this and continuing in #8240 |
What happened?
The quantized version of gemma 27b (Q8_0) still gets the answer wrong to even simple problems.
The version of gemma on ai studio answers correctly all my questions.
Example problem that quantized gemma consistently fails while the ai studio gemma answers correctly.
The correct answer is 7 or 8.
I also tried asking the model to repeat the question by prepending "Repeat the question and then answer it: ".
The model in llama.cpp fails this simple task while the model in ai studio repeats the question word by word.
I noticed that the ai studio response starts with
Here's how to solve the...
while the response when run in llama.cpp starts with
Here's how to solve this...
So I printed the probabilities from llama.cpp and this is the output. I would have expected a much higher probabilities for "the" with respect to "this" even after quantization:
Here is the setup:
model version: bartowski gemma-27b-it at Q8_0 after tokenizer fix
llama-server: 3264 (upstream version after merge)
inference parameters: temperature = 0.01, seed = 0
Name and Version
$ ./llama-cli --version
version: 3264 (09a5534f)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
The text was updated successfully, but these errors were encountered: