Bug: quantized gemma 27b output still wrong after tokenizer fix and soft capping #8183

matteoserva · 2024-06-28T08:57:08Z

What happened?

The quantized version of gemma 27b (Q8_0) still gets the answer wrong to even simple problems.
The version of gemma on ai studio answers correctly all my questions.

Example problem that quantized gemma consistently fails while the ai studio gemma answers correctly.

Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally.
Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?

The correct answer is 7 or 8.

I also tried asking the model to repeat the question by prepending "Repeat the question and then answer it: ".
The model in llama.cpp fails this simple task while the model in ai studio repeats the question word by word.

I noticed that the ai studio response starts with
Here's how to solve the...
while the response when run in llama.cpp starts with
Here's how to solve this...

So I printed the probabilities from llama.cpp and this is the output. I would have expected a much higher probabilities for "the" with respect to "this" even after quantization:

<bos><start_of_turn>user
Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally.
Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?<end_of_turn>
<start_of_turn>model
[('Here', 1.0), ('This', 0.0), ('1', 0.0), ('**', 0.0), ('It', 0.0)]
[("'", 1.0), (' is', 0.0), (' are', 0.0), (' how', 0.0), (' we', 0.0)]
[('s', 1.0), ('d', 0.0), ('st', 0.0), ('ss', 0.0), ('re', 0.0)]
[(' how', 1.0), (' a', 0.0), (' the', 0.0), (' why', 0.0), (' step', 0.0)]
[(' to', 1.0), (' we', 0.0), (' this', 0.0), (' the', 0.0), (' you', 0.0)]
[(' solve', 1.0), (' break', 0.0), (' figure', 0.0), (' work', 0.0), (' breakdown', 0.0)]
[(' this', 1.0), (' the', 0.0), (' problem', 0.0), (' that', 0.0), (' these', 0.0)]
[(' problem', 1.0), (' word', 0.0), (':', 0.0), (' riddle', 0.0), (' puzzle', 0.0)]

Here is the setup:
model version: bartowski gemma-27b-it at Q8_0 after tokenizer fix
llama-server: 3264 (upstream version after merge)
inference parameters: temperature = 0.01, seed = 0

Name and Version

$ ./llama-cli --version
version: 3264 (09a5534f)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

<bos><start_of_turn>user
Repeat the question and then answer it: Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally. Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?<end_of_turn>
<start_of_turn>model
**Question:** Matteo has 20 apples. He buys 20 oranges. Then he discards a quarter of his fruits equally between apples and oranges. How many oranges remain?

**Answer:** Here's how to solve this problem:

* **Step 1: Understand the Problem**

The problem states that Matteo discards a quarter of his total fruit, not just a quarter of the oranges.

* **Step 2: Calculate Total Fruit**

Matteo starts with 20 apples.

* **Step 3: Calculate the Split**

After buying 20 oranges, he has 20 + 20 = 40 fruits total.

* **Step 4: Calculate the Discard**

Half of his total fruit is 40 / 2 = 20 fruits.

* **Step 5: Calculate the Remaining Fruit**

Since he discards a quarter of his fruit, he has 40 / 4 = 10 fruits discarded.

* **Step 6: Calculate the Remaining Apples**

He discards 10 fruits / 2 = 5 apples.

* **Answer:**  Therefore, 5 apples are discarded.

Let me know if you'd like to know how many apples and oranges Matteo has left! 🍎🍊

The text was updated successfully, but these errors were encountered:

0wwafa · 2024-06-28T10:44:10Z

aistudio? are you confusing gemma with gemini?

matteoserva · 2024-06-28T10:55:51Z

Gemma 2 is available in AI studio since yesterday. I live in Italy. I don't know if it's available everywhere.

EliEron · 2024-06-28T11:52:30Z

I'm having the exact same experience, I've been testing Gemma-2 for data extraction, the 9B model almost gets the answers perfect, whereas the 27B model only understands it has to output JSON, literally everything else (including the JSON key names) it gets wrong. Like it's a night and day difference between them.

I've tested the full model using Nvidia's NIM service (You get a 1000 requests from signing up) and the 27B model has zero issues with any of the tasks there.

I am running a Q8 Quant so the quality loss should be minimal. So I am very confident something is wrong with the quantized 27B model.

matteoserva · 2024-06-28T12:20:49Z

I can also confirm that the 9b is affected less by this.

I tried the same prompt with it. It outputs the wrong numeric solution but it was able to repeat the question word by word as requested.

Prompt was Repeat the question and then answer it: [my question]

Rotatingxenomorph · 2024-06-28T13:44:43Z

Seems like Google broke something

https://huggingface.co/google/gemma-2-27b-it/discussions/10

MoonRide303 · 2024-06-28T16:41:34Z

Soft capping might be missing, see huggingface/transformers#31698.

matteoserva · 2024-06-28T16:44:39Z

Soft capping might be missing, see huggingface/transformers#31698.

They talk about it in the paper. They say that soft capping was temporarily disabled to make the model compatible with existing implementations of flash attention, and that the performance hit is negligible.

Apparently it was not negligible.

0wwafa · 2024-06-28T20:49:21Z

Gemma 2 is available in AI studio since yesterday. I live in Italy. I don't know if it's available everywhere.

I didn't notice! I will try it.

P.S.
Paste your full llama.cpp command to replicate the issue. I'm curious.

matteoserva · 2024-06-28T21:15:38Z

There is now a PR that fixes the soft capping problem: #8197

Another issue that might be relevant is that gemma 2 uses a sliding window attention instead of global attention for every other layer. It could be missing, which means that the context is currently limited to 4096 tokens. See last comment to that issue: #3377

This might also solve the Phi3 issue: #7709

P.S. Paste your full llama.cpp command to replicate the issue. I'm curious.

@0wwafa the simplest command you can run is the following:
./llama-cli -m gemma-2-27b-it-Q6_K.gguf -p "<bos><start_of_turn>user\nRepeat the question and then answer it: Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally. Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?<end_of_turn>\n<start_of_turn>model\n"

EDIT

After #8197 the output improved a lot.
The local model still explodes after simple prompts that are not an issue for the aistudio version.

The simplest prompt that completely breaks the local model is:
Completa la frase: tanto va la gatta al lardo che...

The aistudio model answers:
tanto va la gatta al lardo che ci lascia lo zampino.
which is the only correct answer.

The local model starts rambling about fat pigs and then comments its own answer in spanish.

<bos><start_of_turn>user
Completa la frase: tanto va la gatta al lardo che...<end_of_turn>
<start_of_turn>model
... **se la scrofa la ingrassa**. 

Esta es una frase hecha italiana que significa que si alguien insiste mucho en algo, al final lo conseguirá, aunque sea por casualidad o por la ayuda de alguien más.

The model was requantized from the hf repo version after updating both the hf repo and transformers, and after merging the soft capping PR. Quants used: Q8_0

foldl · 2024-06-29T14:52:38Z

I have implemented those two "soft capping" and interleaved SWA/full attention in chatllm.cpp, and Q8_0 quantized Gemma-2 could solve this fruit problem with greedy sampling (while Q4_1 fails):

    ________          __  __    __    __  ___ 
   / ____/ /_  ____ _/ /_/ /   / /   /  |/  /_________  ____
  / /   / __ \/ __ `/ __/ /   / /   / /|_/ // ___/ __ \/ __ \
 / /___/ / / / /_/ / /_/ /___/ /___/ /  / // /__/ /_/ / /_/ /
 \____/_/ /_/\__,_/\__/_____/_____/_/  /_(_)___/ .___/ .___/
You are served by Gemma-2,                    /_/   /_/
with 27227128320 (27.2B) parameters.

You  > Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally. Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?
A.I. > Here's how to solve the problem step-by-step:

1. **Total Fruits:** Matteo starts with 20 apples + 20 oranges = 40 fruits.

2. **First Discard:** He discards half, which is 40 fruits / 2 = 20 fruits. This leaves him with 40 fruits - 20 fruits = 20 fruits.

3. **Fruits After First Discard:** He now has 10 apples and 10 oranges.

4. **Second Discard:** He discards a quarter of his fruits, which is 20 fruits / 4 = 5 fruits.

5. **Final Apple Count:** Since he discards 5 fruits equally between apples and oranges, he loses 5 fruits / 2 = 2.5 apples. Since you can't have half an apple, we'll round down. This leaves him with 10 apples - 2 apples = 8 apples.


**Answer:** Matteo has 8 apples remaining.

matteoserva · 2024-06-29T21:44:45Z

I have implemented those two "soft capping" and interleaved SWA/full attention in chatllm.cpp, and Q8_0 quantized Gemma-2 could solve this fruit problem with greedy sampling (while Q4_1 fails):

I tested your implementation at Q8_0 with my benchmarks and the output matches exactly the reference implementation by google (To clarify: I mean the gemma2 model on AI studio).
My congratulations! You did a really good job.

0wwafa · 2024-06-29T22:33:33Z

@matteoserva chek my quantizations: https://huggingface.co/RobertSinclair
if you want we can further discuss this (facebook/whatsapp/discord) I speak italian too.

foldl · 2024-07-01T04:24:09Z

All you need is to go deeper.

I would like to report that a self-merged (or self stacked) Gemma-2 9B (Q8_0) can solve this math problem, too.

Here layer No. 8/9/16/17/24/25/32/33 are repeated (resulting in a 10.8B model):

main.exe --temp 0  -m gemma-2-9b.bin --layer_spec "0:10,8:18,16:26,24:34,32:42" -i
    ________          __  __    __    __  ___ 
   / ____/ /_  ____ _/ /_/ /   / /   /  |/  /_________  ____
  / /   / __ \/ __ `/ __/ /   / /   / /|_/ // ___/ __ \/ __ \
 / /___/ / / / /_/ / /_/ /___/ /___/ /  / // /__/ /_/ / /_/ /
 \____/_/ /_/\__,_/\__/_____/_____/_/  /_(_)___/ .___/ .___/
You are served by Gemma-2,                    /_/   /_/
with 10827267584 (10.8B) parameters.

You  > Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally. Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?
A.I. > Here's how to solve this problem step-by-step:

1. **Total Fruit:** Matteo starts with 20 apples + 20 oranges = 40 fruits.

2. **Discarding Half:** He discards half his fruit, so he has 40 fruits / 2 = 20 fruits left.

3. **Apples and Oranges Left:**  He now has 20 fruits, which is an equal mix of apples and oranges (since he discarded them equally). So he has 20 fruits / 2 = 10 apples and 10 oranges left.

4. **Discarding a Quarter:** He discards a quarter of his remaining fruit equally between apples and oranges. 
   *  For apples: 10 apples / 4 = 2.5 apples. Since he can't have half an apple, we'll say he discards 2 apples.
   *  For oranges: 10 oranges / 4 = 2.5 oranges.  We'll say he discards 2 oranges.

5. **Final Count:**  Matteo has 10 apples - 2 apples = 8 apples left.



**Answer:** Matteo has 8 apples remaining.

An even deeper one (--layer_spec "0:12,6:18,12:24,18:30,24:36,30:42", 15.2B) can solve this, too.

matteoserva · 2024-07-01T17:01:48Z

closing this and continuing in #8240

matteoserva added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jun 28, 2024

matteoserva changed the title ~~Bug: quantized gemma 27b output still wrong after tokenizer fix~~ Bug: quantized gemma 27b output still wrong after tokenizer fix and soft capping Jun 29, 2024

matteoserva closed this as completed Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: quantized gemma 27b output still wrong after tokenizer fix and soft capping #8183

Bug: quantized gemma 27b output still wrong after tokenizer fix and soft capping #8183

matteoserva commented Jun 28, 2024

0wwafa commented Jun 28, 2024

matteoserva commented Jun 28, 2024 •

edited

Loading

EliEron commented Jun 28, 2024 •

edited

Loading

matteoserva commented Jun 28, 2024

Rotatingxenomorph commented Jun 28, 2024

MoonRide303 commented Jun 28, 2024 •

edited

Loading

matteoserva commented Jun 28, 2024

0wwafa commented Jun 28, 2024

matteoserva commented Jun 28, 2024 •

edited

Loading

foldl commented Jun 29, 2024

matteoserva commented Jun 29, 2024 •

edited

Loading

0wwafa commented Jun 29, 2024

foldl commented Jul 1, 2024 •

edited

Loading

matteoserva commented Jul 1, 2024

Bug: quantized gemma 27b output still wrong after tokenizer fix and soft capping #8183

Bug: quantized gemma 27b output still wrong after tokenizer fix and soft capping #8183

Comments

matteoserva commented Jun 28, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

0wwafa commented Jun 28, 2024

matteoserva commented Jun 28, 2024 • edited Loading

EliEron commented Jun 28, 2024 • edited Loading

matteoserva commented Jun 28, 2024

Rotatingxenomorph commented Jun 28, 2024

MoonRide303 commented Jun 28, 2024 • edited Loading

matteoserva commented Jun 28, 2024

0wwafa commented Jun 28, 2024

matteoserva commented Jun 28, 2024 • edited Loading

EDIT

foldl commented Jun 29, 2024

matteoserva commented Jun 29, 2024 • edited Loading

0wwafa commented Jun 29, 2024

foldl commented Jul 1, 2024 • edited Loading

matteoserva commented Jul 1, 2024

matteoserva commented Jun 28, 2024 •

edited

Loading

EliEron commented Jun 28, 2024 •

edited

Loading

MoonRide303 commented Jun 28, 2024 •

edited

Loading

matteoserva commented Jun 28, 2024 •

edited

Loading

matteoserva commented Jun 29, 2024 •

edited

Loading

foldl commented Jul 1, 2024 •

edited

Loading