Some more observations regarding QX_0
quantizations with LLaMAv2 7B
#2421
Replies: 2 comments
-
Interesting indeed. Maybe we could figure out a slower quantization mode which somehow measures the weight distribution on each tensor and selects a good enough quantization method for it. This would give a good compromise between quality, speed and model size. Simply calculating the variance of the tensor weights may work, and experimentally determinate which quantization method to use for different variance ranges. |
Beta Was this translation helpful? Give feedback.
-
The |
Beta Was this translation helpful? Give feedback.
-
In #2276 I reported an unusual behavior of LLaMAv2 7B when using
Q4_0
andQ5_0
quantizations. In short, the observation is that with short prompts such as"I believe the meaning of life is"
after the end of the first sentence, the generation switches into some weird mode - for example, often generating text in some other language, starting with non-capital letter, etc. This only happens if the sentence ends with.
. I'e it doesn't occur when it ends with!
and?
. I find this weird and it has been bugging me ever since.Today I noticed in the
quantize
tool output that the tensors in layers 0 and 1 have significantly different weight distributions compared to the other layers:This discrepancy is also observed to similar extend for LLaMAv1.
So I decided to increase the quantization accuracy of the tensors in layer
0
and1
to see what happens.The weird behavior disappeared. I tried doing the same for other layers and it does not help.
More strange is that after a few more experiments, the only tensor that I have to quantize more accurately (i.e. either
Q6_K
orQ8_0
) to fix the crazy texts islayers.1.feed_forward.w2
.For example, here are the top token probability for the first token in the second sentence using vanilla
Q4_0
:You can see, these tokens don't make sense for a start of a new sentence.
Here are the probabilities with
layers.1.feed_forward.w2
quantized withQ6_K
, everything else is the same:Looks much better and more consistent with full F16 precision token candidates:
Thought this is an interesting observation to share.
Here is a diff patch if anyone wants to play with this further:
Beta Was this translation helpful? Give feedback.
All reactions