-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New quantization method for Q4_K and Q5_K #4739
Conversation
Thank you for measuring with KL divergence, I made some charts a while back for 7b and 13b KL div and found it to be a much more interpretable metric of how much is actually changing. |
As context continues, the model generally gets lower ppl as it progresses the end of the context window on wikitext ppl evaluations. Can you show the tables with the first 15% for all models so that it's a 1:1 comparison? |
I would also like to mention that for average KL divergences in Mistral 7b quants (for my short evaluation of about 500 tokens of wikitext), 3_K_M was ~0.04, 4_K_M was 0.01, while 5_K_M was 0.003. This large gap seems to align with subjective perceptions from users of 5_K_M being the "best bang for the buck" in terms of diminishing returns compared to 4_K_M. If we adapt the scale for readability / interpretability by multiplying it by 100x fp16 = ~0 measured scaled change from original probabilities (cause it's the original) Notably, there is still a small average difference between q8_0 and q6_K, but it's extremely small. In your table, phi-2 has much higher divergences than that; a 5_K_S is most similar to a 3_K_M (or 3_K_L likely which wasn't measured) for a Mistral model in terms of relative quantization loss. This seems to support my belief that dense large models are easier to quantize. |
Exactly which distributions are used to calculate KL divergence? Logits at the end of the network?
What do you mean here? I don't see any confidence intervals. |
I'm not sure what he's done, but when I measured it, I compared the probability distributions post-softmax (at temperature 1.0 of course). This is what the model was trained to predict, so intuitively it makes sense that the difference between these output distributions is what we want to optimize for when it comes to quantization. Considering my charts for Mistral 7b seem to align roughly with Llama2 7b in terms of avg. divergence, I think this is what he's done. Not sure where he derived the margin of error from, but regardless of that, it's a more comprehensive and significantly better datapoint for quantization loss. The top-token agreement part of the table is also interesting; being able to say "on average it has the same top token 90% of the time" is more intuitive and easy to understand than any perplexity measurements would be considering lower ppl is not strictly better; Optimizing for the most similar end distribution (lowest KL div) will probably be more coherent than optimizing for greedy sampling, however. The way it currently stands, using perplexity to gauge how much the model changed is a very rough, high-error margin metric that doesn't give you a good way of understanding how the distribution fundamentally changed. |
btw @Ttl:
Is there a way I could set up a script or something that will automatically test and measure KL divergences with different K-quant configurations for the particular model and finds an 'optimal' mixture? I would extremely appreciate something like this, and if you could point me to what needs to be done, I would love to assist. |
The script used to calculate KL divergence is linked in the post: https://gist.github.com/Ttl/0d51f739dc59254b4b2183e259c97d82. See docstring for usage instructions. It calculates the KL divergence of softmaxed output logits for each token. Confidence interval is marked in the table with ±. Perplexity error bounds are from the llama.cpp perplexity program and looking at the source code I'm not sure what statistical quantity it measures? KL divergence script is about 10x slower than perplexity calculation because its implemented in Python. The main issue with testing multiple models is the evaluation speed. Mixtral fp16 model is 93 GB while I only have 32 GB of ram so it swaps heavily. |
Interesting work and analysis - thank you for sharing it!
Apart from the extra work for evaluating the results do you expect similar gains from using this approach for the rest of the quantizations? I see you've applied the least-squares fit for Overall I agree that it makes sense to start evaluating the differences against the F16 distribution in more details. The KL divergence between the logits is a step in this direction. I think another aspect would be to compare the embeddings at different stages of the computation graph (#2783). I ran a quick test using the new |
In my work experience value +- value is always used as the notation for best estimate +- 1 sigma. The notation we use for confidence intervals is to explicitly give the interval bounds. I think you should also generally say how the confidence intervals are calculated. Since they are in this case symmetrical I assume they are generated from the standard deviation by assuming a normal distribution.
The final perplexity value is calculated as a mean of individual values. If you assume the values are normally distributed you can calculate an uncertainty for the mean from the standard deviation of the values. |
Generally speaking I think it would be useful if we had a way to calculate perplexity, KL divergence, etc. directly in llama.cpp; for me the biggest challenge with numerical computations is always to ensure that the results are actually correct (or at least sufficiently precise). Currently you can already set |
I did try least squares fitting the scale of
I think most of the time is spent in loop trying other possible quantization choices. Limit can be decreased to speed it up but it affects the performance slightly and I figured it's better to do it well than do it fast.
KL divergence confidence bound is calculated with scipy bayes_mvs, it should be equal to normal distribution assumption at this high number of samples.
Yaml format is not dense enough. I save logits currently in fp32 binary format and full |
Interesting work. A small correction to the concept of perplexity. It just measuring the confidence of a model is a bit too simplistic. To see this, consider a "language model" that always predicts a probability of 1 for the exact same token, and probability zero for all other tokens. This is an extremely confident model (as confident as it gets), and yet its perplexity will be infinity, while its practical utility will be zero. Oh, KL-divergence will be infinity too, so we are finding that the two are somehow related in this case, so perhaps they are closely related in general? Did you try writing down mathematically how PPL looks like when expressed with the two probability distributions we are trying to compare? Concerning the NP-hardness of the mixed integer least squares problem: we are quantizing blocks, so quite small optimization problems for each quantization block, so that problem is readily solvable. Why don't we use it then? Because we do not really know what we want to minimize and, as we have learned early on, using the exact solution of the mixed integer least squares problem can lead to disastrous results, especially if no weights are being used at all (and you see how in the existing k-quants implementation the weights are sometimes Concerning the statistical uncertainty that you see in the PPL output: this does not reflect the extremely strong correlation between the logits predicted by different variations of the same language model ( |
Thanks for the very good comments. I see your point with perplexity. It measures how well the model is able to predict the next token in the test set and it does make sense for also quantized models. However, in text generation if we are interested in minimizing the difference in generated tokens to unquantized model, then the quantization method with lower KL divergence should give closer results when sampling generated tokens from the output logit distribution when sampling with temperature=1 and without other fancy tricks. Top-k, min-p and others complicate it a little bit depending where the differences in the logit distributions are. Correlation is also a very good point. It doesn't really make sense to compare reported perplexity or kl divergence uncertainties of different quantizations of the same model. Here's two plots of phi-2 q4_k_s quantization perplexity and kl divergence estimation differences as function of tokens/batches. I have KL divergences for each token, but for perplexity I have only the reported batch output however at this scale it shouldn't make too big difference: KL divergence converges much more quickly. I can calculate 99% confidence bound for KL divergence difference as: (-0.004766, -0.004326) so this PR should improve it for that model with that quantization with very high confidence. Perplexity for the same model is worse with this quantization and while the confidence isn't as high and I can't calculate it since I don't have the samples, eyeballing from the plot the confidence looks at least moderately good that this PR makes it slightly worse for this model and this quantization. They are different measures and while there is a correlation I guess it makes sense that one quantization method doesn't need to be better on both of them. Since top-1 token agreement was better with this PR and I think it should be closer to unquantized model for text generation according to the earlier argument I tested perplexity calculation on phi-2's own generated text. I first generate perplexity test data with the fp16 model with: #!/usr/bin/bash
for i in {1..100}
do
echo $i
./main -m ../phi-2/ggml-model-f16.gguf -p "$(sed "${i}q;d" wiki.test.raw)" -n -2 -ngl 99 --ignore-eos --top-p 1 --min-p 0 -c 0 >> phi2_test.txt
done Prompt is initialized from lines from I think that comparing KL divergence makes more sense than perplexity if the goal is to minimize difference to unquantized model in text generation. Better measure could take some sampler parameters into account, such as top-k but I'm not quite sure at the moment how they should be considered. |
Good discussion. But let's look at some equations. KL-divergence:
where Logarithm of ratio of quantized to base perplexities:
where So, basically, both are expectation values of Granted, computing KL-divergence is vastly more efficient than PPL to obtain the same statistical uncertainty by virtue of getting a score for each model token after each token generation as opposed to PPL, which gets a single score per generated token. I wish we would have thought of that back in the day where I think it would be useful to look at KL-divergence evaluated over the |
On current master I'm getting |
I don't really understand your equations. Sum is over different things. KL divergence equation sums over all logits, but perplexity only considers one probability for each token and sum is taken over all tokens. I was trying to evaluate the difference the quantization causes to actual generated tokens in text generation. The easiest case is when temperature is 0, then top-1 token is picked which is listed in the table in the first post. I presented case for temperature 1 earlier. Evaluation of perplexity on self generated text was experiment testing it. Mixtral perplexity was calculated with |
OK, this explains it. You are using the instruct tuned version, which isn't clear from the table. Instruct tuned models always have higher PPL than their respective base models. |
After doing this for a sufficiently large number of tokens, where some tokens will appear more frequently than others, we have as a result the expectation value of
Do you see it now? |
I see. Very similar derivation is also on Wikipedia page of perplexity but just without the division.
I did this experiment with Phi-2 Q4_K_S. First take top-N logits of fp16 model, then select only those indices from fp16 and quantized model logits, softmax and then calculate KL divergence. Problem with this approach is that logits are masked based on fp16 model top-N which can be different than quantized top-N logits, but the error should be small as long as N isn't too small and distributions are similar enough that any large outliers aren't masked out.
In the second column is summed probabilities of selected indices from fp16 model. Most of the probability mass is on the few top values as would be expected. It doesn't look like there's any convergence of KL divergence values of the two quantization methods at these values and the amount that the small probabilities contribute to KL divergence isn't very much. I chose 40 as minimum value as that is the default sampler I think the summary is that there are two ways to define which quantized model is better:
Normally these two measures are expected to correlate but at least on this case it doesn't look that they agree. I'm not sure exactly why that's the case. It can be argued which one is a better choice, but I'll close this PR since the currently accepted definition is 1 and this is worse on that measure. EDIT: I'm not quite sure about the above after thinking it further. KL divergence is better for this PR on the |
Quantization performance is usually evaluated using perplexity. Perplexity measures how sure the model is of its token choices, in other words how sharp the output logit distribution is. It makes sense for evaluating unquantized models because there isn't really any one true output for the output tokens. However, when measuring how much quantization affects the output there is a correct output, the unquantized model's output with the same inputs, and it makes sense to compare how closely aligned the output logits are to the unquantized model. KL divergence is very natural choice for measuring how close two distributions are.
This PR changes the quantization for Q4_K and Q5_K formats, commit history also includes changes for some other formats too but testing all of them was too much work. Quantization format is unchanged and the generated weights are backwards compatible. It's not very big change and especially if perplexity is used to compare models the changes are within the error margin. KL divergence measurement is better and outside the error margins with all tested models. RMS error is also slightly decreased.
RMS errors
KL divergence is calculated with this script. Reference logits are calculated with fp16 model on
wiki.test.raw
. Mixtral 8x7B uses only about 15% of the file because running fp16 model is way too slow on my computer, other models are evaluated on full file. KL divergence error margin is 99% confidence interval. Perplexity is also calculated onwiki.test.raw
using the full file. Last column is the fraction of how often the token with highest value agreed on quantized and fp16 models and it gives more easily interpretable measure on how the quantization actually affects the generated tokens.The script also outputs other statistics such as 90%, 95% and 99% quantile KL divergence and how often fp16 top token is in quantized model top5 and top10. I didn't try to fill all of them in the table but this PR also improves those for example LLama-2 7B Q4_K_S:
Llama-2 7B stats
For example
Eval top token in reference top-10 probability
increased from 0.99904 to 0.999611, which means that with 336345 total tokens the amount of tokens where top-1 quantized model token was outside of top-10 tokens of fp16 model decreased from 323 to 131 in the test set.Here's two plots with also some other quantization formats for easier visualization:
I don't think this approach is optimal even when considering only the current activation unaware quantization. Even if just using RMS error as the target the quantization of Q_K formats is mixed integer least squares problem which is NP-hard, but I didn't want to spend too much time on it and wanted to first get some feedback if this makes sense.