Root mean square of token probability differences as new quantization quality metric #2875

JohannesGaessler · 2023-08-29T14:11:57Z

JohannesGaessler
Aug 29, 2023
Collaborator

I have investigated potential new metrics other than differences in perplexity for judging the quality of a quantization format. The full report can be found here.
The TLDR is that I propose using the root mean square of the differences in token probability ($\mathrm{RMS}_p$) between the quantized and unquantized model as a new metric.
I think this would have the following advantages:

$\mathrm{RMS}_p$ is interpretable since it approximates the standard deviation of the change in token probability from the quantization.
$\mathrm{RMS}_p$ is sensitive to changes of the probability in the medium range rather than changes to very high or low probabilities that would not be relevant after top-p sampling anyways.
You could potentially calculate $\mathrm{RMS}_p$ much faster than perplexity because you could consider more than one token probability per input token and thus get more data points per llama.cpp eval.

This is what a plot of $\mathrm{RMS}_p$ looks like:

This is the corresponding table:

Quantization format	Model size [GiB]	Perplexity	Perplexity diff vs. F16	RMS logits	$\mathrm{RMS}_p$	Mean prob diffs	STD prob diffs
F16	12.55	5.7963	-	-	-	-	-
Q4_0	3.56	5.9667	1.7044e-01 +- 8.9485e-03	6.4194e-01 +- 2.6739e-03	4.8091e-02 +- 3.2271e-04	-6.1943e-03	4.7691e-02
Q4_1	3.95	6.0010	2.0476e-01 +- 9.8139e-03	7.4797e-01 +- 3.1986e-03	5.2213e-02 +- 3.5856e-04	-8.3283e-03	5.1544e-02
Q5_0	4.33	5.8291	3.2938e-02 +- 4.8208e-03	3.3836e-01 +- 2.5083e-03	2.5776e-02 +- 2.0147e-04	-1.0959e-03	2.5753e-02
Q5_1	4.72	5.8531	5.6802e-02 +- 4.6942e-03	3.3836e-01 +- 2.5083e-03	2.5776e-02 +- 2.0147e-04	-1.0959e-03	2.5753e-02
Q8_0	6.64	5.8013	5.0130e-03 +- 1.0479e-03	7.4246e-02 +- 7.6008e-04	6.9985e-03 +- 8.1228e-05	-7.9424e-06	6.9985e-03
Q2_K	2.63	6.4462	6.4998e-01 +- 2.2681e-02	1.6808e+00 +- 9.3935e-03	9.9569e-02 +- 5.4573e-04	-2.7971e-02	9.5560e-02
Q3_K_S	2.75	6.2947	4.9848e-01 +- 1.7213e-02	1.3440e+00 +- 4.2525e-03	8.7031e-02 +- 4.9141e-04	-2.2907e-02	8.3962e-02
Q3_K_M	3.07	6.0272	2.3097e-01 +- 1.0466e-02	7.8546e-01 +- 2.6638e-03	5.9372e-02 +- 3.9066e-04	-1.1507e-02	5.8247e-02
Q3_K_L	3.35	5.9872	1.9097e-01 +- 9.7091e-03	7.2806e-01 +- 2.5221e-03	5.4021e-02 +- 3.7963e-04	-8.9787e-03	5.3269e-02
Q4_K_S	3.59	5.8890	9.2732e-02 +- 6.8324e-03	5.0052e-01 +- 2.0265e-03	3.9054e-02 +- 2.7155e-04	-5.0621e-03	3.8725e-02
Q4_K_M	3.80	5.8803	8.4071e-02 +- 5.8384e-03	4.2331e-01 +- 1.9717e-03	3.2940e-02 +- 2.5994e-04	-4.3586e-03	3.2650e-02
Q5_K_S	4.33	5.8222	2.5931e-02 +- 4.4898e-03	3.2355e-01 +- 1.7203e-03	2.4456e-02 +- 1.9604e-04	-1.1886e-03	2.4427e-02
Q5_K_M	4.45	5.8282	3.1913e-02 +- 3.4193e-03	2.3983e-01 +- 1.2860e-03	1.8893e-02 +- 1.5800e-04	-1.0047e-03	1.8866e-02
Q6_K	5.15	5.8095	1.3218e-02 +- 2.2927e-03	1.6149e-01 +- 1.3805e-03	1.2712e-02 +- 1.1252e-04	-5.2424e-04	1.2701e-02

I chose not to add the uncertainties for perplexity because they would be misleading in this context due to the very high correlation.

I very much welcome feedback for my idea, particularly from @ggerganov and @ikawrakow who have spent a lot of time on quantization formats.

ikawrakow · 2023-08-29T15:48:03Z

ikawrakow
Aug 29, 2023

Interesting proposal. Have you compared the uncertainty of this metric with the uncertainty of the perplexity difference (i.e., uncertainty of PPL(Q) - PPL(f16), which is basically PPL * uncertainty (logit(Q) - logit(fp16)))?

PPL and the proposed RMS_p are strongly correlated, see the graph below. The red line is a quadratic fit to the data points that I took from the table above.

Then I started thinking: but what if I had a magic quantization that always produced probability 1 for the correct token? RMS_p will be very bad, but perplexity would be 1. Perplexity = 1 would be amazing, but we would miss this magic quantization because we would see the bad RMS_p and discard it ;-)

2 replies

JohannesGaessler Aug 29, 2023
Collaborator Author

I've updated the table with the perplexity difference. I also tried fitting it vs $\mathrm{RMS}_p$:

So you could approximate the metrics with each other via some conversion factor and squaring or taking the root of the value (but the $\chi^2 / \mathrm{NDF}$ value indicates that this conversion would not be exact). What I think would also be interesting to investigate would be how the relationship changes if you re-run the tests with higher context; that would obviously lower perplexity but intuitively I would expect $\mathrm{RMS}_p$ to stay roughly constant.

ikawrakow Aug 29, 2023

So, based on the updated tables, it looks like the relative uncertainty of RMS_p is lower than the relative uncertainty of PPL. If so, this metric can be very useful to more quickly figure out if a specific change in quantization makes an actual difference for generation quality.

KerfuffleV2 · 2023-08-29T16:57:30Z

KerfuffleV2
Aug 29, 2023
Collaborator

For a specific quantization, do you get different values for different tensors? If so, it seems like this could be a good way to automatically determine stuff like the k-quants strategies that try to put more bits in the more important tensors.

3 replies

ikawrakow Aug 29, 2023

@KerfuffleV2 This is not about a difference between tensors. Instead, it is about the difference between the probabilities to predict the next token computed by the full model and by a given quantized model.

JohannesGaessler Aug 29, 2023
Collaborator Author

I'm not entirely sure what you mean. I've built the YAML logging (which I used for data collection) into the perplexity binary which only receives the logits at the end of the model. It would be possible to compare and minimize output value differences across the entire model but I don't that would be very useful since the way you should weigh the intermediary differences is not clear and they could be amplified or dampened on their way to the end of the model.

KerfuffleV2 Aug 29, 2023
Collaborator

This is not about a difference between tensors. Instead, it is about the difference between the probabilities to predict the next token computed by the full model and by a given quantized model.

Right. Let me clarify what I mean:

Let's say we're talking about comparing the full model with a Q4_K quantized model. If we quantize the output tensor with Q6_K vs Q4_K the probabilities of the next token you mentioned will change. Right?

edit: Guess I can kind of answer my question: It would work, but in the same way trying permutations of quantization + running perplexity would. I was thinking it was possibly something that could be done by directly comparing the data in the tensors. (I believe there was another type of RMS calculation that worked that way to determine how much of an error quantization introduced compared to the original version.)

ggerganov · 2023-08-30T08:06:59Z

ggerganov
Aug 30, 2023
Maintainer

Interesting analysis. In contrast to standard perplexity evaluation of a model, where we have information only for 1 "correct" token for each context, here we have more information available in the predicted probabilities across the entire vocab. It makes sense that if utilized correctly, this information can result in a better (faster and more accurate) way to evaluate the effects on the quality change of the model due to quantization.

In some sense, this metric can be interpreted as "how different two models are?" and probably can have other applications beyond evaluating the quality of quantum models. For example, the RMS_p of fine-tuned models with respect to the base model might be a way to compare the amount of "behavior change" as a result of the fine-tuning.

0 replies

Piezoid · 2023-08-30T09:37:12Z

Piezoid
Aug 30, 2023

Nice to see less logits thrown away.
Have you tried using the KL divergence instead of RMS to compare distributions?

1 reply

JohannesGaessler Aug 30, 2023
Collaborator Author

No I have not. If possible I would prefer a way of comparison that is easily interpretable and the KL divergence I think is comparatively difficult to understand. There is also the issue that due to the use of logarithms I expect the total value to be dominated by low-probability tokens which would not be sampled from anyways.

Storing all values for the full 32000 vocab size could also be an issue (ideally you would want to calculate the F16 reference values only once and write them to a file). For wikitext the amount of necessary storage space would be on the order of tens of gigabytes to do a full comparison.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Root mean square of token probability differences as new quantization quality metric #2875

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Root mean square of token probability differences as new quantization quality metric #2875

JohannesGaessler Aug 29, 2023 Collaborator

Replies: 4 comments · 6 replies

ikawrakow Aug 29, 2023

JohannesGaessler Aug 29, 2023 Collaborator Author

ikawrakow Aug 29, 2023

KerfuffleV2 Aug 29, 2023 Collaborator

ikawrakow Aug 29, 2023

JohannesGaessler Aug 29, 2023 Collaborator Author

KerfuffleV2 Aug 29, 2023 Collaborator

ggerganov Aug 30, 2023 Maintainer

Piezoid Aug 30, 2023

JohannesGaessler Aug 30, 2023 Collaborator Author

JohannesGaessler
Aug 29, 2023
Collaborator

Replies: 4 comments 6 replies

ikawrakow
Aug 29, 2023

JohannesGaessler Aug 29, 2023
Collaborator Author

KerfuffleV2
Aug 29, 2023
Collaborator

JohannesGaessler Aug 29, 2023
Collaborator Author

KerfuffleV2 Aug 29, 2023
Collaborator

ggerganov
Aug 30, 2023
Maintainer

Piezoid
Aug 30, 2023

JohannesGaessler Aug 30, 2023
Collaborator Author