Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New quantization method for Q4_K and Q5_K #4739

Closed
wants to merge 13 commits into from
Closed

Conversation

Ttl
Copy link
Contributor

@Ttl Ttl commented Jan 2, 2024

Quantization performance is usually evaluated using perplexity. Perplexity measures how sure the model is of its token choices, in other words how sharp the output logit distribution is. It makes sense for evaluating unquantized models because there isn't really any one true output for the output tokens. However, when measuring how much quantization affects the output there is a correct output, the unquantized model's output with the same inputs, and it makes sense to compare how closely aligned the output logits are to the unquantized model. KL divergence is very natural choice for measuring how close two distributions are.

This PR changes the quantization for Q4_K and Q5_K formats, commit history also includes changes for some other formats too but testing all of them was too much work. Quantization format is unchanged and the generated weights are backwards compatible. It's not very big change and especially if perplexity is used to compare models the changes are within the error margin. KL divergence measurement is better and outside the error margins with all tested models. RMS error is also slightly decreased.

RMS errors
RMS error (master)
phi-2:
q4_K : rmse 0.00226361, maxerr 0.52862549, 95pct<0.0042, median<0.0016
q5_K : rmse 0.00115071, maxerr 0.28332520, 95pct<0.0022, median<0.0010

Llama-2 7B:
q4_K : rmse 0.00137495, maxerr 0.11077881, 95pct<0.0026, median<0.0010
q5_K : rmse 0.00069610, maxerr 0.05142975, 95pct<0.0014, median<0.0006

Mixtral 8x7B:
q4_K : rmse 0.00082300, maxerr 0.44823837, 95pct<0.0016, median<0.0008
q5_K : rmse 0.00041677, maxerr 0.20952225, 95pct<0.0008, median<0.0004

RMS error (PR)
phi-2:
q4_K : rmse 0.00220827, maxerr 0.54077148, 95pct<0.0042, median<0.0016
q5_K : rmse 0.00109032, maxerr 0.33684158, 95pct<0.0020, median<0.0008

Llama-2 7B:
q4_K : rmse 0.00134057, maxerr 0.31831360, 95pct<0.0026, median<0.0010
q5_K : rmse 0.00065857, maxerr 0.11894464, 95pct<0.0014, median<0.0006

Mixtral 8x7B:
q4_K : rmse 0.00079569, maxerr 0.44116211, 95pct<0.0016, median<0.0006
q5_K : rmse 0.00039095, maxerr 0.19310760, 95pct<0.0008, median<0.0004

KL divergence is calculated with this script. Reference logits are calculated with fp16 model on wiki.test.raw. Mixtral 8x7B uses only about 15% of the file because running fp16 model is way too slow on my computer, other models are evaluated on full file. KL divergence error margin is 99% confidence interval. Perplexity is also calculated on wiki.test.raw using the full file. Last column is the fraction of how often the token with highest value agreed on quantized and fp16 models and it gives more easily interpretable measure on how the quantization actually affects the generated tokens.

Model Quantization Perplexity KL divergence to fp16 model Top-1 token agreement to fp16 model
phi-2 (master) Q4_K_S 11.0052 ± 0.07823 0.0430889 ± 0.000276 0.8895 ± 0.001508
phi-2 (PR) Q4_K_S 11.0145 ± 0.07824 0.0385425 ± 0.000245 0.8951 ± 0.001472
Llama-2 7B (master) Q4_K_S 5.8852 ± 0.03279 0.0246132 ± 0.000419 0.9269 ± 0.001156
Llama-2 7B (PR) Q4_K_S 5.8930 ± 0.03290 0.0226238 ± 0.000415 0.929 ± 0.00114
Mixtral-8x7B (master) Q4_K_S 4.5136 ± 0.02424 0.035074 ± 0.000992 0.9191 ± 0.003105
Mixtral-8x7B (PR) Q4_K_S 4.5159 ± 0.02435 0.0323496 ± 0.000888 0.9208 ± 0.003075
Model Quantization Perplexity KL divergence to fp16 model Top-1 token agreement to fp16 model
phi-2 (master) Q5_K_S 10.8949 ± 0.07747 0.0225299 ± 0.000143 0.9193 ± 0.00131
phi-2 (PR) Q5_K_S 10.8883 ± 0.07739 0.0210207 ± 0.000130 0.9215 ± 0.001294
Llama-2 7B (master) Q5_K_S 5.8204 ± 0.03250 0.0104636 ± 0.000210 0.9507 ± 0.0009612
Llama-2 7B (PR) Q5_K_S 5.8160 ± 0.03252 0.00894639 ± 0.000129 0.9518 ± 0.0009511
Mixtral-8x7B (master) Q5_K_S 4.4399 ± 0.02378 0.0131443 ± 0.000651 0.9501 ± 0.00248
Mixtral-8x7B (PR) Q5_K_S 4.4440 ± 0.02379 0.0125088 ± 0.000457 0.9497 ± 0.002488

The script also outputs other statistics such as 90%, 95% and 99% quantile KL divergence and how often fp16 top token is in quantized model top5 and top10. I didn't try to fill all of them in the table but this PR also improves those for example LLama-2 7B Q4_K_S:

Llama-2 7B stats
Model: llama-2-7b-q4_k_s.gguf
Size: 3.59 GiB, (BPW 4.58)
Tokens: 336345
KL-divergence:
mean: 0.0246132, [0.0241945 - 0.025032]
q90: 0.04327, [0.04299 - 0.04354]
q95: 0.06774, [0.06709 - 0.06839]
q99: 0.2054, [0.1996 - 0.2112]
max: 6.866
Reference top token in eval top-n probability:
ref_top1: 0.9269 ± 0.001156
ref_top5: 0.9965 ± 0.0002622
ref_top10: 0.997556 ± 0.0002193
Eval top token in reference top-n probability:
eval_top5: 0.9982 ± 0.0001866
eval_top10: 0.99904 ± 0.0001376

Model: llama-2-7b-q4_k_pr.gguf
Size: 3.59 GiB, (BPW 4.58)
Tokens: 336345
KL-divergence:
mean: 0.0226238, [0.0222092 - 0.0230385]
q90: 0.04112, [0.04085 - 0.04139]
q95: 0.06352, [0.06297 - 0.06408]
q99: 0.1809, [0.1767 - 0.1853]
max: 8.778
Reference top token in eval top-n probability:
ref_top1: 0.929 ± 0.00114
ref_top5: 0.999 ± 0.0001432
ref_top10: 0.999762 ± 6.849e-05
Eval top token in reference top-n probability:
eval_top5: 0.9988 ± 0.0001559
eval_top10: 0.999611 ± 8.764e-05

For example Eval top token in reference top-10 probability increased from 0.99904 to 0.999611, which means that with 336345 total tokens the amount of tokens where top-1 quantized model token was outside of top-10 tokens of fp16 model decreased from 323 to 131 in the test set.

Here's two plots with also some other quantization formats for easier visualization:

phi2_kl
llama2_kl

I don't think this approach is optimal even when considering only the current activation unaware quantization. Even if just using RMS error as the target the quantization of Q_K formats is mixed integer least squares problem which is NP-hard, but I didn't want to spend too much time on it and wanted to first get some feedback if this makes sense.

@kalomaze
Copy link
Contributor

kalomaze commented Jan 2, 2024

Thank you for measuring with KL divergence, I made some charts a while back for 7b and 13b KL div and found it to be a much more interpretable metric of how much is actually changing.

@kalomaze
Copy link
Contributor

kalomaze commented Jan 2, 2024

Mixtral 8x7B uses only about 15% of the file because running fp16 model is way too slow on my computer, other models are evaluated on full file.

As context continues, the model generally gets lower ppl as it progresses the end of the context window on wikitext ppl evaluations. Can you show the tables with the first 15% for all models so that it's a 1:1 comparison?

@kalomaze
Copy link
Contributor

kalomaze commented Jan 2, 2024

I would also like to mention that for average KL divergences in Mistral 7b quants (for my short evaluation of about 500 tokens of wikitext), 3_K_M was ~0.04, 4_K_M was 0.01, while 5_K_M was 0.003. This large gap seems to align with subjective perceptions from users of 5_K_M being the "best bang for the buck" in terms of diminishing returns compared to 4_K_M.

image

If we adapt the scale for readability / interpretability by multiplying it by 100x

fp16 = ~0 measured scaled change from original probabilities (cause it's the original)
Q8_0 = ~0.06 avg. measured change from original probabilities
Q6_K = ~0.1 avg. measured change
Q5_K_M = ~0.3 avg. measured change
Q4_K_M = ~1.0 avg. measured change
Q3_K_M = ~3.7 avg. measured change
Q2_K = ~8.2 avg. measured scaled change

Notably, there is still a small average difference between q8_0 and q6_K, but it's extremely small.

In your table, phi-2 has much higher divergences than that; a 5_K_S is most similar to a 3_K_M (or 3_K_L likely which wasn't measured) for a Mistral model in terms of relative quantization loss. This seems to support my belief that dense large models are easier to quantize.

@JohannesGaessler
Copy link
Collaborator

Exactly which distributions are used to calculate KL divergence? Logits at the end of the network?

KL divergence error margin is 99% confidence interval.

What do you mean here? I don't see any confidence intervals.

@kalomaze
Copy link
Contributor

kalomaze commented Jan 2, 2024

Exactly which distributions are used to calculate KL divergence? Logits at the end of the network?

KL divergence error margin is 99% confidence interval.

What do you mean here? I don't see any confidence intervals.

I'm not sure what he's done, but when I measured it, I compared the probability distributions post-softmax (at temperature 1.0 of course). This is what the model was trained to predict, so intuitively it makes sense that the difference between these output distributions is what we want to optimize for when it comes to quantization.

Considering my charts for Mistral 7b seem to align roughly with Llama2 7b in terms of avg. divergence, I think this is what he's done.

Not sure where he derived the margin of error from, but regardless of that, it's a more comprehensive and significantly better datapoint for quantization loss. The top-token agreement part of the table is also interesting; being able to say "on average it has the same top token 90% of the time" is more intuitive and easy to understand than any perplexity measurements would be considering lower ppl is not strictly better; Optimizing for the most similar end distribution (lowest KL div) will probably be more coherent than optimizing for greedy sampling, however.

The way it currently stands, using perplexity to gauge how much the model changed is a very rough, high-error margin metric that doesn't give you a good way of understanding how the distribution fundamentally changed.

@kalomaze
Copy link
Contributor

kalomaze commented Jan 3, 2024

btw @Ttl:

testing all of them was too much work

Is there a way I could set up a script or something that will automatically test and measure KL divergences with different K-quant configurations for the particular model and finds an 'optimal' mixture? I would extremely appreciate something like this, and if you could point me to what needs to be done, I would love to assist.

@Ttl
Copy link
Contributor Author

Ttl commented Jan 3, 2024

The script used to calculate KL divergence is linked in the post: https://gist.github.com/Ttl/0d51f739dc59254b4b2183e259c97d82. See docstring for usage instructions. It calculates the KL divergence of softmaxed output logits for each token. Confidence interval is marked in the table with ±. Perplexity error bounds are from the llama.cpp perplexity program and looking at the source code I'm not sure what statistical quantity it measures?

KL divergence script is about 10x slower than perplexity calculation because its implemented in Python. The main issue with testing multiple models is the evaluation speed. Mixtral fp16 model is 93 GB while I only have 32 GB of ram so it swaps heavily.

@ggerganov
Copy link
Owner

Interesting work and analysis - thank you for sharing it!

This PR changes the quantization for Q4_K and Q5_K formats, commit history also includes changes for some other formats too but testing all of them was too much work.

Apart from the extra work for evaluating the results do you expect similar gains from using this approach for the rest of the quantizations? I see you've applied the least-squares fit for Q4_1 and Q5_1 at some point, but not for Q4_0 and Q5_0 - do you think it is worth evaluating these as well? Would be interested if this would help to resolve the issue that I've described in #2421 where the Q4_0 and Q5_0 quantizations of LLaMA 7B v2 switch to German after the first sentence.

Overall I agree that it makes sense to start evaluating the differences against the F16 distribution in more details. The KL divergence between the logits is a step in this direction. I think another aspect would be to compare the embeddings at different stages of the computation graph (#2783).

I ran a quick test using the new Q4_K quantization and it takes ~3x longer to perform the 7B qunatization compared to master. Although it's not a showstopper, it might be worth to try to improve the performance.

@JohannesGaessler
Copy link
Collaborator

Confidence interval is marked in the table with ±.

In my work experience value +- value is always used as the notation for best estimate +- 1 sigma. The notation we use for confidence intervals is to explicitly give the interval bounds. I think you should also generally say how the confidence intervals are calculated. Since they are in this case symmetrical I assume they are generated from the standard deviation by assuming a normal distribution.

Perplexity error bounds are from the llama.cpp perplexity program and looking at the source code I'm not sure what statistical quantity it measures?

The final perplexity value is calculated as a mean of individual values. If you assume the values are normally distributed you can calculate an uncertainty for the mean from the standard deviation of the values.

@JohannesGaessler
Copy link
Collaborator

Generally speaking I think it would be useful if we had a way to calculate perplexity, KL divergence, etc. directly in llama.cpp; for me the biggest challenge with numerical computations is always to ensure that the results are actually correct (or at least sufficiently precise). Currently you can already set --logdir for the perplexity binary which will record the logits in a YAML file but the performance of the Python YAML libraries which you could then use to load this data afterwards is bad. Ideally we would record reference logits for FP16 to a binary file and then just read it in again in C++ code where we can directly compare it against other logits as they are calculated.

@Ttl
Copy link
Contributor Author

Ttl commented Jan 3, 2024

Apart from the extra work for evaluating the results do you expect similar gains from using this approach for the rest of the quantizations?

I did try least squares fitting the scale of Q4_0 but I don't think it has much benefit. There just isn't enough parameters to fit to improve it. Same with just trying to least squares fit the super block scale for Q3_K and Q6_K. Q4_1 and Q5_1 quantization can be improved by measurable amount by least squares fitting min and scale.

Although it's not a showstopper, it might be worth to try to improve the performance.

I think most of the time is spent in loop trying other possible quantization choices. Limit can be decreased to speed it up but it affects the performance slightly and I figured it's better to do it well than do it fast.

I think you should also generally say how the confidence intervals are calculated.

KL divergence confidence bound is calculated with scipy bayes_mvs, it should be equal to normal distribution assumption at this high number of samples.

Currently you can already set --logdir for the perplexity binary which will record the logits in a YAML

Yaml format is not dense enough. I save logits currently in fp32 binary format and full wiki.test.raw phi-2 logits take 42.6 GB on disk with gz compression. 50k vocab size, 300k tokens and 4 bytes per logit adds up quickly. KL divergence calculation in llama.cpp would be useful, I coded it in Python because it's much quicker to get working.

@ikawrakow
Copy link
Contributor

ikawrakow commented Jan 4, 2024

Interesting work.

A small correction to the concept of perplexity. It just measuring the confidence of a model is a bit too simplistic. To see this, consider a "language model" that always predicts a probability of 1 for the exact same token, and probability zero for all other tokens. This is an extremely confident model (as confident as it gets), and yet its perplexity will be infinity, while its practical utility will be zero. Oh, KL-divergence will be infinity too, so we are finding that the two are somehow related in this case, so perhaps they are closely related in general? Did you try writing down mathematically how PPL looks like when expressed with the two probability distributions we are trying to compare?

Concerning the NP-hardness of the mixed integer least squares problem: we are quantizing blocks, so quite small optimization problems for each quantization block, so that problem is readily solvable. Why don't we use it then? Because we do not really know what we want to minimize and, as we have learned early on, using the exact solution of the mixed integer least squares problem can lead to disastrous results, especially if no weights are being used at all (and you see how in the existing k-quants implementation the weights are sometimes abs(x), and some other times x^2, which is just based on experimentation, a.k.a. numerology). It gets better with an importance matrix used as weights in a weighted RMSE minimization, and we can be more courageous driving towards the minimum, but even then the exact solution is not really the best option.

Concerning the statistical uncertainty that you see in the PPL output: this does not reflect the extremely strong correlation between the logits predicted by different variations of the same language model (fp16 and quantized or two different quantizations). So that, if quantization 1 has PPL = X +/- dX and quantization 2 has Y +/- dY, and we see |X-Y| < dX or dY, this does not mean that the difference between X and Y is "within the margin of error" and not statistically significant. Given that you have outputs of all predicted logits to compute KL-divergence, you can easily look at the statistical uncertainty of the predicted logit difference, which will be much smaller. Btw, did you look at the uncertainty of your KL-divergence estimates?

@Ttl
Copy link
Contributor Author

Ttl commented Jan 4, 2024

Thanks for the very good comments. I see your point with perplexity. It measures how well the model is able to predict the next token in the test set and it does make sense for also quantized models. However, in text generation if we are interested in minimizing the difference in generated tokens to unquantized model, then the quantization method with lower KL divergence should give closer results when sampling generated tokens from the output logit distribution when sampling with temperature=1 and without other fancy tricks. Top-k, min-p and others complicate it a little bit depending where the differences in the logit distributions are.

Correlation is also a very good point. It doesn't really make sense to compare reported perplexity or kl divergence uncertainties of different quantizations of the same model. Here's two plots of phi-2 q4_k_s quantization perplexity and kl divergence estimation differences as function of tokens/batches. I have KL divergences for each token, but for perplexity I have only the reported batch output however at this scale it shouldn't make too big difference:

phi2_ppl_diff

phi2_kl_diff

KL divergence converges much more quickly. I can calculate 99% confidence bound for KL divergence difference as: (-0.004766, -0.004326) so this PR should improve it for that model with that quantization with very high confidence. Perplexity for the same model is worse with this quantization and while the confidence isn't as high and I can't calculate it since I don't have the samples, eyeballing from the plot the confidence looks at least moderately good that this PR makes it slightly worse for this model and this quantization.

They are different measures and while there is a correlation I guess it makes sense that one quantization method doesn't need to be better on both of them. Since top-1 token agreement was better with this PR and I think it should be closer to unquantized model for text generation according to the earlier argument I tested perplexity calculation on phi-2's own generated text.

I first generate perplexity test data with the fp16 model with:

#!/usr/bin/bash
for i in {1..100}
do
    echo $i
    ./main -m ../phi-2/ggml-model-f16.gguf -p "$(sed "${i}q;d" wiki.test.raw)" -n -2 -ngl 99 --ignore-eos --top-p 1 --min-p 0 -c 0 >> phi2_test.txt
done

Prompt is initialized from lines from wiki.test.raw and rest of the context is completed by the model. Initial prompt could be generated more intelligently, but this was just for quick test. The model quantized with this PR gets better perplexity on that test set PR: 3.2497 +/- 0.01839, master: 3.2550 +/- 0.01842.

phi2_self_ppl

I think that comparing KL divergence makes more sense than perplexity if the goal is to minimize difference to unquantized model in text generation. Better measure could take some sampler parameters into account, such as top-k but I'm not quite sure at the moment how they should be considered.

@ikawrakow
Copy link
Contributor

@Ttl

Good discussion. But let's look at some equations.

KL-divergence:

Sum_i P_i ln(Q_i/P_i)

where P_i and Q_i are the probabilities for token i predicted by the base and quantized models.

Logarithm of ratio of quantized to base perplexities:

Sum_i P'_i ln(Q_i/P_i)

where P'_i is the observed probability for token i in the evaluation dataset consisting of text written by actual humans.

So, basically, both are expectation values of ln(Q_i/P_i) evaluated over two different probability distribution functions (PDF). You say that the KL-divergence, which uses the token probabilities predicted by a far from perfect language model as a PDF, is more useful than PPL, which uses a token PDF derived from text written by humans. I have a hard time understanding how you can be so confident this to be the case.

Granted, computing KL-divergence is vastly more efficient than PPL to obtain the same statistical uncertainty by virtue of getting a score for each model token after each token generation as opposed to PPL, which gets a single score per generated token. I wish we would have thought of that back in the day where llama.cpp did not have GPU support and each PPL evaluation over wiki.text.raw took more than an hour. But today, with PPL ready in 1.5 minutes on a modern GPU, I just fail to see the value added by KL-divergence. But I guess, that's just me.

I think it would be useful to look at KL-divergence evaluated over the N highest probabilities as predicted by the base model. My expectation is that, as you reduce N from 32,000 to say, 10,000 -> 1,000 -> 100, the KL-divergence of both quantizations will converge to very similar values, and I'm curious to see at what top-N the convergence will occur. As predicted token probabilities often lack sharp peak(s), the KL-divergence is heavily influenced by tokens that are highly irrelevant, given the current context. Hence, in my opinion, such a "top-N KL-divergence" is more useful than the KL-divergence evaluated using the entire set of tokens.

@ikawrakow ikawrakow mentioned this pull request Jan 5, 2024
@ikawrakow
Copy link
Contributor

On current master I'm getting PPL = 4.2523 +/- 0.02190 for Mixtral-8x7B using Q4_K_S. Where did the perplexity of 4.5136 reported in the table come from?

@Ttl
Copy link
Contributor Author

Ttl commented Jan 5, 2024

I don't really understand your equations. Sum is over different things. KL divergence equation sums over all logits, but perplexity only considers one probability for each token and sum is taken over all tokens.

I was trying to evaluate the difference the quantization causes to actual generated tokens in text generation. The easiest case is when temperature is 0, then top-1 token is picked which is listed in the table in the first post. I presented case for temperature 1 earlier. Evaluation of perplexity on self generated text was experiment testing it.

Mixtral perplexity was calculated with ./perplexity -m ../Mixtral-8x7B-Instruct-v0.1/mixtral-8x7b-g4_k_s.gguf -f wiki.test.raw -ngl 6 and llama.cpp was compiled with CUBLAS. I get slightly different initial output without ngl option, but I haven't run a full calculation with it yet. Other quantization is also evaluated with same parameters so they should be comparable.

@ikawrakow
Copy link
Contributor

ikawrakow commented Jan 5, 2024

Mixtral perplexity was calculated with ./perplexity -m ../Mixtral-8x7B-Instruct-v0.1/mixtral-8x7b-g4_k_s.gguf -f wiki.test.raw -ngl 6

OK, this explains it. You are using the instruct tuned version, which isn't clear from the table. Instruct tuned models always have higher PPL than their respective base models.

@ikawrakow
Copy link
Contributor

ikawrakow commented Jan 5, 2024

I don't really understand your equations. Sum is over different things. KL divergence equation sums over all logits, but perplexity only considers one probability for each token and sum is taken over all tokens.

PPL(Quantized) = exp( Sum over evaluated tokens [ ln(Q_i) ] / number of evaluated tokens )
PPL(Base)      = exp( Sum over evaluated tokens [ ln(P_i) ] / number of evaluated tokens )
ln (PPL(Quantized)/PPL(Base) ) = Sum over evaluated tokens [ ln(Q_i/P_i) ] / number of evaluated tokens

After doing this for a sufficiently large number of tokens, where some tokens will appear more frequently than others, we have as a result the expectation value of ln (Q_i/P_i) over all tokens in the vocabulary using the naturally occurring frequency of tokens P'_i in a human written text, so

ln (PPL(Quantized)/PPL(Base) ) = Sum over tokens in the vocabulary P'_i ln (Q_i / P_i)

Do you see it now?

@Ttl
Copy link
Contributor Author

Ttl commented Jan 7, 2024

I see. Very similar derivation is also on Wikipedia page of perplexity but just without the division.

I think it would be useful to look at KL-divergence evaluated over the N highest probabilities as predicted by the base model.

I did this experiment with Phi-2 Q4_K_S. First take top-N logits of fp16 model, then select only those indices from fp16 and quantized model logits, softmax and then calculate KL divergence. Problem with this approach is that logits are masked based on fp16 model top-N which can be different than quantized top-N logits, but the error should be small as long as N isn't too small and distributions are similar enough that any large outliers aren't masked out.

Top-N Probability sum PR master
All 100 % 0.038542476 0.04308889
10000 99.86% 0.038461696 0.0429979
1000 98.3% 0.03774189 0.042222083
100 93.1% 0.03527869 0.0395158
40 89.1% 0.03329313 0.037291486

In the second column is summed probabilities of selected indices from fp16 model. Most of the probability mass is on the few top values as would be expected. It doesn't look like there's any convergence of KL divergence values of the two quantization methods at these values and the amount that the small probabilities contribute to KL divergence isn't very much. I chose 40 as minimum value as that is the default sampler top-k value.

I think the summary is that there are two ways to define which quantized model is better:

  1. The quantized model with output distribution closer to high quality human generated text is better. Evaluated by calculating perplexity on high quality human generated text (wiki.test).
  2. The quantized model with output distribution closer to the unquantized model is better. Can be calculated with KL divergence or perplexity on text generated by unquantized model with T=1.

Normally these two measures are expected to correlate but at least on this case it doesn't look that they agree. I'm not sure exactly why that's the case. It can be argued which one is a better choice, but I'll close this PR since the currently accepted definition is 1 and this is worse on that measure.

EDIT: I'm not quite sure about the above after thinking it further. KL divergence is better for this PR on the wiki.test and the difference is on the largest logits that would affect the generated tokens. So the output should be closer to the unquantized model in the sense that the distribution of generated logits is closer to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants