About imatrix overfitting, and importance of input text #5263

Artefact2 · 2024-02-01T17:18:33Z

Artefact2
Feb 1, 2024
Collaborator

Imatrix has been here for a while and I haven't seen many guidelines (or testing at all) on how to use it. Common objections/concerns are overfitting, and generating the imatrix on the "wrong" kind of text. See #5006

Is imatrix overfitting an actual thing?
How much does input text make a difference? Can it make things worse?

To try and gather some data, I tried three datasets for training/testing and three different number of chunks (10K = 20 chunks of 512 tokens, 100K = 200 chunks, 1M = 2000 chunks) used for calculating the imatrix.

frwiki is part of a raw XML dump of the french Wikipedia. It contains a mix of structured XML data, french text, and wikicode markup.
mbotf is concatenated text of my Malazan Book of the Fallen books. It's english fiction.
wiki is the wikitext we all know and seem to use. It contains factual english text.

I used Mistral-7B to calculate KL-divergence median for Q2_K quants generated with all nine possible imatrixes on all three test datasets, and no imatrix as a baseline.

Looking forward to your opinions on the results, or about the methodology. For now, I'll keep using wikitext with 100K tokens. Might not always be optimal depending on the model's use case, but it seems unlikely to make things worse.

Raw data: imatrix-tests.zip

laurids-reichardt · 2024-02-01T17:32:31Z

laurids-reichardt
Feb 1, 2024

I'm surprised that importance matrices trained on 10k/100k/1m token barely seem to diverge from each other. While some “overfitting” does seem to occur, it's also significantly less prevalent than one might expect.
If you have the capacity, could you extend your tests with importance matrices trained on 10k/100k/1m random token? I'd be interested if random token lead to a similar reduction in divergence without the risk of “overfitting” on any domain specific text.

0 replies

kalomaze · 2024-02-02T01:26:59Z

kalomaze
Feb 2, 2024

I think the median / average might not be the smartest measurement.
What quantization hurts the most are outliers. It would make sense to use KLD_99 or KLD_95 to track the average outliers compared to the median/raw average.

Here's KLD_99 sorted for your q2_K files:

Mistral-7B-v0.1-Q2_K.gguf-frwiki.test-logits.dat.txt: 1.954291
Mistral-7B-v0.1-imatrix-mbotf2000-Q2_K.gguf-frwiki.test-logits.dat.txt: 1.633244
Mistral-7B-v0.1-imatrix-mbotf200-Q2_K.gguf-frwiki.test-logits.dat.txt: 1.603780
Mistral-7B-v0.1-imatrix-mbotf20-Q2_K.gguf-frwiki.test-logits.dat.txt: 1.559638
Mistral-7B-v0.1-imatrix-wiki200-Q2_K.gguf-frwiki.test-logits.dat.txt: 1.558138
Mistral-7B-v0.1-imatrix-wiki20-Q2_K.gguf-frwiki.test-logits.dat.txt: 1.554918
Mistral-7B-v0.1-imatrix-wiki2000-Q2_K.gguf-frwiki.test-logits.dat.txt: 1.547127
Mistral-7B-v0.1-imatrix-frwiki200-Q2_K.gguf-frwiki.test-logits.dat.txt: 1.507167
Mistral-7B-v0.1-imatrix-frwiki2000-Q2_K.gguf-frwiki.test-logits.dat.txt: 1.474576
Mistral-7B-v0.1-imatrix-frwiki20-Q2_K.gguf-frwiki.test-logits.dat.txt: 1.466922
Mistral-7B-v0.1-Q2_K.gguf-mbotf.test-logits.dat.txt: 1.450589
Mistral-7B-v0.1-Q2_K.gguf-wiki.test-logits.dat.txt: 1.345861
Mistral-7B-v0.1-imatrix-frwiki20-Q2_K.gguf-mbotf.test-logits.dat.txt: 1.237720
Mistral-7B-v0.1-imatrix-wiki20-Q2_K.gguf-mbotf.test-logits.dat.txt: 1.198453
Mistral-7B-v0.1-imatrix-frwiki2000-Q2_K.gguf-mbotf.test-logits.dat.txt: 1.198218
Mistral-7B-v0.1-imatrix-wiki2000-Q2_K.gguf-mbotf.test-logits.dat.txt: 1.197715
Mistral-7B-v0.1-imatrix-frwiki200-Q2_K.gguf-mbotf.test-logits.dat.txt: 1.197382
Mistral-7B-v0.1-imatrix-wiki200-Q2_K.gguf-mbotf.test-logits.dat.txt: 1.179826
Mistral-7B-v0.1-imatrix-mbotf200-Q2_K.gguf-mbotf.test-logits.dat.txt: 1.153089
Mistral-7B-v0.1-imatrix-mbotf2000-Q2_K.gguf-mbotf.test-logits.dat.txt: 1.123589
Mistral-7B-v0.1-imatrix-mbotf20-Q2_K.gguf-mbotf.test-logits.dat.txt: 1.123481
Mistral-7B-v0.1-imatrix-mbotf2000-Q2_K.gguf-wiki.test-logits.dat.txt: 1.084025
Mistral-7B-v0.1-imatrix-mbotf20-Q2_K.gguf-wiki.test-logits.dat.txt: 1.080889
Mistral-7B-v0.1-imatrix-mbotf200-Q2_K.gguf-wiki.test-logits.dat.txt: 1.074427
Mistral-7B-v0.1-imatrix-wiki2000-Q2_K.gguf-wiki.test-logits.dat.txt: 1.071503
Mistral-7B-v0.1-imatrix-frwiki20-Q2_K.gguf-wiki.test-logits.dat.txt: 1.064731
Mistral-7B-v0.1-imatrix-frwiki200-Q2_K.gguf-wiki.test-logits.dat.txt: 1.060656
Mistral-7B-v0.1-imatrix-wiki200-Q2_K.gguf-wiki.test-logits.dat.txt: 1.048056
Mistral-7B-v0.1-imatrix-wiki20-Q2_K.gguf-wiki.test-logits.dat.txt: 1.046832
Mistral-7B-v0.1-imatrix-frwiki2000-Q2_K.gguf-wiki.test-logits.dat.txt: 1.046684

Btw, thank you for helping investigate this, I've been very curious about optimal quantization calibration.
I've been considering a script that works as a genetic algorithm to optimize a calibration dataset for lower KL div stats.

0 replies

kalomaze · 2024-02-02T09:35:25Z

kalomaze
Feb 2, 2024

Working on a script right now that will automatically quantize a bunch of randomized groups of text data and measure KLD_99.

1 reply

ResearchTLDR Feb 25, 2024

Please share the script, if you can!

sorasoras · 2024-02-02T13:23:55Z

sorasoras
Feb 2, 2024

@Artefact2 I have a question.
What if you mix up all three group of text by take 1/3 from each of the dataset, then use that to do imatrix. it might give you better result overall

0 replies

kalomaze · 2024-02-02T13:32:43Z

kalomaze
Feb 2, 2024

group_40.txt
I'm getting much lower KL divergences (average + 99, 95, 90 percentiles) using pseudo-random data so far compared to Wikitext.
This pseudo-random synthetic data was generated with Temperature 2.0, Min P 0.05, and an intentionally small context size on an unrelated 7b model.

If I use the first 25k tokens of this data:

===== KL-divergence statistics
Average:   0.266808 ±  0.005099
Median :   0.034154
Maximum:  14.252633
KLD_99 :   3.044612
KLD_95 :   1.215638
KLD_90 :   0.717481

If I use 40k tokens of Wikitext:

===== KL-divergence statistics
Average:   0.279426 ±  0.005417
Median :   0.034247
Maximum:  14.234488
KLD_99 :   3.360007
KLD_95 :   1.289230
KLD_90 :   0.739574

I am measuring KL divergence over 30,000 tokens that uses a mix of data (Lyrics, conversations, a wikipedia article or two, etc), so the sample size should be large enough to rule out any differences there.

What especially improves are the harder to predict outliers, as noted by KLD_99 and KLD_95.

Both are q2_K quantized using the base model of Fett-uccine-7B-GGUF.

What doesn't improve over Wikitext is PURELY random data (it's actually a little bit worse); pseudo-random data seems to be optimal, though.

2 replies

ggerganov Feb 2, 2024
Maintainer

This pseudo-random synthetic data was generated with Temperature 2.0, Min P 0.05, and an intentionally small context size on an unrelated 7b model.

What's the argument for using an unrelated model instead of the F16 model that you are quantizing?

kalomaze Feb 2, 2024

What's the argument for using an unrelated model instead of the F16 model that you are quantizing?

No particular one; as long as the data is sufficiently high entropy while still vaguely coherent, this technique should be roughly similar in performance. It might be even better to generate from the source model.

kalomaze · 2024-02-02T17:34:10Z

kalomaze
Feb 2, 2024

group_10_merged.txt

This is about ~50k pseudo-random tokens.
I am getting the best balance between the maximum divergence and the other divergence statistics using this file when quantizing 7b:

===== KL-divergence statistics
Average:   0.269416 ±  0.005092
Median :   0.032920
Maximum:  11.138887
KLD_99 :   3.165778
KLD_95 :   1.232471
KLD_90 :   0.713969
Minimum:  -0.000006
KLD_01 :  -0.000000
KLD_05 :   0.000000
KLD_10 :   0.000000

I recommend using this file for doing imatrix calibration from here on out; imatrix data should generally transfer well across different models.

0 replies

ikawrakow · 2024-02-03T09:25:44Z

ikawrakow
Feb 3, 2024

@Artefact2 Nice work!

On using (pseudo-) random data for imatrix generation

If you have a meaningful calibration dataset, I recommend against (pseudo-)random data. A more comprehensive evaluation that does not rely on a single, quite small dataset will tend to favor imatrix created from textual data. Here is an example:

Mistral-7B-Instruct-v0.2 quantized with IQ2_XS
Importance matrices used
- Pseudo-random using group_10_merged.txt that @kalomaze provided in this discussion
- Using a mix of 1000 chunks each of randomly picked C4 English and French training datasets taken from https://huggingface.co/datasets/allenai/c4
- Using 100 chunks from wiki.train.raw
Evaluated on
- wiki.test.raw with context length 4096. Why 4096? Because using random tokens for imatrix calibration caries the risk of screwing up the attention portion of the model, and a longer context will show this.
- 1000 chunks from an English C4 validation dataset (en/c4-validation.00000-of-00008.json) with a context of 512. Why 512? Because C4 is made up from unrelated paragraphs that do not allow to meaningfully take advantage of a longer context
- 1000 chunks from a French C4 validation dataset (multilingual/c4-fr-validation.tfrecord-00001-of-00016.json) with a context of 512
- HellaSwag validation dataset (10042 tasks)
- MMLU test dataset (13943 tasks)

I'm not using Winogrande and ARC because the test/validation datasets that I have available for these are too small to reveal statistically significant differences.

The table summarizes the results

Test	fp16	Pseudo-random imatrix	C4-En-Fr imatrix	`wiki.train` imatrix
wiki.test.raw-4096 ↓	5.4446	6.2907	6.2722	6.2521
C4 English ↓	9.3668	10.4302	10.3651	10.4240
C4 French ↓	5.0091	5.9014	5.8405	6.0314
HellaSwag ↑	84.45	79.47	79.89	79.81
MMLU ↑	42.01	39.39	39.75	39.39

My take from this data

The new pseudo-random dataset for imatrix generation that @kalomaze provided here is much better than the original random dataset given in Importance matrix calculations work best on near-random data #5006, which often gave disastrous results on actual tasks that are commonly used to evaluate LLM performance.
Nevertheless, it is outperformed by importance matrices obtained from textual data. The differences, albeit small, are statistically significant
The only case where the pseudo-random imatrix outperforms an imatrix from text is the French C4 dataset and imatrix from wiki.train.raw. But the moment we add some French context to the calibration, imatrix from textual data is again better.
The PPL for the French C4 dataset drops to 5.8036 if I use only French imatrix calibration, so if a specific language is the primary use case it may be best to create the imatrix using training data from that language only.
Importance matrix from (pseudo-)random data may indeed be influencing attention tensors in negative ways. The wiki.train imatrix has a lower PPL compared to pseudo-random of 0.015 for context of 512, 0.0386 for context of 4096, and 0.0447 for context of 8192. One should study this in more detail, but to me it seems logical that I wouldn't be able to pay attention to a long and focused conversation if all my experience is based on random phrases being thrown around.
If English is not the primary focus but there is no meaningful calibration dataset available, the pseudo-random data maybe a good option (although, I'm basing this on a single observation, so YMMV).

2 replies

notwa Feb 3, 2024

Out of curiosity, are you running imatrix with the default settings besides --chunks N? --ctx-size is still 512, right? I just don't wanna miss anything when quantizing models myself.

kalomaze Feb 3, 2024

One should study this in more detail, but to me it seems logical that I wouldn't be able to pay attention to a long and focused conversation if all my experience is based on random phrases being thrown around.

One thing that I did note is that pretraining style data (non-wikitext) outperformed Wikitext (especially for maximum KL div error reduction), but didn't quite reach the pseudo-random data's KL divergence statistics.
It would be interesting to test HellaSwag, MMLU with this type of data as well.

Also, if attention tensors are sensitive to calibration data, it might be optimal to use full context for calibration instead of 512 ctx.

The main reason why I'm suspicious on wikitext is that it doesn't contain any code or programming examples and I think that skews it towards natural english.

kalomaze · 2024-02-03T23:41:38Z

kalomaze
Feb 3, 2024

I am using 200k worth of tokens randomly sampled from MiniPile, a decent small pretraining dataset, and I'm comparing different context sizes for calibration (q4_K_S, 7b):

2048 ctx calibration, 200k tokens ish pretraining data

Average:   0.031940 ±  0.000726
Median :   0.003527
Maximum:   2.836281
KLD_99 :   0.409257
KLD_95 :   0.139948
KLD_90 :   0.078575
Minimum:  -0.000062
KLD_01 :  -0.000001
KLD_05 :   0.000000
KLD_10 :   0.000000

4096 ctx calibration, 200k tokens ish pretraining data

===== KL-divergence statistics
Average:   0.031000 ±  0.000802
Median :   0.003107
Maximum:   4.043037
KLD_99 :   0.388180
KLD_95 :   0.128744
KLD_90 :   0.074335
Minimum:  -0.000016
KLD_01 :  -0.000001
KLD_05 :   0.000000
KLD_10 :   0.000000

8192 ctx calibration, 200k tokens ish pretraining data

===== KL-divergence statistics
Average:   0.032073 ±  0.000718
Median :   0.003311
Maximum:   2.339515
KLD_99 :   0.402137
KLD_95 :   0.144028
KLD_90 :   0.079583
Minimum:  -0.000057
KLD_01 :  -0.000001
KLD_05 :   0.000000
KLD_10 :   0.000000

512 ctx calibration, 200k tokens ish pretraining data

===== KL-divergence statistics
Average:   0.031587 ±  0.000726
Median :   0.003271
Maximum:   3.180042
KLD_99 :   0.404726
KLD_95 :   0.135895
KLD_90 :   0.077871
Minimum:  -0.000053
KLD_01 :  -0.000001
KLD_05 :   0.000000
KLD_10 :   0.000000

It would seem that native context size is the best all around for evaluating the importance matrix and results in the lowest average divergence, as well as the best outlier error reduction.

0 replies

kalomaze · 2024-02-07T13:50:21Z

kalomaze
Feb 7, 2024

groups_merged.txt
Here is a decent general purpose imatrix calibration dataset. It should be more diverse than wikitext at ~30k tokens, as it is excerpts of a larger dataset which includes coding examples (which seems quite important!)
This means it's generally higher entropy data compared to wikitext, and it's real data rather than pseudo-randomly generated data.
I get lower KL div than wikitext for the same length and the outputs seem qualitatively better.

Next up, I will be looking into a way to preselect individual pieces of a larger dataset for higher KL div / higher PPL outlier sections so that the quantization is more robust to outliers (instead of throwing random data at it).

19 replies

oldgithubman May 23, 2024

Update - we don't need to use "--chunks 1000000"
Default is all

Dampfinchen May 24, 2024

@InferenceIllusionist Made a new one:

groups_merged-enhancedV3.txt

InferenceIllusionist May 30, 2024

@Dampfinchen off the top looks like enhancedv3 performed better than groups_merged in Mean KLD and Max KLD (Max is much lower).
KLD median and mean PPL look better too but those are more within the margin of error. This is for L3 8b base using the same divergence.dat kld baseline file as before, so results can be compared to the previous control.

====== Perplexity statistics ======
Mean PPL(Q)                   :   3.138334 ┬▒   0.055233
Mean PPL(base)                :   3.114549 ┬▒   0.054710
Cor(ln(PPL(Q)), ln(PPL(base))):  99.72%
Mean ln(PPL(Q)/PPL(base))     :   0.007608 ┬▒   0.001306
Mean PPL(Q)/PPL(base)         :   1.007637 ┬▒   0.001316
Mean PPL(Q)-PPL(base)         :   0.023786 ┬▒   0.004114

====== KL divergence statistics ======
Mean    KLD:   0.010202 ┬▒   0.000148
Maximum KLD:   0.357476
99.9%   KLD:   0.161255
99.0%   KLD:   0.072964
99.0%   KLD:   0.072964
Median  KLD:   0.002731
10.0%   KLD:   0.000023
 5.0%   KLD:   0.000008
 1.0%   KLD:  -0.000000
Minimum KLD:  -0.000006

====== Token probability statistics ======
Mean    ╬öp: -0.170 ┬▒ 0.025 %
Maximum ╬öp: 36.678%
99.9%   ╬öp: 15.558%
99.0%   ╬öp:  8.913%
95.0%   ╬öp:  3.641%
90.0%   ╬öp:  1.767%
75.0%   ╬öp:  0.177%
Median  ╬öp: -0.003%
25.0%   ╬öp: -0.304%
10.0%   ╬öp: -2.431%
 5.0%   ╬öp: -4.708%
 1.0%   ╬öp: -10.531%
 0.1%   ╬öp: -17.493%
Minimum ╬öp: -29.150%
RMS ╬öp    :  2.863 ┬▒ 0.052 %
Same top p: 96.048 ┬▒ 0.171 %

Dampfinchen Jun 2, 2024

@InferenceIllusionist thank you again for these results! Seems like V3 does worse than V2 in this test.

It would be really interesting to see results for MMLU and wikitext instead of arc-challenge. If it's not too much trouble, could you compare V3, V2, Kalo's original and Turbo-Mini against those? I would really love to know if the differences between these datasets vary against different tests. And this time, perhaps we should look at how 8B Instruct does and not the base model, because Instruct models are what people actually going to use!

I know that's asking for a lot of data though, so feel free to take your sweet time with it if you decide to do it!

h-lunah Aug 14, 2024

@InferenceIllusionist Made a new one:

groups_merged-enhancedV3.txt

anyone tested the v4.1 and v5-rc from https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c ?

it is claimed to improve on v3

kalomaze · 2024-02-08T18:01:46Z

kalomaze
Feb 8, 2024

There are indeed chunks that have higher outlier frequencies compared to the rest of the data.

This is (500k?) tokens ish of wiki.train.raw. I used "KLD_95" (so, the average KL div for the top 5% highest KL divergences).
I used q4_0 without imatrix to measure; it makes sense to me that the basic linear rounding quantization would be a good way to determine what is most affected by quantization

(I clustered them for chart readability.)

As you can see, there are a lot of chunks on the "tail end" that are less affected by quantization: the top 5% highest KL div was 0.937 for the very worst chunk, and 0.214 for the best chunk.

My theory is that selecting the top % of chunks that are most "damaged" by quantization for calibration reduces outlier error better than pseudo-random data would. And this way, we have a more mathematically "precise" way of finding better calibration data compared to just throwing 100k tokens of either unsorted Wikitext or gibberish at it.

1 reply

kalomaze Feb 9, 2024

Pre-selecting in this manner does indeed improve the quantization end result.

No imatrix q4_0:

===== KL-divergence statistics
Average: 0.050906 ± 0.000536
Median : 0.016759
Maximum: 17.871239
KLD_99 : 0.451322
KLD_95 : 0.169855
KLD_90 : 0.104060
Minimum: -0.000072

Top ~40k tokens, sorted by top5% highest avg kl divergence imatrix, q4_0

===== KL-divergence statistics
Average: 0.043574 ± 0.000496
Median : 0.013641
Maximum: 17.626831
KLD_99 : 0.405678
KLD_95 : 0.142108
KLD_90 : 0.086210
Minimum: -0.000090

All ~500k tokens wikitext data, imatrix, q4_0:

===== KL-divergence statistics
Average: 0.043865 ± 0.000497
Median : 0.013710
Maximum: 17.534262
KLD_99 : 0.412861
KLD_95 : 0.142599
KLD_90 : 0.086496
Minimum: -0.000063

Nexesenex · 2024-02-18T12:27:04Z

Nexesenex
Feb 18, 2024

@ikawrakow , @Artefact2 , considering the benefits of the iMatrix, not only in English, but also in other languages like my very own French, could one of you guys assemble and share a training file of alternated English / French (and why not with most of the languages broadly supported to some extend by the Llama2 & Mistral models) sequences of text allowing to train properly an iMatrix benefiting all the languages involved? And an eval file of each language in the fashion of wiki.test.raw?

Ideally, it would work like wiki.train.raw does no matter the ctx chosen (I use 32 and it works quite well, but 128 or 512 are still probably a bit better) and the numbers of chunks chosen, up to a few thousands by language.

For example, if I set 500 chunks on the iMatrix, 250 chunks would be trained in English, 250 chunks in French.
Or for 5 languages, 100 chunks of each for a total of 500.

I lack of the know how to do that properly and efficiently involving all the aforementioned criteria. But I think it would help greatly a lot of people to make a single iMatrix file and a single series of quant benefiting a maximum amount of people, including yours truly.

0 replies

KnutJaegersberg · 2024-02-18T13:03:39Z

KnutJaegersberg
Feb 18, 2024

is there data that shows the difference between the 20k random dataset and the new pseudo-random data?
the difference in the other discussion between the 20k data and wikitext doesn't look huge to me.

1 reply

BrickBee Feb 23, 2024

Yes, there is data that supports the current assumption that the pure random dataset leads to worse results than the new balanced random data. However, there is also an indication that we are looking at noisy results and that the new balanced random data is also not optimal. Here is one of the Reddit threads with test results.

Xonar92 · 2024-02-21T19:26:10Z

Xonar92
Feb 21, 2024

Also wondering! From

is there data that shows the difference between the 20k random dataset and the new pseudo-random data? the difference in the other discussion between the 20k data and wikitext doesn't look huge to me.

Also wondering this. At least in theory it could have a big impact what is used. I guess the only way to really test this idea is to create the same model with different data sources for the imatrix and compare ppl? Or is there another way to do this?

1 reply

BrickBee Feb 23, 2024

In addition to the link in the reply above there are more test results with perplexity and hellaswag in this Reddit thread. Be sure to expand the whole thing, as Reddit likes collapsing replies.

MJH1000 · 2024-03-11T10:06:34Z

MJH1000
Mar 11, 2024

@ikawrakow Thanks for all the hard work, great job. The link to wiki.train.raw is down at https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=blog.salesforceairesearch.com do you know an alternate link, thanks.

3 replies

ikawrakow Mar 11, 2024

I have added them to this repository: https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp

MJH1000 Mar 11, 2024

Thanks again, what context length would you recommend for creating the imatrix.dat file using wiki-train.raw for 70b models and mixtral?

ikawrakow Mar 11, 2024

I use 512 tokens most of the time. There has been a lot of discussion around the imatrix topic, but for me the small improvements that one may get from using different context lengths, different calibration datasets, etc., are very minor compared to what one gains from using vs not using an imatrix.

jukofyork · 2024-04-19T08:04:54Z

jukofyork
Apr 19, 2024

Has anyone experimented with adding a small value to the importance matrix weights?

If I have understood correctly, the importance matrix weights are an approximate diagonal Hessian.

If that is the case, a common way to deal with overfitting is to add a scalar multiple of the identity matrix to the (diagonal or otherwise) Hessian (see: Tikhonov_regularization).

The regularized Hessian matrix $H_r$ is computed as:

$$H_r = H + \lambda I$$

where $H$ is the original (diagonal) Hessian, $\lambda$ is the regularization parameter, and $I$ is the identity matrix.

To determine the optimal value of $\lambda$, we have two options:

Use a cross-validation dataset to find the value of $\lambda$ that yields the best performance on the validation set. This might be too computationally expensive though...
Treat the diagonal Hessian as a precision matrix (inverse of the covariance matrix) and use a Gaussian prior distribution for the parameters. In this case, we can standardize the diagonal Hessian and then specify a prior distribution for the precision matrix parameters. This should hopefully work consistently over different models/tensors and save the user from having to tune it over and over...

From a Bayesian perspective:

The regularization parameter $\lambda$ can be seen as a hyperparameter controlling the strength of the prior precision matrix $\Lambda_0 = \lambda I$.
The regularized Hessian $H_r$ can be interpreted as the precision matrix of the posterior distribution, combining information from the calibration data (likelihood) and the prior.

It's likely the use of random and semi-random data mentioned in this thread is acting as a "quick and dirty" form of regularisation anyway:

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/bishop-tikhonov-nc-95.pdf

and IMO, it would probably be better to consider doing it in a more principled way - especially considering the calibration dataset is so small and the imatrix computation isn't using the full context nor the correct prompt format, etc.

Also, has anyone actually looked into a full Hessian approximation or at least checked the off-diagonals are small? If there is any significant multicollinearity among the weights then this method of using only the diagonal Hessian like this could have a serious impact on the model.

How feasible is it to compute the full Hessian approximation by summing $J^TJ$ using CUDA? Is it the memory footprint that is the problem rather than the computation? Is it out of the question?

0 replies

oldgithubman · 2024-05-02T22:17:53Z

oldgithubman
May 2, 2024

wiki is the wikitext we all know and seem to use. It contains factual english text.

lol

0 replies

jukofyork · 2024-05-05T13:18:15Z

jukofyork
May 5, 2024

wiki is the wikitext we all know and seem to use. It contains factual english text.

lol

People put a lot of weight on it being English (and not code or another language) but I think the biggest problem is the data leakage between wiki.train.raw and wiki.test.raw (take a look and you'll see how distinctive the text is; with all the "===" strings and so on), and it's very likely that a lot of the apparent gains in PPL are due to this.

If you don't know what I mean by "data leakage" then this might help: https://gwern.net/tank

There are ways we could correct for this bias in the reported drops in PPL, but most are nonparametric and would require quite a lot of extra computation...

0 replies

oldgithubman · 2024-05-06T00:04:03Z

oldgithubman
May 6, 2024

wiki is the wikitext we all know and seem to use. It contains factual english text.

lol

People put a lot of weight on it being English (and not code or another language) but I think the biggest problem is the data leakage between wiki.train.raw and wiki.test.raw (take a look and you'll see how distinctive the text is; with all the "===" strings and so on), and it's very likely that a lot of the apparent gains in PPL are due to this.

If you don't know what I mean by "data leakage" then this might help: https://gwern.net/tank

There are ways we could correct for this bias in the reported drops in PPL, but most are nonparametric and would require quite a lot of extra computation...

I was just laughing at wikipedia being "factual." I know what was meant. It was a half joke. On a serious note though, we probably shouldn't be using wikipedia (I'm currently using it too). I don't know how much the gross political bias matters for imatrices, but these LLM's being trained on things like wikipedia and reddit concerns me

0 replies

jukofyork · 2024-05-06T00:40:54Z

jukofyork
May 6, 2024

Just trying to wrap my head around how these weights are used:

float sum_x2 = 0;
for (int j = 0; j < n_per_row; ++j) sum_x2 += x[j]*x[j];
float sigma2 = sum_x2/n_per_row;

So sigma2 holds the average squared value.

weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]);

So if xb[j]*xb[j] is the same as the average value this would simplify to:

weight[j] = qw[j] * sqrtf(2*xb[j]*xb[j]) = qw[j] * sqrtf(2) * sqrt(xb[j]*xb[j]) = sqrtf(2)*qw[j] * fabs(xb[j])

and the unweighted version is just:

weight[j] = xb[j]*xb[j];

So I'm trying to see how sqrt(2) * gradient(x)^2 * |x| relates to x^2 = |x| * |x|?

What is special about sqrt(2) * gradient(x)^2 and when does it equal |x|?

We can also look at gradient(x)^2 as the diagonal Hessian approximation (ie: take the diagonal of the outer product, etc), but I still don't see the relationship...

0 replies

Phil209 · 2024-05-08T19:08:36Z

Phil209
May 8, 2024

Edit: Never mind. Not using the latest llama.cpp with the latest quantizations of Llama 3 8b Instruct and with the right settings were the cause the hallucination spike I was seeing.

0 replies

jukofyork · 2024-05-11T14:56:35Z

jukofyork
May 11, 2024

Just trying to wrap my head around how these weights are used:
float sum_x2 = 0;
for (int j = 0; j < n_per_row; ++j) sum_x2 += x[j]*x[j];
float sigma2 = sum_x2/n_per_row;
So sigma2 holds the average squared value.
weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]);
So if xb[j]*xb[j] is the same as the average value this would simplify to:
weight[j] = qw[j] * sqrtf(2*xb[j]*xb[j]) = qw[j] * sqrtf(2) * sqrt(xb[j]*xb[j]) = sqrtf(2)*qw[j] * fabs(xb[j])
and the unweighted version is just:
weight[j] = xb[j]*xb[j];
So I'm trying to see how sqrt(2) * gradient(x)^2 * |x| relates to x^2 = |x| * |x|?

What is special about sqrt(2) * gradient(x)^2 and when does it equal |x|?

We can also look at gradient(x)^2 as the diagonal Hessian approximation (ie: take the diagonal of the outer product, etc), but I still don't see the relationship...

Nope, I've looked at this all sorts of ways and can't for the life of me see where the use of sqrtf(sigma2 + xb[j]*xb[j]) comes from. The code in the function quantize_row_q4_0_impl seems the easiest to understand:

static void quantize_row_q4_0_impl(const float * restrict x, block_q4_0 * restrict y, int64_t n_per_row, const float * quant_weights) {
    static_assert(QK4_0 == 32, "QK4_0 must be 32");

    if (!quant_weights) {
        quantize_row_q4_0_reference(x, y, n_per_row);
        return;
    }

    float weight[QK4_0];
    int8_t L[QK4_0];

    float sum_x2 = 0;
    for (int j = 0; j < n_per_row; ++j) sum_x2 += x[j]*x[j];
    float sigma2 = sum_x2/n_per_row;

    const int64_t nb = n_per_row/QK4_0;
    for (int ib = 0; ib < nb; ++ib) {
        const float * xb = x + QK4_0 * ib;
        const float * qw = quant_weights + QK4_0 * ib;
        for (int j = 0; j < QK4_0; ++j) weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]);
        float d = make_qx_quants(QK4_0, 8, xb, L, 1, weight);
        y[ib].d = GGML_FP32_TO_FP16(d);
        for (int j = 0; j < 16; ++j) {
            y[ib].qs[j] = L[j] | (L[j+16] << 4);
        }
    }
}

For this to be consistent:

When qw[i] are all uniform then it should simplify to weight[j] = xb[j] * xb[j]. This isn't the case currently unless qw[j] are all set to unity. Tracing the code through to the make_qx_quants() function leads to the if (suml2 > 0 && sumlx*sumlx > best*suml2) condition that seems to not be invariant to the scaling of a uniform qw[] input vector (hence why the problem occurred with the new MoE code all being in a single tensor).
I think in completely unregularized form the formula should be weight[j] = qw[j] * xb[j]*xb[j] as this seems to make the most sense with regard to minimizing the expected squared error in the same way as the original weight[j] = xb[j] * xb[j] (ie: the original formula can be thought of as assuming the activations are always the same).
It should be possible to introduce a lambda variable that controls the degree of regularization, with lambda = 0 being unregularized and weight[j] --> xb[j] * xb[j] as lambda --> inf, but this mixing of squared and squared-rooted values doesn't seem to make this possible.

All I can think is that the sqrtf(sigma2 + xb[j]*xb[j]) is doing something akin to regularization, but in a very non-standard way. The mixing of squared and squared-rooted values seems very strange and the fact that it doesn't seem consistent with the original weight[j] = xb[j] * xb[j] code, makes me think that somebody should look again at what this is doing exactly and document the reasoning behind it if valid.

Reformulating this to use a proper lambda variable for regularization could even lead to improved quantization, and at the very least explore the sensitivity to sample sizes used to calculate the expected squared activations passed in quant_weights.

Also the code in ggml-quants.c is nearly 14000 lines and bordering on completely impenetrable IMO... If whoever wrote this leaves the project then it will be a huge undertaking to sort this out so anybody else can work on it! :)

Even just looking for the qw[j] * sqrtf(sigma2 + xb[j]*xb[j]) lines that then call make_qx_quants took me over an hour; each has inconsistent iteration variable names, some use magic numbers and some use constants, the quantize_row_q6_K_impl has the lines randomly commented out:

        //float sum_x2 = 0;
        //for (int j = 0; j < QK_K; ++j) sum_x2 += x[j]*x[j];
        //float sigma2 = sum_x2/QK_K;

        float max_scale = 0;
        float max_abs_scale = 0;

        for (int ib = 0; ib < QK_K/16; ++ib) {

            float scale;
            if (quant_weights) {
                const float * qw = quant_weights + QK_K*i + 16*ib;
                //for (int j = 0; j < 16; ++j) weights[j] = qw[j] * sqrtf(sigma2 + x[16*ib + j]*x[16*ib + j]);
                //scale = make_qx_quants(16, 32, x + 16*ib, L + 16*ib, 1, weights);
                scale = make_qx_quants(16, 32, x + 16*ib, L + 16*ib, 1, qw);
            } else {
                scale = make_qx_quants(16, 32, x + 16*ib, L + 16*ib, 1, NULL);
            }
            scales[ib] = scale;

and so on...

It's not as though this code has to be really efficient either. All the sections of code like the above could easily be refactored into a single function instead of repeating it with slight tweaks 15-20 times.

0 replies

jukofyork · 2024-05-11T15:11:26Z

jukofyork
May 11, 2024

The The Unreasonable Ineffectiveness of the Deeper Layers paper as implemented by Charles Goddard in PruneMe, suggest that a lot of the later blocks are doing very little and a refactored version of ggml-quants.c might open up possibilities of optimizing over the relative importance of different blocks (or even different types of tensors in the different blocks) too.

0 replies

About imatrix overfitting, and importance of input text #5263

Artefact2 Feb 1, 2024 Collaborator

Replies: 22 comments · 30 replies

ggerganov Feb 2, 2024 Maintainer

On using (pseudo-) random data for imatrix generation

Artefact2
Feb 1, 2024
Collaborator

Replies: 22 comments 30 replies

ggerganov Feb 2, 2024
Maintainer