Replies: 22 comments 30 replies
-
I'm surprised that importance matrices trained on 10k/100k/1m token barely seem to diverge from each other. While some “overfitting” does seem to occur, it's also significantly less prevalent than one might expect. |
Beta Was this translation helpful? Give feedback.
-
I think the median / average might not be the smartest measurement. Here's KLD_99 sorted for your q2_K files:
Btw, thank you for helping investigate this, I've been very curious about optimal quantization calibration. |
Beta Was this translation helpful? Give feedback.
-
Working on a script right now that will automatically quantize a bunch of randomized groups of text data and measure KLD_99. |
Beta Was this translation helpful? Give feedback.
-
@Artefact2 I have a question. |
Beta Was this translation helpful? Give feedback.
-
group_40.txt If I use the first 25k tokens of this data:
If I use 40k tokens of Wikitext:
I am measuring KL divergence over 30,000 tokens that uses a mix of data (Lyrics, conversations, a wikipedia article or two, etc), so the sample size should be large enough to rule out any differences there. What especially improves are the harder to predict outliers, as noted by KLD_99 and KLD_95. Both are q2_K quantized using the base model of Fett-uccine-7B-GGUF. What doesn't improve over Wikitext is PURELY random data (it's actually a little bit worse); pseudo-random data seems to be optimal, though. |
Beta Was this translation helpful? Give feedback.
-
This is about ~50k pseudo-random tokens.
I recommend using this file for doing imatrix calibration from here on out; imatrix data should generally transfer well across different models. |
Beta Was this translation helpful? Give feedback.
-
@Artefact2 Nice work! On using (pseudo-) random data for imatrix generationIf you have a meaningful calibration dataset, I recommend against (pseudo-)random data. A more comprehensive evaluation that does not rely on a single, quite small dataset will tend to favor imatrix created from textual data. Here is an example:
I'm not using Winogrande and ARC because the test/validation datasets that I have available for these are too small to reveal statistically significant differences. The table summarizes the results
My take from this data
|
Beta Was this translation helpful? Give feedback.
-
I am using 200k worth of tokens randomly sampled from MiniPile, a decent small pretraining dataset, and I'm comparing different context sizes for calibration (q4_K_S, 7b): 2048 ctx calibration, 200k tokens ish pretraining data
4096 ctx calibration, 200k tokens ish pretraining data
8192 ctx calibration, 200k tokens ish pretraining data
512 ctx calibration, 200k tokens ish pretraining data
It would seem that native context size is the best all around for evaluating the importance matrix and results in the lowest average divergence, as well as the best outlier error reduction. |
Beta Was this translation helpful? Give feedback.
-
groups_merged.txt Next up, I will be looking into a way to preselect individual pieces of a larger dataset for higher KL div / higher PPL outlier sections so that the quantization is more robust to outliers (instead of throwing random data at it). |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
@ikawrakow , @Artefact2 , considering the benefits of the iMatrix, not only in English, but also in other languages like my very own French, could one of you guys assemble and share a training file of alternated English / French (and why not with most of the languages broadly supported to some extend by the Llama2 & Mistral models) sequences of text allowing to train properly an iMatrix benefiting all the languages involved? And an eval file of each language in the fashion of wiki.test.raw? Ideally, it would work like wiki.train.raw does no matter the ctx chosen (I use 32 and it works quite well, but 128 or 512 are still probably a bit better) and the numbers of chunks chosen, up to a few thousands by language. For example, if I set 500 chunks on the iMatrix, 250 chunks would be trained in English, 250 chunks in French. I lack of the know how to do that properly and efficiently involving all the aforementioned criteria. But I think it would help greatly a lot of people to make a single iMatrix file and a single series of quant benefiting a maximum amount of people, including yours truly. |
Beta Was this translation helpful? Give feedback.
-
is there data that shows the difference between the 20k random dataset and the new pseudo-random data? |
Beta Was this translation helpful? Give feedback.
-
Also wondering! From
Also wondering this. At least in theory it could have a big impact what is used. I guess the only way to really test this idea is to create the same model with different data sources for the imatrix and compare ppl? Or is there another way to do this? |
Beta Was this translation helpful? Give feedback.
-
@ikawrakow Thanks for all the hard work, great job. The link to wiki.train.raw is down at https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=blog.salesforceairesearch.com do you know an alternate link, thanks. |
Beta Was this translation helpful? Give feedback.
-
Has anyone experimented with adding a small value to the importance matrix weights? If I have understood correctly, the importance matrix weights are an approximate diagonal Hessian. If that is the case, a common way to deal with overfitting is to add a scalar multiple of the identity matrix to the (diagonal or otherwise) Hessian (see: Tikhonov_regularization). The regularized Hessian matrix where To determine the optimal value of
From a Bayesian perspective:
It's likely the use of random and semi-random data mentioned in this thread is acting as a "quick and dirty" form of regularisation anyway: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/bishop-tikhonov-nc-95.pdf and IMO, it would probably be better to consider doing it in a more principled way - especially considering the calibration dataset is so small and the imatrix computation isn't using the full context nor the correct prompt format, etc. Also, has anyone actually looked into a full Hessian approximation or at least checked the off-diagonals are small? If there is any significant multicollinearity among the weights then this method of using only the diagonal Hessian like this could have a serious impact on the model. How feasible is it to compute the full Hessian approximation by summing |
Beta Was this translation helpful? Give feedback.
-
lol |
Beta Was this translation helpful? Give feedback.
-
People put a lot of weight on it being English (and not code or another language) but I think the biggest problem is the data leakage between If you don't know what I mean by "data leakage" then this might help: https://gwern.net/tank There are ways we could correct for this bias in the reported drops in PPL, but most are nonparametric and would require quite a lot of extra computation... |
Beta Was this translation helpful? Give feedback.
-
I was just laughing at wikipedia being "factual." I know what was meant. It was a half joke. On a serious note though, we probably shouldn't be using wikipedia (I'm currently using it too). I don't know how much the gross political bias matters for imatrices, but these LLM's being trained on things like wikipedia and reddit concerns me |
Beta Was this translation helpful? Give feedback.
-
Just trying to wrap my head around how these weights are used:
So
So if
and the unweighted version is just:
So I'm trying to see how What is special about We can also look at |
Beta Was this translation helpful? Give feedback.
-
Edit: Never mind. Not using the latest llama.cpp with the latest quantizations of Llama 3 8b Instruct and with the right settings were the cause the hallucination spike I was seeing. |
Beta Was this translation helpful? Give feedback.
-
Nope, I've looked at this all sorts of ways and can't for the life of me see where the use of static void quantize_row_q4_0_impl(const float * restrict x, block_q4_0 * restrict y, int64_t n_per_row, const float * quant_weights) {
static_assert(QK4_0 == 32, "QK4_0 must be 32");
if (!quant_weights) {
quantize_row_q4_0_reference(x, y, n_per_row);
return;
}
float weight[QK4_0];
int8_t L[QK4_0];
float sum_x2 = 0;
for (int j = 0; j < n_per_row; ++j) sum_x2 += x[j]*x[j];
float sigma2 = sum_x2/n_per_row;
const int64_t nb = n_per_row/QK4_0;
for (int ib = 0; ib < nb; ++ib) {
const float * xb = x + QK4_0 * ib;
const float * qw = quant_weights + QK4_0 * ib;
for (int j = 0; j < QK4_0; ++j) weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]);
float d = make_qx_quants(QK4_0, 8, xb, L, 1, weight);
y[ib].d = GGML_FP32_TO_FP16(d);
for (int j = 0; j < 16; ++j) {
y[ib].qs[j] = L[j] | (L[j+16] << 4);
}
}
} For this to be consistent:
All I can think is that the Reformulating this to use a proper Also the code in Even just looking for the //float sum_x2 = 0;
//for (int j = 0; j < QK_K; ++j) sum_x2 += x[j]*x[j];
//float sigma2 = sum_x2/QK_K;
float max_scale = 0;
float max_abs_scale = 0;
for (int ib = 0; ib < QK_K/16; ++ib) {
float scale;
if (quant_weights) {
const float * qw = quant_weights + QK_K*i + 16*ib;
//for (int j = 0; j < 16; ++j) weights[j] = qw[j] * sqrtf(sigma2 + x[16*ib + j]*x[16*ib + j]);
//scale = make_qx_quants(16, 32, x + 16*ib, L + 16*ib, 1, weights);
scale = make_qx_quants(16, 32, x + 16*ib, L + 16*ib, 1, qw);
} else {
scale = make_qx_quants(16, 32, x + 16*ib, L + 16*ib, 1, NULL);
}
scales[ib] = scale; and so on... It's not as though this code has to be really efficient either. All the sections of code like the above could easily be refactored into a single function instead of repeating it with slight tweaks 15-20 times. |
Beta Was this translation helpful? Give feedback.
-
The The Unreasonable Ineffectiveness of the Deeper Layers paper as implemented by Charles Goddard in PruneMe, suggest that a lot of the later blocks are doing very little and a refactored version of |
Beta Was this translation helpful? Give feedback.
-
Imatrix has been here for a while and I haven't seen many guidelines (or testing at all) on how to use it. Common objections/concerns are overfitting, and generating the imatrix on the "wrong" kind of text. See #5006
To try and gather some data, I tried three datasets for training/testing and three different number of chunks (10K = 20 chunks of 512 tokens, 100K = 200 chunks, 1M = 2000 chunks) used for calculating the imatrix.
frwiki
is part of a raw XML dump of the french Wikipedia. It contains a mix of structured XML data, french text, and wikicode markup.mbotf
is concatenated text of my Malazan Book of the Fallen books. It's english fiction.wiki
is the wikitext we all know and seem to use. It contains factual english text.I used Mistral-7B to calculate KL-divergence median for Q2_K quants generated with all nine possible imatrixes on all three test datasets, and no imatrix as a baseline.
Looking forward to your opinions on the results, or about the methodology. For now, I'll keep using wikitext with 100K tokens. Might not always be optimal depending on the model's use case, but it seems unlikely to make things worse.
Raw data: imatrix-tests.zip
Beta Was this translation helpful? Give feedback.
All reactions