iq3_xxs: guards for the no-imatrix situation #5334
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
IQ3_XXS
can give a very bad quantization when used without an importance matrix (imatrix), see #5332.Instead of adding a warning or even disallowing
IQ3_XXS
quantization without an imatrix, this PR prevents a bad outcome by usingQ3_K
for theattn_v
tensors, and a mix ofQ4_K
andQ3_K
for theffn_down
tensors when no imatrix has been supplied. This results in a somewhat larger quantized model (e.g., 2.61 GiB vs 2.5 GiB for 7B LLaMAs) but a more reasonable PPL (e.g.,5.4923
for LLaMA-v2-7B and a context of 4096 vs100+
).