-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B #2807
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I'd assume the same should apply to 34B? |
ggerganov
approved these changes
Aug 26, 2023
llama.cpp
Outdated
@@ -4678,6 +4682,10 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s | |||
++n_feed_forward_w2; | |||
} | |||
} | |||
if (n_attention_wv != n_feed_forward_w2 || (uint32_t)n_attention_wv != model.hparams.n_layer) { | |||
fprintf(stderr, "============ Strange model: n_attention_wv = %d, n_feed_forward_w2 = %d, hparams.n_layer = %d\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use LLAMA_LOG_WARN
with __func__
prefix as all other logs
mattgauf
added a commit
to mattgauf/llama.cpp
that referenced
this pull request
Aug 26, 2023
* master: (773 commits) server : add `/detokenize` endpoint (ggerganov#2802) convert.py : advanced option (ggerganov#2753) llama : use Unicode Escape Sequence to replace encoded characters (ggerganov#2814) flake.nix : add rocm support and cleanup (ggerganov#2808) llama : move #includes out of _GNU_SOURCE conditional (ggerganov#2817) main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggerganov#1528) llama : use std::abs in llama_sample_tail_free (ggerganov#2800) k-quants : remove unnecessary tensor shape restrictions (ggerganov#2811) Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggerganov#2807) Fix HellaSwag (ggerganov#2805) flake : build llama.cpp on Intel with nix (ggerganov#2795) Handle null rope scaling value (ggerganov#2793) Fix spm whitespaces (ggerganov#2806) examples : skip unnecessary external lib in server README.md how-to (ggerganov#2804) llama : fix struct decl (ggerganov#2790) Faster perplexity computation (ggerganov#2786) llama : add llama_beam_search() (ggerganov#2267) convert.py : Get rope scale from HuggingFace models (ggerganov#2772) llama-bench : add model sizes (ggerganov#2771) convert.py : export rope freq_base when converting CodeLlama from an HF model (ggerganov#2773) ...
akawrykow
pushed a commit
to akawrykow/llama.cpp
that referenced
this pull request
Aug 29, 2023
…rganov#2807) * Better perplexity for 2- and 3-bit quantization for the 70B model * PR comment --------- Co-authored-by: Iwan Kawrakow <[email protected]>
How long is the context for the perplexity values in the table, @ikawrakow? |
512 tokens |
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In LLaMA-v2-70B eight heads share the same
K
andV
attention tensors, and as a result they are 8X smaller than the attentionQ
tensor. The attentionV
tensor is quite important for generation quality, so it is often quantized with more bits when using k_quants. Given this, we can get a nice improvement in perplexity score (as a measure of generation quality) with negligible increase in quantized model size by quantizing the entire attentionV
tensor with 5 bits when the k_quants logic has decided to quantize it with 3 or 4 bits. The table shows the PPL change for a subset of the k_quants: