-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : tool for evaluating quantization results per layer #2783
Comments
I experimented with cutting outliers in the weight distribution to get better weight spread in the quantization. It seemed to work good to measure the standard deviation of the tensor weights (in each normalization block) and cutting all weights that fell outside of about 4 to 5 SDs. I only tested this approach using the same cutting point on all tensors, but I guess the best number of SD's to use as cutting point will depend on the tensor. |
Inplace operations would also need to be disabled, otherwise they will overwrite the result of the previous operation. I am not sure if it is worth keeping the inplace operations at all, they create other issues and if memory usage is a concern, ggml-alloc will already make operations inplace automatically when possible. |
I'm currently gathering data regarding quantization quality metrics as indicated in #2657 . I will write up a proper report in the next few days but one of my preliminary findings is that quantization mostly adds noise to the logits which then manifests as higher perplexity due to asymmetry in the calculation. So I think a good method of investigating per-tensor sensitivity to quantization would be to take an unquantized model, add noise to a tensor, and then look at how this changes the perplexity/output logits. |
Not stale, though low-prio |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Following up on #2421, I think we should implement some better way to observe at which point of the inference the results start to deviate significantly between the classical and quantum models.
So I'm thinking of adding a simple tool that takes as input 2
ggml
exported graphs - one classical and one quantum, of the same model. The tool evals both graphs on the CPU usingggml
and prints detailed statistical information of the intermediate F32 results after each graph node. For example, each result node which has been given a name will be compared and we'll print stuff like,min
,max
,avg
,var
, etc.I'm hoping with such tool to be able to detect which nodes in the computation require more precision in order to keep the quantization differences small enough and hopefully become an automated way of deciding which tensors require more bits than others.
cc @slaren I know you had similar ideas - we can discuss here how to add such support.
Currently I think the
ggml
graph export/import will be fairly trivial to utilize and will require almost no intervention in the existingllama.cpp
implementation. The only thing we might have to take into account is when exporting the graph to disable the allocator so that all results are available in memory after the computation.The text was updated successfully, but these errors were encountered: