Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : tool for evaluating quantization results per layer #2783

Open
ggerganov opened this issue Aug 25, 2023 · 8 comments
Open

llama : tool for evaluating quantization results per layer #2783

ggerganov opened this issue Aug 25, 2023 · 8 comments
Labels
enhancement New feature or request generation quality Quality of model output

Comments

@ggerganov
Copy link
Owner

Following up on #2421, I think we should implement some better way to observe at which point of the inference the results start to deviate significantly between the classical and quantum models.

So I'm thinking of adding a simple tool that takes as input 2 ggml exported graphs - one classical and one quantum, of the same model. The tool evals both graphs on the CPU using ggml and prints detailed statistical information of the intermediate F32 results after each graph node. For example, each result node which has been given a name will be compared and we'll print stuff like, min, max, avg, var, etc.

I'm hoping with such tool to be able to detect which nodes in the computation require more precision in order to keep the quantization differences small enough and hopefully become an automated way of deciding which tensors require more bits than others.

cc @slaren I know you had similar ideas - we can discuss here how to add such support.
Currently I think the ggml graph export/import will be fairly trivial to utilize and will require almost no intervention in the existing llama.cpp implementation. The only thing we might have to take into account is when exporting the graph to disable the allocator so that all results are available in memory after the computation.

@ggerganov ggerganov added enhancement New feature or request generation quality Quality of model output labels Aug 25, 2023
@klosax
Copy link
Contributor

klosax commented Aug 25, 2023

I'm hoping with such tool to be able to detect which nodes in the computation require more precision in order to keep the quantization differences small enough and hopefully become an automated way of deciding which tensors require more bits than others.

I experimented with cutting outliers in the weight distribution to get better weight spread in the quantization. It seemed to work good to measure the standard deviation of the tensor weights (in each normalization block) and cutting all weights that fell outside of about 4 to 5 SDs. I only tested this approach using the same cutting point on all tensors, but I guess the best number of SD's to use as cutting point will depend on the tensor.

@slaren
Copy link
Collaborator

slaren commented Aug 25, 2023

The only thing we might have to take into account is when exporting the graph to disable the allocator so that all results are available in memory after the computation.

Inplace operations would also need to be disabled, otherwise they will overwrite the result of the previous operation. I am not sure if it is worth keeping the inplace operations at all, they create other issues and if memory usage is a concern, ggml-alloc will already make operations inplace automatically when possible.

@ggerganov
Copy link
Owner Author

@klosax Sure, we can discuss strategies after we have the tool ready.

@slaren Yes, lets for now avoid using those and at some point we can also remove them from the ggml API

@ggerganov ggerganov moved this to Todo in ggml : roadmap Aug 26, 2023
@JohannesGaessler
Copy link
Collaborator

I'm currently gathering data regarding quantization quality metrics as indicated in #2657 . I will write up a proper report in the next few days but one of my preliminary findings is that quantization mostly adds noise to the logits which then manifests as higher perplexity due to asymmetry in the calculation. So I think a good method of investigating per-tensor sensitivity to quantization would be to take an unquantized model, add noise to a tensor, and then look at how this changes the perplexity/output logits.

@ggerganov
Copy link
Owner Author

Not stale, though low-prio

@github-actions github-actions bot added the stale label May 5, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@ggerganov ggerganov removed the stale label May 19, 2024
@ggerganov ggerganov reopened this May 19, 2024
@github-actions github-actions bot added the stale label Jun 19, 2024
Copy link
Contributor

github-actions bot commented Jul 4, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Jul 4, 2024
@ggerganov ggerganov reopened this Jul 4, 2024
@ggerganov ggerganov removed the stale label Jul 4, 2024
@github-actions github-actions bot added the stale label Aug 4, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@ggerganov ggerganov removed the stale label Aug 18, 2024
@ggerganov ggerganov reopened this Aug 18, 2024
@github-actions github-actions bot added the stale label Sep 18, 2024
@ggerganov ggerganov removed the stale label Sep 28, 2024
@github-actions github-actions bot added the stale label Oct 29, 2024
@ggerganov ggerganov removed the stale label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request generation quality Quality of model output
Projects
Status: Todo
Development

No branches or pull requests

4 participants