llama : tool for evaluating quantization results per layer #2783

ggerganov · 2023-08-25T10:02:47Z

Following up on #2421, I think we should implement some better way to observe at which point of the inference the results start to deviate significantly between the classical and quantum models.

So I'm thinking of adding a simple tool that takes as input 2 ggml exported graphs - one classical and one quantum, of the same model. The tool evals both graphs on the CPU using ggml and prints detailed statistical information of the intermediate F32 results after each graph node. For example, each result node which has been given a name will be compared and we'll print stuff like, min, max, avg, var, etc.

I'm hoping with such tool to be able to detect which nodes in the computation require more precision in order to keep the quantization differences small enough and hopefully become an automated way of deciding which tensors require more bits than others.

cc @slaren I know you had similar ideas - we can discuss here how to add such support.
Currently I think the ggml graph export/import will be fairly trivial to utilize and will require almost no intervention in the existing llama.cpp implementation. The only thing we might have to take into account is when exporting the graph to disable the allocator so that all results are available in memory after the computation.

The text was updated successfully, but these errors were encountered:

klosax · 2023-08-25T10:57:34Z

I'm hoping with such tool to be able to detect which nodes in the computation require more precision in order to keep the quantization differences small enough and hopefully become an automated way of deciding which tensors require more bits than others.

I experimented with cutting outliers in the weight distribution to get better weight spread in the quantization. It seemed to work good to measure the standard deviation of the tensor weights (in each normalization block) and cutting all weights that fell outside of about 4 to 5 SDs. I only tested this approach using the same cutting point on all tensors, but I guess the best number of SD's to use as cutting point will depend on the tensor.

slaren · 2023-08-25T15:11:45Z

The only thing we might have to take into account is when exporting the graph to disable the allocator so that all results are available in memory after the computation.

Inplace operations would also need to be disabled, otherwise they will overwrite the result of the previous operation. I am not sure if it is worth keeping the inplace operations at all, they create other issues and if memory usage is a concern, ggml-alloc will already make operations inplace automatically when possible.

ggerganov · 2023-08-25T16:09:43Z

@klosax Sure, we can discuss strategies after we have the tool ready.

@slaren Yes, lets for now avoid using those and at some point we can also remove them from the ggml API

JohannesGaessler · 2023-08-27T21:00:46Z

I'm currently gathering data regarding quantization quality metrics as indicated in #2657 . I will write up a proper report in the next few days but one of my preliminary findings is that quantization mostly adds noise to the logits which then manifests as higher perplexity due to asymmetry in the calculation. So I think a good method of investigating per-tensor sensitivity to quantization would be to take an unquantized model, add noise to a tensor, and then look at how this changes the perplexity/output logits.

ggerganov · 2024-04-04T18:27:02Z

Not stale, though low-prio

github-actions · 2024-05-19T01:07:14Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions · 2024-07-04T01:06:52Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions · 2024-08-18T01:07:20Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov added enhancement New feature or request generation quality Quality of model output labels Aug 25, 2023

ggerganov added this to ggml : roadmap Aug 25, 2023

ggerganov moved this to Todo in ggml : roadmap Aug 26, 2023

ggerganov mentioned this issue Sep 13, 2023

Certain 70B Q4_0 quants outputting gibberish (other quant formats unaffected) #3148

Closed

ggerganov mentioned this issue Nov 17, 2023

Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization #4111

Closed

ggerganov mentioned this issue Nov 26, 2023

Get outputs from intermediate layers #4224

Closed

ggerganov mentioned this issue Jan 3, 2024

New quantization method for Q4_K and Q5_K #4739

Closed

github-actions bot added the stale label Mar 25, 2024

ggerganov removed the stale label Apr 4, 2024

github-actions bot added the stale label May 5, 2024

github-actions bot closed this as completed May 19, 2024

ggerganov removed the stale label May 19, 2024

ggerganov reopened this May 19, 2024

github-actions bot added the stale label Jun 19, 2024

github-actions bot closed this as completed Jul 4, 2024

ggerganov reopened this Jul 4, 2024

ggerganov removed the stale label Jul 4, 2024

github-actions bot added the stale label Aug 4, 2024

github-actions bot closed this as completed Aug 18, 2024

ggerganov removed the stale label Aug 18, 2024

ggerganov reopened this Aug 18, 2024

github-actions bot added the stale label Sep 18, 2024

ggerganov removed the stale label Sep 28, 2024

github-actions bot added the stale label Oct 29, 2024

ggerganov removed the stale label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : tool for evaluating quantization results per layer #2783

llama : tool for evaluating quantization results per layer #2783

ggerganov commented Aug 25, 2023

klosax commented Aug 25, 2023 •

edited

Loading

slaren commented Aug 25, 2023

ggerganov commented Aug 25, 2023

JohannesGaessler commented Aug 27, 2023

ggerganov commented Apr 4, 2024

github-actions bot commented May 19, 2024

github-actions bot commented Jul 4, 2024

github-actions bot commented Aug 18, 2024

llama : tool for evaluating quantization results per layer #2783

llama : tool for evaluating quantization results per layer #2783

Comments

ggerganov commented Aug 25, 2023

klosax commented Aug 25, 2023 • edited Loading

slaren commented Aug 25, 2023

ggerganov commented Aug 25, 2023

JohannesGaessler commented Aug 27, 2023

ggerganov commented Apr 4, 2024

github-actions bot commented May 19, 2024

github-actions bot commented Jul 4, 2024

github-actions bot commented Aug 18, 2024

klosax commented Aug 25, 2023 •

edited

Loading