Guide to choosing quants and engines : r/LocalLLaMA #641

irthomasthomas · 2024-02-27T20:06:37Z

Guide to choosing quants and engines : r/LocalLLaMA

Guide to choosing quants and engines : r/LocalLLaMA

DESCRIPTION:
Ever wonder which type of quant to download for the same model, GPTQ or GGUF or exl2? And what app/runtime/inference engine you should use for this quant? Here's my guide.

TLDR:

If you have multiple gpus of the same type (3090x2, not 3090+3060), and the model can fit in your vram: Choose AWQ+Aphrodite (4 bit only) > GPTQ+Aphrodite > GGUF+Aphrodite;
If you have a single gpu and the model can fit in your vram, or multiple gpus with different vram sizes: Choose exl2+exllamav2 ≈ GPTQ+exllamav2 (4 bit only);
If you need to do offloading or your gpu does not support Aprodite/exllamav2, GGUF+llama.cpp is your only choice.

You want to use a model but cannot fit it in your vram in fp16, so you have to use quantization. When talking about quantization, there are two concept, First is the format, how the model is quantized, the math behind the method to compress the model in a lossy way; Second is the engine, how to run such a quantized model. Generally speaking, quantization of the same format at the same bitrate should have the exactly same quality, but when run on different engines the speed and memory consumption can differ dramatically.

Please note that I primarily use 4-8 bit quants on Linux and never go below 4, so my take on extremely tight quants of <=3 bit might be completely off.

Part I: review of quantization formats.

There are currently 4 most popular quant formats:

GPTQ: The old and good one. It is the first "smart" quantization method. It ultilizes a calibration dataset to improve quality at the same bitrate. Takes a lot time and vram+ram to make a GPTQ quant. Usually comes at 3, 4, or 8 bits. It is widely adapted to almost all kinds of model and can be run on may engines.
AWQ: An even "smarter" format than GPTQ. In theory it delivers better quality than GPTQ of the same bitrate. Usually comes at 4 bits. The recommended quantization format by vLLM and other mass serving engines.
GGUF: A simple quant format that doesn't require calibration, so it's basically round-to-nearest argumented with grouping. Fast and easy to quant but not the "smart" type. Recently imatrix was added to GGUF, which also ultilizes a calibration dataset to make it smarter like GPTQ. GGUFs with imatrix ususally has the "IQ" in name: like "name-IQ3_XS" vs the original "name-Q3_XS". However imatrix is usually applied to tight quants <= 3 and I don't see many larger GGUF quants made with imatrix.
EXL2: The quantization format used by exllamav2. EXL2 is based on the same optimization method as GPTQ. The major advantage of exl2 is that it allows mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight. So you can tailor the bitrate to your vram: You can fit a 34B model in a single 4090 in 4.65 bpw at 4k context, improving a bit of quality over 4 bit. But if you want longer ctx you can lower the bpw to 4.35 or even 3.5.

So in terms of quality of the same bitrate, AWQ > GPTQ = EXL2 > GGUF. I don't know where should GGUF imatrix be put, I suppose it's at the same level as GPTQ.

Besides, the choice of calibration dataset has subtle effect on the quality of quants. Quants at lower bitrates have the tendency to overfit on the style of the calibration dataset. Early GPTQs used wikitext, making them slightly more "formal, dispassionate, machine-like". The default calibration dataset of exl2 is carefully picked by its author to contain a broad mix of different types of data. There are often also "-rpcal" flavours of exl2 calibrated on roleplay datasets to enhance RP experience.

Part II: review of runtime engines.

Different engines support different formats. I tried to make a table:

Comparison of quant formats and engines
Pre-allocation: The engine pre-allocate the vram needed by activation and kv cache, effectively reducing vram usage and improving speed because pytorch handles vram allocation badly. However, pre-allocation means the engine need to take as much vram as your model's max ctx length requires at the start, even if you are not using it.

VRAM optimization: Efficient attention implementation like FlashAttention or PagedAttention to reduce memory usage, especially at long context.

One notable player here is the Aphrodite-engine (Aphrodite-engine). At first glance it looks like a replica of vLLM, which sounds less attractive for in-home usage when there are no concurrent requests. However after GGUF is supported and exl2 on the way, it could be a game changer. It supports tensor-parallel out of the box, that means if you have 2 or more gpus, you can run your (even quantized) model in parallel, and that is much faster than all the other engines where you can only use your gpus sequentially. I achieved 3x speed over llama.cpp running miqu using 4 2080 Ti!

Some personal notes:

If you are loading a 4 bit GPTQ model in hugginface transformer or AutoGPTQ, unless you specify otherwise, you will be using the exllama kernel, but not the other optimizations from exllama.
4 bit GPTQ over exllamav2 is the single fastest method without tensor parallel, even slightly faster than exl2 4.0bpw.
vLLM only supports 4 bit GPTQ but Aphrodite supports 2,3,4,8 bit GPTQ.
Lacking FlashAttention at the moment, llama.cpp is inefficient with prompt preprocessing when context is large, often taking several seconds or even minutes before it can start generation. The actual generation speed is not bad compared to exllamav2.
Even with one gpu, GGUF over Aphrodite can ultilize PagedAttention, possibly offering faster preprocessing speed than llama.cpp.

Update: shing3232 kindly pointed out that you can convert a AWQ model to GGUF and run it in llama.cpp. I never tried that so I cannot comment on the effectiveness of this approach.

URL: Guide to choosing quants and engines

Suggested labels

{'label-name': 'model-quantization-guide', 'label-description': 'Information on choosing quantization formats for machine learning models.', 'confidence': 65.61}

irthomasthomas · 2024-02-27T20:06:39Z

Related issues

#304: GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga

### Details

Similarity score: 0.93 - [ ] [GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga](https://www.reddit.com/r/Oobabooga/comments/178yqmg/gptq_vs_exl2_vs_awq_vs_q4_k_m_model_sizes/ GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga)

GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes

Mod Post
Size (mb) Model
16560 Phind_Phind-CodeLlama-34B-v2-EXL2-4.000b
17053 Phind_Phind-CodeLlama-34B-v2-EXL2-4.125b
17463 Phind-CodeLlama-34B-v2-AWQ-4bit-128g
17480 Phind-CodeLlama-34B-v2-GPTQ-4bit-128g-actorder
17548 Phind_Phind-CodeLlama-34B-v2-EXL2-4.250b
18143 Phind_Phind-CodeLlama-34B-v2-EXL2-4.400b
19133 Phind_Phind-CodeLlama-34B-v2-EXL2-4.650b
19284 phind-codellama-34b-v2.Q4_K_M.gguf
19320 Phind-CodeLlama-34B-v2-AWQ-4bit-32g
19337 Phind-CodeLlama-34B-v2-GPTQ-4bit-32g-actorder
I created all these EXL2 quants to compare them to GPTQ and AWQ. The preliminary result is that EXL2 4.4b seems to outperform GPTQ-4bit-32g while EXL2 4.125b seems to outperform GPTQ-4bit-128g while using less VRAM in both cases.

I couldn't test AWQ yet because my quantization ended up broken, possibly due to this particular model using NTK scaling, so I'll probably have to go through the fun of burning my GPU for 16 hours again to quantize and evaluate another model so that a conclusion can be reached.

Also no idea if Phind-CodeLlama is actually good. WizardCoder-Python might be better.

Suggested labels

"LLM-Quantization"

#389: AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT

### Details

Similarity score: 0.89 - [ ] [AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT](https://forum.opennmt.net/t/awq-quantization-support-new-generic-converter-for-all-hf-llama-like-models/5569)

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Here's an example of the syntax:

python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors

TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
output: Specifies the target directory and model name you want to save.
format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

#### Subword
src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

# Model info
model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

# Inference
# ...

When considering your priority:

For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
AWQ models are faster than FP16 for batch size 1.

Please read more here: GitHub - casper-hansen/AutoAWQ

Important Note:

There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script:

We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

Enjoy!

VS: Fast Inference with vLLM

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

Batch size 1: 80.5 tokens/second
Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.

This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

#431: awq llama quantization

### Details

Similarity score: 0.89 - [ ] [awq llama quantization](huggingface.co)

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not.

Model Conversion

Here's an example of the syntax for converting a model:

tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors

TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
output: Specifies the target directory and model name you want to save.
format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

Config File

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

Subword:
  src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
  tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

Model info:
  model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

Inference:
  # ...

Priority

When considering your priority:

For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
AWQ models are faster than FP16 for batch size 1.
Read more: GitHub - casper-hansen/AutoAWQ

Important Note

There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script

We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

vLLM Performance

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

Batch size 1: 80.5 tokens/second
Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.
This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

null

#504: AutoAWQ 4bit quantization

### Details

Similarity score: 0.87 - [ ] [Code search results](https://github.com/casper-hansen/AutoAWQ)

CONTENT:

TITLE: Code search results

DESCRIPTION:

Add file
Folders and files
Name
Latest commit
casper-hansen
- Fix n_samples (Assisted Generation: a new direction toward low-latency text generation #326)
- ebe8fc3
- History
.github/workflows
- AMD ROCM Support (A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog #315)
awq
- Fix n_samples (Assisted Generation: a new direction toward low-latency text generation #326)
examples
- Marlin symmetric quantization and inference (MultiLoRA: Democratizing LoRA for Better Multi-Task Learning #320)
scripts
- Exclude download of CUDA wheels (TinyLlama exl2 quants for speculative decoding #159)
tests
- Torch only inference + any-device quantization (#319)
.gitignore
- first commit
LICENSE
- add LICENSE
README.md
- AMD ROCM Support (A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog #315)
setup.py
- AMD ROCM Support (A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog #315)

Repository files navigation

README
MIT license

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Suggested labels

null

#457: I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA

### Details

Similarity score: 0.87 - [ ] [I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1abihou/i_keep_running_out_of_memory_whats_the_biggest/)

Here's the reformatted text in Markdown format:

# GPU and Model Recommendations

**GPU Only:**

- You can use 7B Models at 8 bpw with 8K context, or maybe up to 12k context.
- If you wish to use 13B models, then you have to use 4bpw and limit yourself to 2K Context.

**GPU + CPU:**

- Use `.gguf` files to offload part of the model to VRAM.
- Check the disk usage when inferencing in the activity monitor app (or whatever it is called in your OS). If the disk usage is 100% (disk is swapping), then it is impossible to fit the model in RAM + VRAM and tokens per second will be very low.
- In that case, reduce context size and reduce bpw. The best models you can probably run now are:
  - OpenChat 3.5 7B at 8bpw (Use the latest version)
  - <https://huggingface.co/vicgalle/solarized-18B-dpo-GGUF> at 4bpw and 4K context.
- If you want to run Nous-Capybara-34b, switch to the 3bpw version and try to offload 35 layers to GPU. If you want to run bigger models, upgrade RAM to 64GB.

**Tip from /u/Working-Flatworm-531:**

- Just do not load kv in VRAM, you can use `ooba` to disable it.
- Also try lower quants, for example Q4_K_S is still good. You still wouldn't be able to run 34B models with good speed, but at least it's something.
- You can also check your BIOS and maybe increase RAM frequency. After that, you'd be able to run ~20B models at ~2t/s at 8k ~ 12k context.

**Recommended List from /u/Working-Flatworm-531:**

- Use Linux.
- Overclock RAM (if possible).
- Overclock CPU (if possible).
- Overclock GPU.
- Don't load kv cache in VRAM, instead load more layers to the VRAM.
- Use smaller quants.
- Use fast interface (didn't try Kobold, use Ooba).
- Check RAM (should be dual channel).

Suggested labels

{ "label-name": "memory-optimization", "description": "Strategies for optimizing memory usage when running models on 3060 12gb GPU.", "confidence": 94.88 }

#391: Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA

### Details

Similarity score: 0.86 - [ ] [Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/17h4rqz/speculative_decoding_in_exllama_v2_and_llamacpp/)

Speculative Decoding in Exllama v2 and llama.cpp Comparison

Discussion

We discussed speculative decoding (SD) in a previous thread. For those who are not aware of this feature, it allows LLM loaders to use a smaller "draft" model to help predict tokens for a larger model. In that thread, someone asked for tests of speculative decoding for both Exllama v2 and llama.cpp. Although I generally only run models in GPTQ, AWQ, or exl2 formats, I was interested in doing the exl2 vs. llama.cpp comparison.

Test Setup

The tests were run on a 2x 4090, 13900K, DDR5 system. The screen captures of the terminal output of both are available below. If someone has experience with making llama.cpp speculative decoding work better, please share.

Exllama v2 Results

Model: Xwin-LM-70B-V0.1-4.0bpw-h6-exl2

Draft Model: TinyLlama-1.1B-1T-OpenOrca-GPTQ

Performance can be highly variable, but it goes from ~20 t/s without SD to 40-50 t/s with SD.

No SD

Prompt processed in 0.02 seconds, 4 tokens, 200.61 tokens/second
Response generated in 10.80 seconds, 250 tokens, 23.15 tokens/second

With SD

Prompt processed in 0.03 seconds, 4 tokens, 138.80 tokens/second
Response generated in 5.10 seconds, 250 tokens, 49.05 tokens/second

Suggested labels

{ "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }

irthomasthomas mentioned this issue Mar 5, 2024

You (Probably) Shouldn't use a Lookup Table #689

Open

1 task

This was referenced Mar 16, 2024

Vector Database Benchmarks - Qdrant #738

Open

Accelerating SQL Workloads with PG-Strom: Harnessing GPU Power in PostgreSQL #767

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guide to choosing quants and engines : r/LocalLLaMA #641

Guide to choosing quants and engines : r/LocalLLaMA #641

irthomasthomas commented Feb 27, 2024

irthomasthomas commented Feb 27, 2024

Suggested labels

"LLM-Quantization"

Suggested labels

{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

Quantization and Acceleration

Model Conversion

Config File

Priority

Important Note

Offline Quantizer Script

vLLM Performance

Suggested labels

null

CONTENT:

TITLE: Code search results

Suggested labels

null

Suggested labels

{ "label-name": "memory-optimization", "description": "Strategies for optimizing memory usage when running models on 3060 12gb GPU.", "confidence": 94.88 }

Speculative Decoding in Exllama v2 and llama.cpp Comparison

Discussion

Test Setup

Exllama v2 Results

No SD

With SD

Suggested labels

{ "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }

Guide to choosing quants and engines : r/LocalLLaMA #641

Guide to choosing quants and engines : r/LocalLLaMA #641

Comments

irthomasthomas commented Feb 27, 2024

Guide to choosing quants and engines : r/LocalLLaMA

Suggested labels

{'label-name': 'model-quantization-guide', 'label-description': 'Information on choosing quantization formats for machine learning models.', 'confidence': 65.61}

irthomasthomas commented Feb 27, 2024

Related issues

#304: GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga

Suggested labels

"LLM-Quantization"

#389: AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT

Suggested labels

{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

#431: awq llama quantization

Quantization and Acceleration

Model Conversion

Config File

Priority

Important Note

Offline Quantizer Script

vLLM Performance

Suggested labels

null

#504: AutoAWQ 4bit quantization

CONTENT:

TITLE: Code search results

Suggested labels

null

#457: I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA

Suggested labels

{ "label-name": "memory-optimization", "description": "Strategies for optimizing memory usage when running models on 3060 12gb GPU.", "confidence": 94.88 }

#391: Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA

Speculative Decoding in Exllama v2 and llama.cpp Comparison

Discussion

Test Setup

Exllama v2 Results

No SD

With SD

Suggested labels

{ "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }