Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guide to choosing quants and engines : r/LocalLLaMA #641

Open
1 task
irthomasthomas opened this issue Feb 27, 2024 · 1 comment
Open
1 task

Guide to choosing quants and engines : r/LocalLLaMA #641

irthomasthomas opened this issue Feb 27, 2024 · 1 comment
Labels
Algorithms Sorting, Learning or Classifying. All algorithms go here. llm-inference-engines Software to run inference on large language models llm-quantization All about Quantized LLM models and serving MachineLearning ML Models, Training and Inference New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers Software2.0 Software development driven by AI and neural networks.

Comments

@irthomasthomas
Copy link
Owner

Guide to choosing quants and engines : r/LocalLLaMA

DESCRIPTION:
Ever wonder which type of quant to download for the same model, GPTQ or GGUF or exl2? And what app/runtime/inference engine you should use for this quant? Here's my guide.

TLDR:

  • If you have multiple gpus of the same type (3090x2, not 3090+3060), and the model can fit in your vram: Choose AWQ+Aphrodite (4 bit only) > GPTQ+Aphrodite > GGUF+Aphrodite;
  • If you have a single gpu and the model can fit in your vram, or multiple gpus with different vram sizes: Choose exl2+exllamav2 ≈ GPTQ+exllamav2 (4 bit only);
  • If you need to do offloading or your gpu does not support Aprodite/exllamav2, GGUF+llama.cpp is your only choice.

You want to use a model but cannot fit it in your vram in fp16, so you have to use quantization. When talking about quantization, there are two concept, First is the format, how the model is quantized, the math behind the method to compress the model in a lossy way; Second is the engine, how to run such a quantized model. Generally speaking, quantization of the same format at the same bitrate should have the exactly same quality, but when run on different engines the speed and memory consumption can differ dramatically.

Please note that I primarily use 4-8 bit quants on Linux and never go below 4, so my take on extremely tight quants of <=3 bit might be completely off.

Part I: review of quantization formats.

There are currently 4 most popular quant formats:

  • GPTQ: The old and good one. It is the first "smart" quantization method. It ultilizes a calibration dataset to improve quality at the same bitrate. Takes a lot time and vram+ram to make a GPTQ quant. Usually comes at 3, 4, or 8 bits. It is widely adapted to almost all kinds of model and can be run on may engines.
  • AWQ: An even "smarter" format than GPTQ. In theory it delivers better quality than GPTQ of the same bitrate. Usually comes at 4 bits. The recommended quantization format by vLLM and other mass serving engines.
  • GGUF: A simple quant format that doesn't require calibration, so it's basically round-to-nearest argumented with grouping. Fast and easy to quant but not the "smart" type. Recently imatrix was added to GGUF, which also ultilizes a calibration dataset to make it smarter like GPTQ. GGUFs with imatrix ususally has the "IQ" in name: like "name-IQ3_XS" vs the original "name-Q3_XS". However imatrix is usually applied to tight quants <= 3 and I don't see many larger GGUF quants made with imatrix.
  • EXL2: The quantization format used by exllamav2. EXL2 is based on the same optimization method as GPTQ. The major advantage of exl2 is that it allows mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight. So you can tailor the bitrate to your vram: You can fit a 34B model in a single 4090 in 4.65 bpw at 4k context, improving a bit of quality over 4 bit. But if you want longer ctx you can lower the bpw to 4.35 or even 3.5.

So in terms of quality of the same bitrate, AWQ > GPTQ = EXL2 > GGUF. I don't know where should GGUF imatrix be put, I suppose it's at the same level as GPTQ.

Besides, the choice of calibration dataset has subtle effect on the quality of quants. Quants at lower bitrates have the tendency to overfit on the style of the calibration dataset. Early GPTQs used wikitext, making them slightly more "formal, dispassionate, machine-like". The default calibration dataset of exl2 is carefully picked by its author to contain a broad mix of different types of data. There are often also "-rpcal" flavours of exl2 calibrated on roleplay datasets to enhance RP experience.

Part II: review of runtime engines.

Different engines support different formats. I tried to make a table:

Comparison of quant formats and engines
Pre-allocation: The engine pre-allocate the vram needed by activation and kv cache, effectively reducing vram usage and improving speed because pytorch handles vram allocation badly. However, pre-allocation means the engine need to take as much vram as your model's max ctx length requires at the start, even if you are not using it.

VRAM optimization: Efficient attention implementation like FlashAttention or PagedAttention to reduce memory usage, especially at long context.

One notable player here is the Aphrodite-engine (Aphrodite-engine). At first glance it looks like a replica of vLLM, which sounds less attractive for in-home usage when there are no concurrent requests. However after GGUF is supported and exl2 on the way, it could be a game changer. It supports tensor-parallel out of the box, that means if you have 2 or more gpus, you can run your (even quantized) model in parallel, and that is much faster than all the other engines where you can only use your gpus sequentially. I achieved 3x speed over llama.cpp running miqu using 4 2080 Ti!

Some personal notes:

  • If you are loading a 4 bit GPTQ model in hugginface transformer or AutoGPTQ, unless you specify otherwise, you will be using the exllama kernel, but not the other optimizations from exllama.
  • 4 bit GPTQ over exllamav2 is the single fastest method without tensor parallel, even slightly faster than exl2 4.0bpw.
  • vLLM only supports 4 bit GPTQ but Aphrodite supports 2,3,4,8 bit GPTQ.
  • Lacking FlashAttention at the moment, llama.cpp is inefficient with prompt preprocessing when context is large, often taking several seconds or even minutes before it can start generation. The actual generation speed is not bad compared to exllamav2.
  • Even with one gpu, GGUF over Aphrodite can ultilize PagedAttention, possibly offering faster preprocessing speed than llama.cpp.

Update: shing3232 kindly pointed out that you can convert a AWQ model to GGUF and run it in llama.cpp. I never tried that so I cannot comment on the effectiveness of this approach.

URL: Guide to choosing quants and engines

Suggested labels

{'label-name': 'model-quantization-guide', 'label-description': 'Information on choosing quantization formats for machine learning models.', 'confidence': 65.61}

@irthomasthomas irthomasthomas added Algorithms Sorting, Learning or Classifying. All algorithms go here. llm-inference-engines Software to run inference on large language models llm-quantization All about Quantized LLM models and serving MachineLearning ML Models, Training and Inference New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers Software2.0 Software development driven by AI and neural networks. labels Feb 27, 2024
@irthomasthomas
Copy link
Owner Author

Related issues

#304: GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga

### DetailsSimilarity score: 0.93 - [ ] [GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga](https://www.reddit.com/r/Oobabooga/comments/178yqmg/gptq_vs_exl2_vs_awq_vs_q4_k_m_model_sizes/ GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga)

GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes

Mod Post
Size (mb) Model
16560 Phind_Phind-CodeLlama-34B-v2-EXL2-4.000b
17053 Phind_Phind-CodeLlama-34B-v2-EXL2-4.125b
17463 Phind-CodeLlama-34B-v2-AWQ-4bit-128g
17480 Phind-CodeLlama-34B-v2-GPTQ-4bit-128g-actorder
17548 Phind_Phind-CodeLlama-34B-v2-EXL2-4.250b
18143 Phind_Phind-CodeLlama-34B-v2-EXL2-4.400b
19133 Phind_Phind-CodeLlama-34B-v2-EXL2-4.650b
19284 phind-codellama-34b-v2.Q4_K_M.gguf
19320 Phind-CodeLlama-34B-v2-AWQ-4bit-32g
19337 Phind-CodeLlama-34B-v2-GPTQ-4bit-32g-actorder
I created all these EXL2 quants to compare them to GPTQ and AWQ. The preliminary result is that EXL2 4.4b seems to outperform GPTQ-4bit-32g while EXL2 4.125b seems to outperform GPTQ-4bit-128g while using less VRAM in both cases.

I couldn't test AWQ yet because my quantization ended up broken, possibly due to this particular model using NTK scaling, so I'll probably have to go through the fun of burning my GPU for 16 hours again to quantize and evaluate another model so that a conclusion can be reached.

Also no idea if Phind-CodeLlama is actually good. WizardCoder-Python might be better.

Suggested labels

"LLM-Quantization"

#389: AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT

### DetailsSimilarity score: 0.89 - [ ] [AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT](https://forum.opennmt.net/t/awq-quantization-support-new-generic-converter-for-all-hf-llama-like-models/5569)

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Here's an example of the syntax:

python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors
  • TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
  • output: Specifies the target directory and model name you want to save.
  • format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

#### Subword
src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

# Model info
model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

# Inference
# ...

When considering your priority:

  • For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
  • AWQ models are faster than FP16 for batch size 1.

Please read more here: GitHub - casper-hansen/AutoAWQ

Important Note:

  • There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
  • The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
  • If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script:

  • We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

Enjoy!


VS: Fast Inference with vLLM

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

  • Batch size 1: 80.5 tokens/second
  • Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.

This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

#431: awq llama quantization

### DetailsSimilarity score: 0.89 - [ ] [awq llama quantization](huggingface.co)

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not.

Model Conversion

Here's an example of the syntax for converting a model:

tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors
  • TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
  • output: Specifies the target directory and model name you want to save.
  • format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

Config File

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

Subword:
  src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
  tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

Model info:
  model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

Inference:
  # ...

Priority

When considering your priority:

  • For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
  • AWQ models are faster than FP16 for batch size 1.
  • Read more: GitHub - casper-hansen/AutoAWQ

Important Note

  • There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
  • The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
  • If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script

We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

vLLM Performance

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

  • Batch size 1: 80.5 tokens/second
  • Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.
  • This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

null

#504: AutoAWQ 4bit quantization

### DetailsSimilarity score: 0.87 - [ ] [Code search results](https://github.com/casper-hansen/AutoAWQ)

CONTENT:

TITLE: Code search results

DESCRIPTION:

Repository files navigation

README
MIT license

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

[2023/12] Mixtral, LLaVa, QWen, Baichuan model support.
[2023/11] AutoAWQ inference has been integrated into 🤗 transformers. Now includes CUDA 12.1 wheels.
[2023/10] Mistral (Fused Modules), Bigcode, Turing support, Memory Bug Fix (Saves 2GB VRAM)
[2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
[2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
[2023/08] PyPi package released and AutoModel class available

Install

Prerequisites

NVIDIA:

Your NVIDIA GPU(s) must be of Compute Capability 7.5. Turing and later architectures are supported.
Your CUDA version must be CUDA 11.8 or later.

AMD:

Your ROCm version must be ROCm 5.6 or later.

Install from PyPi

To install the newest AutoAWQ from PyPi, you need CUDA 12.1 installed.

pip install autoawq

Build from source

For CUDA 11.8, ROCm 5.6, and ROCm 5.7, you can install wheels from the release page:

pip install autoawq@https://github.com/casper-hansen/AutoAWQ/releases/download/v0.2.0/autoawq-0.2.0+cu118-cp310-cp310-linux_x86_64.whl

Or from the main branch directly:

pip install autoawq@https://github.com/casper-hansen/AutoAWQ.git

Or by cloning the repository and installing from source:

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

All three methods will install the latest and correct kernels for your system from AutoAWQ_Kernels.

If your system is not supported (i.e. not on the release page), you can build the kernels yourself by following the instructions in AutoAWQ_Kernels and then install AutoAWQ from source.

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Mistral 7B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
GPTJ 6.7B
Aquila 7B
Aquila2 7B/34B
Yi 6B/34B
Qwen 1.8B/7B/14B/72B
BigCode 1B/7B/15B
GPT NeoX 20B
GPT-J 6B
LLaVa 7B/13B
Mixtral 8x7B
Baichuan 7B/13B
QWen 1.8B/7B/14/72B

Usage

Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

INT4 GEMM vs INT4 GEMV vs FP16

There are two versions of AWQ: GEMM

Suggested labels

null

#457: I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA

### DetailsSimilarity score: 0.87 - [ ] [I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1abihou/i_keep_running_out_of_memory_whats_the_biggest/)

Here's the reformatted text in Markdown format:

# GPU and Model Recommendations

**GPU Only:**

- You can use 7B Models at 8 bpw with 8K context, or maybe up to 12k context.
- If you wish to use 13B models, then you have to use 4bpw and limit yourself to 2K Context.

**GPU + CPU:**

- Use `.gguf` files to offload part of the model to VRAM.
- Check the disk usage when inferencing in the activity monitor app (or whatever it is called in your OS). If the disk usage is 100% (disk is swapping), then it is impossible to fit the model in RAM + VRAM and tokens per second will be very low.
- In that case, reduce context size and reduce bpw. The best models you can probably run now are:
  - OpenChat 3.5 7B at 8bpw (Use the latest version)
  - <https://huggingface.co/vicgalle/solarized-18B-dpo-GGUF> at 4bpw and 4K context.
- If you want to run Nous-Capybara-34b, switch to the 3bpw version and try to offload 35 layers to GPU. If you want to run bigger models, upgrade RAM to 64GB.

**Tip from /u/Working-Flatworm-531:**

- Just do not load kv in VRAM, you can use `ooba` to disable it.
- Also try lower quants, for example Q4_K_S is still good. You still wouldn't be able to run 34B models with good speed, but at least it's something.
- You can also check your BIOS and maybe increase RAM frequency. After that, you'd be able to run ~20B models at ~2t/s at 8k ~ 12k context.

**Recommended List from /u/Working-Flatworm-531:**

- Use Linux.
- Overclock RAM (if possible).
- Overclock CPU (if possible).
- Overclock GPU.
- Don't load kv cache in VRAM, instead load more layers to the VRAM.
- Use smaller quants.
- Use fast interface (didn't try Kobold, use Ooba).
- Check RAM (should be dual channel).

Suggested labels

{ "label-name": "memory-optimization", "description": "Strategies for optimizing memory usage when running models on 3060 12gb GPU.", "confidence": 94.88 }

#391: Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA

### DetailsSimilarity score: 0.86 - [ ] [Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/17h4rqz/speculative_decoding_in_exllama_v2_and_llamacpp/)

Speculative Decoding in Exllama v2 and llama.cpp Comparison

Discussion

We discussed speculative decoding (SD) in a previous thread. For those who are not aware of this feature, it allows LLM loaders to use a smaller "draft" model to help predict tokens for a larger model. In that thread, someone asked for tests of speculative decoding for both Exllama v2 and llama.cpp. Although I generally only run models in GPTQ, AWQ, or exl2 formats, I was interested in doing the exl2 vs. llama.cpp comparison.

Test Setup

The tests were run on a 2x 4090, 13900K, DDR5 system. The screen captures of the terminal output of both are available below. If someone has experience with making llama.cpp speculative decoding work better, please share.

Exllama v2 Results

Model: Xwin-LM-70B-V0.1-4.0bpw-h6-exl2

Draft Model: TinyLlama-1.1B-1T-OpenOrca-GPTQ

Performance can be highly variable, but it goes from ~20 t/s without SD to 40-50 t/s with SD.

No SD

Prompt processed in 0.02 seconds, 4 tokens, 200.61 tokens/second
Response generated in 10.80 seconds, 250 tokens, 23.15 tokens/second

With SD

Prompt processed in 0.03 seconds, 4 tokens, 138.80 tokens/second
Response generated in 5.10 seconds, 250 tokens, 49.05 tokens/second

Suggested labels

{ "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithms Sorting, Learning or Classifying. All algorithms go here. llm-inference-engines Software to run inference on large language models llm-quantization All about Quantized LLM models and serving MachineLearning ML Models, Training and Inference New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers Software2.0 Software development driven by AI and neural networks.
Projects
None yet
Development

No branches or pull requests

1 participant