Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update doc #499

Merged
merged 3 commits into from
Jun 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=False, use_qbi

You can also load an AWQ model by using AutoModelForCausalLM, just make sure you have AutoAWQ installed.
Note that not all models will have fused modules when loading from transformers.
See more [documentation here](https://huggingface.co/docs/transformers/main/en/quantization#awq).
See more [documentation here](https://huggingface.co/docs/transformers/main/en/quantization/awq).

```python
import torch
Expand Down Expand Up @@ -327,4 +327,4 @@ generation_output = model.generate(
)

print(processor.decode(generation_output[0], skip_special_tokens=True))
```
```
23 changes: 14 additions & 9 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,21 @@ Example inference speed (RTX 4090, Ryzen 9 7950X, 64 tokens):
- Install: `pip install autoawq`.
- Your torch version must match the build version, i.e. you cannot use torch 2.0.1 with a wheel that was built with 2.2.0.
- For AMD GPUs, inference will run through ExLlamaV2 kernels without fused layers. You need to pass the following arguments to run with AMD GPUs:
```python
model = AutoAWQForCausalLM.from_quantized(
...,
fuse_layers=False,
use_exllama_v2=True
)
```
- For CPU device, you should install intel-extension-for-transformers with `pip install intel-extension-for-transformers`. And the latest version of torch is required since "intel-extension-for-transformers(ITREX)" was built with the latest version of torch(now ITREX 1.4 was build with torch 2.2). If you build ITREX from source code, then you need to ensure the consistency of the torch version. And you should use "use_qbits=True" for CPU device. Up to now, the feature of fuse_layers hasn't been ready for CPU device.

```python
model = AutoAWQForCausalLM.from_quantized(
...,
fuse_layers=False,
use_exllama_v2=True
)
```
```python
model = AutoAWQForCausalLM.from_quantized(
...,
fuse_layers=False,
use_qbits=True
)
```

## Supported models

Expand All @@ -50,4 +56,3 @@ The detailed support list:
| LLaVa | 7B/13B |
| Mixtral | 8x7B |
| Baichuan | 7B/13B |
| QWen | 1.8B/7B/14/72B |