Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marlin symmetric quantization and inference #320

Merged
merged 8 commits into from
Feb 3, 2024
Merged

Conversation

IlyasMoutawwakil
Copy link
Collaborator

@IlyasMoutawwakil IlyasMoutawwakil commented Jan 25, 2024

with @casper-hansen 🤗
experimental, still needs cleanup.

@IlyasMoutawwakil
Copy link
Collaborator Author

Perplexity results

Symmetric AWQ Marlin model:

user@hf-dgx-01:/workspace/opt-bench$ python examples/eval.py --model_path vicuna-7b-v1.5-awq-marlin
Perplexity 7.138: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 166/166 [00:54<00:00,  3.04it/s]
Perplexity: 7.138

Zero Point AWQ model:

user@hf-dgx-01:/workspace/opt-bench$ python examples/eval.py --model_path vicuna-7b-v1.5-awq
Perplexity 7.013: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 166/166 [01:37<00:00,  1.71it/s]
Perplexity: 7.013

@IlyasMoutawwakil
Copy link
Collaborator Author

IlyasMoutawwakil commented Jan 29, 2024

updated with new marlin

Perf Bench

Batch Size = 1

GEMM

GPU: NVIDIA A100-SXM4-80GB
Model: IlyasMoutawwakil/vicuna-7b-v1.5-awq-gemm
Version: GEMM
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:------------------|
|            1 |               32 |              32 |            277.835 |           93.0104 | 4.55 GB (5.75%)   |
|            1 |               64 |              64 |           1749.38  |           93.7882 | 4.57 GB (5.78%)   |
|            1 |              128 |             128 |           2058.66  |           93.2005 | 4.61 GB (5.82%)   |
|            1 |              256 |             256 |           2349.42  |           93.2399 | 4.67 GB (5.90%)   |
|            1 |              512 |             512 |           2546.7   |           92.5384 | 4.80 GB (6.06%)   |
|            1 |             1024 |            1024 |           5635.29  |           92.893  | 5.06 GB (6.40%)   |
|            1 |             2048 |            2048 |           5343.16  |           76.1673 | 6.15 GB (7.77%)   |
|            1 |             4096 |            4096 |           4555.36  |           77.43   | 11.15 GB (14.08%) |
|            1 |             8192 |            8192 |           3133.58  |           57.5579 | 28.68 GB (36.23%) |

Marlin

GPU: NVIDIA A100-SXM4-80GB
Model: IlyasMoutawwakil/vicuna-7b-v1.5-awq-marlin
Version: Marlin
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:------------------|
|            1 |               32 |              32 |            277.605 |          102.415  | 4.52 GB (5.71%)   |
|            1 |               64 |              64 |           2707.15  |          103.143  | 4.55 GB (5.74%)   |
|            1 |              128 |             128 |           5128.98  |          102.101  | 4.58 GB (5.78%)   |
|            1 |              256 |             256 |           6342.32  |          104.245  | 4.64 GB (5.86%)   |
|            1 |              512 |             512 |           6664.59  |          103.435  | 4.77 GB (6.03%)   |
|            1 |             1024 |            1024 |           6254.59  |          104.344  | 5.04 GB (6.36%)   |
|            1 |             2048 |            2048 |           5427.11  |          103.412  | 6.13 GB (7.75%)   |
|            1 |             4096 |            4096 |           4450.35  |          105.25   | 11.12 GB (14.05%) |
|            1 |             8192 |            8192 |           2955.77  |           72.1539 | 28.65 GB (36.20%) |

ExllamaV2

GPU: NVIDIA A100-SXM4-80GB
Model: IlyasMoutawwakil/vicuna-7b-v1.5-awq-gemm
Version: ExllamaV2
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:------------------|
|            1 |               32 |              32 |             291.52 |          126.095  | 4.55 GB (5.75%)   |
|            1 |               64 |              64 |            1623.86 |          126.239  | 4.57 GB (5.78%)   |
|            1 |              128 |             128 |            4237.64 |          125.642  | 4.61 GB (5.82%)   |
|            1 |              256 |             256 |            6727.2  |          125.932  | 4.67 GB (5.90%)   |
|            1 |              512 |             512 |            8173.39 |          125.267  | 4.80 GB (6.06%)   |
|            1 |             1024 |            1024 |            8178.4  |          124.575  | 5.06 GB (6.40%)   |
|            1 |             2048 |            2048 |            6965.1  |          106.812  | 6.34 GB (8.01%)   |
|            1 |             4096 |            4096 |            5054.08 |          109.529  | 11.43 GB (14.44%) |
|            1 |             8192 |            8192 |            3230.48 |           73.2361 | 29.15 GB (36.82%) |

Batch Size = 8

GEMM

GPU: NVIDIA A100-SXM4-80GB
Model: IlyasMoutawwakil/vicuna-7b-v1.5-awq-gemm
Version: GEMM
|   Batch Size |   Prefill Length |   Decode Length | Prefill tokens/s   | Decode tokens/s    | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|:-------------------|:-------------------|:------------------|
|            8 |               32 |              32 | 1346.3070769585704 | 708.0039668305446  | 4.66 GB (5.88%)   |
|            8 |               64 |              64 | 2919.1490129870836 | 711.0797660422141  | 4.79 GB (6.05%)   |
|            8 |              128 |             128 | 5629.3036596650445 | 707.4964050013705  | 5.04 GB (6.37%)   |
|            8 |              256 |             256 | 7791.613649124367  | 708.4374630521071  | 5.56 GB (7.02%)   |
|            8 |              512 |             512 | 8755.697456241565  | 716.3168883290994  | 6.63 GB (8.38%)   |
|            8 |             1024 |            1024 | 8376.709636556277  | 659.15788232983    | 10.84 GB (13.69%) |
|            8 |             2048 |            2048 | 6869.208920003594  | 521.6471612461911  | 23.00 GB (29.06%) |
|            8 |             4096 |            4096 | 4909.631004412986  | 335.84994344853817 | 62.33 GB (78.75%) |
|            8 |             8192 |            8192 | OOM                | OOM                | 74.05 GB (93.56%) |

Marlin

GPU: NVIDIA A100-SXM4-80GB
Model: IlyasMoutawwakil/vicuna-7b-v1.5-awq-marlin
Version: Marlin
|   Batch Size |   Prefill Length |   Decode Length | Prefill tokens/s   | Decode tokens/s   | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|:-------------------|:------------------|:------------------|
|            8 |               32 |              32 | 2187.0295157029404 | 819.9005986560782 | 4.63 GB (5.85%)   |
|            8 |               64 |              64 | 7313.37104870623   | 817.403946406821  | 4.76 GB (6.01%)   |
|            8 |              128 |             128 | 7714.85411899108   | 802.8912710566616 | 5.01 GB (6.33%)   |
|            8 |              256 |             256 | 8115.586666906011  | 811.3558371215785 | 5.53 GB (6.98%)   |
|            8 |              512 |             512 | 7940.376095680874  | 810.7481093096866 | 6.61 GB (8.35%)   |
|            8 |             1024 |            1024 | 7536.497977676355  | 812.9086900695303 | 10.81 GB (13.66%) |
|            8 |             2048 |            2048 | 6143.37585686755   | 697.8004408767624 | 22.97 GB (29.03%) |
|            8 |             4096 |            4096 | 4438.606568948759  | 401.2056340722673 | 62.31 GB (78.72%) |
|            8 |             8192 |            8192 | OOM                | OOM               | 74.03 GB (93.53%) |

ExllamaV2

GPU: NVIDIA A100-SXM4-80GB
Model: IlyasMoutawwakil/vicuna-7b-v1.5-awq-gemm
Version: ExllamaV2
|   Batch Size |   Prefill Length |   Decode Length | Prefill tokens/s   | Decode tokens/s   | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|:-------------------|:------------------|:------------------|
|            8 |               32 |              32 | 2170.5566653526994 | 855.5439061703213 | 4.66 GB (5.88%)   |
|            8 |               64 |              64 | 8809.358083134719  | 855.2168217153052 | 4.79 GB (6.05%)   |
|            8 |              128 |             128 | 9950.853059881654  | 843.669717389118  | 5.04 GB (6.37%)   |
|            8 |              256 |             256 | 10298.249394568622 | 805.7640419758422 | 5.56 GB (7.02%)   |
|            8 |              512 |             512 | 10324.639166137811 | 737.9628318194814 | 6.91 GB (8.73%)   |
|            8 |             1024 |            1024 | 9187.918802352617  | 640.7063451146627 | 11.30 GB (14.28%) |
|            8 |             2048 |            2048 | 7291.295603279198  | 508.1310214280306 | 23.84 GB (30.12%) |
|            8 |             4096 |            4096 | 5026.979565560173  | 332.1365984993962 | 63.93 GB (80.77%) |
|            8 |             8192 |            8192 | OOM                | OOM               | 77.15 GB (97.47%) |

@vince62s
Copy link

nice work. may the architecture of the GPU impact things?

@IlyasMoutawwakil
Copy link
Collaborator Author

@vince62s I'd say "definitely" based on the fact that the kernel has many PTX assembly blocks and a hard constraint on architecture, from the kernel's repo https://github.com/IST-DASLab/marlin

NVIDIA GPU with compute capability >= 8.0 (Ampere or Ada, Marlin is not yet optimized for Hopper)

@vince62s
Copy link

@vince62s I'd say "definitely" based on the fact that the kernel has many PTX assembly blocks and a hard constraint on architecture, from the kernel's repo https://github.com/IST-DASLab/marlin

NVIDIA GPU with compute capability >= 8.0 (Ampere or Ada, Marlin is not yet optimized for Hopper)

looking forward seeing numbers at batch_size even higher 32/64 which might be reasonable for seq len 1024/2048 when Marlin is optimized for Hopper.

@casper-hansen
Copy link
Owner

Looks good to me! Fixed a small bug with the workspace after the latest update to Marlin. Nice to have a refactor of the Quantizer as well.

@casper-hansen casper-hansen merged commit 34085ed into main Feb 3, 2024
@IlyasMoutawwakil
Copy link
Collaborator Author

IlyasMoutawwakil commented Feb 4, 2024

@casper-hansen awesome! apologies for not cleaning up the PR myself 😅 thanks for taking care of it 🙏
btw the workspaces can be created the same way as in exllamav2, i.e. one workspace per device instead of per linear layer.

@casper-hansen casper-hansen deleted the marlin-support branch February 12, 2024 14:21
@jeromeku
Copy link

jeromeku commented Apr 2, 2024

@IlyasMoutawwakil

Great work adapting Marlin to AWQ.

I'm currently looking to do the same -- that is adapt optimized inference kernels for different quantization formats.

Roughly, what are the major changes that need to be made to adapt a quantization format in order to use a kernel such as Marlin? Specifically, how do the quantized weights, scales, and zeros need to be preprocessed in order to conform to the required layout for Marlin, AWQ specific GEMV / GEMM, exLlama, etc.?

E.g., starting from 4-bit quantized weights packed [0, 1, 2, 3, 4, 5, 6, 7] as an int, what permutations / shuffling / packing needs to be done in order to use each of these kernels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants