Add AQLM support (experimental) #5466

oobabooga · 2024-02-07T20:50:55Z

This method claims to be better than QuIP# at 2-bit accuracy: https://arxiv.org/abs/2401.06118

~~Not much had to be added, as fortunately the authors integrated it HF transformers through custom Python code provided with the model checkpoint.~~

In the latest transformers version, AQLM is fully integrated, so all this PR does is add the aqlm requirement. AQLM models should be loaded with the transformers loader.

Quantized models can be found at: https://huggingface.co/ISTA-DASLab

Old description

All that needs to be done is install the requirements and load the model with `--trust-remote-code`.

I also had to disable 'low_cpu_mem_usage': True.

Example

python server.py --model BlackSamorez_Llama-2-70b-AQLM-2Bit-2x8-hf --trust-remote-code

Perplexity

On a small test that I have been running since the beginning of ~~this~~ last year to compare different quantizations (same as in #4803):

Model	Perplexity
llama-2-70b.ggmlv3.q4_K_M.bin	4.552218437194824
llama-65b.ggmlv3.q4_K_M.bin	4.906391620635986
BlackSamorez_Llama-2-70b-AQLM-2Bit-1x16-hf	5.048985958099365
relaxml/Llama-2-70b-E8P-2Bit (QuIP#)	5.173901081085205
llama-30b.ggmlv3.q4_K_M.bin	5.215567588806152
BlackSamorez_Llama-2-70b-AQLM-2Bit-2x8-hf	5.535104751586914

The 1x16 variant is probably the best one, but I couldn't evaluate it due to lack of memory.

tsengalb99 · 2024-02-08T04:07:51Z

Hi @oobabooga, is the blacksamorez aqlm model an official aqlm model (do they have a repo?) or someone's attempt at quantizing with their code? I've been trying to find an officially released aqlm model but haven't been able to, and the aqlm paper is lacking some important details. The numbers in your table seem to indicate aqlm does worse than old quip#, which is contrary to what the aqlm arxiv claims. On the quip# side, we do have some new and significantly improved 2, 3, and 4 bit models that I'm going to announce later this week. We've also had better 3 and 4 bit models for a while now under "E8PRVQ" on huggingface (iirc you expressed interest in this when we first announced quip#) but I never got around to announcing those.

oobabooga · 2024-02-08T04:32:37Z

@tsengalb99 those models are included in the Google Colab notebook linked in the README for AQLM, so I think they are official:

https://github.com/Vahe1994/AQLM
https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/colab_example.ipynb

My numbers are not enought for a conclusion as the dataset is small (some 10 samples of 1200 tokens). I haven't been able to do a bigger wikitext test so far.

It's exciting to hear that you have released better quantized models and that better ones are to come. I hope to be able to compare everything and find the Pareto frontiers some time this month.

dalistarh · 2024-02-09T17:52:47Z

Hi @oobabooga and @tsengalb99 ,

One of the AQLM authors here.
To clarify things a bit:

The models you referenced are indeed "official," and the 1x16 model in indeed the most accurate configuration we have right now.
The trade-off between 1x16 and 2x8 is that of accuracy vs inference speed. As shown in the paper, 1x16 is currently the most accurate 2bit model, but in our current implementation it is about on par with FP16 in terms of decoding speed.
The 2x8 configuration is much faster to decode (~3x vs FP16), but drops more accuracy.
We are still in the process of optimizing quantization hyper-parameters and will update models as we find better configurations. We also plan to release Mixtral and CodeLLaMA2 70B Colab notebooks in the near future.
It is not clear to me whether the OOM error you got was because you are trying to quantize the models yourself (which can happen) or somehow at runtime (which would point to a bug, maybe on our side). Please open an issue in our repo if we can help with anything.
In any case, I would strongly recommend using a higher number of samples for accurate quantization (this helps both AQLM and QuIP variants).

Cheers,
Dan

BlackSamorez · 2024-02-10T10:56:59Z

Disabling 'low_cpu_mem_usage': True shouldn't be necessary once the new accelerate version is released (this PR fixed the error).
Moreover, we're working on integrating AQLM into the newly added quantizers interface of transformers. Once it's done, trust_remote_code=True wouldn't be necessary anymore.

tsengalb99 · 2024-02-12T22:06:53Z

@oobabooga btw I just updated the quip-sharp repo with the latest code. The latest models are on HF and preprint is on arxiv as well.

oobabooga · 2024-02-13T01:45:14Z

Thanks @tsengalb99. So, according to your data, AQLM does not surpass QuIP#, or at least the updated QuIP#.

https://arxiv.org/pdf/2402.04396.pdf

tsengalb99 · 2024-02-13T01:51:34Z

Correct. It looks like AQLM also had some updated numbers for ICML vs what we had in our preprint but the latest QuIP# should still be better. I think the important thing right now is getting CUDA graphs to work with HF b/c without CUDA graphs both methods spend most of their time on kernel launches. Arthur merged his PR but at least with the way I was using CUDA graphs, the latest transformers still doesn't work. Need to look into this more.

ArthurZucker · 2024-02-13T07:33:11Z

Main with compile is broken, huggingface/transformers#28937 should fix it !

oobabooga · 2024-02-16T03:01:09Z

I have managed to test BlackSamorez_Llama-2-70b-AQLM-2Bit-1x16-hf now (downloaded 2024-02-07), and confirmed that it has lower perplexity than relaxml_Llama-2-70b-E8P-2Bit (downloaded 2023-12-04) in my small test. The data is in the updated table above.

That's not very informative, but there you go. It at least tells me that the performance of (old?) AQLM is pretty impressive, as old QuIP# was already very good.

tsengalb99 · 2024-02-17T16:49:48Z

The updated QuIP# models are under the same model cards on HF so if you get bored you should be able to rerun eval on new QuIP# by just calling the same command since HF will redownload new models.

BlackSamorez · 2024-02-28T13:19:23Z

Hi @oobabooga!
I just wanted to let you know that we've updated our fine-tuning setup for AQLM, once again greatly improving the performance. The new results can be found in the repo README marked with a cross (they reuse the old model cards). We're currently in process of applying this fine-tuning to the most popular models we've published so far.

This reverts commit 71737be.

oobabooga · 2024-03-08T20:29:32Z

Thanks for the info @BlackSamorez. I still haven't had time to do a through perplexity comparison, as there are many new methods now, including llama.cpp with calibration (imatrix), HQQ, EXL2 (updated a few months ago), AQLM, and updated QuIP# (@tsengalb99). Methods with calibration in particular require a preliminary study on what calibration dataset to use.

In any case, since AQLM is now fully integrated with the transformers library, I will merge this PR that just adds the aqlm requirement so that models available at https://huggingface.co/ISTA-DASLab can be loaded.

AQLM support (experimental)

71737be

oobabooga marked this pull request as draft February 7, 2024 20:51

oobabooga mentioned this pull request Feb 7, 2024

AQLM Quantization #5465

Closed

oobabooga deleted the branch dev February 17, 2024 21:53

oobabooga closed this Feb 17, 2024

oobabooga reopened this Feb 17, 2024

oobabooga added 3 commits March 8, 2024 11:48

Revert "AQLM support (experimental)"

37ab937

This reverts commit 71737be.

Merge branch 'dev' into aqlm

7c09cbf

Add aqlm to requirements

ac3a958

oobabooga marked this pull request as ready for review March 8, 2024 20:26

oobabooga merged commit 0e6eb7c into dev Mar 8, 2024

oobabooga deleted the aqlm branch March 8, 2024 20:39

bartowski1182 pushed a commit to bartowski1182/text-generation-webui that referenced this pull request Mar 23, 2024

Add AQLM support (transformers loader) (oobabooga#5466)

d6c08fc

PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Oct 22, 2024

Add AQLM support (transformers loader) (oobabooga#5466)

1197889

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AQLM support (experimental) #5466

Add AQLM support (experimental) #5466

oobabooga commented Feb 7, 2024 •

edited

Loading

tsengalb99 commented Feb 8, 2024

oobabooga commented Feb 8, 2024

dalistarh commented Feb 9, 2024

BlackSamorez commented Feb 10, 2024

tsengalb99 commented Feb 12, 2024

oobabooga commented Feb 13, 2024

tsengalb99 commented Feb 13, 2024

ArthurZucker commented Feb 13, 2024

oobabooga commented Feb 16, 2024

tsengalb99 commented Feb 17, 2024

BlackSamorez commented Feb 28, 2024

oobabooga commented Mar 8, 2024 •

edited

Loading

Add AQLM support (experimental) #5466

Add AQLM support (experimental) #5466

Conversation

oobabooga commented Feb 7, 2024 • edited Loading

tsengalb99 commented Feb 8, 2024

oobabooga commented Feb 8, 2024

dalistarh commented Feb 9, 2024

BlackSamorez commented Feb 10, 2024

tsengalb99 commented Feb 12, 2024

oobabooga commented Feb 13, 2024

tsengalb99 commented Feb 13, 2024

ArthurZucker commented Feb 13, 2024

oobabooga commented Feb 16, 2024

tsengalb99 commented Feb 17, 2024

BlackSamorez commented Feb 28, 2024

oobabooga commented Mar 8, 2024 • edited Loading

oobabooga commented Feb 7, 2024 •

edited

Loading

oobabooga commented Mar 8, 2024 •

edited

Loading