Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AQLM support (experimental) #5466

Merged
merged 4 commits into from
Mar 8, 2024
Merged

Add AQLM support (experimental) #5466

merged 4 commits into from
Mar 8, 2024

Conversation

oobabooga
Copy link
Owner

@oobabooga oobabooga commented Feb 7, 2024

This method claims to be better than QuIP# at 2-bit accuracy: https://arxiv.org/abs/2401.06118

Not much had to be added, as fortunately the authors integrated it HF transformers through custom Python code provided with the model checkpoint.

In the latest transformers version, AQLM is fully integrated, so all this PR does is add the aqlm requirement. AQLM models should be loaded with the transformers loader.

Quantized models can be found at: https://huggingface.co/ISTA-DASLab

Old description All that needs to be done is install the requirements and load the model with `--trust-remote-code`.

I also had to disable 'low_cpu_mem_usage': True.

Example

python server.py --model BlackSamorez_Llama-2-70b-AQLM-2Bit-2x8-hf --trust-remote-code

Perplexity

On a small test that I have been running since the beginning of this last year to compare different quantizations (same as in #4803):

Model Perplexity
llama-2-70b.ggmlv3.q4_K_M.bin 4.552218437194824
llama-65b.ggmlv3.q4_K_M.bin 4.906391620635986
BlackSamorez_Llama-2-70b-AQLM-2Bit-1x16-hf 5.048985958099365
relaxml/Llama-2-70b-E8P-2Bit (QuIP#) 5.173901081085205
llama-30b.ggmlv3.q4_K_M.bin 5.215567588806152
BlackSamorez_Llama-2-70b-AQLM-2Bit-2x8-hf 5.535104751586914

The 1x16 variant is probably the best one, but I couldn't evaluate it due to lack of memory.

@oobabooga oobabooga marked this pull request as draft February 7, 2024 20:51
@oobabooga oobabooga mentioned this pull request Feb 7, 2024
@tsengalb99
Copy link

Hi @oobabooga, is the blacksamorez aqlm model an official aqlm model (do they have a repo?) or someone's attempt at quantizing with their code? I've been trying to find an officially released aqlm model but haven't been able to, and the aqlm paper is lacking some important details. The numbers in your table seem to indicate aqlm does worse than old quip#, which is contrary to what the aqlm arxiv claims. On the quip# side, we do have some new and significantly improved 2, 3, and 4 bit models that I'm going to announce later this week. We've also had better 3 and 4 bit models for a while now under "E8PRVQ" on huggingface (iirc you expressed interest in this when we first announced quip#) but I never got around to announcing those.

@oobabooga
Copy link
Owner Author

@tsengalb99 those models are included in the Google Colab notebook linked in the README for AQLM, so I think they are official:

https://github.com/Vahe1994/AQLM
https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/colab_example.ipynb

My numbers are not enought for a conclusion as the dataset is small (some 10 samples of 1200 tokens). I haven't been able to do a bigger wikitext test so far.

It's exciting to hear that you have released better quantized models and that better ones are to come. I hope to be able to compare everything and find the Pareto frontiers some time this month.

@dalistarh
Copy link

Hi @oobabooga and @tsengalb99 ,

One of the AQLM authors here.
To clarify things a bit:

  • The models you referenced are indeed "official," and the 1x16 model in indeed the most accurate configuration we have right now.

  • The trade-off between 1x16 and 2x8 is that of accuracy vs inference speed. As shown in the paper, 1x16 is currently the most accurate 2bit model, but in our current implementation it is about on par with FP16 in terms of decoding speed.

  • The 2x8 configuration is much faster to decode (~3x vs FP16), but drops more accuracy.

  • We are still in the process of optimizing quantization hyper-parameters and will update models as we find better configurations. We also plan to release Mixtral and CodeLLaMA2 70B Colab notebooks in the near future.

  • It is not clear to me whether the OOM error you got was because you are trying to quantize the models yourself (which can happen) or somehow at runtime (which would point to a bug, maybe on our side). Please open an issue in our repo if we can help with anything.

  • In any case, I would strongly recommend using a higher number of samples for accurate quantization (this helps both AQLM and QuIP variants).

Cheers,
Dan

@BlackSamorez
Copy link

Disabling 'low_cpu_mem_usage': True shouldn't be necessary once the new accelerate version is released (this PR fixed the error).
Moreover, we're working on integrating AQLM into the newly added quantizers interface of transformers. Once it's done, trust_remote_code=True wouldn't be necessary anymore.

@tsengalb99
Copy link

@oobabooga btw I just updated the quip-sharp repo with the latest code. The latest models are on HF and preprint is on arxiv as well.

@oobabooga
Copy link
Owner Author

Thanks @tsengalb99. So, according to your data, AQLM does not surpass QuIP#, or at least the updated QuIP#.

https://arxiv.org/pdf/2402.04396.pdf

print

@tsengalb99
Copy link

Correct. It looks like AQLM also had some updated numbers for ICML vs what we had in our preprint but the latest QuIP# should still be better. I think the important thing right now is getting CUDA graphs to work with HF b/c without CUDA graphs both methods spend most of their time on kernel launches. Arthur merged his PR but at least with the way I was using CUDA graphs, the latest transformers still doesn't work. Need to look into this more.

@ArthurZucker
Copy link

Main with compile is broken, huggingface/transformers#28937 should fix it !

@oobabooga
Copy link
Owner Author

I have managed to test BlackSamorez_Llama-2-70b-AQLM-2Bit-1x16-hf now (downloaded 2024-02-07), and confirmed that it has lower perplexity than relaxml_Llama-2-70b-E8P-2Bit (downloaded 2023-12-04) in my small test. The data is in the updated table above.

That's not very informative, but there you go. It at least tells me that the performance of (old?) AQLM is pretty impressive, as old QuIP# was already very good.

@tsengalb99
Copy link

The updated QuIP# models are under the same model cards on HF so if you get bored you should be able to rerun eval on new QuIP# by just calling the same command since HF will redownload new models.

@oobabooga oobabooga deleted the branch dev February 17, 2024 21:53
@oobabooga oobabooga closed this Feb 17, 2024
@oobabooga oobabooga reopened this Feb 17, 2024
@BlackSamorez
Copy link

Hi @oobabooga!
I just wanted to let you know that we've updated our fine-tuning setup for AQLM, once again greatly improving the performance. The new results can be found in the repo README marked with a cross (they reuse the old model cards). We're currently in process of applying this fine-tuning to the most popular models we've published so far.

@oobabooga oobabooga marked this pull request as ready for review March 8, 2024 20:26
@oobabooga
Copy link
Owner Author

oobabooga commented Mar 8, 2024

Thanks for the info @BlackSamorez. I still haven't had time to do a through perplexity comparison, as there are many new methods now, including llama.cpp with calibration (imatrix), HQQ, EXL2 (updated a few months ago), AQLM, and updated QuIP# (@tsengalb99). Methods with calibration in particular require a preliminary study on what calibration dataset to use.

In any case, since AQLM is now fully integrated with the transformers library, I will merge this PR that just adds the aqlm requirement so that models available at https://huggingface.co/ISTA-DASLab can be loaded.

@oobabooga oobabooga merged commit 0e6eb7c into dev Mar 8, 2024
@oobabooga oobabooga deleted the aqlm branch March 8, 2024 20:39
bartowski1182 pushed a commit to bartowski1182/text-generation-webui that referenced this pull request Mar 23, 2024
PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants