-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixtral: Mixture of Experts quantization #251
Conversation
Referring to your code, I implemented Mixtral (transformers==4.36.2) in AutoAWQ 0.1.7 as a single file gist. The all FFN of the MoE has |
I have the similar implementation and got the ppl result like this
To my surprise, just using rtn can get a very strong performance. |
I checked right before, and I also think that using RTN alone produces better results too. Thanks for your share. |
Thanks for a reference implementation. I have been exhausting GPU credits trying to scale this model effectively. There is no specific reason for the current approach other than it worked the best in my tests - however, your implementation is better as is evident by the results. Do you want to raise a PR to merge your changes into this branch/PR so we can merge it into AutoAWQ? I can also do it if you don’t mind. |
I updated the code with the new quantization of layers, I got Perplexity 4.294. What did you do differently from the current implementation? |
I checked your commit, and it's fine to use it as is. Feel free to use it. |
Did you guys run a MMLU benchmark on the quantized model? I'm a bit disappointed. getting 60 vs 71 |
Did you evaluate with fused modules? |
Well I'm using OpenNMT-py but I benchmarked (speed-wise) your code and in fact the only 2 big things are fasttransformer (that I replaced by flash2 with kv cache doing the same stuff) and "your" RMSnorm kernel, both of them making the nice speed. btw gemv works fine for batches > 1, just a little slower than gemm but works ok. |
Nice, I have been looking to replace FasterTransformer modules with Flash Attention. The kernels that are in AutoAWQ are imported from other projects to maximize inference speed and to create generalized modules. GEMV is great in many cases, especially for local models! |
False alarm, I am getting 67.1 using the right Rope Theta. btw don't forget to make it an option bc @younesbelkada is already tagging this PR :) |
Thanks @vince62s you mean in the transformers integration for fused modules? |
yes here: https://github.com/casper-hansen/AutoAWQ/blob/main/awq/modules/fused/attn.py#L224 |
Ah yes makes sense, thanks for the heads up, will update once I raise the PR in transformers! |
Ahh this was probably the problem I had with perplexity earlier. I forgot to modify everything to support the correct theta value. Thanks for pointing it out @vince62s, I now remember this as a problem :) |
for the sake of completeness, I ran my same mmlu script on the HF model from @casper-hansen |
Please reference the mixtral_quant script as it has special instructions!
Glad to hear it’s performing well on MMLU. Can you share your benchmark script? I’m in the process of adding more evaluation scripts to AutoAWQ. I was thinking of using vLLM for optimized parallel evaluation. |
I am using my own adaptation (for OpenNMT-py) of this script https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU which is almost the original implementation of the MMLU (slightly different of lm_eval harness used by HF leader board). |
BIG NOTE: Pending more perplexity numbers. Looking to see if we can optimize before merging.