-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPTQ via Transformers. [Basic] #2365
Conversation
Due to some bugs with functionality, I refactored the model_adapter quite a bit. Could use further refactoring I think but this is a start. Please note this does affect almost all adapters, so proper code review + testing where possible should be employed. |
Official documentation recommends it to be off. I am not entirely sure the source of why this happens, would need to investigate in order to sort out how / when it can be on. |
Relevant notes are here: https://huggingface.co/docs/transformers/main/main_classes/quantization |
14c0818
to
e4758da
Compare
Updated from master. |
Bump. :) |
@digisomni Sorry for the delay. This is a big refactor. Please allow some time for me to review. |
Should be good to go now. |
@merrymercy looks like @digisomni did their part to have it working. Can we merge it? |
bc36cb1
to
ec73e60
Compare
I've rebased again. I recommend review + test + merging this sooner rather than later as people merge their PRs with adapter mods, it tends to muddy the waters as they're building on the old paradigm. This makes it harder to rebase the more time that elapses from the last rebase. |
I am closing this refactor in favor of using the vLLM worker. If at some point we need to push out features faster than vLLM, we can continue trying to maintain FastChat's native model worker, but for now it seems vLLM is going to be faster to the punch. |
@digisomni I am sorry to hear that, but we have very limited bandwidth. You are welcome to try it as well. The default HF worker seems not well suited for high-performance deployment. |
@digisomni Really, try SGLang. I already moved as much as I can to it (except Mixtral, which I haven't got running yet) to SGLang, as it's so much better than VLLM and not even comparable to the original worker. |
@surak could you submit the issues you got with SGLang worker? |
This enables support for GPTQ via Transformers. Seems the cleanest and most efficient way to do things.
Also updated format.sh to allow 'greater than or equal to' tool verisons.
Note: Perhaps the old way of quanting can be deprecated then as the maintainer recommends using AutoGPTQ. Alternatively, if there's a compelling reason to also have manual support, then the GPTQ module should be upgraded to use AutoGPTQ.
Closes #2215 #1745 #1671 #2375
Warning: This alters the package requirements for FastChat.