Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPTQ via Transformers. [Basic] #2365

Closed
wants to merge 16 commits into from

Conversation

digisomni
Copy link
Contributor

@digisomni digisomni commented Sep 5, 2023

This enables support for GPTQ via Transformers. Seems the cleanest and most efficient way to do things.

Also updated format.sh to allow 'greater than or equal to' tool verisons.

Note: Perhaps the old way of quanting can be deprecated then as the maintainer recommends using AutoGPTQ. Alternatively, if there's a compelling reason to also have manual support, then the GPTQ module should be upgraded to use AutoGPTQ.

Closes #2215 #1745 #1671 #2375

Warning: This alters the package requirements for FastChat.

@digisomni digisomni marked this pull request as ready for review September 5, 2023 14:40
@digisomni digisomni marked this pull request as draft September 5, 2023 16:43
@digisomni digisomni marked this pull request as ready for review September 5, 2023 17:43
@digisomni digisomni marked this pull request as ready for review September 6, 2023 09:23
@digisomni
Copy link
Contributor Author

Due to some bugs with functionality, I refactored the model_adapter quite a bit. Could use further refactoring I think but this is a start.

Please note this does affect almost all adapters, so proper code review + testing where possible should be employed.

@leonxia1018
Copy link
Contributor

I tried your branch with Exllama enabled but get this error:
image
Really need Exllama kernel, as it much faster than the default kernel.

@digisomni
Copy link
Contributor Author

Official documentation recommends it to be off. I am not entirely sure the source of why this happens, would need to investigate in order to sort out how / when it can be on.

@digisomni
Copy link
Contributor Author

@digisomni
Copy link
Contributor Author

Updated from master.

@digisomni
Copy link
Contributor Author

Bump. :)

@merrymercy
Copy link
Member

@digisomni Sorry for the delay. This is a big refactor. Please allow some time for me to review.
Could you rebase to the main following the recent updates (#2512, #2559 )?

@digisomni
Copy link
Contributor Author

Should be good to go now.

@surak
Copy link
Collaborator

surak commented Nov 3, 2023

@merrymercy looks like @digisomni did their part to have it working. Can we merge it?

@digisomni
Copy link
Contributor Author

I've rebased again. I recommend review + test + merging this sooner rather than later as people merge their PRs with adapter mods, it tends to muddy the waters as they're building on the old paradigm. This makes it harder to rebase the more time that elapses from the last rebase.

@digisomni
Copy link
Contributor Author

I am closing this refactor in favor of using the vLLM worker. If at some point we need to push out features faster than vLLM, we can continue trying to maintain FastChat's native model worker, but for now it seems vLLM is going to be faster to the punch.

@digisomni digisomni closed this Feb 14, 2024
@merrymercy
Copy link
Member

merrymercy commented Feb 14, 2024

@digisomni I am sorry to hear that, but we have very limited bandwidth.
Currently, our focus is on SGLang worker #2928

You are welcome to try it as well. The default HF worker seems not well suited for high-performance deployment.

@surak
Copy link
Collaborator

surak commented Feb 14, 2024

@digisomni Really, try SGLang. I already moved as much as I can to it (except Mixtral, which I haven't got running yet) to SGLang, as it's so much better than VLLM and not even comparable to the original worker.

@merrymercy
Copy link
Member

@surak could you submit the issues you got with SGLang worker?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

is there a possibility that fastchat can support llama2-70b quantized with gptq?
4 participants