-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Bitsandbytes quantization for transformer backend #1775
Comments
good point @fakezeta! do you have already have the changes and come up with a PR? maybe we can take it from there |
You know I'm old, give me some time 😄 |
That's fine, actually, thanks for the efforts! |
#1775 and fix: Transformer backend error on CUDA #1774 (#1823) * fixes #1775 and #1774 Add BitsAndBytes Quantization and fixes embedding on CUDA devices * Manage 4bit and 8 bit quantization Manage different BitsAndBytes options with the quantization: parameter in yaml * fix compilation errors on non CUDA environment
…for Openvino and CUDA (#1892) * fixes #1775 and #1774 Add BitsAndBytes Quantization and fixes embedding on CUDA devices * Manage 4bit and 8 bit quantization Manage different BitsAndBytes options with the quantization: parameter in yaml * fix compilation errors on non CUDA environment * OpenVINO draft First draft of OpenVINO integration in transformer backend * first working implementation * Streaming working * Small fix for regression on CUDA and XPU * use pip version of optimum[openvino] * Update backend/python/transformers/transformers_server.py Signed-off-by: Ettore Di Giacinto <[email protected]> --------- Signed-off-by: Ettore Di Giacinto <[email protected]> Co-authored-by: Ettore Di Giacinto <[email protected]>
Is your feature request related to a problem? Please describe.
Quantization is not available for transformer backend
Describe the solution you'd like
Add bitsandbytes 4bit quantization to be triggered by the user with the
low_vram
flag in the model definitionAdditionally I propose to use the
f16
flag to change thecompute_dtype
tobfloat16
for better optimization on Nvidia cardsDescribe alternatives you've considered
Additional context
I've implemented this while fixing #1774
Issue opened for tracking
The text was updated successfully, but these errors were encountered: