-
Would it be possible to quantize a model (to a lower bpw) which has already been quantized. If it is possible, would it be lower quality than quantizing from the original model and are there combinations, which would lead to better results than other (e.g 8bpw -> 4bpw vs. 6bpw -> 4bpw). |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
It's not directly supported, no. It expects every linear layer loaded to have an FP16 weight matrix to work on, and while this could be converted on the fly from quantized weights, that functionality just isn't there at the moment. I could add it, I suppose, but re-quantizing a model would still be less than ideal. FP16->8bpw->4bpw would produce a somewhat degraded model compared to FP16->4bpw. FP16->6bpw->4bpw would most likely be worse still. It would also use a bit more memory than converting from FP16. But it's definitely doable, in principle, with some tweaks to the code. |
Beta Was this translation helpful? Give feedback.
It's not directly supported, no. It expects every linear layer loaded to have an FP16 weight matrix to work on, and while this could be converted on the fly from quantized weights, that functionality just isn't there at the moment.
I could add it, I suppose, but re-quantizing a model would still be less than ideal. FP16->8bpw->4bpw would produce a somewhat degraded model compared to FP16->4bpw. FP16->6bpw->4bpw would most likely be worse still. It would also use a bit more memory than converting from FP16.
But it's definitely doable, in principle, with some tweaks to the code.