Replies: 2 comments
-
You'll probably get a better/more detailed explanation from someone else. The simplified version is the parameters are divided into chunks/groups and then there's extra metadata about the scale. This means each chunk can be quantized more accurately. Just forcing everything into 4 bits for example means that the scale has to be able to handle the highest and lowest values. However, parts of the data may have a bunch of values in one range, other parts may have values in a different range. I.E. if one chunk has This is just explaining the idea in a very general/simplified way. Anyway, if you average out how many bits are used including the scale values that occur per chunk, you can end up with half a bit. |
Beta Was this translation helpful? Give feedback.
-
You can't have something that is half a bit, obviously, except in average. It's kind of like saying the average family has 2.1 children. |
Beta Was this translation helpful? Give feedback.
-
q4_0, q5_0, & q8_0 have 0.5 bits/weight less than the _1 quantizations. How does this work, and what does it mean? I remember seeing an explanation somewhere here, but I can't find it anymore.
Beta Was this translation helpful? Give feedback.
All reactions