-
In basic gguf format are all the weights and activations are being processed as fp16 or there are separate operators for int8 and other formats?. Operators like convolutional , fully connected layer and ffn?? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Quantization is mostly used to reduce the size of the weights that need to be multiplied (matrix multiplication |
Beta Was this translation helpful? Give feedback.
Quantization is mostly used to reduce the size of the weights that need to be multiplied (matrix multiplication
ggml_mul_mat
); the rest of the operations are performed in fp16 or fp32, depending on the case.