Specific int8 operators for conv and fully connected? #10151

mikhilg10 · 2024-11-03T19:27:46Z

mikhilg10
Nov 3, 2024

In basic gguf format are all the weights and activations are being processed as fp16 or there are separate operators for int8 and other formats?. Operators like convolutional , fully connected layer and ffn??
What I understood from ggml that quantization is storage purposes only. While inference it'll convert qX to fp16 or 32. Is my understanding wrong??
And
If I want to add the int8 specific operators how to do that. Can anyone help me???

Answered by FSSRepo

Nov 3, 2024

Quantization is mostly used to reduce the size of the weights that need to be multiplied (matrix multiplication ggml_mul_mat); the rest of the operations are performed in fp16 or fp32, depending on the case.

View full answer

FSSRepo · 2024-11-03T21:37:48Z

FSSRepo
Nov 3, 2024
Collaborator

Quantization is mostly used to reduce the size of the weights that need to be multiplied (matrix multiplication ggml_mul_mat); the rest of the operations are performed in fp16 or fp32, depending on the case.

1 reply

mikhilg10 Nov 5, 2024
Author

so the activations are used as it is in fp32 and fp16??

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specific int8 operators for conv and fully connected? #10151

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Specific int8 operators for conv and fully connected? #10151

mikhilg10 Nov 3, 2024

Replies: 1 comment · 1 reply

FSSRepo Nov 3, 2024 Collaborator

mikhilg10 Nov 5, 2024 Author

mikhilg10
Nov 3, 2024

Replies: 1 comment 1 reply

FSSRepo
Nov 3, 2024
Collaborator

mikhilg10 Nov 5, 2024
Author