This repo includes some quantization methods implemented with cxx and python, which base on mxnet.
There are six quantization methods as belows:
quantization_int8: according to google's quantization aware training(QAT) method. But also align quantization method with tensorrt, that we deploy to. paper link: https://arxiv.org/abs/1712.05877
FQN: low-bit QAT for detection tasks. paper link: http://openaccess.thecvf.com/content_CVPR_2019/papers/Li_Fully_Quantized_Network_for_Object_Detection_CVPR_2019_paper.pdf
PACT: paper link: https://arxiv.org/abs/1805.06085
DoReFa: paper link: https://arxiv.org/abs/1606.06160
QIL: this method is hard to train, we haven't trained the quantiztaion parameters successfully. paper link:https://arxiv.org/abs/1808.05779
GDRQ: the method is easy to implement, but it's quantization pipeline is complicated. hard to reproduct its' result. paper link: https://arxiv.org/abs/1908.01477
setting
the netowrk of cifar10 is resnet20, imagenet is resnet18
paper-cifar10 | our cifar10 | paper-imagenet | our imagenet | |
---|---|---|---|---|
fp32 | 0.916 | 0.920 | 0.702 | 0.697 |
4bits | 0.913 | 0.918 | 0.692 | 0.694 |
gap(fp32 - 4bits) | 0.003 | 0.002 | 0.01 | 0.003 |
3bits | 0.911 | 0.917 | 0.681 | 0.674 |
gap(fp32 - 3bits) | 0.005 | 0.003 | 0.021 | 0.023 |
2bits | 0.897 | 0.897 | 0.644 | 0.603 |
gap(fp32 - 2bits) | 0.019 | 0.023 | 0.058 | 0.094 |
notice: our result with 2 bits quantization on imagenet is lower than its' reported result. about 4% gap.
1. the netowrk of cifar10 is resnet20, imagenet is resnet18;
2. if it's not explictly stated, the bits of wegiht and activation is in the same by default.
3. google quantize means quantization_int8, w-google-act-pact means quantization_int8 for weight and PACT for activation.
cifar10
pact-paper | our pact | our google quantize | w-gdrq-act-pact | w-google-act-pact | |
---|---|---|---|---|---|
fp32 | 0.916 | 0.920 | 0.920 | 0.920 | 0.920 |
4bits | 0.913 | 0.918 | 0.914 | 0.917 | 0.919 |
w3,act4 | 0.918 | 0.911 | 0.914 | 0.917 | |
3bits | 0.911 | 0.917 | 0.891 | 0.918 | 0.916 |
2bits | 0.897 | 0.897 | 0.67 | 0.903 | 0.893 |
imagenet
pact-paper | our pact | our google quantize | our gdrq | w-gdrq-act-pact | w-google-act-pact | w-gdrq-act-pact(ft) | |
---|---|---|---|---|---|---|---|
fp32 | 0.702 | 0.697 | 0.697 | 0.697 | |||
4bits | 0.692 | 0.694 | 0.658 | ||||
w3,act4 | 0.685 | 0.639 | 0.689(ft) | 0.691 | 0.669 | ||
3bits | 0.681 | 0.674 | 0.536 | 0.674 | 0.644 | 0.682 | |
2bits | 0.644 | 0.603 | 0.042 | 0.608 | 0.412 | 0.632 |
In our private deep model, the int8 training result can align with result in fp32 training so far. And the real int8 inference on tensorrt result is nearly the same as int8 training result.
In our practice, the easy way to train model in int8 is finetuning the fp32 model. base_itn8_lr = base_fp32_lr/2
, finetune_epoch=2
, lr_scheduler
is setting to SineScheduler
, which lr = base_int8_lr * Sine(curr_iter/total_iters * pi)
due to quantization_int8
operator requires maxall_except_dim
function which implemented in mshadow
by us. so replace the source reduceto1d.h
file with ours.
1. copy those files in `operator_cxx/contrib` into `mxnet_home/operator/contrib`
2. copy reduceto1d.h in `3rdparty/mshadow/mshadow/extension/reduceto1d.h` into `mxnet_home/3rdparty/mshadow/mshadow/extension/reduceto1d.h`
3. compile your mxnet
We attach quantize node by parsing symbol file of mxnet. It can generate graph with quantization node with your quantization setting. Detail implementation in utils/graph_optimize.py
.
There is a simple example with resnet18 network
. to run python3 utils/graph_optimize.py
. the source graph
and attached quantized node graph
as below shows: