[OPT] Low-bit Quantization #2116

ZihengJiang · 2018-11-15T09:37:52Z

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers.

ajtulloch · 2018-11-15T20:10:07Z

This is all assuming a symmetric quantization scheme, correct? Have you considered generalizing this slightly to an asymmetric quantization scheme like the one used in GEMMLOWP, QNNPACK, FBGEMM, NNAPI, etc?

python/tvm/relay/_quantization.py

tqchen · 2018-11-15T21:24:10Z

Since quantization is a major feature, it is better to send a RFC first

ZihengJiang · 2018-11-17T05:33:46Z

I will propose a RFC next week. Thanks @ajtulloch @tqchen .

include/tvm/relay/op.h

python/tvm/relay/build_module.py

python/tvm/relay/quantize/quantize.py

topi/python/topi/util.py

python/tvm/relay/quantize/quantize_ops.py

src/relay/pass/forward_rewrite.cc

src/relay/op/nn/convolution.cc

python/tvm/relay/quantize/quantize.py

src/relay/pass/pattern_util.h

src/relay/pass/quantize.cc

python/tvm/relay/quantize/quantize.py

ajtulloch · 2018-12-05T23:43:13Z

Has there been an RFC posted btw? This comment probably belongs there.

FWIW I'm a little concerned about some directions this PR is taking, or at least some use-cases that would be good to see handled that I don't see how they fit in currently.

For background on my perspective, a standard training flow for quantized models in TF/C2 (at least the fwk's I'm familiar with that implement this), is to:

Implement a model in a standard ML framework, generally using fp16/bfloat16/fp32 compute precision as this has highest throughput on most commonly-used training hardware.
(optionally) insert fake quantization (here, called simulated quantization) nodes at quantization boundaries (i.e. if your backend implements a fused Int8Conv + Int8Relu, you'd insert them after a Conv + Relu block), to simulate the quantization numerics at training time.
Train the model as usual
Implement a graph rewriting pass (i.e. TF's toco, C2's int8_converter, MXNet's quantization, etc) that rewrites the graph to target the int8 operators directly — i.e. remapping subgraphs of e.g. FP32Conv + FP32Relu to be a fused Int8ConvRelu operator. This requires computing output quantization parameters at requantization boundaries, which can be done either by
- calibration to an example set of activations, via e.g. l-p norm or kl minimization (c2/tf/mxnet/tensorrt)
- using activation ranges learned during training (c2/tf).
Using this quantized graph, evaluate various metrics to verify the quantization-induced error/loss is acceptable.
Deploy the quantized graph.

Does this workflow make sense to folks? If not, could folks please elaborate on where we differ?

Given this flow, we'd like to insert TVM into this process. One key use case that I'd like TVM to consider supporting is to allow frameworks to continue to use their existing approaches for Steps 1-5, and involve TVM in Step 6. There are several reasons for this, such as calibration-based quantization isn't always sufficient, and we'd like to supporting importing from existing int8 graph IRs like TFLite or C2.

I think requiring TVM to take on Steps 4 and 5 in order to implement quantized models is unnecessarily opinionated, and moves it towards being a fully-fledged framework in it's own right (which I thought was not the goal).

I would have thought one natural (and minimalistic) direction for TVM to support quantized models (which isn't precluded by this diff, but I want to see what folks think about this) would be something like:

Implement (in topi) support for int8 ops (i.e. ((u)int8 inputs, int32 accumulation, int32 output). This is partially done already by the great work from folks in the community. If we generalize to asymmetric quantization (which IMO is quite important), then it's arguably more natural to represent the inputs/outputs as tuples of (uint8 tensor, float min, float max) or equivalently (uint8 tensor, int32 bias, float scale), and implement operators using this representation.
Add some kind of requantize op in NNVM, that performs a int32 -> (u)int8 requantization with the appropriate output float min/float max obtained via calibration or training.
Implement in nnvm frontend an importer for e.g. tflite models (which would mostly involve mapping ops like TFLiteConv into a nnvm::Conv + nnvm::Requantize sequence, and ensuring that TVM/NNVM fuse away sequences of requantize/pointwise/requantize), and demonstrate a) bitwise numerical equivalence, and b) speedups vs tflite's runtime for models like MobileNetV2 or similar.

Concretely, my concerns with this approach (assuming the goal is to be the 'the one true way' to execute quantized models in TVM) are that it a) integrates too early in the pipeline, which unnecessarily requires some assumptions, and b) these assumptions aren't the most general ones (i.e. requires symmetric quantization as used by e.g. MKLDNN), which precludes asymmetric quantization as in TF, TFLite, C2, GEMMLOWP, QNNPACK, and channel-wise quantization as in TF/C2 which is very useful for pushing bitwidths lower (see e.g. https://arxiv.org/pdf/1806.08342.pdf), and c) is less modular than other approaches, which makes it harder to target from existing frameworks that already support quantization.

I don't think our goals are in conflict, I just thought that I should put this on the radar. Happy to send out an RFC (and dedicate engineering effort) to the more alternative approach as well if folks are on board?

tqchen · 2018-12-06T02:31:06Z

@ajtulloch an RFC need to be sent out and we won't merge the PR before the RFC get discussed, so we can move the discuss there after it get posted

ZihengJiang · 2018-12-06T02:46:01Z

Hi @ajtulloch, I have a paper deadline so I pushed forward this PR in a hurry to get a workable quantization workflow. Let me send out a RFC tomorrow. This PR won't be merged before we have discussion in the community.

tqchen · 2018-12-06T03:06:16Z

x

lixiaoquan · 2018-12-07T10:16:21Z

Currently, it seems NNVM requires inputs of a op have same data type. But a quantization scheme may cause different types of inputs. Any suggestion about that?

ajtulloch · 2018-12-08T02:08:03Z

@lixiaoquan there's no such requirement today AFAIK, it's user-controlled in the implementation of attr<FInferType>(..) for the relevant NNVM op.

tqchen · 2018-12-10T04:43:05Z

topi/python/topi/util.py

@@ -213,3 +214,16 @@ def select_array(i, j):
        return now

    return tvm.compute(matrix.shape, select_array, name=name)
+
+
+@tvm.register_func("print_tensor")


sure, maybe we can add as an util later as separate PR, but we need documents on these

python/tvm/relay/quantize/annotate_ops.py

src/relay/pass/quantize.cc

python/tvm/relay/quantize/__init__.py

tests/python/quantize/test_pass_quantize.py

python/tvm/relay/quantize/annotate_ops.py

python/tvm/relay/quantize/quantize.py

src/relay/pass/pattern_util.h

ZihengJiang · 2019-01-17T21:44:28Z

@liangfu Thanks for catching this outdated test

ZihengJiang · 2019-01-17T21:45:41Z

python/tvm/relay/build_module.py

@@ -124,7 +124,7 @@ def _bind_params_by_name(func, params):
    return expr.bind(func, bind_dict)


-def optimize(func, target, params=None):
+def optimize(func, target=None, params=None):


Seems this API changes recently? It breaks some codes @tqchen

ZihengJiang · 2019-01-18T00:48:47Z

Here is an evaluation script: https://gist.github.com/ZihengJiang/bcabe46a712a417a01a6967d4430b6b5

@eqy @vinx13 @liangfu

tqchen · 2019-01-18T17:50:29Z

@antinucleon @hlu1 @anijain2305 please also help take a look when you have time

eqy · 2019-01-18T20:02:07Z

@ZihengJiang sorry this is basic question, but is there support for mixed quantization levels? It looks like currently we specify a global weight and activation precision only. Since we can already skip the first k conv layers, it seems that this would be a useful generalization.

eqy

typo

ZihengJiang · 2019-01-23T22:09:22Z

@eqy User can override the rewrite function to implement mix-precision quantization. But it is not included in this pr

vinx13 · 2019-01-25T06:53:08Z

In resnet, we use int32 for residual addition. But I found saving intermediate int32 results to global memory is much slower, is it possible to use int8 in this case (we need to mofidy annotate of add)? I'm not sure the impact to the model precision.

python/tvm/relay/quantize/_quantize.py

* [QUANTIZE] Quantization implementation. * Update. * Update. * Update. * Update.

YiranCdr · 2020-08-04T15:06:50Z

hey guys, I'm wondering whether or not TVM support any INT16 quantization? If the answer is yes, is it quantization aware training or post-training quantization? Thanks!

ZihengJiang added the status: WIP label Nov 15, 2018

tqchen requested changes Nov 15, 2018

View reviewed changes

python/tvm/relay/_quantization.py Outdated Show resolved Hide resolved

tqchen added the status: need RFC need RFC discussion label Nov 15, 2018

ZihengJiang force-pushed the dev branch from bff439a to 43fb017 Compare November 27, 2018 01:23

tqchen force-pushed the dev branch from 81669fb to 963c758 Compare December 2, 2018 17:40

ZihengJiang force-pushed the dev branch 2 times, most recently from 81669fb to d92f41e Compare December 3, 2018 19:38

ZihengJiang requested review from Huyuwei and Laurawly as code owners December 4, 2018 06:29

tqchen requested changes Dec 4, 2018

View reviewed changes

ZihengJiang force-pushed the dev branch from 37ee257 to a4c7b65 Compare December 5, 2018 00:57

vinx13 reviewed Dec 5, 2018

View reviewed changes

python/tvm/relay/quantize/quantize.py Outdated Show resolved Hide resolved

src/relay/pass/pattern_util.h Outdated Show resolved Hide resolved

src/relay/pass/quantize.cc Outdated Show resolved Hide resolved

vinx13 reviewed Dec 5, 2018

View reviewed changes

python/tvm/relay/quantize/quantize.py Outdated Show resolved Hide resolved

tqchen requested changes Dec 10, 2018

View reviewed changes

vinx13 requested changes Dec 10, 2018

View reviewed changes

python/tvm/relay/quantize/annotate_ops.py Outdated Show resolved Hide resolved

python/tvm/relay/quantize/annotate_ops.py Outdated Show resolved Hide resolved

tqchen force-pushed the dev branch from a4c7b65 to ac6b3d4 Compare December 10, 2018 20:53

ZihengJiang force-pushed the dev branch from ac6b3d4 to a4c7b65 Compare December 10, 2018 21:19

vinx13 reviewed Dec 11, 2018

View reviewed changes

src/relay/pass/quantize.cc Outdated Show resolved Hide resolved

tqchen requested changes Dec 14, 2018

View reviewed changes

This comment has been minimized.

Sign in to view

ZihengJiang changed the title ~~[WIP] Low-bit Quantization~~ [OPT] Low-bit Quantization Dec 18, 2018

ZihengJiang commented Jan 17, 2019

View reviewed changes

eqy reviewed Jan 23, 2019

View reviewed changes

[QUANTIZE] Quantization implementation.

951b31e

ZihengJiang force-pushed the dev branch from c4b866a to 951b31e Compare January 25, 2019 22:46

tqchen reviewed Jan 28, 2019

View reviewed changes

python/tvm/relay/quantize/_quantize.py Outdated Show resolved Hide resolved

ZihengJiang added 4 commits January 27, 2019 20:41

Update.

aadfd6e

Update.

1903895

Update.

c56563b

Update.

5199f4a

tqchen approved these changes Jan 31, 2019

View reviewed changes

ZihengJiang merged commit 741b6bb into apache:master Jan 31, 2019

ZihengJiang added status: accepted and removed status: need review status: need update need update based on feedbacks labels Feb 2, 2019

libing4752 pushed a commit to libing4752/tvm that referenced this pull request Feb 18, 2019

[OPT] Low-bit Quantization (apache#2116)

acbb919

* [QUANTIZE] Quantization implementation. * Update. * Update. * Update. * Update.

merrymercy pushed a commit to merrymercy/tvm that referenced this pull request Feb 18, 2019

[OPT] Low-bit Quantization (apache#2116)

f55c74a

* [QUANTIZE] Quantization implementation. * Update. * Update. * Update. * Update.

tqchen mentioned this pull request Feb 18, 2019

[RFC] Quantization Workflow #2259

Closed

wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019

[OPT] Low-bit Quantization (apache#2116)

9089013

* [QUANTIZE] Quantization implementation. * Update. * Update. * Update. * Update.

wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019

[OPT] Low-bit Quantization (apache#2116)

893eab2

* [QUANTIZE] Quantization implementation. * Update. * Update. * Update. * Update.

yzhliu mentioned this pull request Mar 2, 2019

[DEV] TVM v0.6 Roadmap #2623

Closed

28 tasks

vinx13 mentioned this pull request Apr 5, 2019

[RELAY][FUSION] Enhance fusion rule that starts from elemwise and broadcast #2932

Merged

tqchen mentioned this pull request Nov 8, 2019

[RELEASE][DRAFT] TVM v0.6 Release candidate #4259

Closed

ZihengJiang mentioned this pull request Jan 21, 2020

[RFC] Learning-based Automated Quantization #4757

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPT] Low-bit Quantization #2116

[OPT] Low-bit Quantization #2116

ZihengJiang commented Nov 15, 2018

ajtulloch commented Nov 15, 2018

tqchen commented Nov 15, 2018

ZihengJiang commented Nov 17, 2018

ajtulloch commented Dec 5, 2018

tqchen commented Dec 6, 2018

ZihengJiang commented Dec 6, 2018

tqchen commented Dec 6, 2018 •

edited

Loading

lixiaoquan commented Dec 7, 2018

ajtulloch commented Dec 8, 2018

tqchen Dec 10, 2018

This comment has been minimized.

ZihengJiang commented Jan 17, 2019 •

edited

Loading

ZihengJiang Jan 17, 2019

ZihengJiang commented Jan 18, 2019

tqchen commented Jan 18, 2019

eqy commented Jan 18, 2019

eqy left a comment

ZihengJiang commented Jan 23, 2019

vinx13 commented Jan 25, 2019 •

edited

Loading

YiranCdr commented Aug 4, 2020

[OPT] Low-bit Quantization #2116

[OPT] Low-bit Quantization #2116

Conversation

ZihengJiang commented Nov 15, 2018

ajtulloch commented Nov 15, 2018

tqchen commented Nov 15, 2018

ZihengJiang commented Nov 17, 2018

ajtulloch commented Dec 5, 2018

tqchen commented Dec 6, 2018

ZihengJiang commented Dec 6, 2018

tqchen commented Dec 6, 2018 • edited Loading

lixiaoquan commented Dec 7, 2018

ajtulloch commented Dec 8, 2018

tqchen Dec 10, 2018

Choose a reason for hiding this comment

This comment has been minimized.

ZihengJiang commented Jan 17, 2019 • edited Loading

ZihengJiang Jan 17, 2019

Choose a reason for hiding this comment

ZihengJiang commented Jan 18, 2019

tqchen commented Jan 18, 2019

eqy commented Jan 18, 2019

eqy left a comment

Choose a reason for hiding this comment

ZihengJiang commented Jan 23, 2019

vinx13 commented Jan 25, 2019 • edited Loading

YiranCdr commented Aug 4, 2020

tqchen commented Dec 6, 2018 •

edited

Loading

ZihengJiang commented Jan 17, 2019 •

edited

Loading

vinx13 commented Jan 25, 2019 •

edited

Loading