Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Quantization] Any way to simulate asymmetric quantization? #665

Closed
masahi opened this issue Mar 9, 2020 · 2 comments
Closed

[Quantization] Any way to simulate asymmetric quantization? #665

masahi opened this issue Mar 9, 2020 · 2 comments
Assignees
Labels

Comments

@masahi
Copy link

masahi commented Mar 9, 2020

Hi, from the doc https://intel.github.io/mkl-dnn/dev_guide_attributes_quantization.html, it seems DNNL quantized convolution supports only symmetric quantization. But I have a use case where I want to execute quantized conv op that comes from PyTorch, and there are some non-zero zero points.

Is there a way to simulate quantized convolution with non-zero zero points using DNNL? Performance is not too important for me right now.

Is

Manual shift -> normal int 32 conv -> requantize

an good approach?

@masahi masahi added the question label Mar 9, 2020
@emfomenk
Copy link

Hi @masahi,

is manual shift a good approach?

No, if you want to keep using int8 data types. The reason is that once the shift is applied the data could be no more representable in int8 (say, the original value was 3, and zero point was 200 --> once subtracted the value will be -193 which doesn't belong to s8).

If you are fine with emulating int8 computations using floating point operations -- the approach you suggested should work. The only pitfall is that the result might be slightly different result if during the computations the rounding happens (f32 has only 23 mantissa bits, hence if any intermediate data is greater than 2^23 it couldn't be exactly represented in f32 data type --> hence rounding).

The alternative approach would be to do two step computations (assuming that nontrivial zero point is applied to the source data tensor only):

  1. Compute int8 convolution with s32 output (w/o any scaling and post-ops) w/o taking zero points into account
  2. Compute int8 convolution with s32 output with a special input -- broadcasted zero-point.
  3. Subtract the second tensor from the first one.
  4. Apply all (re-)quantization scaling and post-ops

This is conceptually how the library would implement the convolution with non-trivial zero points. However, this is much more intrusive way and actually even quite inefficient (the slowdown is >2x compared to the convolution w/o zero point). Given also the complex API the library has, I would suggest to avoid going this route :)

Summary:

  1. If the performance is not a concern at all, probably using the implementation from the framework is the way to go
  2. If the previous bullet doesn't work (say for whatever reason the performance there is awful), manual shift could be used (but don't forget to change the data type). Whether to use framework here or DNNL -- up to you.
  3. As long as performance is concerned, the only way to go is to have a proper implementation in the library.

P.S. A nice explanation how could implementation handle the zero points efficiently could be found in gemmlowp docs.

@emfomenk emfomenk self-assigned this Mar 12, 2020
@masahi
Copy link
Author

masahi commented Mar 12, 2020

@emfomenk Thanks very much for the detailed answer. My use case is to convert quantized PyTorch models to TVM and run them on more backends. So using PyTorch implementation is not an option.

TVM community is developing a mechanism to easily plug in external libraries like TensorRT and DNNL to their compilation pipeline. See my PR apache/tvm#4741 for example, where I demonstrate using DNNL's fused conv op from TVM. My next step is to do the same exercise for quantized ops, and for that I need to handle asymmetry. Since this is mostly demo purpose and having sort of reliable "ground truth" is more important, I don't care about performance for now.

The gemmlowp approach of decomposing qconv into 4 terms is also how TVM handles asymmetry. See https://github.com/apache/incubator-tvm/blob/0755e4a58897c64d6a7ffc86bab3df45554bac7e/src/relay/qnn/op/convolution.cc#L512-L580

Decomposing and executing decomposed ops with DNNL seems like a good plan. But it would be nicer if the library can handle it automatically. Since both PyTorch and Tensorflow generate non zero zero points, I think there are good use cases. (Of course, users should be aware that it would be slower than symmetric quantization).

@masahi masahi closed this as completed Mar 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants