Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Reading quantized models from TFLite and MxNet - operators API #3252

Closed
anijain2305 opened this issue May 28, 2019 · 2 comments
Closed

Comments

@anijain2305
Copy link
Contributor

To increase quantization support in TVM, it is necessary to support the pre-quantized models, i.e., the models that have been quantized in the framework itself (outside of Relay). In this issue, we are laying down the high-level API design for some of the quantized operators. A large portion of this is coming from the following relevant discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their experiences with quantization, and also @shoubhik for helping design this RFC.

Other non-TVM related links that were used to understand quantization

  • GemmLowP - Doc
  • TFlite reference code

Covered frameworks for now - TFLite and MxNet
Target network for now - Inception V3 from TFLite. (I will create one for Mxnet)
Target platforms for now - ARM and Intel (will create separate Issue as the project progresses)


List of required operators - quantize, quantized_conv2d, qunatized_relu, quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize


It will be good if we can agree on Relay ops - its inputs/outputs and the attributes. The initial proposal for the quantize, quantized_conv2d and dequantize ops is as follows (other quantized_* operators will be on the same lines as that of quantized_conv2d)

Op quantize

def quantize(data, scale, zero_point, out_dtype):
    """
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor.

    Parameters
    -----------
    data: FP32 tensor
           The input tensor in FP32.
    
    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    Returns
    -------
    quantized_data: int8/uint8 tensor
           The quantized tensor.

    """

Key points to discuss

  • The scale and zero_point calculations happen outside the relay graph, i.e., the framework parsers will have to compute the scale and offset if only min and max are provided. Reference implementation in TFLite. This can also be thought as a framework parser utils where we can handle min/max, symmetric/asymmetric etc and generate the scale and zero_point as frameworks handles them.

Op quantized_conv2d

def quantized_conv2d(quantized_data, quantized_kernel, 
        input_scale, input_zero_point,
        kernel_scale, kernel_zero_point,
        output_scale, output_zero_point,
        out_dtype,

        # All the old remaining ones from conv2d
        strides=(1, 1),
        padding=(0, 0),
        dilation=(1, 1),
        groups=1,
        channels=None,
        kernel_size=None,
        data_layout="NCHW",
        kernel_layout="OIHW",
        out_layout=""):
    """
    
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor. The scale and zero_point calculations
    happen outside the relay graph, i.e., the framework parsers will have to compute
    the scale and offset if only min and max are provided. 

    Parameters
    -----------
    quantized_data: int8/uint8 tensor
           The quantized input tensor in int8/uint8.

    quantized_kernel: FP32 tensor
           The quantized kernel tensor in int8/uint8.
    
    input_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_data int8 values back to FP32.

    input_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_data distribution.

    kernel_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_kernel int8 values back to FP32.

    kernel_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_kernel distribution.

    output_scale: FP32 scalar (An attribute of the op)
           The output scale is set during the quantization process using training/calibration.
           The float scalar to scale the quantized_output int8 values back to FP32.

    output_zero_point: Int32 zero point (An attribute of the op)
           The output zero point is set during the quantization process using training/calibration.
           The zero point of the quantized_output distribution.

    out_dtype: String
           The dtype of the quantized_output. Can only be int8/uint8.
           The requantization from int32 to int8/uint8 is a part of the op compute.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    ..... Other attributes are same as before.


    Returns
    -------
    quantized_output: int8/uint8 tensor
           The quantized tensor.

    """

Key points to discuss further

  • This op has a set of computations that can be pre-computed ideally but difficult to do because fold-constant only works on Relay ops and not within a Relay op. This has been discussed in more detail in discuss forum.
    • First pre-computable - The core computation has some compute with kernel (Term 2 and Term 4 in the above link) that will be the part of tvm compute. This is very hard to avoid. We need a fused compute to get the best performance.
    • Second pre-computable - The output scale and zero_point are used to calculate int multiplier and shifts to keep all the computations in Int domain. This computation changes for each op (e.g. concat will handle this in a different manner compared to conv). So, this computation is also kept inside quantized_conv2d op. This can be avoided by changing the API and replacing output_scale with output_multiplier and output_shift. But, this seems very specific to TFLite and one might want to handle the output_scale and output_offset in a different manner. I am not sure about this part, so please comment.
  • The op already has the requantization portion accounted for. As far as I understand, the requantization portion is just a clamp for out_dtype. (The handling of output_multiplier and output_shift, as mentioned above, is for the calculation of output quantized tensor and not for requantization).

Op dequantize

Dequantization is required while connecting a quantized operator and an FP32 operator. This might be a temporary stage where we do not have a quantized implementation of the second op. Dequantization might also be required at the end of the network to keep the output of the graph in FP32.

def dequantize(quantized_data, scale, zero_point, out_dtype):
    """
    Dequantize takes the scale and zero_point attributes and dequantizes the 
    int8/uint8 tensor to FP32 tensor.

    Parameters
    -----------
    quantized_data: int8/uint8 quantized input tensor
           The input tensor in int8/uint8.
    
    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be float32.

    Returns
    -------
    data: FP32 tensor
           The dequantized tensor.

    """
@anijain2305
Copy link
Contributor Author

Adding others who might be interested in this @ajtulloch @eqy @ZihengJiang @tqchen

@tqchen
Copy link
Member

tqchen commented May 28, 2019

Consolidated to #2351

@tqchen tqchen closed this as completed May 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants