Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QNN] Requantize operator #3531

Merged
merged 37 commits into from
Aug 8, 2019
Merged

[QNN] Requantize operator #3531

merged 37 commits into from
Aug 8, 2019

Conversation

anijain2305
Copy link
Contributor

@anijain2305 anijain2305 commented Jul 10, 2019

Requantize converts one quantized tensor representation to another quantized
representation. The PR has following implementation features

Relevant Issue - #2351

  • Requantize operator defined in qnn namespace - relay.op.qnn.requantize
  • Lowering of the requantize to exisiting Relay operators
  • Integer fixed point implementation of requantize
    • Two rounding modes - FE_UPWARDS (round towards infinity) and
      FE_AWAY_FROM_ZERO (std::round behavior)
  • Floating point implementation as well, that can act as reference or can be
    used for devices when FP32 computation is not expensive.
  • Unit test cases

Credit to TFLite and GemmLowp to provide reference implementations.

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers.

@tqchen
Copy link
Member

tqchen commented Jul 10, 2019

Let us distinguish the qnn from the quantize namespace, given that the primary lowering phase supports in src/relay/qnn, we can also discuss the python namespace, perhaps relay.qnn. which is different from the current quantize namespace(automatic quantization in relay from float models)

@anijain2305
Copy link
Contributor Author

@tqchen Thanks for the quick comment :)

Can you please elaborate a little bit? Do you mean to change this file - https://github.com/dmlc/tvm/pull/3531/files#diff-dcae58edc1986609a54e3a202ee49b3b

@@ -0,0 +1,37 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen Does it make sense to move this to newly created directory - python/tvm/relay/op/qnn?

This will ensure the quantize python namespace is not polluted.

@@ -0,0 +1,262 @@
/*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen Should l call this file qnn_rewrite.cc to have a clear marking between quantize (Quantization in Relay) and qnn (creating new quantized ops) work?

@anijain2305
Copy link
Contributor Author

@FrozenGene Please have a look.

}

RELAY_REGISTER_OP("qnn.requantize")
.set_attr<FForwardRewrite>("FQuantizeForwardRewrite", RequantizeForwardRewrite);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen Does it make sense to replace FQuantizeForwardRewrite with FQnnForwardRewrite (for differentiating qnn and quantize)?

@tqchen
Copy link
Member

tqchen commented Jul 11, 2019

Given that qnn is a dialect. We could consider giving it a special name place and folder to live. e.g.

  • src/relay/qnn
  • src/relay/op/qnn
  • tvm.relay.qnn

We can debate on whether it should be relay/qnn/op or relay/op/qnn cc @jroesch Perhaps let us open an RFC for dialect convention.

I have not yet taken a deep look into the PR, but here are a few high-level comments

  • Header files, for utilities etc, we want to move most of them into src/ folder instead of include
    • The idea is that everything in the include is public API(which external project can depend on), everything else should be in src/
  • There are a few style issues, we mainly use CamelCase for our APIs(as per Google C style).

@anijain2305
Copy link
Contributor Author

I see. Makes sense. I don't have a strong preference for any of the namespaces that you suggested. My favorite is src/relay/qnn. In this case, we are only dealing with ops. But, I can believe that some other dialect might want to more than just ops. But, I am ok with any of the above options.
Please let me know. I can refactor to see how it looks like.

Thanks for the high-level comments. I will fix the utils and put them in the src/relay/qnn directory. Will look for CamelCase. (I was hoping that lint would do that for me, but I guess not).

@anijain2305
Copy link
Contributor Author

@tqchen I have updated the code. The C++ files are in src/relay/qnn directory. It has its own sub-directories like include, op, pass to serve different purposes. Let me know how it looks.

from . import _make

def requantize(input_data, input_zero_point, input_scale, output_zero_point,
output_scale, out_dtype="int32", use_int_compute=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following relay attr's default value is true. Please unify the default value (We should set it be True).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thanks :)


def requantize(input_data, input_zero_point, input_scale, output_zero_point,
output_scale, out_dtype="int32", use_int_compute=False,
rounding_mode="FE_UPWARD"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFLite requires "round_awar_from_zero". Could you explain why we set it "FE_UPWARD" be the default value? Because of MXNet's requirement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be "round_away_from_zero". These two rounding modes provide two points in the performance accuracy tradeoff curve. FE_UPWARD has fewer operators compared to round_way_from_zero. But, I think you are right, we should set "round_away_from_zero" by default. If somebody wants better performance, they can set it to FE_UPWARD using a network-level config.

// input_tensor, the result in int64 where the decimal point is sitting
// between bits 31 and 30 (from the right, rightmost bit is bit 0).
Expr scalar = MakeConstantScalar(up_idtype, fixed_point_multiplier);
auto multiplied_t = Multiply(tensor, scalar);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is overflow logic? If it is overflow, we return int32_max.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into this. I remember seeing an overflow check. Thanks for pointing out.

Copy link
Contributor Author

@anijain2305 anijain2305 Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FrozenGene Thought little more about it. I think we don't need this check. Please let me know if this makes sense.

The relevant portion from TFLite is

  bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();

Here a is input tensor and b is the fixed_point_multiplier. The fixed_point_multiplier is calculated using frexp.

Now, frexp gives negative fixed point multiplier only when the input floating point number is negative. In our case, this floating point number is input_scale/output_scale. I think the scales are always positive, so we should never have negative number. (Let me know if I am wrong)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

src/relay/qnn/pass/quantize_rewrite.cc Outdated Show resolved Hide resolved
Copy link
Member

@zhiics zhiics left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a pass on it. It is mainly style stuff.

# specific language governing permissions and limitations
# under the License.
# pylint: disable=wildcard-import
"""Neural network related operators."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quantized neural network?

}

inline Expr Full(Expr fill_value,
Array<IndexExpr> shape,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align

}
};


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove one blank line


/*!
* \file tvm/relay/quantize_util.h
* \brief Utility methods needs for quantized ops that can be shared
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed?

RELAY_REGISTER_OP("qnn.requantize")
.describe(R"code(Requantize operator.

FIXME
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix what?


// 2) Subtract the input_zero_point
auto tensor = input_tensor;
tensor = Cast(tensor, up_idtype);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto tensor = Cast(input_tensor, up_idtype); ?

auto input_zp = MakeConstantScalar(up_idtype, param->input_zero_point);
tensor = Subtract(tensor, input_zp);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove blank lines

Expr scalar = MakeConstantScalar(up_idtype, fixed_point_multiplier);
auto multiplied_t = Multiply(tensor, scalar);


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove one blank line


// 4) Find the rounding scalar. This depends on where the final decimal point
// sits. As we will be right shifting the multiplied_t, we need to first
// calculate the totol_right_shift.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_right_shift




if __name__ == "__main__":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move xx_test under main below directly? run_test looks not necessary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about separate them to individual test_ functions instead of packing together to one giant test_requantize?

@tqchen tqchen changed the title Requantize operator [QNN] Requantize operator Jul 11, 2019
return func


def run_tests():
Copy link
Member

@FrozenGene FrozenGene Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add uint8 test. Like TFLite, whose out dtype is uint8. Let us cover it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will add it by tonight.

namespace tvm {
namespace relay {

inline bool IsInt8(const DataType& dtype) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly a too small function to introduce another level of indirection. Directly remove it and keep things in the original condition

*/

/*!
* \file tvm/relay/quantize_util.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong file name

|| IsInt16(dtype) || IsUint16(dtype);
}

enum class QuantizeOpType : uint8_t {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless there is a need(data compactness), you don't want to use uint8_t as storage type

return dtype == Float(32);
}

inline bool IsQuantizedType(const DataType& dtype) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsQNNDataType

* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to qnn_lower.cc

from __future__ import absolute_import as _abs
from . import _make

def requantize(input_data, input_zero_point, input_scale, output_zero_point,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please debate on API naming, especially refer to possible existing API names and choices of options

@tqchen
Copy link
Member

tqchen commented Jul 12, 2019

I made a few quick reviews. @FrozenGene @zhiics @antinucleon @hlu1 please help check again if you have time.

Let us specifically spend some time to discuss the API, in particular, what are the choices of parameter names, order and the options, do they make sense or not. Is there existing APIs that we can be consistent with etc.

.describe("Defines the rounding direction when the value is midway between"
"two representable values. There are two supported modes - FE_UPWARD"
"or FE_AWAY_FROM_ZERO. More context can be found at"
"https://www.gnu.org/software/libc/manual/html_node/Rounding.html");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to explain the meaning of the two modes here as well, so to make it self-explained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added the description.


Parameters
----------
quantized_data : tvm.relay.Expr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc does not match the arguments in the function

} else if (IsUint32(dtype)) {
return std::numeric_limits<uint32_t>::min();
}
LOG(FATAL) << "Type not supported\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also << dtype to make debug more easier. also I think we can remove "\n"

RELAY_REGISTER_OP("qnn.requantize")
.set_attr<FForwardRewrite>("FQuantizeForwardRewrite", RequantizeForwardRewrite);

TVM_REGISTER_API("relay._qnn.rewrite")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be ported to pass manager? @zhiics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked to @anijain2305 offline, it sounds this pass is more or less a dialect that is only performed in the frontend parsers. I think we probably don't want to bring it into the pass manager because 1) it is quite frontend specific, 2) it only does Expr -> Expr transformation.

We could probably use better naming instead of rewrite here and ir_pass in Python. Since we have pass under the dialect fold, should we have transform as well?

@tqchen thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we need to use transform

const Array<Expr>& new_args, const NodeRef& ctx) {
CHECK_EQ(new_args.size(), 1);
Expr quantized_data = new_args[0];
const auto* param = ref_call->attrs.as<RequantizeAttrs>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if param == nullptr ?

return IsFloat32(in_dtype) || IsQuantizedType(in_dtype);
case QuantizeOpType ::Dequantize:
return IsQuantizedType(in_dtype);
case QuantizeOpType ::Requantize:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove space before ::




if __name__ == "__main__":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about separate them to individual test_ functions instead of packing together to one giant test_requantize?

@shoubhik
Copy link
Contributor

I made a few quick reviews. @FrozenGene @zhiics @antinucleon @hlu1 please help check again if you have time.

Let us specifically spend some time to discuss the API, in particular, what are the choices of parameter names, order and the options, do they make sense or not. Is there existing APIs that we can be consistent with etc.

Can you also look at #3512 from API name perspective as well.

@anijain2305
Copy link
Contributor Author

anijain2305 commented Jul 13, 2019

Let me write down my initial pass over the API design. Please help me refine it further.

Order of parameters

// All input relay exprs
input1_relay_expr,
input2_relay_expr,
...
...

// Properties for the input relay expr alphabetically (for example zero_point)
input1_prop1,
input1_prop2,
...
input2_prop1,
input2_prop2,

// Properties for the output relay expr alphabetically (for example zero_point)
output1_prop1,
output1_prop2,

// Finally, the properties of the ops alphabetically (like use_bias, use_int_compute)
op_prop1
op_prop2

Parameter names

I could not find a specific pattern in the names. However, I think these are some common rules

  • parameter name should be concise
  • If there are multiple choices for the same param, most widely used one can be chosen to retain readability.
  • for bool params, a good name can describe the action (E.g count_include_pad, clip). There are few examples in the codebase that do not follow this rule.

Current proposal

Adhering to above rules, the API looks like this

def requantize(data,
               # Input expr properties
               input_scale,
               input_zero_point,

               # Output expr properties
               output_scale,
               output_zero_point,

               # Op properties
               out_dtype,
               rounding, # _mode seemed unnecessary
               use_int_domain, # compute was somewhat ambiguous               
               ):

@anijain2305
Copy link
Contributor Author

Thanks for all the comments. I fixed them. Please review once again.
Currently changed the API as per above proposal.

@tqchen
Copy link
Member

tqchen commented Jul 16, 2019

Can we also list the related APIs from existing frameworks e.g. tflite so we can get a good reference. Right now I can understand most of the parameters, except use_int_domain was a bit unclear to me

@anijain2305
Copy link
Contributor Author

anijain2305 commented Jul 16, 2019

For TF, the requantize operator looks like this

Arguments:

scope: A Scope object
input_min: The float value that the minimum quantized input value represents.
input_max: The float value that the maximum quantized input value represents.
requested_output_min: The float value that the minimum quantized output value represents.
requested_output_max: The float value that the maximum quantized output value represents.
out_type: The type of the output. Should be a lower bit depth than Tinput.

I think all the above arguments are represented in current API proposal (min/max are represented as scale/zero_point).


use_int_domain gives an ability to run the computation in FP32 domain. The integer domain has more instructions due to correct rounding, compared to FP32 computation. So, there can be a tradeoff for different HW targets.

However, if you think we will never use FP32 computation for requantize, that is also a reasonable answer.

@FrozenGene
Copy link
Member

Maybe we don't need use_int_domain. I can not imagine one HW we will set this be false.

@anijain2305
Copy link
Contributor Author

Thanks @FrozenGene and @tqchen. Spent some more time thinking about it. Realized that it might not be necessary. Removed use_int_domain.

Please review again.

@anijain2305
Copy link
Contributor Author

@tqchen Please let me know if you got a chance to look at the changes.

By the way, phisiart was added for review automatically when I was rebasing. Maybe, I made a mistake in rebasing. I don't know how to remove the reviewer.

@anijain2305
Copy link
Contributor Author

@tqchen Can we try to get this merged if everything looks good? Will open up discussion/review for many other PRs.

@tqchen tqchen merged commit a78adbd into apache:master Aug 8, 2019
@tqchen
Copy link
Member

tqchen commented Aug 8, 2019

wweic pushed a commit to wweic/tvm that referenced this pull request Aug 9, 2019
* [Relay] [Quantization] WIP - Common files for the qauntization work.

* [Relay] [Quantization] WIP - Prototyping requantize op.

* Requantize operator implementation.

Requantize converts one quantized tensor representation to another quantized
representation. The PR has following implementation features

- Requantize operator defined in qnn namespace - relay.qnn.requantize
- Lowering of the requantize to exisiting Relay operators
- Integer fixed point implementation of requantize
    - Two rounding modes - FE_UPWARDS (round towards infinity) and
    FE_AWAY_FROM_ZERO (std::round behavior)
- Floating point implementation as well, that can act as reference or can be
used for devices when FP32 computation is not used.
- Unit test cases

Relevant Issue - apache#2351

Credit to TFLite and GemmLowp to provide reference implementations.

* Typo and lint fixes.

* Doc fix.

* Uncommenting the lint script (fixing mistake).

* Modifying the unit tests.

* Moving C++ files into src/relay/qnn

* Moving python files to python/tvm/relay/qnn. Some minor fixes.

* Moving the attrs.h inside the include directory.

* Pushing files that I forgot earlier. Changing util location.

* Incorporating comments. API change. Lint fixes.

* Modifying the GetFixedPointMultiplierShift API as per comments.

* Forgot the dialect change.

* Changing rewrite to qnn_lower.

* Renaming Quantize to Qnn for clarity.

* Remove use_int_domain.

* Incorportaing review comments.

* Adding API doc for QNN dialect.

* Move the qnn_lower pass to transform namespace.

* Moving from expr to module. Adding namespace in C++.

* Minor sentence rewrites. Added qnn namespace.

* Added the API doc.

* Chanding default out_dtype to int8. Adding a test with in/out_dtype as uint8.

* Style fixes. Better error messages.

* Adding documentation.

* More documentation fixes.

* Adding out dtype check for requantize.

* Adding corner case for FP32 to fixed point conversion.

* Adding extra line.

* Documentation fix.

* Adding static inline.

* Incorporating jackwish comment. Removed idtype from requantize lowering.

* Removing Quantize/Dequantize code. Restricting Requantize to (u)int8/int32.

* Style fixes.

* Fix the docs.

* Move to Legalize API.
wweic pushed a commit to neo-ai/tvm that referenced this pull request Sep 6, 2019
* [Relay] [Quantization] WIP - Common files for the qauntization work.

* [Relay] [Quantization] WIP - Prototyping requantize op.

* Requantize operator implementation.

Requantize converts one quantized tensor representation to another quantized
representation. The PR has following implementation features

- Requantize operator defined in qnn namespace - relay.qnn.requantize
- Lowering of the requantize to exisiting Relay operators
- Integer fixed point implementation of requantize
    - Two rounding modes - FE_UPWARDS (round towards infinity) and
    FE_AWAY_FROM_ZERO (std::round behavior)
- Floating point implementation as well, that can act as reference or can be
used for devices when FP32 computation is not used.
- Unit test cases

Relevant Issue - apache#2351

Credit to TFLite and GemmLowp to provide reference implementations.

* Typo and lint fixes.

* Doc fix.

* Uncommenting the lint script (fixing mistake).

* Modifying the unit tests.

* Moving C++ files into src/relay/qnn

* Moving python files to python/tvm/relay/qnn. Some minor fixes.

* Moving the attrs.h inside the include directory.

* Pushing files that I forgot earlier. Changing util location.

* Incorporating comments. API change. Lint fixes.

* Modifying the GetFixedPointMultiplierShift API as per comments.

* Forgot the dialect change.

* Changing rewrite to qnn_lower.

* Renaming Quantize to Qnn for clarity.

* Remove use_int_domain.

* Incorportaing review comments.

* Adding API doc for QNN dialect.

* Move the qnn_lower pass to transform namespace.

* Moving from expr to module. Adding namespace in C++.

* Minor sentence rewrites. Added qnn namespace.

* Added the API doc.

* Chanding default out_dtype to int8. Adding a test with in/out_dtype as uint8.

* Style fixes. Better error messages.

* Adding documentation.

* More documentation fixes.

* Adding out dtype check for requantize.

* Adding corner case for FP32 to fixed point conversion.

* Adding extra line.

* Documentation fix.

* Adding static inline.

* Incorporating jackwish comment. Removed idtype from requantize lowering.

* Removing Quantize/Dequantize code. Restricting Requantize to (u)int8/int32.

* Style fixes.

* Fix the docs.

* Move to Legalize API.
@anijain2305 anijain2305 deleted the requantize branch November 13, 2019 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants