diff --git a/docs/toolchain/quantization/1.1_Introdution_to_Post-training_Quantization.md b/docs/toolchain/quantization/1.1_Introdution_to_Post-training_Quantization.md index 3585f11..9ea8880 100644 --- a/docs/toolchain/quantization/1.1_Introdution_to_Post-training_Quantization.md +++ b/docs/toolchain/quantization/1.1_Introdution_to_Post-training_Quantization.md @@ -6,6 +6,3 @@ Post-training quantization(PTQ) uses a batch of calibration data to calibrate th

Figure 1. PTQ Chart

- - - \ No newline at end of file diff --git a/docs/toolchain/quantization/1.2_Flow_and_Steps.md b/docs/toolchain/quantization/1.2_Flow_and_Steps.md index 2d39859..74bbbbd 100644 --- a/docs/toolchain/quantization/1.2_Flow_and_Steps.md +++ b/docs/toolchain/quantization/1.2_Flow_and_Steps.md @@ -1,8 +1,5 @@ # 2 Post-training Quantization(PTQ) Flow and Steps - -[Workflow Overview](https://doc.kneron.com/docs/#toolchain/manual_1_overview/) - -In this section, we introduce Knerex, our Fixed-Point Model Quantization and Generation Tool. Quantization is the step where the floating-point weight are quantized into fixed-point to reduce the size and the calculation complexity. Currently, the quantization method of Knerex is based on Post Train Quantization(PTQ). In the near future, Quantization Aware Training(QAT) will be added on Knerex as an alternative to PTQ. Next, we will explain the principle and steps of Knerex, which guides you in terms of PTQ principles, model preparation, model verification, model quantification and compilation, performance analysis and tuning, precision analysis and tuning, etc. +Kneron toolchain delivers the whole flow to prepare a floating point model, quantize it into a fixed-point model, and compile it into executable binaries. Please check [Workflow Overview](https://doc.kneron.com/docs/#toolchain/manual_1_overview/) for Kneron toolchain details. Post-training quantization is one of the key procedures in the toolchain work flow. In this section, we introduce Knerex, our Fixed-Point Model Quantization and Generation Tool. Quantization is the step where the floating-point weight are quantized into fixed-point to reduce the size and the calculation complexity. Currently, the quantization method of Knerex is based on Post Train Quantization(PTQ). In the near future, Quantization Aware Training(QAT) will be added on Knerex as an alternative to PTQ. Next, we will explain the principle and steps of Knerex, which guides you in terms of PTQ principles, model preparation, model verification, model quantification and compilation, performance analysis and tuning, precision analysis and tuning, etc. ``` 2.1 Introduction to PTQ Flow @@ -18,33 +15,27 @@ In this section, we introduce Knerex, our Fixed-Point Model Quantization and Gen Model conversion refers to the process of converting the original floating-point model to a Regularized Piano-ONNX model. -The original floating-point model (also referred to as a floating-point model in some parts of the text) refers to a model that you have trained using DL frameworks such as TensorFlow/PyTorch, with a calculation precision of float32. Regularized Piano-ONNX model is a model format suitable for running on Kneron chips. - -The complete model development process with the Kneron toolchain involves five important stages: floating-point model preparation, model checking and performance evaluation by Kneron End to End Simulator, model transformation, and precision evaluation, as shown in the following figure. - -The floating-point model preparation stage is to prepare floating-point model for the model conversion tool. These models are usually obtained based on public DL training frameworks. It is important to note that the models need to be exported in a format supported by the Kneron toolchain. For specific requirements and recommendations, please refer to the "Floating-point Model Preparation" section. +The original floating-point model (also referred to as a floating-point model in some parts of the text) refers to a model that you have trained using DL frameworks such as TensorFlow/PyTorch, with a calculation precision of float32. Regularized Piano-ONNX model is a model format suitable for running on Kneron AI accelerator chips. -The model checking stage is used to ensure that the algorithm model meets the chip's requirements. Kneron provides designated tools to complete this stage of the check. For cases that do not meet the requirements, the verification tool will clearly provide specific operator information that does not meet the requirements, making it easier for you to adjust the model based on the operator constraints. For specific usage, please refer to the "Kneron End to End Simulator" section. +The complete model development process with the Kneron toolchain involves five important stages: floating-point model preparation, model checking and performance evaluation by Kneron End to End Simulator, model transformation, and precision evaluation. -The performance evaluation stage provides a series of tools to evaluate model performance. Before application deployment, you can use these tools to verify if the model's performance meets the application requirements. For cases where the performance falls short of expectations, you can also refer to the performance optimization suggestions provided by Kneron for tuning. For specific evaluation details, please refer to the Kneron End to End Simulator chapter. +The floating-point model preparation stage is to prepare floating-point model for the model conversion tool. These models are usually obtained based on public DL training frameworks. It is important to note that the models need to be exported in a format supported by the Kneron toolchain. For specific requirements and recommendations, please refer to the "[Floating-point Model Preparation](https://doc.kneron.com/docs/#toolchain/manual_1_overview/#14-floating-point-model-preparation)" section. -The model conversion stage converts the floating-point model to a fixed-point model. In order for the model to run efficiently on the Kneron chip, the Kneron conversion tool completes key steps such as model optimization, quantization, and compilation. Kneron's quantization method has been verified through long-term technology and production testing and can guarantee precision loss of less than 0.01 for most typical models. Please refer to the Calibration Data Preparation and Model Quantization and Compilation chapters for specific instructions. +The model checking stage is used to ensure that the algorithm model meets the chip's requirements. Kneron provides designated tools to complete this stage of the check. For cases that do not meet the requirements, the verification tool will clearly provide specific operator information that does not meet the requirements, making it easier for you to adjust the model based on the operator constraints. For specific usage, please refer to the "[Floating Point Model Check](https://doc.kneron.com/docs/#toolchain/manual_3_onnx/#33-e2e-simulator-check-floating-point)" section. -The accuracy evaluation stage provides E2E Simulator for evaluating the accuracy of the model. In most cases, the Kneron-converted model can maintain a similar level of accuracy as the original floating-point model. Before deploying the application, you can use the Kneron tools to verify whether the model's accuracy meets expectations. For cases where the accuracy is lower than expected, you can also refer to the performance optimization recommendations provided by Kneron for tuning. Please refer to the Model Accuracy Analysis and Tuning section for specific evaluations. +The performance evaluation stage provides a series of tools to evaluate model performance. Before application deployment, you can use these tools to verify if the model's performance meets the application requirements. For cases where the performance falls short of expectations, you can also refer to the performance optimization suggestions provided by Kneron for tuning. For specific evaluation details, please refer to the [Model Performance Evaluation](https://doc.kneron.com/docs/#toolchain/manual_3_onnx/#32-ip-evaluation) section. -## 2.2 Hardware Supported Operators List +The model conversion stage converts the floating-point model to a fixed-point model. In order for the model to run efficiently on Kneron chips, the Kneron conversion tool completes key steps such as model optimization, quantization, and compilation. Kneron's quantization method has been verified through long-term technology and production testing and can guarantee precision loss of less than 0.01 for most typical models. Please refer to [Model Quantization](https://doc.kneron.com/docs/#toolchain/manual_4_bie/) for details. -https://doc.kneron.com/docs/#toolchain/appendix/operators/#_top +The accuracy evaluation stage provides E2E Simulator for evaluating the accuracy of the model. In most cases, the Kneron-converted model can maintain a similar level of accuracy as the original floating-point model. Before deploying the application, you can use the Kneron tools to verify whether the model's accuracy meets expectations. For cases where the accuracy is lower than expected, you can also refer to the performance optimization recommendations provided by Kneron for tuning. Please refer to the [Model Accuracy Analysis and Tuning](https://doc.kneron.com/docs/#toolchain/manual_4_bie/#42-e2e-simulator-check-fixed-point) section for specific evaluations. -https://gitlab.kneron.tw/SYS/knerex/-/blob/next_gen_activatePerChannel/updater/include/config/Node_Description_v2.xlsm +## 2.2 [Hardware Supported Operators List](https://doc.kneron.com/docs/#toolchain/appendix/operators/#_top) -## 2.3 Floating-Point Model Preparation +## 2.3 [Floating-Point Model Preparation](https://doc.kneron.com/docs/#toolchain/manual_3_onnx/) -https://doc.kneron.com/docs/#toolchain/manual_3_onnx/ +### 2.3.1 How To Use ONNX Converter: -### 2.3.1 HOW TO USE ONNX CONVERTER: - -Converting the original floating-point model to a Regularized Piano-ONNX model. https://doc.kneron.com/docs/#toolchain/appendix/converters/ ONNX_Convertor is an open-source project on Github. If there is any bugs in the ONNX_Convertor project inside the docker, don't hesitate to try git pull under the project folder to get the latest update. And if the problem persists, you can raise an issue there. We also welcome contributions to the project. +Converting the original floating-point model to a Regularized Piano-ONNX model. [ONNX_Convertor](https://doc.kneron.com/docs/#toolchain/appendix/converters/) is an open-source project on Github. If there is any bugs in the ONNX_Convertor project inside the docker, don't hesitate to try git pull under the project folder to get the latest update. And if the problem persists, you can raise an issue there. We also welcome contributions to the project. The general process for model conversion is as following: @@ -54,35 +45,33 @@ ONNX exported by Pytorch cannot skip step 1 and directly go into step 2. Please If you're still confused reading the manual, please try our examples from https://github.com/kneron/ConvertorExamples -### 2.3.2. IP Evaluation(Model Evaluation) +### 2.3.2. IP Evaluation (Model Evaluation) Before we start quantizing the model and try simulating the model, we need to test if the model can be taken by the toolchain structure and estimate the performance. IP evaluator is such a tool which can estimate the performance of your model and check if there is any operator or structure not supported by our toolchain. -## 2.4 Kneron End to End Simulator +## 2.4 [Kneron End to End Simulator](https://doc.kneron.com/docs/#toolchain/appendix/app_flow_manual/) -This project allows users to perform image inference using Kneron's built in simulator. https://doc.kneron.com/docs/#toolchain/appendix/app_flow_manual/ +This project allows users to perform image inference using Kneron's built in simulator. ## 2.5 Model Quantitation and Compile -In this stage, you will complete the conversion from a floating-point model to a fixed-point model. After this stage, you will have a model that can run efficiently on the Kneron chip. The ONNX converter tools are used for model conversion, and during the conversion process, important processes such as model optimization and calibration quantization are completed. Calibration requires preparation of calibration data in accordance with the model's pre-processing requirements. You can refer to the "2.4 Kneron End to End Simulator" section to prepare the calibration data in advance. To help you fully understand the model conversion, this section will introduce the use of conversion tools, the interpretation of internal conversion processes, the interpretation of conversion results, and the interpretation of conversion outputs in turn. - -### 2.5.1 HOW TO SET UP PARAM VALUES +In this stage, you will complete the conversion from a floating-point model to a fixed-point model. After this stage, you will have a model that can run efficiently on the Kneron chip. The ONNX converter tools are used for model conversion, and during the conversion process, important processes such as model optimization and calibration quantization are completed. Calibration requires preparation of calibration data in accordance with the model's pre-processing requirements. You can refer to the [Kneron End to End Simulator](https://doc.kneron.com/docs/#toolchain/appendix/app_flow_manual/) section to prepare the calibration data in advance. To help you fully understand the model conversion, this section will introduce the use of conversion tools, the interpretation of internal conversion processes, the interpretation of conversion results, and the interpretation of conversion outputs in turn. -https://docs.google.com/document/d/1ePFzwLggkcpfTEee1nq-6VP9K264iP_4-iTPLFRAIYs/edit?usp=sharing +### 2.5.1 How To Set Up Param Values | No. | Parameter name | Parameter Configuration Description | Must/Optional | | ----: | :----: | :---- | :---- | | 00 | p_onnx | Parameter Usage: Path to ONNX file.
Range: N/A.
Default value: N/A.
Description: It should have passed through the ONNX converter. | Must | | 01 | np_txt | Parameter Usage: A dictionary of list of images in numpy format.
Range: N/A.
Default value: N/A.
Description: The keys should be the names of input nodes of the model.
e.g., {"input1": [img1, img2]}, here img1/img2 are two images -> preprocess -> numpy 3D array (HWC) | Must | | 02 | platform | Parameter Usage: Choose the platform architecture to generate fix models.
Range: "520" / "720" / "530" / "630".
Default value: N/A.
Description: Correspond to our chip on your board. For example, “520” for the KL520 chip. | Must | -| 03 | optimize | Parameter Usage: Level of optimization. The larger number, the better model performance, but takes longer.
Range: "0" / "1" / "2" / "3" / "4"
Default value: "0"
Description:
* "0": generated quantization fix model.
* "1": bias adjust parallel, no fm cut improve
* "2": bias adjust parallel, w fm cut improve
* "3": bias adjust sequential, no firmware cut improvement. SLOW!
* "4": bias adjust sequential, with firmware cut improvement. SLOW! | Optional | +| 03 | optimize | Parameter Usage: Level of optimization. The larger number, the better model performance, but takes longer.
Range: "0" / "1" / "2" / "3" / "4"
Default value: "0"
Description:
* "0": generated quantization fix model.
* "1": bias adjust parallel, no fm cut improve
* "2": bias adjust parallel, w fm cut improve | Optional | | 04 | datapath_range_method | Parameter Usage: Method to analyze list of images data.
Range: "percentage" / "mmse"
Default value: "percentage"
Description:
* “mmse”: use snr-based-range method.
* “percentage”: use arbitrary percentage. | Optional | | 05 | data_analysis_pct | Parameter Usage: Applicable when datapath_range_method set to "percentage". Intercept the data range.
Range: 0.0 ~ 1.0
Default value: 0.999, set to 1.0 if detection model
Description: It is used to exclude extreme values. For example, the default setting is 0.999. It means 0.1% of absolute maximum value will be removed among all data. | Optional | | 06 | data_analysis_threads | Parameter Usage: Multi Thread setting
Range: 1 ~ number of cpu cores / memory available
Default value: 4
Description: Number of threads to use for data analysis for quantization. | Optional | | 07 | datapath_bitwidth_mode | Parameter Usage: Specify the data flow of the generated fix model in “int8” or “int16”.
Range: "int8" / "int16"
Default value: "int8"
Description: The input/output data flows of most operator nodes set to int8 in default. Through this parameter , the data flows can be adjusted to int16 under operator node constraints. | Optional | | 08 | weight_bitwidth_mode | Parameter Usage: Specify the weight flow of the generated fix model in “int4”, “int8” or “int16”.
Range: "int4" / "int8" / "int16"
Default value: "int8"
Description: The weight flows of most operator nodes set to int8 in default. Through this parameter , the weight flows can be adjusted to int4 or int16 under operator node constraints. | Optional | | 09 | model_in_bitwidth_mode | Parameter Usage: Specify the generated fix model input in “int8” or “int16”.
Range: "int8" / "int16"
Default value: "int8"
Description: The model input set to int8 in default. Through this parameter , the model input can be adjusted to int16 under operator node constraints. | Optional | -| 10 | model_out_bitwidth_mode | Parameter Usage: Specify the generated fix model output in “int8” or “int16”.
Range: "8" / "15"
Default value: "8"
Description: The model output set to int8 in default. Through this parameter , the model output can be adjusted to int16 under operator node constraints. | Optional | +| 10 | model_out_bitwidth_mode | Parameter Usage: Specify the generated fix model output in “int8” or “int16”.
Range: "int8" / "int16"
Default value: "int8"
Description: The model output set to int8 in default. Through this parameter , the model output can be adjusted to int16 under operator node constraints. | Optional | | 11 | percentile | Parameter Usage: The range to search.
Range: 0.0 ~ 1.0
Default value: 0.001
Description: It is used under “mmse” mode. The larger the value, the larger the search range, the better the performance but the longer the simulation time. | Optional | | 12 | outlier_factor | Parameter Usage: Used under 'mmse' mode.The factor applied on outliers.
Range: 1.0 ~ 2.0 or higher.
Default value: 1.0
Description: For example, if clamping data is sensitive to your model, set outlier_factor to 2 or higher. Higher outlier_factor will reduce outlier removal by increasing range. | Optional | | 13 | quantize_mode | Parameter Usage: Need extra tuning or not.
Range: "default" / "post_sigmoid"
Default value: "default"
Description: If a model's output nodes were ALL sigmoids and had been removed, choose "post_sigmoid" for better performance. | Optional | @@ -105,8 +94,9 @@ Data path analyzer is to evaluate the data node dynamic range based on the given In PTQ, datapath analyzer results are used to calculate per-channel/per-layer ranges. These ranges are important figures while doing data quantization(calculating scale and data radix). There is one more additional min/max record with another outlier. Original for int8; additional for int16. -We also provide "Fine Tune Range using SNR analysis" to decide the per channel data range by finding k-max/min clustering data. This method should provide a more robust data range. +We will also provide "Fine Tune Range using SNR analysis" to decide the per channel data range by finding k-max/min clustering data. This method should provide a more robust data range. This method will be released soon. + -### 2.5.5 INT16(Bitwidth mode) Configuration Instructions +### 2.5.4 INT16(Bitwidth mode) Configuration Instructions During the fx model quantization, most of the operator node in the mode are quantized to int8. By configuring "datapath_bitwidth_mode" and "weight_bitwidth_mode", the data flow and weight flow of operator nodes can be calculated and quantize as int16 under operator node constraints. In addition, by configuring "in_bitwidth_mode" and "out_bitwidth_mode", the input and output of the fx model will consider as int16. This is handy to find quantization settings that have better performance for the model. In the future, by configuring "quan_config", specify one or more particular operator nodes to be calculated and quantize in int16. For example, setting op node named "conv1" into int16, the output of "conv1" and inputs of all its children will consider as int16. For unsupported scenes, operator nodes should ignore int16 request and calculate the operator node in int8. -### 2.5.6 Per Channel Quantization + +### 2.5.5 Per Channel Quantization Compared to Per Layer Quantization, Per Channel Quantization can calculate the scale and radix for datapath and weight of every individual channel under operator node constraints, and it has a better quantization performance. The quantization infos provided by Per Channel Quantization should be protected by Per Layer Quantization results, since there could be extreme values and ranges in particular channels. -### 2.5.7 Bias Adjustment +### 2.5.6 Bias Adjustment By configuring "optimize", bias adjustment algorithm can be activated. The purpose of float point based bias adjustment algorithm is to improve bias quantization performance. First, it scans the graph of nodes and builds a list of all nodes with bias. So far only Conv, BN and Gemm are included as biased node. The bias are adjusted and optimized based on float point inference once per operator node. This option takes extra running time. -### 2.5.9 Conversion Output Interpretation +### 2.5.7 Conversion Output Interpretation The previous section mentioned that the successful conversion of the model includes four parts, and this section will introduce the purpose of each output: @@ -152,13 +144,11 @@ The process of producing ".decomposed.onnx" model can be found before Model Quan Stage 2 & 3: .scaled.onnx & .scaled.wqbi.onnx -The production process of ".scaled.onnx" and ".scaled.wqbi.onnx" are the result of Model Quantization stage. These model has completed the calibration and quantization processes, and the accuracy loss after quantization can be viewed here. ".scaled.onnx" is a must-use model in the accuracy verification process, and the specific usage method can be found in the "Model Accuracy Analysis and Tuning" section. ".scaled.wqbi.onnx" is an optional-use model. Bias adjustment is applied on this model to use calculated quantization infos to improve bias quantization performance. +The production process of ".scaled.onnx" and ".scaled.wqbi.onnx" are the result of Model Quantization stage. These model has completed the calibration and quantization processes, and the accuracy loss after quantization can be viewed here. ".scaled.onnx" is a must-use model in the accuracy verification process. ".scaled.wqbi.onnx" is an optional-use model. Bias adjustment is applied on this model to use calculated quantization infos to improve bias quantization performance. ## 2.6 Precision Analysis for Model Quantitation -### 2.6.1 E2E Simulator Check (Fixed Point) - -https://doc.kneron.com/docs/#toolchain/manual_4_bie/¶ +### 2.6.1 [E2E Simulator Check (Fixed Point)](https://doc.kneron.com/docs/#toolchain/manual_4_bie/) Before going into the next section of compilation, E2E Simulator would help to ensure the quantized model do not lose too much precision. @@ -174,20 +164,7 @@ The usage is almost the same as using onnx. In the code above, inf_results is a As mentioned above, we do not provide any postprocess. In reality, you may want to have your own postprocess function in Python, too. -### 2.6.2 Get Radix Value (Deprecated) - -In the previous versions or for the debug usage, we may need to get the input quantization value manually, which is the radix. Below is the API. - -```python -# [API] -ktc.get_radix(inputs) Get the radix value from the given inputs. -``` - -Args: - -inputs (List): a list of numpy arrays which could be the inputs of the target model. Raises: * ValueError: raise if the input values are out of range - -### 2.6.3 Dynasty Inference Dump on a Single Image +### 2.6.2 Dynasty Inference Dump on a Single Image We provide dynasty inference dump on a single image file by turning on "export_dynasty_dump" in Parameter Configuration. It should dump the dynasty inference result for every operator nodes of fixed-point model. You can manually analysis the results between the float value and dynasty inference result.