7.3.3 TensorRT源码分析

7.3.3.1 前言

图1. 加速流程

NVIDIA TensorRT是一种高性能神经网络推理(Inference)引擎，用于在生产环境中部署深度学习应用程序，应用有图像分类、分割和目标检测等，可提供最大的推理吞吐量和效率。TensorRT是第一款可编程推理加速器，能加速现有和未来的网络架构。提供了包括神经网络模型计算图优化、INT8量化、FP16低精度运算等神经网络前向推理优化的方法。目前TensorRT提供了C++与Python的API接口，本文中主要使用C++接口为例说明TensorRT框架的一般使用流程。

7.3.3.2 TensorRT 库构成

以编译后源码压缩包TensorRT-7.0.0.11.Ubuntu-18.04.x86_64-gnu.cuda-10.0.cudnn7.6.tar的安装方式进行安装的TensorRT库主要有以下文件夹

├── data
├── doc
├── include  # 所有头文件，可以查看所有函数的接口和说明
│   ├── NvCaffeParser.h
│   ├── NvInfer.h
│   ├── NvInferPlugin.h
│   ├── NvInferPluginUtils.h
│   ├── NvInferRuntimeCommon.h
│   ├── NvInferRuntime.h
│   ├── NvInferVersion.h
│   ├── NvOnnxConfig.h
│   ├── NvOnnxParser.h
│   ├── NvOnnxParserRuntime.h
│   ├── NvUffParser.h
│   └── NvUtils.h
├── lib      # 所有动态链接库.so文件
├── python   # python API 例子
└── samples  # c++ 例子

NvInferRuntimeCommon.h ：定义了 TRT 运行时所需的基础数据结构（如Dims，PluginField）和大部分基类（class ILogger， class IPluginV2）
NvInferRuntime.h: 继承NvInferRuntimeCommon.h中的基类，定义了runtime时的拓展功能子类
NvInfer.h：继承NvInferRuntime.h中的基类，定义了大部分可以直接加入的神经网络层（如class IConvolutionLayer，class IPoolingLayer，class IPoolingLayer）
NvInferPluginUtils.h：定义plugin layer所需的基础数据结构
NvInferPlugin.h：初始化注册所有结构包含plugin layer
其他的从文件名即可看出时分别Caffe，ONNX，UFF 的解析器parser，分别对应训练后的Caffe，Pytorch，TensorFlow网络

7.3.3.3 TensorRT推理过程

TensorRT的使用包括两个阶段：build and deployment：

build

图2. Build

该阶段主要完成模型转换（从pytorch/caffe/TensorFlow到TensorRT），如上图所示，在模型转换时会完成前述优化过程中的层间融合、精度校准。这一步的输出是一个针对特定GPU平台和网络模型的优化过的TensorRT模型，这个TensorRT模型可以序列化存储到磁盘或内存中。存储到磁盘中的文件称之为 planfile。

deployment

图3. Deployment

该阶段主要完成推理过程，如上图所示。将上一个步骤中的plan文件首先反序列化，并创建一个 runtime engine，然后就可以输入数据（比如测试集或数据集之外的图片），然后输出分类向量结果或检测结果。

7.3.3.4 pytorch转TensorRT

图4. pytorch转TensorRT

如上图，本文采用的流程为：Pytorch -> Onnx -> TensorRT。即首先将Pytorch模型转换为Onnx模型，然后通过TensorRT解析Onnx模型，创建TensorRT引擎及进行前向推理。

一、ONNX环境安装

>> sudo apt-get install libprotobuf-dev protobuf-compiler # protobuf is a prerequisite library
>> git clone --recursive https://github.com/onnx/onnx.git # Pull the ONNX repository from GitHub 
>> cd onnx
>> mkdir build && cd build 
>> cmake .. # Compile and install ONNX
>> make # Use the ‘-j’ option for parallel jobs, for example, ‘make -j $(nproc)’ 
>> make install

二、Python模型转Onnx

输入数据预处理Nvidia官方提供的一个基于unet模型的大脑MRI图像分割的例子：

>> wget https://developer.download.nvidia.com/devblogs/speeding-up-unet.7z // Get the ONNX model and test the data
>> tar xvf speeding-up-unet.7z # Unpack the model data into the unet folder
>> cd unet
>> python create_network.py #Inside the unet folder, it creates the unet.onnx file

这里create_network.py 脚本就是将pytorch模型转换为onnx模型（转换后的onnx模型，可利用 Netron 工具进行可视化），脚本的具体代码如下：

# create_network.py
import torch
from torch.autograd import Variable
import torch.onnx as torch_onnx
import onnx
def main():
    input_shape = (3, 256, 256)
    model_onnx_path = "unet.onnx"
    dummy_input = Variable(torch.randn(1, *input_shape))
    model = torch.hub.load('mateuszbuda/brain-segmentation-pytorch', 'unet',
      in_channels=3, out_channels=1, init_features=32, pretrained=True)
    model.train(False)
    
    inputs = ['input.1']
    outputs = ['186']
    dynamic_axes = {'input.1': {0: 'batch'}, '186':{0:'batch'}}
    out = torch.onnx.export(model, dummy_input, model_onnx_path, input_names=inputs, output_names=outputs, dynamic_axes=dynamic_axes)

if __name__=='__main__':
    main()

三、处理输入数据

在 Kaggle中下载所有的图片，或者百度网盘提取码：v3xf 。

任意选择所下载图片中的3张图片（图片名中没有_mask的），然后使用下面 prepareData.py 脚本将图片进行预处理以及序列化并保存：

#  prepareData.py
import torch 
import argparse
import numpy as np
from torchvision import transforms                    
from skimage.io import imread
from onnx import numpy_helper
from skimage.exposure import rescale_intensity

def normalize_volume(volume):
    p10 = np.percentile(volume, 10)
    p99 = np.percentile(volume, 99)
    volume = rescale_intensity(volume, in_range=(p10, p99))
    m = np.mean(volume, axis=(0, 1, 2))
    s = np.std(volume, axis=(0, 1, 2))
    volume = (volume - m) / s
    return volume

def main(args):
    model = torch.hub.load('mateuszbuda/brain-segmentation-pytorch', 'unet',
      in_channels=3, out_channels=1, init_features=32, pretrained=True)
    model.train(False)
    
    filename = args.input_image
    input_image = imread(filename)
    input_image = normalize_volume(input_image)
    input_image = np.asarray(input_image, dtype='float32')
    
    preprocess = transforms.Compose([
      transforms.ToTensor(),
    ])
    input_tensor = preprocess(input_image)
    input_batch = input_tensor.unsqueeze(0)
    
    tensor1 = numpy_helper.from_array(input_batch.numpy())
    with open(args.input_tensor, 'wb') as f:
        f.write(tensor1.SerializeToString())
    if torch.cuda.is_available():
        input_batch = input_batch.to('cuda')
        model = model.to('cuda')
    with torch.no_grad():
        output = model(input_batch)
    
    tensor = numpy_helper.from_array(output[0].cpu().numpy())
    with open(args.output_tensor, 'wb') as f:
        f.write(tensor.SerializeToString())
if __name__=='__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_image', type=str)
    parser.add_argument('--input_tensor', type=str, default='input_0.pb')
    parser.add_argument('--output_tensor', type=str, default='output_0.pb')
    args=parser.parse_args()
    main(args)

>> pip install medpy #dependency for utils.py file
>> mkdir test_data_set_0
>> mkdir test_data_set_1
>> mkdir test_data_set_2
>> python prepareData.py --input_image your_image1 --input_tensor test_data_set_0/input_0.pb --output_tensor test_data_set_0/output_0.pb   # This creates input_0.pb and output_0.pb
>> python prepareData.py --input_image your_image2 --input_tensor test_data_set_1/input_0.pb --output_tensor test_data_set_1/output_0.pb   # This creates input_0.pb and output_0.pb
>> python prepareData.py --input_image your_image3 --input_tensor test_data_set_2/input_0.pb --output_tensor test_data_set_2/output_0.pb   # This creates input_0.pb and output_0.pb

这样，我们完成了pytorch模型转onnx模型，以及输入数据预处理的步骤，下一步就是用TensorRT接口调用模型、创建引擎、并进行推理。

四、示例代码下载

首先下载官方提供的上述unet模型对应的TensorRT示例，并进行编译：

>> git clone https://github.com/parallel-forall/code-samples.git
>> cd code-samples/posts/TensorRT-introduction
>> make clean && make # Compile the TensorRT C++ code

在编译示例代码的时候，可能会遇到 NvInfer.h 文件或者其他TensorRT文件找不到的情况，此时可以在 code-samples/posts/TensorRT-introduction/Makefile 中加入TensorRT的路径：

将：
CXXFLAGS=-std=c++11 -DONNX_ML=1 -Wall -I$(CUDA_INSTALL_DIR)/include
LDFLAGS=-L$(CUDA_INSTALL_DIR)/lib64 -L$(CUDA_INSTALL_DIR)/lib64/stubs -L/usr/local/lib
改为：
CXXFLAGS=-std=c++11 -DONNX_ML=1 -Wall -I$(CUDA_INSTALL_DIR)/include -I/<xxxDir>/TensorRT-7.0.0.11/include
LDFLAGS=-L$(CUDA_INSTALL_DIR)/lib64 -L$(CUDA_INSTALL_DIR)/lib64/stubs -L/usr/local/lib -L/<xxxDir>/TensorRT-7.0.0.11/targets/x86_64-linux-gnu/lib
修改<xxxDir>对应为你的路径

然后直接运行代码即可：

>> cd to code-samples/posts/TensorRT-introduction
>> ./simpleOnnx_1 path/to/unet/unet.onnx path/to/unet/test_data_set_0/input_0.pb # The sample application expects output reference values in path/to/unet/test_data_set_0/output_0.pb

运行成功之后，会在源目录下生成output.pgm文件，即为模型的输出。

五、主要代码分析

TensorRT提供了两种方式进行网络的部署：1. 各种parser对网络模型进行解析与转换；2. 利用TensorRT的api，Layer By Layer的方式进行模型的构建和转换。第1种方式简便快捷，这里以第1种为例。

创建 CUDA Engine:

// Declare the CUDA engine
unique_ptr<ICudaEngine, Destroy<ICudaEngine>> engine{nullptr};
...
// Create the CUDA engine
engine.reset(createCudaEngine(onnxModelPath));

其中，创建 Engine 的函数为 createCudaEngine，流程图如下：

图5. Engine构建流程

具体代码如下：

nvinfer1::ICudaEngine* createCudaEngine(string const& onnxModelPath, int batchSize){
    unique_ptr<nvinfer1::IBuilder, Destroy<nvinfer1::IBuilder>> builder{nvinfer1::createInferBuilder(gLogger)};
    const auto explicitBatch = 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
    unique_ptr<nvinfer1::INetworkDefinition, Destroy<nvinfer1::INetworkDefinition>> network{builder->createNetworkV2(explicitBatch)};
    unique_ptr<nvonnxparser::IParser, Destroy<nvonnxparser::IParser>> parser{nvonnxparser::createParser(*network, gLogger)};
    unique_ptr<nvinfer1::IBuilderConfig,Destroy<nvinfer1::IBuilderConfig>> config{builder->createBuilderConfig()};
 
    if (!parser->parseFromFile(onnxModelPath.c_str(), static_cast<int>(ILogger::Severity::kINFO)))
    {
            cout << "ERROR: could not parse input engine." << endl;
            return nullptr;
    }
    builder->setMaxBatchSize(batchSize);
    config->setMaxWorkspaceSize((1 << 30));
  
  	//这里可以设置网络推理使用精度，如果平台支持半精度，还可以指定为半精度
  	if (mTrtParams.FP16){
        if(!builder->platformHasFastFp16()){
            gLogWarning << "Platform has not fast Fp16! It will still using Fp32!"<< std::endl;
        }else{
            config -> setFlag(nvinfer1::BuilderFlag::kFP16);
        }
    } 
  
    auto profile = builder->createOptimizationProfile();
    profile->setDimensions(network->getInput(0)->getName(), OptProfileSelector::kMIN, Dims4{1, 3, 256 , 256});
    profile->setDimensions(network->getInput(0)->getName(), OptProfileSelector::kOPT, Dims4{1, 3, 256 , 256});
    profile->setDimensions(network->getInput(0)->getName(), OptProfileSelector::kMAX, Dims4{32, 3, 256 , 256});    
    config->addOptimizationProfile(profile);
    return builder->buildEngineWithConfig(*network, *config);
}

创建 execution context:

context 的功能是存储模型推理过程中的中间层激活值，创建的代码如下：

// Declare the execution context
unique_ptr<IExecutionContext, Destroy<IExecutionContext>> context{nullptr};
...
// Create the execution context
context.reset(engine->createExecutionContext());

执行前向推理:

前向推理的代码如下所示，推理的过程为：

1）将存储在CPU端的输入数据copy到GPU上；

2）context->enqueueV2(bindings, stream, nullptr) 执行GPU端推理；

3）将GPU端的输出copy回CPU端。

void launchInference(IExecutionContext* context, cudaStream_t stream, vector<float> const& inputTensor, vector<float>& outputTensor, void** bindings, int batchSize)
{
    int inputId = getBindingInputIndex(context);

    cudaMemcpyAsync(bindings[inputId], inputTensor.data(), inputTensor.size() * sizeof(float), cudaMemcpyHostToDevice, stream);
    context->enqueueV2(bindings, stream, nullptr);
    cudaMemcpyAsync(outputTensor.data(), bindings[1 - inputId], outputTensor.size() * sizeof(float), cudaMemcpyDeviceToHost, stream);
}

在执行完 launchInference 之后，调用cudaStreamSynchronize 函数进行同步，以确保结果被访问之前GPU端已经完成了计算。

具体main执行如下：

int main(int argc, char* argv[])
{
    // Declaring cuda engine.
    unique_ptr<ICudaEngine, Destroy<ICudaEngine>> engine{nullptr};
    // Declaring execution context.
    unique_ptr<IExecutionContext, Destroy<IExecutionContext>> context{nullptr};
    vector<float> inputTensor;
    vector<float> outputTensor;
    vector<float> referenceTensor;
    void* bindings[2]{0};
    vector<string> inputFiles;
    CudaStream stream;

    if (argc != 3)
    {
        cout << "usage: " << argv[0] << " <path_to_model.onnx> <path_to_input.pb>" << endl;
        return 1;
    }

    string onnxModelPath(argv[1]);
    inputFiles.push_back(string{argv[2]}); //输入图片路径
    int batchSize = inputFiles.size();

    // Create Cuda Engine.
    engine.reset(createCudaEngine(onnxModelPath, batchSize));  
    if (!engine)
        return 1;

    // Assume networks takes exactly 1 input tensor and outputs 1 tensor.
    assert(engine->getNbBindings() == 2);
    assert(engine->bindingIsInput(0) ^ engine->bindingIsInput(1));

    for (int i = 0; i < engine->getNbBindings(); ++i)
    {
        Dims dims{engine->getBindingDimensions(i)};
        size_t size = accumulate(dims.d+1, dims.d + dims.nbDims, batchSize, multiplies<size_t>());
        // Create CUDA buffer for Tensor.
        cudaMalloc(&bindings[i], batchSize * size * sizeof(float));

        // Resize CPU buffers to fit Tensor.
        if (engine->bindingIsInput(i)){
            inputTensor.resize(size);
        }
        else
            outputTensor.resize(size);
    }

    // Read input tensor from ONNX file.
    if (readTensor(inputFiles, inputTensor) != inputTensor.size())
    {
        cout << "Couldn't read input Tensor" << endl;
        return 1;
    }
    

    // Create Execution Context.
    context.reset(engine->createExecutionContext());
    
    Dims dims_i{engine->getBindingDimensions(0)};
    Dims4 inputDims{batchSize, dims_i.d[1], dims_i.d[2], dims_i.d[3]};
    context->setBindingDimensions(0, inputDims);

    launchInference(context.get(), stream, inputTensor, outputTensor, bindings, batchSize);

    Dims dims{engine->getBindingDimensions(1)};
    saveImageAsPGM(outputTensor, dims.d[2], dims.d[3]);
    // Wait until the work is finished.
    cudaStreamSynchronize(stream);

    vector<string> referenceFiles;
    for (string path : inputFiles)
        referenceFiles.push_back(path.replace(path.rfind("input"), 5, "output"));
    // Try to read and compare against reference tensor from protobuf file.


    referenceTensor.resize(outputTensor.size());
    if (readTensor(referenceFiles, referenceTensor) != referenceTensor.size())
    {
        cout << "Couldn't read reference Tensor" << endl;
        return 1;
    }

    Dims dims_o{engine->getBindingDimensions(1)};
    int size = batchSize * dims_o.d[2] * dims_o.d[3];
    verifyOutput(outputTensor, referenceTensor, size);  //验证输出
    
    for (void* ptr : bindings)
        cudaFree(ptr);   //不要忘记释放申请的显存
    return 0;
}

更多细节，请查阅NVIDIA代码库simpleOnnx_1.cpp。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

7.3.3 TensorRT源码分析.md

7.3.3 TensorRT源码分析.md

7.3.3 TensorRT源码分析

7.3.3.1 前言

7.3.3.2 TensorRT 库构成

7.3.3.3 TensorRT推理过程

7.3.3.4 pytorch转TensorRT

一、ONNX环境安装

二、Python模型转Onnx

三、处理输入数据

四、示例代码下载

五、主要代码分析

Files

7.3.3 TensorRT源码分析.md

Latest commit

History

7.3.3 TensorRT源码分析.md

File metadata and controls

7.3.3 TensorRT源码分析

7.3.3.1 前言

7.3.3.2 TensorRT 库构成

7.3.3.3 TensorRT推理过程

7.3.3.4 pytorch转TensorRT

一、ONNX环境安装

二、Python模型转Onnx

三、处理输入数据

四、示例代码下载

五、主要代码分析