Triton inference server CC

Talk

Gave a talk at FossUnited Bangalore. Here are the slides.

Intro

This repo contains all the code I worked on while learning Triton Inference Server. It has both PyTorch and Tensorflow pipeline but my main focus is to explore PyTorch models so all the advanced features will be based on PyTorch.

For starters, you can set up your environment as explained below and then walk through the starter notebooks:

inference_notebooks/inference_pytorch.ipynb
inference_notebooks/inference_tensorflow.ipynb

Later you can explore advanced use cases in inference_notebooks/advance.ipynb.

I have also prepared some notes here in README, you can explore them too.

Setup

IMPORTANT: Check the supported triton version from the Deep Learning Frameworks support matrix and update the base image in Dockerfiles accordingly. To get the version, use nvcc --version. For me it was CUDA:11.7 and I used 22.08 triton version.

Create the image

docker build -t triton_cc_pt:0.0.1 -f dockers/Dockerfile.cpu.pt .

For tensorflow:

docker build -t triton_cc_tf:0.0.1 -f dockers/Dockerfile.cpu.tf .

Run the notebook and save the weights folder to ensure the default PyTorch model gets loaded.

Run the docker container with tritonserver in detach mode

bash bash_scripts/triton_server_pytorch.sh

For tensorflow:

bash bash_scripts/triton_server_tensorflow.sh

Notes:

The current base image in dockerfile has all the backends, which may not be required. Consider customizing it. More info here.
The ports are for:
1. http: 8000
2. grpc: 8001
3. metrics: 8002
We are mounting the local models folder to the container's models folder to ensure all the model files we create from within the notebook get mapped automatically to the container and we can then ping tritionserver directly.

Learnings for specific backend:

Python backend.

Spend some time debugging the model.py code. You can add breakpoints and fix issues conveniently. Run the server and when you hit it using tritonclient, the breakpoint will work.
Always take care of the input dimension you are mentioning in the config.pbtxt.

Onnx backend

Make sure the first axis of input and output is dynamic for batch_size.
Always take care of the input dimension you are mentioning in the config.pbtxt.
Make sure to use the same input and output names while creating the Onnx model and during client inference.
Take care of the dtypes you are using to compile to Onnx and the ones specified in the config.pbtxt. For instance, in the case of transformers tokenizer, it returns dtype int64 and if you use int32 (preferred) in config.pbtxt, it will fail.

TensorRT Backend

Installation

Personal recommendation is to run this within a docker container.

Create a container:

docker run --gpus all --rm -it -v ${PWD}/models/:/workspace/models/ -v ${PWD}/weights/:/workspace/weights/ -v ${PWD}/inference_notebook/:/workspace/inference_notebook/ --name triton_trtc nvcr.io/nvidia/tensorrt:22.08-py3

Run the TensorRT section of the notebook.

Useful info

TensorRT is not supported for each operation and can cause issues. In that case, try upgrading its version but keep in mind the CUDA version and trition of your system. If possible update the CUDA version.
FP16 version takes time to compile so take a break.

Features

Dynamic Batching

While using the HTTP client, use asycn_infer and make sure to set concurrency while initializing the client.
While using the GRPC client, use async_infer with a callback. And don't use context manager with the client. Not sure what's the reason, but will explore and update here.

TODO:

Performance analyser
Model analyzer
Metrics
Stable diffusion pipelines
Efficient deployment on the cloud (for eg. runpod.io)

Resources

https://youtu.be/cKf-KxJVlzE

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
bash_scripts		bash_scripts
dockers		dockers
inference_notebooks		inference_notebooks
models		models
models_tf/onnx		models_tf/onnx
requirements		requirements
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Triton inference server CC

Talk

Intro

Setup

Notes:

Learnings for specific backend:

Python backend.

Onnx backend

TensorRT Backend

Installation

Useful info

Features

Dynamic Batching

TODO:

Resources

About

Releases

Packages

Languages

rohitgr7/triton_inference_cc

Folders and files

Latest commit

History

Repository files navigation

Triton inference server CC

Talk

Intro

Setup

Notes:

Learnings for specific backend:

Python backend.

Onnx backend

TensorRT Backend

Installation

Useful info

Features

Dynamic Batching

TODO:

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages