INT8 support in dynamo workflows #2455

peri044 · 2023-11-13T17:14:05Z

peri044
Nov 13, 2023
Collaborator

INT8 support in dynamo workflows

TL;DR

In line with Torchscript front end, we should support INT8 precision in dynamo workflow by setting enabled_precisions={torch.int8} and passing a DataLoaderCalibrator or CacheCalibrator class in the calibrator argument

Goals

The support for INT8 precision has three phases

Phase 1: Full graph compilation

If the precision is set as INT8, this should imply require_full_compilation=True. We can unify Torchscript and Dynamo workflows a bit here. torch_tensorrt/py/ptq.py is the main file which holds these DataLoaderCalibrator and CacheCalibrator classes. A prototype implementation can be seen here https://github.com/pytorch/TensorRT/blob/int8_ptq/py/torch_tensorrt/ptq.py#L74-L104
Once you have the required algo_info, cache file, we build the derivates of INT8Calibrator class within dynamo (dynamo/utils.py)and ts (ts/_compile_spec.py) workflows.

from torch_tensorrt.ptq import DataLoaderCalibrator, CalibrationAlgo
calibrator = DataLoaderCalibrator(
            self.testing_dataloader,
            cache_file="./calibration.cache",
            use_cache=False,
            algo_type=CalibrationAlgo.ENTROPY_CALIBRATION_2,
            device=torch.device("cuda:0"),
        )
input = torch_tensorrt.Input(min_shape=(1, 3, 224, 224), opt_shape=(1, 3, 224, 224), max_shape=(1, 3, 224, 224))
trt_model = torch_tensorrt.compile(model, ir="dynamo", [input], calibrator=calibrator)

Phase 2: QAT

We use pytorch_quantization toolkit to produce QAT graphs. For dynamo, we can take these graphs and apply torch.export/torch.compile on them.

a) First, we need to verify if these ops are produced in the outputs of dynamo.
b) The work required here would be to add converter support for torch.fake_quantize_per_tensor_affine and torch.fake_quantize_per_channel_affine ops.

Phase 3: INT8 + Fallback

In the case of PTQ fallback,

a) If a particular op is unsupported, TRT subgraphs would require their own calibrators and the dataset to calibrate on would be different.
b) If the particular op is supported but it is forced to fallback via torch_executed_ops, an alternative approach is to explicitly set this op precision to fp32 using TRT APIs (implying we run this graph as require_full_compilation=True) and proceed with the INT8 calibration (via PTQ API). We rely on TensorRT to handle the mixed precision inference of this graph.

In the case of QAT fallback, we don't have to explicitly do anything about this. The regions which are convertible to INT8 will be converted and the rest would run in FP32.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INT8 support in dynamo workflows #2455

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

INT8 support in dynamo workflows #2455

peri044 Nov 13, 2023 Collaborator

INT8 support in dynamo workflows

TL;DR

Goals

Phase 1: Full graph compilation

Phase 2: QAT

Phase 3: INT8 + Fallback

Prototype - Phase 1 : Medium, Phase 2: Medium

MVP 2.2 - Phase 1 : Medium, Phase 2: Medium

Future work / Extensions

Replies: 0 comments

peri044
Nov 13, 2023
Collaborator

MVP `2.2` - Phase 1 : Medium, Phase 2: Medium