v13: fp16 i/o, faster dynamic shapes for TRT backend
-
Added support for fp16 I/O format and faster dynamic shapes in the
TRT
backend.- Thanks to @hooke007 @grobalt @MysteryDove @chainikdn and many others users on the svp forum, it becomes clear that reducing system bandwidth requirement is crucial to 4K RIFE performance on powerful GPUs (https://github.com/AmusementClub/vs-mlrt/discussions/19). The
TRT
backend now accepts fp16 clips beyond fp32, and the output format can be specified via parameteroutput_format
(0 for fp32 and 1 for fp16).
As the only portable way to convert fp32 clips to fp16 is via
resize
(std.Expr
only supports fp16 when the cpu supports the f16c instruction set extension), and to conserve memory bandwidth, you could use the following snippet for RIFE to perform the necessary padding, YUV/RGB conversion and FP16 conversion in one go:th = (src.height + 31) // 32 * 32 # adjust 32 and 31 to match specific AI network input resolution requirements. tw = (src.width + 31) // 32 * 32 # same. padded = src.resize.Bicubic(tw, th, format=vs.RGBS if WANT_FP32 else vs.RGBH, matrix_in_s="709", src_width=tw, src_height=th) flt = vsmlrt.RIFE(padded, model=vsmlrt.RIFEModel.v4_6, backend=backend, output_format=1) # fp16 output oh = src.height * (flt.height // th) # not necessary for RIFE (i.e. oh = src.height), but required for super-resolution upscalers. ow = src.width * (flt.width // tw) res = flt.resize.Bicubic(ow, oh, format=vs.YUV420P8, matrix_s="709", src_width=ow, src_height=oh)
-
Faster dynamic shapes introduced in TensorRT 8.5.1 improves performance and device memory usage of dynamically shaped engines (#20 (comment) and the following benchmark).
Dynamically shaped models can be created by specifying
static_shape=False
,min_shapes
(minimum size),opt_shapes
(optimization size) andmax_shapes
(maximum size) in theTRT
backend.Engine cache placement policy of
vsmlrt.py
is direct-mapped:- If dynamic shapes is used, engine name is determined by
min_shapes
,opt_shapes
andmax_shapes
(among others). - Otherwise, engine name is determined by
opt_shapes
. opt_shapes
is usually set totilesize
in each specific model's interface if not initialized.
- If dynamic shapes is used, engine name is determined by
-
workspace
can now be set to None for unlimited workspace size. (#21) -
Add flag
force_fp16
. This flag forces fp16 computation during inference, and is disabled by default.- It reduces memory usage during engine build, and allows more engines to be built, e.g. 1080p rife on gpus with 4gb memory. (successfully build dynamically shaped rife with
min_shapes=(64, 64), opt_shapes=(1920, 1088), max_shapes=(3840, 2176)
on a 4gb ampere) - It may reduce engine build time.
- It reduces memory usage during engine build, and allows more engines to be built, e.g. 1080p rife on gpus with 4gb memory. (successfully build dynamically shaped rife with
- Thanks to @hooke007 @grobalt @MysteryDove @chainikdn and many others users on the svp forum, it becomes clear that reducing system bandwidth requirement is crucial to 4K RIFE performance on powerful GPUs (https://github.com/AmusementClub/vs-mlrt/discussions/19). The
-
Introduce a new simplified backend interface
BackendV2
. -
Disable tiling support for rife due to incompatible inference design.
dynamic shapes
In conclusion, dynamic shapes should be much more flexible when dealing with different video resolutions (no engine re-compilation is required), and incurs almost no performance degradation starting with TensorRT 8.5. Only increased device memory usage will be a concern.
benchmark
-
configuration: nvidia a10 (ecc disabled), driver 527.41 (tcc), windows server 2022, python 3.11.1, vapoursynth-classic R57.A7,
Backend.TRT(fp16=True, use_cuda_graph=True, tf32=False, use_cudnn=False)
,CUDA_MODULE_LOADING=LAZY
-
Statically shaped engines for each model are compiled separately, while the dynamically shaped engine is compiled once with (1)
static_shape=False, min_shapes=(64, 64), opt_shapes=<max-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions>
or (2)static_shape=False, min_shapes=(64, 64), opt_shapes=<min-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions>
.opt_shapes
may be lowered for faster engine generation.
-
measurements: fps / device memory (MB)
model | 1 stream static | 1 stream dynamic (1) | 1 stream dynamic (2) | 2 streams static | 2 streams dynamic (1) | 2 streams dynamic (2) |
---|---|---|---|---|---|---|
waifu2x upconv7 1920x1080 | 17.4 / 1992 | 17.5 / 1998 | 17.4 / 2040 | 21.2 / 3694 | 21.2 / 3756 | 20.9 / 3850 |
waifu2x upconv7 1280x720 | 37.2 / 1046 | 38.5 / 1930 | 37.8 / 1976 | 46.3 / 1818 | 48.2 / 3628 | 46.5 / 3722 |
waifu2x upconv7 720x480 | 102.2 / 544 | 104.4 / 1894 | 102.2 / 1940 | 123.1 / 834 | 128.1 / 3556 | 123.4 / 3650 |
dpir color 1920x1080 | 7.3 / 2114 | 7.3 / 2114 | 7.2 / 2116 | 7.5 / 3656 | 7.4 / 3992 | 7.4 / 4002 |
dpir color 1280x720 | 16.4 / 1122 | 16.3 / 2086 | 16.2 / 2092 | 16.7 / 1810 | 16.7 / 3936 | 16.7 / 3946 |
dpir color 720x480 | 41.5 / 604 | 41.5 / 2068 | 41.6 / 2074 | 44.3 / 863 | 44.3 / 3900 | 44.2 / 3910 |
real-esrgan v2 1920x1080 | 12.3 / 1320 | 12.3 / 1320 | 12.3 / 1320 | 13.0 / 2196 | 13.2 / 2402 | 12.9 / 2402 |
real-esrgan v2 1280x720 | 26.9 / 736 | 26.9 / 1256 | 27.0 / 1256 | 29.1 / 1130 | 29.3 / 2274 | 29.2 / 2274 |
real-esrgan v2 720x480 | 73.2 / 422 | 73.2 / 1220 | 72.7 / 1220 | 78.7 / 570 | 78.4 / 2202 | 78.1 / 2202 |
cugan 1920x1080 | 9.4 / 4648 | 9.4 / 4726 | 9.2 / 4618 | 9.8 / 8754 | 10.2 / 9210 | 9.9 / 8996 |
cugan 1280x720 | 20.5 / 2214 | 20.5 / 4662 | 20.0 / 4554 | 21.2 / 4050 | 22.9 / 9082 | 22.4 / 8868 |
cugan 720x480 | 54.8 / 996 | 53.7 / 4626 | 52.9 / 4518 | 57.7 / 1690 | 59.9 / 9019 | 58.9 / 8796 |
rife v4.4 1920x1088 | 92.8 / 590 | 92.2 / 594 | 89.5 / 606 | 178.6 / 920 | 177.6 / 942 | 169.5 / 974 |
rife v4.4 1280x736 | 206.0 / 410 | 199.2 / 534 | 199.1 / 550 | 394.2 / 560 | 377.5 / 822 | 374.0 / 854 |
rife v4.4 736x480 | 497.3 / 316 | 442.2 / 504 | 492.3 / 520 | 903.2 / 376 | 809.3 / 762 | 874.1 / 794 |
*The gap is large on rife because of underutilization, and will disappear when using more streams.