Skip to content

v13: fp16 i/o, faster dynamic shapes for TRT backend

Compare
Choose a tag to compare
@github-actions github-actions released this 14 Jan 10:49
· 203 commits to master since this release
  • Added support for fp16 I/O format and faster dynamic shapes in the TRT backend.

    As the only portable way to convert fp32 clips to fp16 is via resize (std.Expr only supports fp16 when the cpu supports the f16c instruction set extension), and to conserve memory bandwidth, you could use the following snippet for RIFE to perform the necessary padding, YUV/RGB conversion and FP16 conversion in one go:

    th = (src.height + 31) // 32 * 32  # adjust 32 and 31 to match specific AI network input resolution requirements.
    tw = (src.width  + 31) // 32 * 32  # same.
    padded = src.resize.Bicubic(tw, th, format=vs.RGBS if WANT_FP32 else vs.RGBH, matrix_in_s="709", src_width=tw, src_height=th)
    flt = vsmlrt.RIFE(padded, model=vsmlrt.RIFEModel.v4_6, backend=backend, output_format=1) # fp16 output
    oh = src.height * (flt.height // th)  # not necessary for RIFE (i.e. oh = src.height), but required for super-resolution upscalers.
    ow = src.width  * (flt.width  // tw)
    res = flt.resize.Bicubic(ow, oh, format=vs.YUV420P8, matrix_s="709", src_width=ow, src_height=oh)
    • Faster dynamic shapes introduced in TensorRT 8.5.1 improves performance and device memory usage of dynamically shaped engines (#20 (comment) and the following benchmark).

      Dynamically shaped models can be created by specifying static_shape=False, min_shapes (minimum size), opt_shapes (optimization size) and max_shapes (maximum size) in the TRT backend.

      Engine cache placement policy of vsmlrt.py is direct-mapped:

      • If dynamic shapes is used, engine name is determined by min_shapes, opt_shapes and max_shapes (among others).
      • Otherwise, engine name is determined by opt_shapes.
      • opt_shapes is usually set to tilesize in each specific model's interface if not initialized.
    • workspace can now be set to None for unlimited workspace size. (#21)

    • Add flag force_fp16. This flag forces fp16 computation during inference, and is disabled by default.

      • It reduces memory usage during engine build, and allows more engines to be built, e.g. 1080p rife on gpus with 4gb memory. (successfully build dynamically shaped rife with min_shapes=(64, 64), opt_shapes=(1920, 1088), max_shapes=(3840, 2176) on a 4gb ampere)
      • It may reduce engine build time.
  • Introduce a new simplified backend interface BackendV2.

  • Disable tiling support for rife due to incompatible inference design.

dynamic shapes

In conclusion, dynamic shapes should be much more flexible when dealing with different video resolutions (no engine re-compilation is required), and incurs almost no performance degradation starting with TensorRT 8.5. Only increased device memory usage will be a concern.

benchmark

  • configuration: nvidia a10 (ecc disabled), driver 527.41 (tcc), windows server 2022, python 3.11.1, vapoursynth-classic R57.A7, Backend.TRT(fp16=True, use_cuda_graph=True, tf32=False, use_cudnn=False), CUDA_MODULE_LOADING=LAZY

    • Statically shaped engines for each model are compiled separately, while the dynamically shaped engine is compiled once with (1) static_shape=False, min_shapes=(64, 64), opt_shapes=<max-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions> or (2) static_shape=False, min_shapes=(64, 64), opt_shapes=<min-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions>.

      opt_shapes may be lowered for faster engine generation.

measurements: fps / device memory (MB)

model 1 stream static 1 stream dynamic (1) 1 stream dynamic (2) 2 streams static 2 streams dynamic (1) 2 streams dynamic (2)
waifu2x upconv7 1920x1080 17.4 / 1992 17.5 / 1998 17.4 / 2040 21.2 / 3694 21.2 / 3756 20.9 / 3850
waifu2x upconv7 1280x720 37.2 / 1046 38.5 / 1930 37.8 / 1976 46.3 / 1818 48.2 / 3628 46.5 / 3722
waifu2x upconv7 720x480 102.2 / 544 104.4 / 1894 102.2 / 1940 123.1 / 834 128.1 / 3556 123.4 / 3650
dpir color 1920x1080 7.3 / 2114 7.3 / 2114 7.2 / 2116 7.5 / 3656 7.4 / 3992 7.4 / 4002
dpir color 1280x720 16.4 / 1122 16.3 / 2086 16.2 / 2092 16.7 / 1810 16.7 / 3936 16.7 / 3946
dpir color 720x480 41.5 / 604 41.5 / 2068 41.6 / 2074 44.3 / 863 44.3 / 3900 44.2 / 3910
real-esrgan v2 1920x1080 12.3 / 1320 12.3 / 1320 12.3 / 1320 13.0 / 2196 13.2 / 2402 12.9 / 2402
real-esrgan v2 1280x720 26.9 / 736 26.9 / 1256 27.0 / 1256 29.1 / 1130 29.3 / 2274 29.2 / 2274
real-esrgan v2 720x480 73.2 / 422 73.2 / 1220 72.7 / 1220 78.7 / 570 78.4 / 2202 78.1 / 2202
cugan 1920x1080 9.4 / 4648 9.4 / 4726 9.2 / 4618 9.8 / 8754 10.2 / 9210 9.9 / 8996
cugan 1280x720 20.5 / 2214 20.5 / 4662 20.0 / 4554 21.2 / 4050 22.9 / 9082 22.4 / 8868
cugan 720x480 54.8 / 996 53.7 / 4626 52.9 / 4518 57.7 / 1690 59.9 / 9019 58.9 / 8796
rife v4.4 1920x1088 92.8 / 590 92.2 / 594 89.5 / 606 178.6 / 920 177.6 / 942 169.5 / 974
rife v4.4 1280x736 206.0 / 410 199.2 / 534 199.1 / 550 394.2 / 560 377.5 / 822 374.0 / 854
rife v4.4 736x480 497.3 / 316 442.2 / 504 492.3 / 520 903.2 / 376 809.3 / 762 874.1 / 794

*The gap is large on rife because of underutilization, and will disappear when using more streams.