Skip to content

Releases: AmusementClub/vs-mlrt

v14.test: latest TensorRT library

13 Mar 12:20
Compare
Choose a tag to compare
Pre-release

This is a preview release for TensorRT 8.6.1.

  • It requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.

  • Add parameters builder_optimization_level and max_aux_streams to the TRT backend.

    • builder_optimization_level: "adjust how long TensorRT should spend searching for tactics with potentially better performance" link
    • max_aux_streams: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." link
      • It is advised to lower max_aux_streams to 0 on heavy models like Waifu2xModel.swin_unet_art to reduce memory usage. Check the benchmark data at the bottom.
  • Following TensorRT 8.6.1, cudnn tactic source of the TRT backend is disabled by default. tf32 is also disabled by default in vsmlrt.py.

  • Add parameter short_path to the TRT backend, which shortens engine path and is enabled on Windows by default.

  • Model Waifu2xModel.swin_unet_art seems does not work with builder_optimization_level=5 from the TRT backend before TRT 9.0. Use builder_optimization_level=4 or lower instead.

Less than 5% performance improvement among built-in models compared to 13.1/13.2, 24% device memory usage reduction on DPIR and 35% on RealESRGAN.


Version information:

  • v13.2 release uses trt 8.5.1 + cuda 11.8.0, which can run on driver >= 450 and 900 series and later gpus.
  • v14.test pre-release uses trt 8.6.1 + cuda 12.1.1, which can only run on driver >= 525 and 10 series and later gpus, with no significant performance improvement measured.
  • vsmlrt.py in both branches can be used interchangeably.

  • Added support for RIFE v4.7 model ("optimized for anime scenes"), which is also available for previous vs-mlrt releases (simply download the new model file here and update vsmlrt.py). It is more computational intensive than v4.6.

  • This pre-release is now feature complete. Development now switch to trt-latest branch and v14.test2 pre-release.

v13.2: latest ort library, DirectML backend

28 May 15:01
Compare
Choose a tag to compare
  • Added support for DirectML backend through ONNX Runtime. It is available for all dx12 devices and can be accessed through backend=Backend.ORT_DML(). waifu2x swin_unet models may not be supported on this backend, and RIFE models may be poorly supported.
  • Asset vsmlrt-windows-x64-vk.*.7z is renamed to vsmlrt-windows-x64-generic-gpu.*.7z and includes backends OV_CPU, OV_GPU, ORT_CPU, ORT_DML, NCNN_VK. cuda asset continue to include all backends in this release.
  • Update onnxruntime to microsoft/onnxruntime@73584f9.

Note

  • Backend OV_GPU may produces reduced precision output. This is under investigation.

benchmark 1

NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 532.03, Windows 10 21H2 LTSC (19044.1415), Python 3.11.3, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model ORT_CUDA ORT_DML
dpir 4.25 / 2573.3 7.01 / 2371.0
dpir (2 streams) 4.58 / 5506.2 8.85 / 4643.1
waifu2x upconv7 9.10 / 5248.1 9.65 / 4503.1
waifu2x upconv7 (2 streams) 11.15 / 2966.9 18.52 / 8911.2
waifu2x cunet / cugan 4.06 / 7875.7 6.36 / 8973.7
waifu2x cunet / cugan (2 streams) N/A 9.51 / 17849.1
waifu2x swin_unet_art N/A N/A
realesrgan 7.52 / 1901.7 8.54 / 1352.4
realesrgan (2 streams) 11.15 / 2966.9 15.58 / 2608.7
rife 34.30 / 1109.1 2.12 / 1417.8
rife (2 streams) 61.45 / 2051.4 4.27 / 2740.9

benchmark 2

AMD Radeon Pro V620 MxGPU, 4608 shaders @ 2390 MHz, Adrenalin 21.20.02.13, Windows Server 2019, Python 3.11.3, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model NCNN_VK ORT_DML
dpir 1.70 / 3248.4 4.75 / 2308.1
dpir (2 streams) 1.74 / 6099.5 4.86 / 4584.6
waifu2x upconv7 5.18 / 6872.3 14.51 / 4448.5
waifu2x upconv7 (2 streams) 6.14 / 13701 15.98 / 8861.2
waifu2x cunet / cugan (2x2 tiles) 1.07 / 3159.8 5.57 / 2196.7
waifu2x cunet / cugan (2x2 tiles, 2 streams) 1.07 / 3159.8 6.08 / 4357.8
waifu2x swin_unet_art N/A N/A
realesrgan 3.86 / 2699.7 9.59 / 1290.4
realesrgan (2 streams) 4.43 / 5355.8 10.58 / 2545.3
rife N/A 2.68 / 1353.5
rife (2 streams) N/A 4.44 / 2673.3

v13.1: latest ov & ort library, new models

29 Jan 15:00
Compare
Choose a tag to compare

the following models can be found on external models:

  • Add support for waifu2x swin_unet models.
  • Add support for ensemble configuration of RIFE. (RIFE v2 acceleration is experimental and may results in reduced quality on TRT backend with fp16 enabled.)

Note

  • Backend OV_GPU may produces reduced precision output. This is under investigation.

Contributed Models

20 Apr 04:26
Compare
Choose a tag to compare
Contributed Models Pre-release
Pre-release

Please see PR #42 for policy.

v13: fp16 i/o, faster dynamic shapes for TRT backend

14 Jan 10:49
Compare
Choose a tag to compare
  • Added support for fp16 I/O format and faster dynamic shapes in the TRT backend.

    As the only portable way to convert fp32 clips to fp16 is via resize (std.Expr only supports fp16 when the cpu supports the f16c instruction set extension), and to conserve memory bandwidth, you could use the following snippet for RIFE to perform the necessary padding, YUV/RGB conversion and FP16 conversion in one go:

    th = (src.height + 31) // 32 * 32  # adjust 32 and 31 to match specific AI network input resolution requirements.
    tw = (src.width  + 31) // 32 * 32  # same.
    padded = src.resize.Bicubic(tw, th, format=vs.RGBS if WANT_FP32 else vs.RGBH, matrix_in_s="709", src_width=tw, src_height=th)
    flt = vsmlrt.RIFE(padded, model=vsmlrt.RIFEModel.v4_6, backend=backend, output_format=1) # fp16 output
    oh = src.height * (flt.height // th)  # not necessary for RIFE (i.e. oh = src.height), but required for super-resolution upscalers.
    ow = src.width  * (flt.width  // tw)
    res = flt.resize.Bicubic(ow, oh, format=vs.YUV420P8, matrix_s="709", src_width=ow, src_height=oh)
    • Faster dynamic shapes introduced in TensorRT 8.5.1 improves performance and device memory usage of dynamically shaped engines (#20 (comment) and the following benchmark).

      Dynamically shaped models can be created by specifying static_shape=False, min_shapes (minimum size), opt_shapes (optimization size) and max_shapes (maximum size) in the TRT backend.

      Engine cache placement policy of vsmlrt.py is direct-mapped:

      • If dynamic shapes is used, engine name is determined by min_shapes, opt_shapes and max_shapes (among others).
      • Otherwise, engine name is determined by opt_shapes.
      • opt_shapes is usually set to tilesize in each specific model's interface if not initialized.
    • workspace can now be set to None for unlimited workspace size. (#21)

    • Add flag force_fp16. This flag forces fp16 computation during inference, and is disabled by default.

      • It reduces memory usage during engine build, and allows more engines to be built, e.g. 1080p rife on gpus with 4gb memory. (successfully build dynamically shaped rife with min_shapes=(64, 64), opt_shapes=(1920, 1088), max_shapes=(3840, 2176) on a 4gb ampere)
      • It may reduce engine build time.
  • Introduce a new simplified backend interface BackendV2.

  • Disable tiling support for rife due to incompatible inference design.

dynamic shapes

In conclusion, dynamic shapes should be much more flexible when dealing with different video resolutions (no engine re-compilation is required), and incurs almost no performance degradation starting with TensorRT 8.5. Only increased device memory usage will be a concern.

benchmark

  • configuration: nvidia a10 (ecc disabled), driver 527.41 (tcc), windows server 2022, python 3.11.1, vapoursynth-classic R57.A7, Backend.TRT(fp16=True, use_cuda_graph=True, tf32=False, use_cudnn=False), CUDA_MODULE_LOADING=LAZY

    • Statically shaped engines for each model are compiled separately, while the dynamically shaped engine is compiled once with (1) static_shape=False, min_shapes=(64, 64), opt_shapes=<max-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions> or (2) static_shape=False, min_shapes=(64, 64), opt_shapes=<min-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions>.

      opt_shapes may be lowered for faster engine generation.

measurements: fps / device memory (MB)

model 1 stream static 1 stream dynamic (1) 1 stream dynamic (2) 2 streams static 2 streams dynamic (1) 2 streams dynamic (2)
waifu2x upconv7 1920x1080 17.4 / 1992 17.5 / 1998 17.4 / 2040 21.2 / 3694 21.2 / 3756 20.9 / 3850
waifu2x upconv7 1280x720 37.2 / 1046 38.5 / 1930 37.8 / 1976 46.3 / 1818 48.2 / 3628 46.5 / 3722
waifu2x upconv7 720x480 102.2 / 544 104.4 / 1894 102.2 / 1940 123.1 / 834 128.1 / 3556 123.4 / 3650
dpir color 1920x1080 7.3 / 2114 7.3 / 2114 7.2 / 2116 7.5 / 3656 7.4 / 3992 7.4 / 4002
dpir color 1280x720 16.4 / 1122 16.3 / 2086 16.2 / 2092 16.7 / 1810 16.7 / 3936 16.7 / 3946
dpir color 720x480 41.5 / 604 41.5 / 2068 41.6 / 2074 44.3 / 863 44.3 / 3900 44.2 / 3910
real-esrgan v2 1920x1080 12.3 / 1320 12.3 / 1320 12.3 / 1320 13.0 / 2196 13.2 / 2402 12.9 / 2402
real-esrgan v2 1280x720 26.9 / 736 26.9 / 1256 27.0 / 1256 29.1 / 1130 29.3 / 2274 29.2 / 2274
real-esrgan v2 720x480 73.2 / 422 73.2 / 1220 72.7 / 1220 78.7 / 570 78.4 / 2202 78.1 / 2202
cugan 1920x1080 9.4 / 4648 9.4 / 4726 9.2 / 4618 9.8 / 8754 10.2 / 9210 9.9 / 8996
cugan 1280x720 20.5 / 2214 20.5 / 4662 20.0 / 4554 21.2 / 4050 22.9 / 9082 22.4 / 8868
cugan 720x480 54.8 / 996 53.7 / 4626 52.9 / 4518 57.7 / 1690 59.9 / 9019 58.9 / 8796
rife v4.4 1920x1088 92.8 / 590 92.2 / 594 89.5 / 606 178.6 / 920 177.6 / 942 169.5 / 974
rife v4.4 1280x736 206.0 / 410 199.2 / 534 199.1 / 550 394.2 / 560 377.5 / 822 374.0 / 854
rife v4.4 736x480 497.3 / 316 442.2 / 504 492.3 / 520 903.2 / 376 809.3 / 762 874.1 / 794

*The gap is large on rife because of underutilization, and will disappear when using more streams.

v12.3.test

09 Jan 15:10
Compare
Choose a tag to compare
v12.3.test Pre-release
Pre-release

v12.2

23 Nov 09:19
Compare
Choose a tag to compare
v12.2 Pre-release
Pre-release

Update vsmlrt.py:

  • Introduce a new release artifact ext-models.v12.2.7z, which comes from External Models, and it's not bundled into full binary release packages (i.e. the cpu, cuda and vk packages). Please refer to their release notes for details on how to use those models.

  • Export a new API vsmlrt.inference for inference of custom models.

    import vsmlrt
    output = vsmlrt.inference(clips, "path/to/onnx", backend=vsmlrt.Backend.TRT(fp16=True))

    If you encounter issues like Cannot find input tensor with name "input" in the network inputs! Please make sure the input tensor names are correct., you could use vsmlrt.inference(..., input_name=None) or export the model with its input name set to "input".

  • Fix trt inference of cugan-pro (3x) models. (#15)

External Models

07 Dec 07:20
Compare
Choose a tag to compare
External Models Pre-release
Pre-release

More models!

In addition to bundled models, vs-mlrt can also be used to run these models:

With more to come.

Also check onnx models provided by the avs-mlrt community.

Usage

If an external model is not supported by the Python wrapper, you can use the generic vsmlrt.inference API to run these models (requires release v12.2 or later).

import vsmlrt
output = vsmlrt.inference(rgbs, "path/to/onnx", backend=vsmlrt.Backend.TRT(fp16=True))

The rife model requires auxiliary inputs and should be used from vsmlrt.RIFE or vsmlrt.RIFEMerge interface.

v12.1

16 Nov 10:43
Compare
Choose a tag to compare
v12.1 Pre-release
Pre-release

This minor release fixes #9: now if vsort/vstrt fails to load required cuda DLLs, they won't crash the entire process.

However, if vs-mlrt is correctly installed, this shouldn't happen. Please report an issue if you can't access the core.trt or core.ort namespaces. Common mistake is forgetting to extract the vsmlrt-cuda.v12.1.7z package for VSORT-Windows-x64.v12.1.7z or VSTRT-Windows-x64.v12.1.7z packages. If in doubt, use the fully bundled release vsmlrt-windows-x64-cuda.v12.1.7z for CUDA users.

Note: we explicitly do not support using both pytorch and vs-mlrt plugins in the same vpy script as pytorch uses its own set of cuda DLL which might be in conflict with the ones vs-mlrt uses. As those DLLs are not explicitly versioned (e.g. nvinfer.dll instead of nvinfer-x.yz.dll), there is nothing we can do.

v12: latest CUDA libraries

01 Nov 10:57
Compare
Choose a tag to compare

Compared to v11, this release updated CUDA dependencies to CUDA 11.8.0, cuDNN 8.6.0 and TensorRT 8.5.1:

  • Added support for the NVIDIA 40 series GPUs.
  • Added support for RIFE on the trt backend.

Known issue

  • Performance of the OV_CPU or ORT_CUDA(fp16=True) backends for RIFE is lower than expected, which is under investigation. Please consider ORT_CPU or ORT_CUDA(fp16=False) for now.
  • The NCNN_VK backend does not support RIFE.

Installation Notes

For some advanced features, vsmlrt.py requires numpy and onnx packages to be available. You might need to run pip install onnx numpy.

Benchmark

previous benchmark

Configuration: NVIDIA RTX 3090, driver 526.47, windows server 2019, vs r60, python 3.11.0, 1080p fp16

Backends: ort-cuda, trt from vs-mlrt v12.

For the trt backend, the engine is created without CUDA_MODULE_LOADING=LAZY environment variable and with it during benchmarking to reduce device memory consumption.

Data format: fps / GPU memory usage (MB)

rife(model=44, 1920x1088)

backend 1 stream 2 streams
ort-cuda 53.62/1771 83.34/2748
trt 71.30/ 626 107.3/ 962

dpir color

backend 1 stream 2 streams
ort-cuda 4.64/3230
trt 10.32/1992 11.61/3475

waifu2x upconv_7

backend 1 stream 2 streams
ort-cuda 11.07/5916 15.04/10899
trt 18.38/2092 31.64/ 3848

waifu2x cunet

backend 1 stream 2 streams
ort-cuda 4.63/8541 5.32/16148
trt 11.44/4771 15.59/ 8972

realesrgan v2/v3

backend 1 stream 2 streams
ort-cuda 8.84/2283 11.10/4202
trt 14.59/1324 21.37/2174