Skip to content

Releases: AmusementClub/vs-mlrt

v15.5: latest TensorRT library, CoreML backend

01 Oct 00:12
Compare
Choose a tag to compare

TRT

  • Upgraded to TensorRT 10.5.0.
  • Volta architecture GPUs (TITAN V, V100) are no longer supported.

ORT

  • Fix MacOS CoreML support for vsort by @yuygfgg in #106.

    This pull request also added theORT_COREML backend to vsmlrt.py.

General

  • Upgraded to CUDA 12.6.1.

vsmlrt.py

  • Added support for RIFE v4.25 and v4.26 models.

  • Added automatic batch inference support via batch_size option in inference() and flexible_inference(), which may improve device utilization for inference on small inputs using some small models.

    • On the one hand, batching improves utilization by creating more work for each kernel invocation and reducing quantization inefficiency of kernel tiles in bulk parallelism. It also reduces average kernel launch and synchronization overhead per work.
    • On the other hand, however, batching causes cache misses and inserts bubbles in the pipeline that may degrade performance.

    This feature requires flexible output support starting with vs-mlrt v15 and is inspired by styler00dollar/VSGAN-tensorrt-docker@ac47012.

    Note that not all onnx models are supported.

    • Future RIFE v2 models will be fixed to support batch inference.

    benchmark:

    • NVIDIA GeForce RTX 4090
    • driver 560.94
    • Windows Server 2019
    • python 3.12.6, vapoursynth-classic R57.A10, vs-mlrt v15.4
    • input: 720x480 RGBS
    • backend: TRT(fp16=True, use_cuda_graph=True)

    Measurements: FPS / Device Memory (MB)

    model batch 1 batch 2
    realesrgan compact (stream 1) 73.01 / 708 138.68 / 950
    realesrgan compact (streams 2) 107.81 / 914 263.87 / 1347
    realesrgan compact (streams 3) 108.30 / 1128 348.23 / 1738
    realesrgan ultracompact (stream 1) 99.43 / 702 165.52 / 950
    realesrgan ultracompact (streams 2) 184.48 / 908 302.56 / 1344
    realesrgan ultracompact (streams 3) 184.69 / 1114 458.18 / 1738

Full Changelog: v15.4...v15.5

v15.4: latest TensorRT library

07 Sep 01:05
Compare
Choose a tag to compare

TRT

  • Upgraded to TensorRT 10.4.0.

General

  • Upgraded to CUDA 12.6.0.

vsmlrt.py

  • Added support for Ani4K-v2 model by @srk24 in #105
  • Added support for RIFE v4.23 and v4.24 models.
  • Add max_tactics option to the TRT backend, which can reduce engine build time by limiting the number of tactics to time.
    • By default, TensorRT will determine the number of tactics based on its own heuristic.

Batch Inference (Preview)

The latest vsmlrt.py (not in v15.4) provides experimental support for batch inference via batch_size option in inference() and flexible_inference(), which may improve device utilization for inference on small inputs using some small models.

This feature requires flexible output support starting with vs-mlrt v15 and is inspired by styler00dollar/VSGAN-tensorrt-docker@ac47012.

Note that not all onnx models are supported.

Preliminary benchmark:

  • NVIDIA GeForce RTX 4090
  • driver 560.94
  • Windows Server 2019
  • python 3.12.6, vapoursynth-classic R57.A10
  • input: 720x480 RGBS
  • backend: TRT(fp16=True, use_cuda_graph=True)

Measurements: FPS / Device Memory (MB)

model batch 1 batch 2
realesrgan compact (stream 1) 73.01 / 708 138.68 / 950
realesrgan compact (streams 2) 107.81 / 914 263.87 / 1347
realesrgan compact (streams 3) 108.30 / 1128 348.23 / 1738
realesrgan ultracompact (stream 1) 99.43 / 702 165.52 / 950
realesrgan ultracompact (streams 2) 184.48 / 908 302.56 / 1344
realesrgan ultracompact (streams 3) 184.69 / 1114 458.18 / 1738

Full Changelog: v15.3...v15.4

v15.3: MIGraphX on Windows

21 Aug 07:23
Compare
Choose a tag to compare

MIGX

  • Add experimental MIGraphX support on Windows. MIGraphX is AMD's graph optimization engine to accelerate machine learning model inference.

    Supported GPUs:

    • gfx1030: Radeon RX 6950 XT, Radeon RX 6900 XT, Radeon RX 6800 XT, Radeon RX 6800, ...
    • gfx1100: Radeon RX 7900 XTX, Radeon RX 7900 XT, ...
    • gfx1101: Radeon RX 7700 XT, ...
    • gfx1102: Radeon RX 7600

    Relevant archives include:

    The MIGraphX runtime in this release uses HIP 6.1.2 and MIGraphX 2.11 (9cf49f9).

    Note that the Windows support has not been officially announced by AMD.

Known limitation

  • The MIGX backend in the vsmlrt.py wrapper does not support device selection and will always use the default device (device_id=0).

General

vsmlrt.py

  • Added support for RIFE v4.22 (lite) models.

Full Changelog: v15.2...v15.3

v15.2: latest TensorRT library

07 Aug 06:42
Compare
Choose a tag to compare

TRT

  • Upgraded to TensorRT 10.3.0.

  • Fixed performance regression of RIFE and SAFA models starting with vs-mlrt v14.test4. This version may still be slightly slower than vs-mlrt v14.test3 under some conditions, however.

General

  • Upgraded to CUDA 12.5.1.

vsmlrt.py

  • Added support for RIFE v4.19 ~ v4.21 models.
  • Added support for ArtCNN R8F64 (chroma) models.
    • Deprecated ArtCNN C4F32 models based on developer's request, but compatibility at the vsmlrt.py level will be guaranteed.

Full Changelog: v15.1...v15.2

v15.1: latest TensorRT library

04 Jul 00:53
Compare
Choose a tag to compare

TRT

  • Upgraded to TensorRT 10.2.0.

  • Add TensorRT release package (vsmlrt-windows-x64-tensorrt). #102

    This package is a strict subset of the CUDA release package, with cuDNN, cuBLAS libraries and support for ORT_CUDA backend removed.

    It supports TRT, OV_*, ORT_CPU, ORT_DML and NCNN_VK backends.

known issue

  • Accoding to the documentation,

    There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.

    This affects RIFE and SAFA models.

    vs-mlrt v14.test3 is the latest one that is not affected. This will be fixed in the next release by TensorRT 10.3.0.

General

  • Upgraded to CUDA 12.5.0.

vsmlrt.py

Full Changelog: v15...v15.1

v15: latest TensorRT library

15 Jun 02:11
Compare
Choose a tag to compare

General

plugins

  • Added parameter flexible_output_prop for flexible output:

    Traditionally, all plugins can only support onnx models with one or three output channels, due to vapoursynth's limitation.

    By using the new flexible output feature, plugins can support onnx models with arbitrary number of output planes.

    from typing import TypedDict
    
    class Output(TypedDict):
        clip: vs.VideoNode
        num_planes: int
    
    prop = "planes" # arbitrary non-empty string
    output = core.ov.Model(src, network_path, flexible_output_prop=prop) # type: Output
    
    clip = output["clip"]
    num_planes = output["num_planes"]
    
    output_planes = [
        clip.std.PropToClip(prop=f"{prop}{i}")
        for i in range(num_planes)
    ] # type: list[vs.VideoNode]

    This feature is supported by all plugins starting with vs-mlrt v15.

vsmlrt.py

  • Added support for RIFE v4.17 models.

  • Added support for ArtCNN models optimised for anime content. The chroma variants are not supported on previous versions of vs-mlrt, because they require the flexible output feature.

  • Added function flexible_inference for flexible output:

    The above sample is simplified as

    output_planes = flexible_inference(src, network_path) # type: list[vs.VideoNode]

TRT

  • Upgraded to TensorRT 10.1.0.

known issue

  • Accoding to the documentation,

    There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.

    This affects RIFE and SAFA models.

    vs-mlrt v14.test3 is the latest one that is not affected.

Community contributions

Full Changelog: v14...v15

v14: latest libraries

25 Apr 01:01
Compare
Choose a tag to compare

Compared to the previous stable (v13.2) release:

General

vsmlrt.py

  • Plugin invocation order in the get_plugin_path() function is sorted to reduce memory consumption.
  • Added support for RIFE v4.7 ~ v4.16 (lite, ensemble) models.
  • Added support for SCUNet models for image denoising.

TRT

plugin and runtime libraries

  • Upgraded to TensorRT 10.0.1.
  • Maxwell and Pascal GPUs are no longer supported. Other backends still support these GPUs.
  • Reduce GPU memory usage for dynamically shaped engines when the actual tile size is smaller than the maximum tile size set during engine building.
  • Reduced engine build time.
  • Added long path support for engines on Windows.
  • cuDNN is no longer a strict runtime dependency.

vsmlrt.py

  • The cuDNN tactic is no longer enabled by default.
  • TF32 acceleration is disabled by default.
  • The maximum workspace is set to None for the total memory size of the GPU.
  • Add parameters builder_optimization_level, max_aux_streams, bf16 (#64), custom_env, custom_args, short_path and engine_folder (#90):
    • builder_optimization_level: "adjust how long TensorRT should spend searching for tactics with potentially better performance" link
    • max_aux_streams: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." link
    • bf16: "TensorRT supports the bfloat16 (brain float) floating point format on NVIDIA Ampere and later architectures ... Note that not all layers support bfloat16." link
    • custom_env, custom_args: custom environment variable and arguments for trtexec engine build.
    • short_path: whether to shorten engine name.
      • On Windows, this could be useful in addressing the maximum path length limitation, and is enabled by default.
    • engine_folder: used to specify custom directory for engines.

known issues

  • Accoding to the documentation, There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.This affects RIFE and SAFA models.

  • trtexec may reports errors like:

    • [E] Error[9]: Skipping tactic 0xded5318b4a444b84 due to exception Cask convolution execution
    • [E] Error[2]: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)

    This issue has been submitted to NVIDIA.

ORT

  • Upgraded to ONNX Runtime v1.18.0.

interface

  • The ORT_* backends now support fp16 I/O. The semantics of the fp16 flag in these backends is as follows:
    • Enabling fp16 will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by the output_format option (0 = fp32, 1 = fp16).
    • Disabling fp16 will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.

CUDA

  • Reduced execution overhead.
  • Added support for TF32 acceleration. This is disabled by default.
  • Added experimental prefer_nhwc flag to reduce the number of layout transformations when using tensor cores. This is disabled by default.

OV

  • Upgraded to OpenVINO 2024.2.0.
  • Added experimental OV_NPU backend for Intel NPUs.

MIGX

  • Added support for MIGraphX backend for AMD GPUs. Currently this backend is Linux only.

Community contributions

  • scripts/vsmlrt.py: update esrgan janai models by @hooke007 in #53
  • scripts/vsmlrt.py: add more esrgan janai models by @hooke007 in #82
  • vsmigx: allow fp16 input & output by @abihf in #86
  • scripts/vsmlrt.py: fix fp16 precision issues of RIFE v2 representations by @charlessuh in #66 (comment)

Benchmark

NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8

1920x1080 RGBS, TRT backend, CUDA graphs enabled, fp16

Measurements: FPS / Device Memory (MB)

model 1 stream 2 streams 3 streams
dpir color 10.99 / 1715.172 11.62 / 3048.540 11.64 / 4381.912
waifu2x upconv_7_{anime_style_art_rgb, photo} 22.38 / 2016.352 32.66 / 3734.880 32.54 / 5453.404
waifu2x cunet / cugan 12.41 / 4359.284 15.53 / 8363.392 15.47 / 12367.504
waifu2x swin_unet 3.80 / 7304.332 4.06 / 14392.408 4.06 / 21276.380
real-esrgan (v2/v3, xsx2) 16.65 / 955.480 22.53 / 1645.904 22.49 / 2336.324
scunet color 4.20 / 2847.708 4.33 / 6646.884 4.33 / 9792.736

Also check benchmarks from previous pre-releases v14.test4 (NVIDIA RTX 2080 Ti/3090/4090 GPUs) and v14.test3 (NVIDIA RTX 4090 and AMD RX 7900 XTX GPUs).


This release uses CUDA 12.4.1, cuDNN 8.9.7, TensorRT 10.0.1, ONNX Runtime v1.18.0, OpenVINO 2024.2.0 and ncnn 20220915 b16f8ca.

Full Changelog: v13.2...v14

v14.test4: latest TensorRT and ONNX Runtime libraries

27 Mar 03:27
Compare
Choose a tag to compare

This is a preview release for TensorRT 10.0.0, following the v14.test, v14.test2 and v14.test3 releases.

  • The TRT backend no longer supports Maxwell and Pascal GPUs. Other backends still support these GPUs. Same as those releases, the current release requires driver version >= 525.

  • Added support for SwinIR models for image restoration, which are only supported by the TRT backend and the ORT_CPU backend from vs-mlrt v14.test4 or later. SwinIR-M and SwinIR-L models exhibit precision issue with fp16 implementation, this is under investigation.

  • Added support for SCUNet models for image denoising, which are only supported by the TRT backend and the ORT_CPU backend from vs-mlrt v14.test4 or later.

  • Added engine_folder argument to the TRT backend in vsmlrt.py to specify custom directory for engines.

  • Starting with this pre-release, for dynamically shaped engines, the trt runtime allocates gpu memory based on the actual tile size, whereas in previous releases, the runtime would have to allocate gpu memory based on the maximum tile size set at engine compile time. This feature requires TensorRT 10 or later.

  • The ORT_* backends now support fp16 I/O. The semantics of the fp16 flag is as follows:

    • Enabling fp16 will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by the output_format option (0 = fp32, 1 = fp16).
    • Disabling fp16 will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.
  • Reduce the overhead of the ORT_CUDA backend.

  • Added support for TF32 acceleration to the ORT_CUDA backend. Disabled by default.

  • Add experimental prefer_nhwc flag to the ORT_CUDA backend to reduce the number of layout transformations when using tensor cores.

  • For production use of the TRT backend, continue to use vsmlrt v13.2. For RIFE and SAFA acceleration on the TRT backend, continue to use any old release.

  • Also check the release notes of the previous pre-releases.


benchmark 1

previous benchmark

  • RTX 4090
    • processor clock @ 2520 MHz
  • Intel Icelake server @ 2100 MHz
  • Driver 551.86
  • Windows 10 21H2 (19044.1415)
  • TensorRT 10.0.0
  • VapourSynth-Classic R57.A8, vapoursynth-plugin v0.96g3

1920x1080 rgbs, CUDA graphs enabled, fp16

Measurements: FPS / Device Memory (MB)

general

model 1 stream 2 streams 3 streams
dpir gray 22.05 / 1818.796 25.30 / 3111.114 25.33 / 4403.488
dpir color 18.30 / 1851.632 25.13 / 3176.808 25.17 / 4501.984
waifu2x upconv_7_{anime_style_art_rgb, photo} 20.45 / 2148.716 41.22 / 3867.240 61.21 / 5585.764
waifu2x upresnet10 17.91 / 1716.588 34.53 / 2941.540 42.33 / 4166.492
waifu2x cunet / cugan 13.89 / 4391.292 25.74 / 8346.248 25.96 / 12301.202
waifu2x swin_unet 4.62 / 7436.692 5.43 / 14426.812 5.43 / 21412.840
real-esrgan (v2/v3, xsx2) 17.06 / 1087.844 33.41 / 1778.264 38.26 / 2468.684
scunet gray 5.29 / 3590.320 5.40 / 6678.768 5.40 / 9767.208
scunet color 5.13 / 3555.568 5.48 / 6611.308 5.47 / 9667.048
swinir-s (2x, color) 1.63 / 15897.048 N/A N/A
swinir-m* (2x, color, 720p) 1.05 / 11305.268 N/A N/A
swinir-l* (4x, color, 720p) 0.61 / 15391.316 N/A N/A

*: swinir-m and swinir-l exhibit precision issues.

rife

v2, fp16 i/o

version 1 stream 2 streams 3 streams 4 streams 5 streams
v4.4-v4.5 136.92/778.432 273.80/1149.204 414.80/1522.028 553.70/1892.796 574.31/2263.568
v4.6 136.01/800.960 275.26/1192.212 411.01/1585.516 544.30/1979.764 550.01/2368.020
v4.7-v4.9 98.20/1302.724 195.78/2187.548 210.12/3074.420 210.45/3957.196 210.66/4844.068
v4.10-v4.15 84.41/1595.592 160.93/2773.280 161.96/3953.020 162.04/5132.760 162.07/6310.448
{v4.12, v4.13, v4.15, v4.16}_lite 93.39/1333.444 187.32/2255.132 197.71/3178.872 198.01/4098.508 197.95/5022.248
v4.14 lite 81.83/1595.292 153.40/2779.424 154.19/3963.260 154.28/5149.140 154.30/6332.980

benchmark 2

previous benchmark

NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model ORT_CUDA NCHW ORT_CUDA NHWC ORT_DML
dpir color 4.54 / 2573.3 5.98 / 2470.9 8.45 / 2364.5
dpir color (2 streams) 4.66 / 4854.9 6.30 / 4680.8 9.48 / 4630.9
waifu2x upconv7 10.98 / 5432.5 3.18 / 3017.8 12.48 / 4493.0
waifu2x upconv7 (2 streams) 14.96 / 10397.1 3.25 / 5780.9 21.72 / 8891.7
waifu2x cunet / cugan 4.70 / 7955.6 4.49 / 6290.6 OOM
waifu2x cunet / cugan (2 streams) 5.11 / 15721.9 4.78 / 12312.0 OOM
waifu2x swin_unet_art 2.98 / 23518.5 3.05 / 22812.0 N/A
realesrgan 8.99 / 1647.7 11.20 / 1127.5 11.99 / 1346.6
realesrgan (2 streams) 10.69 / 3034.5 13.58 / 1994.1 17.34 / 2601.6
rife v4.4 (1920x1088) 61.42 / 1100.9 56.02 / 1162.3 44.73 / 882.4
rife v4.4 (1920x1088, 2 streams) 106.48 / 1953.4 92.88 / 2071.9 68.80 / 1670.7
scunet color N/A N/A N/A

benchmark 3

NVIDIA GeForce RTX 2080 Ti, 4352 shaders @ 1700 MHz, driver 552.22, Windows 10 LTSC 21H2 (19044.1415), Python 3.11.9, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model TRT ORT_CUDA ORT_DML ORT_CUDA NHWC
dpir color (1 stream) 7.08 / 1899 3.10 / 2602 4.99 / 2341 4.26 / 2411
dpir color (2 streams) 8.06 / 3376 3.30 / 5016 5.85 / 4619 4.74 / 4650
waifu2x upconv7 (1 stream) 11.47 / 2014 7.01 / 4949 7.45 / 4501 1.59 / 2923
waifu2x upconv7 (2 streams) 21.44 / 3782 10.11 / 9732 13.23 / 8940 1.77 / 5674
waifu2x cunet / cugan (1 stream) 7.41 / 4664 3.10 / 10067 OOM 0.77 / 6188
waifu2x cunet / cugan (2 streams) 10.92 / 8863 OOM OOM OOM
waifu2x swin_unet_art (1 stream) 2.35 / 7234 OOM N/A OOM
waifu2x swin_unet_art (2 streams) OOM OOM N/A OOM
realesrgan (1 stream) 8.66 / 1268 5.33 / 1545 6.39 / 1316 6.96 / 1033
realesrgan (2 streams) 13.20 / 2166 7.78 / 2932 10.22 / 2571 10.25 / 1895
rife v4.4 (1920x1088, fp16 i/o, 1 stream) 64.97 / 609 46.60 / 967 32.18 / 723 48.5...
Read more

v14.test3: latest TensorRT, MIGraphX backend

03 Dec 01:13
Compare
Choose a tag to compare

This is a preview release for TensorRT 9.2.0, following the v14.test and v14.test2 releases.

  • Same as those releases, it requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.

  • TensorRT 9.2.0 is officially documented as for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only. The Windows build is downloaded from here, and can be used on other GPU models.

  • Users should use the same version of TensorRT as provided (9.2.0) because runtime version checking is disabled in this release.

  • Added support for AnimeJaNai V3 models, contributed by contributed by @hooke007 in #82.

  • Added support for RIFE v4.13 ~ v4.16 (lite, ensemble) models, which are also available for previous vs-mlrt releases (simply download the new model file here and update vsmlrt.py).

    • The v4.13 ~ v4.15 models should have the same execution speed as the v4.10 - v4.12 models.
    • The v4.13 lite model, the v4.15 lite model and the v4.16 lite model should all have the same execution speed as the v4.12 lite model, while the v4.14 lite model may run slower.
  • Added support for fractional video frame interpolation in RIFE.

    • Playback in video players should also set video_player=True (#59 (comment)). This change is experimental.
  • Fixed an issue that causes the TRT backend crashes during script realoading. (#65). It is also fixed in the latest iteration of the v14.test2 release.

  • RIFE v4.7+ models with v2 representation are not working with dynamic shapes (#72). This has been reported to TensorRT developers.

  • Initial MIGraphX support (experimental) for AMD GPUs.

    • fp16 I/O contributed by @abihf in #86.
    • Multi-stream execution, device selection, hip graphs and dynamic shapes are not explicitly supported for now.

    preliminary benchmark on Radeon RX 7900 XTX 1:

    • resolution: 1920x1080
    • measurements: fps / device memory (MB)
    model fp32 fp16
    dpir gray 2.33 / 2829 7.29 / 1702
    dpir color 2.27 / 2861 7.03 / 1734
    waifu2x upconv7 6.31 / 4540 12.90 / 2503
    waifu2x upresnet10 6.65 / 3077 13.63 / 1775
    waifu2x cunet / cugan 3.69 / 6711 8.36 / 3591
    waifu2x swin_unet 2 2.19 / 9791 4.53 / 5236
    realesrgan 3 5.75 / 1961 11.57 / 959
    rife 4 N/A N/A
  • Also check the release notes of the v14.test and v14.test2 releases.


benchmark

  • RTX 4090
    • processor clock @ 2520 MHz
  • Intel Icelake server @ 2100 MHz
  • Driver 551.86
  • Windows 10 21H2 (19044.1415)
  • TensorRT 9.2.0
  • VapourSynth-Classic R57.A8, vapoursynth-plugin v0.96g3

1920x1080 rgbs, CUDA graphs enabled, fp16

Measurements: FPS / Device Memory (MB)

general

model 1 stream 2 streams 3 streams
dpir gray 21.93/1757.352 25.48/3049.696 25.31/4342.044
dpir color 18.24/1790.184 25.11/3115.360 25.22/4440.540
waifu2x upconv_7_{anime_style_art_rgb, photo} 19.58/2148.716 39.87/3867.240 59.94/5585.768
waifu2x upresnet10 17.40/1655.144 34.22/2880.096 42.78/4105.048
waifu2x cunet / cugan 13.64/4391.292 25.09/8346.248 25.19/12301.208
waifu2x swin_unet 4.62/14989.772 OOM OOM
real-esrgan (v2/v3, xsx2) 16.77/1136.996 33.99/1876.568 41.44/2616.140

rife

v2, fp16 i/o

version 1 stream 2 streams 3 streams 4 streams 5 streams
v4.4-v4.5 150.20/622.784 301.05/835.860 448.90/1053.024 615.84/1268.152 787.57/1481.224
v4.6 147.63/624.832 294.53/837.904 452.26/1055.072 603.63/1270.200 764.31/1485.320
v4.7-v4.9 132.06/747.712 268.63/1075.476 403.54/1405.284 494.98/1737.152 496.41/2064.908
v4.10-v4.15 119.09/862.400 238.68/1304.852 346.98/1749.352 349.48/2195.904 349.80/2638.356
{v4.12, v4.13, v4.15, v4.16}_lite 123.72/782.528 250.81/1151.252 377.27/1522.020 403.14/1894.844 403.79/2263.568
v4.14 lite 117.97/839.872 234.67/1265.940 320.23/1696.104 321.88/2124.224 321.18/2552.340

  • This pre-release uses trt 9.2.0 + cuda 12.3.1 + cudnn 8.9.6, which requires a minimum driver version of 525 and is compatible with 10 series and newer GPUs, with no significant performance improvement measured.
  • vsmlrt.py in all branches can be used interchangeably.
  1. RDNA3, Navi 31, 12288 shaders, processor clock @ 2399 MHz, memory clock @ 1249 MHz, driver 6.0.32831, PCIe 4.0 x16, MIGraphX 2.8.0, ROCm 6.0.2, Linux 6.7.0-060700-generic, VapourSynth-Classic R57.A8

  2. tested on MIGraphX 2.10.0 ROCm/AMDMIGraphX@ecd5adc, requiers MIGraphX 2.9.0 ROCm/AMDMIGraphX@2d4a6507c3ad41f9d7ea36de1d7fb257cc788585s and replacing edge padding by reflection padding

  3. tested on MIGraphX 2.10.0 ROCm/AMDMIGraphX@ecd5adc, requiers MIGraphX 2.9.0 ROCm/AMDMIGraphX@2d4a6507c3ad41f9d7ea36de1d7fb257cc788585s

  4. missing support for GridSample operation

v14.test2: latest TensorRT library

23 Oct 00:58
Compare
Choose a tag to compare
Pre-release

This is a preview release for TensorRT 9.1.0, following v14.test release.

  • Same as v14.test release, it requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.

  • TensorRT 9.1.0 is officially documented as for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only on Linux. The Windows build is downloaded from here, and can be used on other GPU models.

    • On Windows, some users have reported crashes when using it in mpv (#65). This problem occurs on an earlier version of this release, which is now fixed.
  • Add parameters bf16 (#64), custom_env and custom_args to the TRT backend.

    • fp16 execution of Waifu2xModel.swin_unet_art is more accurate, faster and uses less GPU memory than bf16 execution (benchmark)
  • Device memory usage of model Waifu2xModel.swin_unet_art is reduced compared to TensorRT 9.0.1 on A10G with 1080p input (at 2.66 fps with 7.0GB VRAM usage) with default auxiliary stream heuristic.

    • TensorRT 9.0.1 using 7 auxiliary streams compared to TensorRT 9.1.0 (3 streams) results in significantly more device memory with no performance gain.
    • Setting max_aux_streams=3 lowers device memory usage of TensorRT 9.0.1 to ~8.9GB, and max_aux_streams=0 corresponds to ~7.3GB usage.
    • TensorRT 9.1.0 with max_aux_streams=0 uses ~6.7GB device memory.
  • Users should use the same version of TensorRT as provided (9.1.0) because runtime version checking is disabled in this release.

  • Added support for RIFE v4.8 - v4.12, v4.12 ~ v4.13 lite (ensemble) models, which are also available for previous vs-mlrt releases (simply download the new model file here and update vsmlrt.py). v4.8 and v4.9 models should have the same execution speed as v4.7, while v4.10-v4.12 models are equally heavier than previous models. Ensemble models are heavier than their non-ensemble counterparts.

    • Starting from RIFE v4.11, all rife models are temporarily moved here with individual packaging.
  • RIFE models with v2 representation for TRT backend now has improved accuracy, contributed by @charlessuh (#66 (comment)). This has been backported to master.

    • This improvement may be very slightly inefficient under onnx file renaming. It is advised to keep onnx file name unchanged and change the function call to vsmlrt.RIFE().
    • By default vsmlrt.RIFE() in vsmlrt.py uses v1 representation. The v2 representation is enabled with vsmlrt.RIFE(_implementation=2) function call.
      Sample Error Messageinput: for dimension number 1 in profile 0 does not match network definition (got min=11, opt=11, max=11), expected min=opt=max=7)
    • v2 representation is still considered experimental.
  • Added support for SAFA v0.1 video enhancement model.

    • This model takes arbitrary sized video and uses both spatial and temporal information to improve visual quality.
    • Note that this model is non-deterministic by nature, and existing backends does not support manual seeding.
    • ~17 fps on RTX 4090 with TRT(fp16), 1080p input and non-adaptive mode. Adaptive mode is about 2x slower than non-adaptive mode, uses more memory and does not support cuda graphs execution.
    • This representation is not supported by the NCNN_VK backend for the same issue as RIFE v2 representation.
  • Also check the release note of v14.test release.


  • This pre-release uses trt 9.1.0 + cuda 12.2.2 + cudnn 8.9.5, which can only run on driver >= 525 and 10 series and later gpus, with improved support for self-attentions found in transformer models.
  • vsmlrt.py in all branches can be used interchangeably.

  • TensorRT 9.0.1 is for Large Language Models (LLMs) on A100, A10G, L4, L40, and H100 GPUs only on x86 Linux. Model Waifu2xModel.swin_unet_artis 1.2x faster compared to TensorRT 8.6.1 on A10G with 720p input (at 6.3 fps with 4GB VRAM usage), thanks to multi-head attention fusion (requires fp16).

This pre-release is now feature complete. Development now switch to the v14.test3 pre-release.