13 Mar 12:20

daf9620

v14.test: latest TensorRT library Pre-release

Pre-release

This is a preview release for TensorRT 8.6.1.

It requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.
Add parameters builder_optimization_level and max_aux_streams to the TRT backend.
- builder_optimization_level: "adjust how long TensorRT should spend searching for tactics with potentially better performance" link
- max_aux_streams: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." link
  - It is advised to lower max_aux_streams to 0 on heavy models like Waifu2xModel.swin_unet_art to reduce memory usage. Check the benchmark data at the bottom.
Following TensorRT 8.6.1, cudnn tactic source of the TRT backend is disabled by default. tf32 is also disabled by default in vsmlrt.py.
Add parameter short_path to the TRT backend, which shortens engine path and is enabled on Windows by default.
Model Waifu2xModel.swin_unet_art seems does not work with builder_optimization_level=5 from the TRT backend before TRT 9.0. Use builder_optimization_level=4 or lower instead.

Less than 5% performance improvement among built-in models compared to 13.1/13.2, 24% device memory usage reduction on DPIR and 35% on RealESRGAN.

Version information:

v13.2 release uses trt 8.5.1 + cuda 11.8.0, which can run on driver >= 450 and 900 series and later gpus.
v14.test pre-release uses trt 8.6.1 + cuda 12.1.1, which can only run on driver >= 525 and 10 series and later gpus, with no significant performance improvement measured.
vsmlrt.py in both branches can be used interchangeably.

Added support for RIFE v4.7 model ("optimized for anime scenes"), which is also available for previous vs-mlrt releases (simply download the new model file here and update vsmlrt.py). It is more computational intensive than v4.6.

This pre-release is now feature complete. Development now switch to trt-latest branch and v14.test2 pre-release.

Assets 14

28 May 15:01

github-actions

v13.2

d9e4111

v13.2: latest ort library, DirectML backend

Added support for DirectML backend through ONNX Runtime. It is available for all dx12 devices and can be accessed through backend=Backend.ORT_DML(). waifu2x swin_unet models may not be supported on this backend, and RIFE models may be poorly supported.
Asset vsmlrt-windows-x64-vk.*.7z is renamed to vsmlrt-windows-x64-generic-gpu.*.7z and includes backends OV_CPU, OV_GPU, ORT_CPU, ORT_DML, NCNN_VK. cuda asset continue to include all backends in this release.
Update onnxruntime to microsoft/onnxruntime@73584f9.

Note

Backend OV_GPU may produces reduced precision output. This is under investigation.

benchmark 1

NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 532.03, Windows 10 21H2 LTSC (19044.1415), Python 3.11.3, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model	ORT_CUDA	ORT_DML
dpir	4.25 / 2573.3	7.01 / 2371.0
dpir (2 streams)	4.58 / 5506.2	8.85 / 4643.1
waifu2x upconv7	9.10 / 5248.1	9.65 / 4503.1
waifu2x upconv7 (2 streams)	11.15 / 2966.9	18.52 / 8911.2
waifu2x cunet / cugan	4.06 / 7875.7	6.36 / 8973.7
waifu2x cunet / cugan (2 streams)	N/A	9.51 / 17849.1
waifu2x swin_unet_art	N/A	N/A
realesrgan	7.52 / 1901.7	8.54 / 1352.4
realesrgan (2 streams)	11.15 / 2966.9	15.58 / 2608.7
rife	34.30 / 1109.1	2.12 / 1417.8
rife (2 streams)	61.45 / 2051.4	4.27 / 2740.9

benchmark 2

AMD Radeon Pro V620 MxGPU, 4608 shaders @ 2390 MHz, Adrenalin 21.20.02.13, Windows Server 2019, Python 3.11.3, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model	NCNN_VK	ORT_DML
dpir	1.70 / 3248.4	4.75 / 2308.1
dpir (2 streams)	1.74 / 6099.5	4.86 / 4584.6
waifu2x upconv7	5.18 / 6872.3	14.51 / 4448.5
waifu2x upconv7 (2 streams)	6.14 / 13701	15.98 / 8861.2
waifu2x cunet / cugan (2x2 tiles)	1.07 / 3159.8	5.57 / 2196.7
waifu2x cunet / cugan (2x2 tiles, 2 streams)	1.07 / 3159.8	6.08 / 4357.8
waifu2x swin_unet_art	N/A	N/A
realesrgan	3.86 / 2699.7	9.59 / 1290.4
realesrgan (2 streams)	4.43 / 5355.8	10.58 / 2545.3
rife	N/A	2.68 / 1353.5
rife (2 streams)	N/A	4.44 / 2673.3

Assets 14

29 Jan 15:00

github-actions

v13.1

cf2bfbf

v13.1: latest ov & ort library, new models

Update openvino to openvinotoolkit/openvino@b0ffec4 , improved support for RIFE in both OV_CPU and OV_GPU. (benchmark on arc a380)
Fix a typo in vsmlrt.py's __all__.
Added parameter num_threads in the OV_CPU backend
Update onnxruntime to microsoft/onnxruntime@8ed3dfe.
Default value of workspace of the TRT backend will be changed to None in the next release, anduse_cudnn will be changed to False.

the following models can be found on external models:

Add support for waifu2x swin_unet models.
Add support for ensemble configuration of RIFE. (RIFE v2 acceleration is experimental and may results in reduced quality on TRT backend with fp16 enabled.)

Note

Backend OV_GPU may produces reduced precision output. This is under investigation.

Assets 14

20 Apr 04:26

AkarinVS

contrib-models

668058b

Contributed Models Pre-release

Pre-release

Please see PR #42 for policy.

AnimeJaNai_v2.7z: RealESRGANv2/{animejanaiV2L1, animejanaiV2L2, animejanaiV2L3}.onnx, contributed by @hooke007 in #53.
AnimeJaNai_v3.7z: RealESRGANv2/{animejanaiV3-HD-L1, animejanaiV3-HD-L2, animejanaiV3-HD-L3}.onnx, contributed by @hooke007 in #82.
ani4k_v2_v1.7z: RealESRGANv2/{Ani4Kv2_G6i2_Compact, Ani4Kv2_G6i2_UltraCompact}.onnx from Ani4K v2, contributed by @srk24 in #105.

Contributors

srk24 and hooke007

Assets 5

14 Jan 10:49

github-actions

v13

5b7840f

v13: fp16 i/o, faster dynamic shapes for TRT backend

Added support for fp16 I/O format and faster dynamic shapes in the TRT backend.
- Thanks to @hooke007 @grobalt @MysteryDove @chainikdn and many others users on the svp forum, it becomes clear that reducing system bandwidth requirement is crucial to 4K RIFE performance on powerful GPUs (https://github.com/AmusementClub/vs-mlrt/discussions/19). The TRT backend now accepts fp16 clips beyond fp32, and the output format can be specified via parameter output_format (0 for fp32 and 1 for fp16).
As the only portable way to convert fp32 clips to fp16 is via resize (std.Expr only supports fp16 when the cpu supports the f16c instruction set extension), and to conserve memory bandwidth, you could use the following snippet for RIFE to perform the necessary padding, YUV/RGB conversion and FP16 conversion in one go:
```
th = (src.height + 31) // 32 * 32  # adjust 32 and 31 to match specific AI network input resolution requirements.
tw = (src.width  + 31) // 32 * 32  # same.
padded = src.resize.Bicubic(tw, th, format=vs.RGBS if WANT_FP32 else vs.RGBH, matrix_in_s="709", src_width=tw, src_height=th)
flt = vsmlrt.RIFE(padded, model=vsmlrt.RIFEModel.v4_6, backend=backend, output_format=1) # fp16 output
oh = src.height * (flt.height // th)  # not necessary for RIFE (i.e. oh = src.height), but required for super-resolution upscalers.
ow = src.width  * (flt.width  // tw)
res = flt.resize.Bicubic(ow, oh, format=vs.YUV420P8, matrix_s="709", src_width=ow, src_height=oh)
```
- Faster dynamic shapes introduced in TensorRT 8.5.1 improves performance and device memory usage of dynamically shaped engines (#20 (comment) and the following benchmark).
  
  Dynamically shaped models can be created by specifying static_shape=False, min_shapes (minimum size), opt_shapes (optimization size) and max_shapes (maximum size) in the TRT backend.
  
  Engine cache placement policy of vsmlrt.py is direct-mapped:
  - If dynamic shapes is used, engine name is determined by min_shapes, opt_shapes and max_shapes (among others).
  - Otherwise, engine name is determined by opt_shapes.
  - opt_shapes is usually set to tilesize in each specific model's interface if not initialized.
- workspace can now be set to None for unlimited workspace size. (#21)
- Add flag force_fp16. This flag forces fp16 computation during inference, and is disabled by default.
  - It reduces memory usage during engine build, and allows more engines to be built, e.g. 1080p rife on gpus with 4gb memory. (successfully build dynamically shaped rife with min_shapes=(64, 64), opt_shapes=(1920, 1088), max_shapes=(3840, 2176) on a 4gb ampere)
  - It may reduce engine build time.
Introduce a new simplified backend interface BackendV2.
Disable tiling support for rife due to incompatible inference design.

dynamic shapes

In conclusion, dynamic shapes should be much more flexible when dealing with different video resolutions (no engine re-compilation is required), and incurs almost no performance degradation starting with TensorRT 8.5. Only increased device memory usage will be a concern.

benchmark

configuration: nvidia a10 (ecc disabled), driver 527.41 (tcc), windows server 2022, python 3.11.1, vapoursynth-classic R57.A7, Backend.TRT(fp16=True, use_cuda_graph=True, tf32=False, use_cudnn=False), CUDA_MODULE_LOADING=LAZY
- Statically shaped engines for each model are compiled separately, while the dynamically shaped engine is compiled once with (1) static_shape=False, min_shapes=(64, 64), opt_shapes=<max-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions> or (2) static_shape=False, min_shapes=(64, 64), opt_shapes=<min-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions>.
  
  opt_shapes may be lowered for faster engine generation.

measurements: fps / device memory (MB)

model	1 stream static	1 stream dynamic (1)	1 stream dynamic (2)	2 streams static	2 streams dynamic (1)	2 streams dynamic (2)
waifu2x upconv7 1920x1080	17.4 / 1992	17.5 / 1998	17.4 / 2040	21.2 / 3694	21.2 / 3756	20.9 / 3850
waifu2x upconv7 1280x720	37.2 / 1046	38.5 / 1930	37.8 / 1976	46.3 / 1818	48.2 / 3628	46.5 / 3722
waifu2x upconv7 720x480	102.2 / 544	104.4 / 1894	102.2 / 1940	123.1 / 834	128.1 / 3556	123.4 / 3650
dpir color 1920x1080	7.3 / 2114	7.3 / 2114	7.2 / 2116	7.5 / 3656	7.4 / 3992	7.4 / 4002
dpir color 1280x720	16.4 / 1122	16.3 / 2086	16.2 / 2092	16.7 / 1810	16.7 / 3936	16.7 / 3946
dpir color 720x480	41.5 / 604	41.5 / 2068	41.6 / 2074	44.3 / 863	44.3 / 3900	44.2 / 3910
real-esrgan v2 1920x1080	12.3 / 1320	12.3 / 1320	12.3 / 1320	13.0 / 2196	13.2 / 2402	12.9 / 2402
real-esrgan v2 1280x720	26.9 / 736	26.9 / 1256	27.0 / 1256	29.1 / 1130	29.3 / 2274	29.2 / 2274
real-esrgan v2 720x480	73.2 / 422	73.2 / 1220	72.7 / 1220	78.7 / 570	78.4 / 2202	78.1 / 2202
cugan 1920x1080	9.4 / 4648	9.4 / 4726	9.2 / 4618	9.8 / 8754	10.2 / 9210	9.9 / 8996
cugan 1280x720	20.5 / 2214	20.5 / 4662	20.0 / 4554	21.2 / 4050	22.9 / 9082	22.4 / 8868
cugan 720x480	54.8 / 996	53.7 / 4626	52.9 / 4518	57.7 / 1690	59.9 / 9019	58.9 / 8796
rife v4.4 1920x1088	92.8 / 590	92.2 / 594	89.5 / 606	178.6 / 920	177.6 / 942	169.5 / 974
rife v4.4 1280x736	206.0 / 410	199.2 / 534	199.1 / 550	394.2 / 560	377.5 / 822	374.0 / 854
rife v4.4 736x480	497.3 / 316	442.2 / 504	492.3 / 520	903.2 / 376	809.3 / 762	874.1 / 794

*The gap is large on rife because of underutilization, and will disappear when using more streams.

Contributors

chainikdn, MysteryDove, and 2 other contributors

Assets 13

09 Jan 15:10

github-actions

v12.3.test

97952cb

v12.3.test Pre-release

Pre-release

This is a preview release for https://github.com/AmusementClub/vs-mlrt/releases/tag/v13.

Assets 13

23 Nov 09:19

github-actions

v12.2

6a843b9

v12.2 Pre-release

Pre-release

Update vsmlrt.py:

Introduce a new release artifact ext-models.v12.2.7z, which comes from External Models, and it's not bundled into full binary release packages (i.e. the cpu, cuda and vk packages). Please refer to their release notes for details on how to use those models.
Export a new API vsmlrt.inference for inference of custom models.
```
import vsmlrt
output = vsmlrt.inference(clips, "path/to/onnx", backend=vsmlrt.Backend.TRT(fp16=True))
```
If you encounter issues like Cannot find input tensor with name "input" in the network inputs! Please make sure the input tensor names are correct., you could use vsmlrt.inference(..., input_name=None) or export the model with its input name set to "input".
Fix trt inference of cugan-pro (3x) models. (#15)

Assets 13

07 Dec 07:20

WolframRhodium

external-models

7352920

External Models Pre-release

Pre-release

More models!

In addition to bundled models, vs-mlrt can also be used to run these models:

anime-segmentation/isnet_is.onnx: anime character segmentation at a0a563c, RGBS -> GRAYS, requires mod64 input
oidn/rt_ldr.onnx: image denoising from Intel® Open Image Denoise library, RGBS, requires mod16 input
ppocr/ml_PP-OCRv3_det.onnx: multilingual text detection model from PaddleOCR, RGBS -> GRAYS, requires mod32 input
waifu2x swin_unet: waifu2x's swin_unet models. It's supported by the Python wrapper with vsmlrt.Waifu2xModel.{swin_unet_art,swin_unet_art_scan,swin_unet_photo{v2}}.
- file list:
  - waifu2x/swin_unet_art/{scale2x, scale4x, noise0, noise0_scale2x, ..., noise3_scale4x}.onnx
  - waifu2x/swin_unet_art_scan/{scale4x, noise0_scale4x, ..., noise3_scale4x}.onnx
  - waifu2x/swin_unet_photo{_v2}/{scale4x, noise0_scale4x, ..., noise3_scale4x}.onnx]
RIFE (ensemble) (source):
- 4.0~4.6 ensemble
- 4.0~4.6 v2
- 4.7
- 4.8
- 4.9
- 4.10
- 4.11
- 4.12
- 4.12 lite
- 4.13
- 4.13 lite
- 4.14
- 4.14 lite
- 4.15
- 4.15 lite
- 4.16 lite
- 4.17
- 4.17 lite
- 4.18
- 4.19
- 4.20
- 4.21
- 4.22
- 4.22 lite
- 4.23
- 4.24
- 4.25
- 4.26
v2 models handle paddings internally and reduce PCIe traffic flow.
safa/{safa_{v0.1,v0.2,v0.3,v0.4}_{non_adaptive,adaptive1x,adaptive}.onnx: SAFA video enhancement models. Individually packaged.
scunet: SCUNet denoisig models.
ArtCNN/ArtCNN_{C4F32,C16F64,R16F96,R8F64}{_Chroma,_DS}: ArtCNN models for anime super-resolution and restoration.

With more to come.

Also check onnx models provided by the avs-mlrt community.

Usage

If an external model is not supported by the Python wrapper, you can use the generic vsmlrt.inference API to run these models (requires release v12.2 or later).

import vsmlrt
output = vsmlrt.inference(rgbs, "path/to/onnx", backend=vsmlrt.Backend.TRT(fp16=True))

The rife model requires auxiliary inputs and should be used from vsmlrt.RIFE or vsmlrt.RIFEMerge interface.

Assets 41

16 Nov 10:43

github-actions

v12.1

890dfe2

v12.1 Pre-release

Pre-release

This minor release fixes #9: now if vsort/vstrt fails to load required cuda DLLs, they won't crash the entire process.

However, if vs-mlrt is correctly installed, this shouldn't happen. Please report an issue if you can't access the core.trt or core.ort namespaces. Common mistake is forgetting to extract the vsmlrt-cuda.v12.1.7z package for VSORT-Windows-x64.v12.1.7z or VSTRT-Windows-x64.v12.1.7z packages. If in doubt, use the fully bundled release vsmlrt-windows-x64-cuda.v12.1.7z for CUDA users.

Note: we explicitly do not support using both pytorch and vs-mlrt plugins in the same vpy script as pytorch uses its own set of cuda DLL which might be in conflict with the ones vs-mlrt uses. As those DLLs are not explicitly versioned (e.g. nvinfer.dll instead of nvinfer-x.yz.dll), there is nothing we can do.

Assets 12

01 Nov 10:57

github-actions

v12

b96fd10

v12: latest CUDA libraries

Compared to v11, this release updated CUDA dependencies to CUDA 11.8.0, cuDNN 8.6.0 and TensorRT 8.5.1:

Added support for the NVIDIA 40 series GPUs.
Added support for RIFE on the trt backend.

Known issue

Performance of the OV_CPU or ORT_CUDA(fp16=True) backends for RIFE is lower than expected, which is under investigation. Please consider ORT_CPU or ORT_CUDA(fp16=False) for now.
The NCNN_VK backend does not support RIFE.

Installation Notes

For some advanced features, vsmlrt.py requires numpy and onnx packages to be available. You might need to run pip install onnx numpy.

Benchmark

previous benchmark

Configuration: NVIDIA RTX 3090, driver 526.47, windows server 2019, vs r60, python 3.11.0, 1080p fp16

Backends: ort-cuda, trt from vs-mlrt v12.

For the trt backend, the engine is created without CUDA_MODULE_LOADING=LAZY environment variable and with it during benchmarking to reduce device memory consumption.

Data format: fps / GPU memory usage (MB)

rife(model=44, 1920x1088)

backend	1 stream	2 streams
ort-cuda	53.62/1771	83.34/2748
trt	71.30/ 626	107.3/ 962

dpir color

backend	1 stream	2 streams
ort-cuda	4.64/3230
trt	10.32/1992	11.61/3475

waifu2x upconv_7

backend	1 stream	2 streams
ort-cuda	11.07/5916	15.04/10899
trt	18.38/2092	31.64/ 3848

waifu2x cunet

backend	1 stream	2 streams
ort-cuda	4.63/8541	5.32/16148
trt	11.44/4771	15.59/ 8972

realesrgan v2/v3

backend	1 stream	2 streams
ort-cuda	8.84/2283	11.10/4202
trt	14.59/1324	21.37/2174

Assets 12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Note

benchmark 1

benchmark 2

Note

Contributors

dynamic shapes

benchmark

Contributors

More models!

Usage

Known issue

Installation Notes

Benchmark

rife(model=44, 1920x1088)

dpir color

waifu2x upconv_7

waifu2x cunet

realesrgan v2/v3

Releases: AmusementClub/vs-mlrt

v14.test: latest TensorRT library

v13.2: latest ort library, DirectML backend

Note

benchmark 1

benchmark 2

v13.1: latest ov & ort library, new models

Note

Contributed Models

Contributors

v13: fp16 i/o, faster dynamic shapes for TRT backend

dynamic shapes

benchmark

Contributors

v12.3.test

v12.2

External Models

More models!

Usage

v12.1

v12: latest CUDA libraries

Known issue

Installation Notes

Benchmark

rife(model=44, 1920x1088)

dpir color

waifu2x upconv_7

waifu2x cunet

realesrgan v2/v3