Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia hardware decoder support #1296

Open
wants to merge 44 commits into
base: master
Choose a base branch
from
Open

Conversation

bradh
Copy link
Contributor

@bradh bradh commented Sep 3, 2024

No description provided.

@bradh bradh marked this pull request as ready for review September 3, 2024 04:54
@farindk
Copy link
Contributor

farindk commented Sep 3, 2024

I get this cmake error when trying to compile:

Compiling 'nvdec' as built-in backend
Not compiling 'libsharpyuv'
-- Configuring done

CMake Error at libheif/CMakeLists.txt:97 (add_library):
  Target "heif" links to target "CUDA::cuda_driver" but the target was not
  found.  Perhaps a find_package() call is missing for an IMPORTED target, or
  an ALIAS target is missing?

From where do I get the cmake config script? I have installed nvidia-cuda-dev.

@farindk
Copy link
Contributor

farindk commented Sep 3, 2024

It compiles correctly despite the error above. Probably that dependency is not needed.

@bradh
Copy link
Contributor Author

bradh commented Sep 3, 2024

The cmake config script comes with cmake, I think. nvidia-cuda-toolkit is probably the package on ubuntu.

Possibly the dependency is only needed for runtime, and perhaps only for JPEG decoding.

@farindk
Copy link
Contributor

farindk commented Sep 3, 2024

I didn't manage to make it work, but that is probably due to the very old laptop (2011) which I could not convince to use the GPU instead of the built-in Intel graphics (cuDeviceGet() failed). I'll have to find a more recent computer.

@farindk
Copy link
Contributor

farindk commented Sep 5, 2024

We might also need an extended version of does_support_format() in the plugin interface because some hardware decoders may support only subsets of the standards, like only 8 bit and 10 bit decoding, but not 12 bit.

@bradh
Copy link
Contributor Author

bradh commented Sep 6, 2024

We might also need an extended version of does_support_format() in the plugin interface because some hardware decoders may support only subsets of the standards, like only 8 bit and 10 bit decoding, but not 12 bit.

I think we probably need to just be tolerant of a software or hardware decoder that claims to support a format, but can't quite do it after all. I'm thinking of cases where it needs a special codec feature that only shows up once you're down at the NAL unit level. In that case, we'd just try another decoder if there was one, and the user hadn't asked for a specific implementation. So maybe that is enough.

If we do want to does_support_format2(...), NVIDIA decoding capabilities are on:

  • chroma format
  • bit size
  • max coded height
  • max coded width
  • min coded height
  • min coded width
  • maximum number of macro blocks (which we can derive from height and width limits)

I'm not sure about other hardware / accelerated implementation. Will see if I can find out what Intel does.

@farindk farindk mentioned this pull request Sep 9, 2024
@farindk
Copy link
Contributor

farindk commented Sep 13, 2024

I could now test the nvidia decoder on a GT1030 with h.265.

When parallel tile decoding is enabled, it stops somewhere in the middle with a "cannot get CUDA context" error. Probably the number of parallel decoders is limited.

Without parallel decoding, it works, but I was surprised how slow it is. NVidia hardware decoding: 9.0s versus 0.7s with the libde265 software decoder (both without parallel tile decoding). My guess is that the hardware setup time is slow.

AVC and JPEG also work, but they are also much slower than the software decoders. I could not test AV1.

What are your experiences?

@farindk
Copy link
Contributor

farindk commented Sep 13, 2024

The problem is that cuCtxCreate takes 0.065 secs and cuCtxDestroy another 0.035 secs. That makes 0.1 secs for each call to does_support_format or decode_image. If we call these two functions for every tile, this adds up.

The supported formats should be easy to cache. Caching a decoder context is not so easy. That might require a plugin function to release the cached decoder at the end of each image. Or maybe we should even keep the cached decoder for even longer in case we are doing batch conversion of many images. Then, we'd call the cache cleanup function after a short time delay.

@farindk
Copy link
Contributor

farindk commented Sep 13, 2024

I've measure the actual decoding time (excluding all initializations and conversions) for a tiled h265 image.

  • nvdec: 0.42sec
  • libde265: 0.55sec
  • ffmpeg: 0.40sec (I think this is the software codec)

That is what we can expect with perfect caching.

@Neoclassic
Copy link

Neoclassic commented Oct 20, 2024

Thanks bradh, Functionally working fine as i have a requirement to decode heic image on GPU.
But slow for heic image with tiles (6x8 tile heics that iphone produces). Mostly as cuda needs to be initilaized for each 48 grid tile.

static int nvdec_does_support_format(enum heif_compression_format format)
Do we really need to check it everytime? Can it be cached ?
Because i hardcoded the method to return 120 and now time is down to 8seconds

export CUDA_DEVICE_MAX_CONNECTIONS=2 or higher
Else the time will be around 45 seconds

`time ./examples/heif-dec ~/ffmpeg_test/testfiles/LiveOff.HEIC out.png
[istream] request_range 0 - 1024
[istream] request_range 24 - 3946
[istream] request_range 15119 - 17157
File contains 1 image
[istream] request_range 17157 - 24447
[istream] request_range 24447 - 46783
GPU in use: Tesla T4
[istream] request_range 78025 - 103598
GPU in use: Tesla T4
[istream] request_range 46783 - 78025
GPU in use: Tesla T4
GPU in use: Tesla T4
[istream] request_range 103598 - 126917
[istream] request_range 126917 - 146390
GPU in use: Tesla T4
GPU in use: Tesla T4
[istream] request_range 146390 - 162298
[istream] request_range 162298 - 174934
GPU in use: Tesla T4
GPU in use: Tesla T4
[istream] request_range 174934 - 178765
GPU in use: Tesla T4
[istream] request_range 178765 - 209539
[istream] request_range 209539 - 256166
GPU in use: Tesla T4
GPU in use: Tesla T4
[istream] request_range 256166 - 297666
GPU in use: Tesla T4
[istream] request_range 297666 - 333953
GPU in use: Tesla T4
[istream] request_range 333953 - 350529
[istream] request_range 362678 - 375496
GPU in use: Tesla T4
[istream] request_range 350529 - 362678
GPU in use: Tesla T4
GPU in use: Tesla T4
[istream] request_range 375496 - 380664
GPU in use: Tesla T4
[istream] request_range 380664 - 407373
GPU in use: Tesla T4
[istream] request_range 407373 - 435430
[istream] request_range 435430 - 475399
GPU in use: Tesla T4
[istream] request_range 475399 - 500341
GPU in use: Tesla T4
GPU in use: Tesla T4
[istream] request_range 500341 - 521156
GPU in use: Tesla T4
[istream] request_range 521156 - 534960
GPU in use: Tesla T4
[istream] request_range 534960 - 548606
[istream] request_range 548606 - 554956
GPU in use: Tesla T4
GPU in use: Tesla T4
[istream] request_range 554956 - 579136
GPU in use: Tesla T4
[istream] request_range 579136 - 612623
GPU in use: Tesla T4
[istream] request_range 612623 - 639585
GPU in use: Tesla T4
[istream] request_range 639585 - 672260
GPU in use: Tesla T4
[istream] request_range 672260 - 692049
GPU in use: Tesla T4
[istream] request_range 692049 - 707221
GPU in use: Tesla T4
[istream] request_range 707221 - 718533
GPU in use: Tesla T4
[istream] request_range 718533 - 736781
[istream] request_range 736781 - 764323
GPU in use: Tesla T4
GPU in use: Tesla T4
[istream] request_range 764323 - 787729
GPU in use: Tesla T4
[istream] request_range 787729 - 809625
GPU in use: Tesla T4
[istream] request_range 809625 - 832233
GPU in use: Tesla T4
[istream] request_range 832233 - 852712
GPU in use: Tesla T4
[istream] request_range 852712 - 872658
GPU in use: Tesla T4
[istream] request_range 872658 - 886348
GPU in use: Tesla T4
[istream] request_range 886348 - 904055
GPU in use: Tesla T4
[istream] request_range 904055 - 922677
GPU in use: Tesla T4
[istream] request_range 922677 - 930679
GPU in use: Tesla T4
[istream] request_range 930679 - 937882
[istream] request_range 937882 - 946358
GPU in use: Tesla T4
GPU in use: Tesla T4
[istream] request_range 946358 - 971564
GPU in use: Tesla T4
[istream] request_range 971564 - 990224
GPU in use: Tesla T4
[istream] request_range 990224 - 1003954
GPU in use: Tesla T4
Written to out.png

real 0m17.878s
user 0m5.555s
sys 0m25.650s`

@bradh
Copy link
Contributor Author

bradh commented Oct 20, 2024

Mostly as cuda needs to be initilaized for each 48 grid tile.

Yes. This can be optimised, as identified by Dirk last month. That work hasn't been done yet though.

@Neoclassic
Copy link

@bradh @farindk Any other planned/unplanned optimizations in this specially with regard to grid based images as there is a lot to be decoded there as well as to and fro from GPU->CPU.

@bradh
Copy link
Contributor Author

bradh commented Oct 21, 2024

@bradh @farindk Any other planned/unplanned optimizations in this specially with regard to grid based images as there is a lot to be decoded there as well as to and fro from GPU->CPU.

Nothing specific.

There is no guarantee this will even be added to libheif, let alone when or in what form. Please be realistic in your expectations - right now there is no customer requirement for it, and no specific funding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants