-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AVX512 accelerated 1D/3D LUTS #1932
Add AVX512 accelerated 1D/3D LUTS #1932
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mark!
I'd like to clarify the F16C option in relation to this. I guess if AVX512 is supported then we should assume F16C is always supported too, right? I have a few comments below related to this.
Yes, it's surprising the timings are not faster. Maybe it's bound by memory accesses? I've seen cases where rearranging how the LUTs are stored in memory (at the cost of taking more space) resulted in a speed-up, though not sure if that would help here.
Yes, the half float conversion instructions are all part of the AVX512F (Foundation) extension. The exact overlap between AVX and AVX2 and F16c support has never been exactly clear to me. I think AVX2 pretty much guarantees F16c but I think its best to check with those extensions. |
I did a bit more perf testing of this with my old lut3d_perf tool It also turns out that github runners on a private repos are different then the public repo ones. The private ones can have avx512. I was able to test this pull request on windows with avx512 by setting up a private fork. I kinda used up all my free minutes for the month doing it but all the tests pass 😆 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Mark Reid <[email protected]>
Signed-off-by: Mark Reid <[email protected]>
Signed-off-by: Mark Reid <[email protected]>
Signed-off-by: Mark Reid <[email protected]>
Signed-off-by: Mark Reid <[email protected]>
Signed-off-by: Mark Reid <[email protected]>
Signed-off-by: Mark Reid <[email protected]>
Signed-off-by: Mark Reid <[email protected]>
Signed-off-by: Mark Reid <[email protected]>
@remia I added your suggestion to all the SIMD tests. I also rebased on top of the current main. |
ocioperf.exe --transform tests/data/files/clf/lut1d_32f_example.clf
Line by Line Average, lut dim 65, 3840x2160 image, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
ocioperf.exe --transform tests/data/files/clf/lut3d_preview_tier_test.clf
Line by Line Average, lut dim 33x33x33, 3840x2160 image, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
I've only been able to test on one machine with AVX512. Not exactly the performance gains I was hoping for. I'm still new to the instructions set, maybe there are some more optimizations we could do. There are quite a few AVX512 extensions. I've limited this implementation to just the AVX512F (foundation) instructions. That basically means any AVX512 capable CPU should be able run it.
Github actions use to have more intel CPU's with AVX512 available. Lately I've been getting only AMD EPYC CPU's without AVX512 for CI. I don't think there is anyway to request a specific cpu. This is very frustrating and will make this more difficult to maintain and test.