Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SSE41 version of UpSample #1836

Merged
merged 28 commits into from
Nov 25, 2021
Merged

Add SSE41 version of UpSample #1836

merged 28 commits into from
Nov 25, 2021

Conversation

brianpopow
Copy link
Collaborator

@brianpopow brianpopow commented Nov 17, 2021

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

This PR adds a SSE41 version of UpSample, which is used during decoding of lossy webp images to convert from YUV to RGB.
This is where a significant portion of the time is spend (ca. 30%).

Related to #1786

This is still work in progress. This is still left to do:

  • Implement converting last block.
  • The last column of each block seems to be incorrect. Fixed with 3f43883
  • Add tests
  • Add benchmarks once above issues are fixed

Example output:
output

Output looks correct now:
output

@brianpopow brianpopow changed the title Add SSE41 version of UpSample WIP: Add SSE41 version of UpSample Nov 17, 2021
@brianpopow
Copy link
Collaborator Author

brianpopow commented Nov 18, 2021

Profiling results look pretty good:

master

upsample

PR

upsample_sse

Testimage used:
Calliphora_lossy.zip

@codecov
Copy link

codecov bot commented Nov 18, 2021

Codecov Report

Merging #1836 (8f6e9ba) into master (1713891) will increase coverage by 0%.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #1836    +/-   ##
=======================================
  Coverage      87%     87%            
=======================================
  Files         937     937            
  Lines       48449   48654   +205     
  Branches     6057    6067    +10     
=======================================
+ Hits        42329   42534   +205     
  Misses       5112    5112            
  Partials     1008    1008            
Flag Coverage Δ
unittests 87% <100%> (+<1%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/ImageSharp/Formats/Webp/Lossy/Vp8Decoder.cs 96% <ø> (-1%) ⬇️
.../ImageSharp/Formats/Webp/Lossy/WebpLossyDecoder.cs 97% <100%> (-1%) ⬇️
src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs 99% <100%> (+<1%) ⬆️
...rc/ImageSharp/Formats/Webp/Lossless/Vp8LEncoder.cs 97% <0%> (-1%) ⬇️
...ageSharp/Formats/Webp/Lossless/PredictorEncoder.cs 97% <0%> (+<1%) ⬆️
...ImageSharp/Formats/Webp/Lossless/Vp8LBitEntropy.cs 100% <0%> (+1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1713891...8f6e9ba. Read the comment docs.

@brianpopow brianpopow changed the title WIP: Add SSE41 version of UpSample Add SSE41 version of UpSample Nov 19, 2021
@brianpopow
Copy link
Collaborator Author

When looking at the profiler measures, the improvements seem huge.
I have started to look into improving the UpScale method, because the profiler was telling me 30% of the time is spent here.
Unfortunately when looking at the benchmark results, its not a big difference. Its barely noticeable.
I really was expecting a big impact on the lossy decoding time for this PR, but its not there.

I am not sure what im doing wrong here or if im reading the profiler output wrong. I am using dotTrace with Line-by-Line sampling, because I thought this is the most accurate, but it seems to be rather misleading.

Comment on lines 184 to 186
UpSample32Pixels(topU.Slice(uvPos), curU.Slice(uvPos), ru);
UpSample32Pixels(topV.Slice(uvPos), curV.Slice(uvPos), rv);
ConvertYuvToBgrSse41(topY, bottomY, topDst, bottomDst, ru, rv, pos, xStep);
Copy link
Member

@antonfirsov antonfirsov Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid slicing and passing spans to these private methods here, and iterate instead using references or pointers. (Pinning is fine with such complex code IMO if it helps keeping readability.)

Comment on lines 752 to 757
// Load the bytes into the *upper* part of 16b words. That's "<< 8", basically.
[MethodImpl(InliningOptions.ShortMethod)]
private static Vector128<byte> LoadHigh(Vector64<byte> src)
{
Vector128<byte> tmp = Unsafe.As<Vector64<byte>, Vector128<byte>>(ref src);
return Sse2.UnpackLow(Vector128<byte>.Zero, tmp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you actually load to Vector128<byte>, then bitshift? Could be much cheaper, not sure what is Unsafe.As<Vector64<byte>, Vector128<byte>>(ref src) being compiled to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its just a mov, SharpLab, this should be fine, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant seem to figure out how to do this differently/better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of this, looks much better now, thanks.

Changed with cc5f7af

@JimBobSquarePants
Copy link
Member

I'm happy for this to go in as-is but I'll leave it to @antonfirsov for a final review since he had questions/suggestions

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still some space for improvement, but feel free to move on, if you'd better focus on something else.

Comment on lines 752 to 757
// Load the bytes into the *upper* part of 16b words. That's "<< 8", basically.
[MethodImpl(InliningOptions.ShortMethod)]
private static Vector128<byte> LoadHigh(Vector64<byte> src)
{
Vector128<byte> tmp = Unsafe.As<Vector64<byte>, Vector128<byte>>(ref src);
return Sse2.UnpackLow(Vector128<byte>.Zero, tmp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
YuvToBgrSse41(topY.Slice(curX), ru, rv, topDst.Slice(curX * step));

if (bottomY != null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if can be moved outside of the loop for happier pipelines / branch predictor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the if outside of the loop with 65870b9

@@ -312,6 +586,178 @@ public static void YuvToBgr(int y, int u, int v, Span<byte> bgr)
bgr[0] = (byte)YuvToB(y, u);
}

#if SUPPORTS_RUNTIME_INTRINSICS

private static void ConvertYuvToBgrSse41(Span<byte> topY, Span<byte> bottomY, Span<byte> topDst, Span<byte> bottomDst, Span<byte> ru, Span<byte> rv, int curX, int step)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These spans can be also converted to ref-s.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a bit more comfortable with using spans here, not sure if its really worth changing to ref's

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main overhead seems to appear here:

ConvertYuv444ToRgbSse41(y.Slice(8), u.Slice(8), v.Slice(8), out Vector128<short> r1, out Vector128<short> g1, out Vector128<short> b1);
ConvertYuv444ToRgbSse41(y.Slice(16), u.Slice(16), v.Slice(16), out Vector128<short> r2, out Vector128<short> g2, out Vector128<short> b2);
ConvertYuv444ToRgbSse41(y.Slice(24), u.Slice(24), v.Slice(24), out Vector128<short> r3, out Vector128<short> g3, out Vector128<short> b3);

You can consider "de-spanning" only in this method and ConvertYuv444ToRgbSse41 replacing .Slice with Unsafe.Add to y/u/v base references.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i am fine with that: 6293f72

// Convert 32 samples of YUV444 to R/G/B
private static void ConvertYuv444ToRgbSse41(Span<byte> y, Span<byte> u, Span<byte> v, out Vector128<short> r, out Vector128<short> g, out Vector128<short> b)
{
Vector128<byte> y0 = LoadHigh(ref MemoryMarshal.GetReference(y));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore me if you already double checked this, but what you really need to make sure now is that if y,u,v have 8 bytes even at the very end, otherwise it's a memory corruption.

The code that allocates these buffers seems to add padding at first glance, but I don't really follow it:

int extraRows = WebpConstants.FilterExtraRows[(int)LoopFilter.Complex]; // assuming worst case: complex filter
int extraY = extraRows * this.CacheYStride;
int extraUv = extraRows / 2 * this.CacheUvStride;
this.YuvBuffer = memoryAllocator.Allocate<byte>((WebpConstants.Bps * 17) + (WebpConstants.Bps * 9) + extraY);
this.CacheY = memoryAllocator.Allocate<byte>((16 * this.CacheYStride) + extraY);
int cacheUvSize = (16 * this.CacheUvStride) + extraUv;
this.CacheU = memoryAllocator.Allocate<byte>(cacheUvSize);
this.CacheV = memoryAllocator.Allocate<byte>(cacheUvSize);

Comment on lines 742 to 744
Vector128<byte> y0 = LoadHigh(ref y);
Vector128<byte> u0 = LoadHigh(ref u);
Vector128<byte> v0 = LoadHigh(ref v);
Copy link
Member

@antonfirsov antonfirsov Nov 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to stop already but can't resist 😄 Pipelined loads may perform better:

Suggested change
Vector128<byte> y0 = LoadHigh(ref y);
Vector128<byte> u0 = LoadHigh(ref u);
Vector128<byte> v0 = LoadHigh(ref v);
// Load the bytes into the *upper* part of 16b words. That's "<< 8", basically.
Vector128<byte> y0 = Unsafe.As<byte, Vector128<byte>>(ref y);
Vector128<byte> u0 = Unsafe.As<byte, Vector128<byte>>(ref u);
Vector128<byte> v0 = Unsafe.As<byte, Vector128<byte>>(ref v);
y0 = Sse2.UnpackLow(Vector128<byte>.Zero, y0);
u0 = Sse2.UnpackLow(Vector128<byte>.Zero, u0);
v0 = Sse2.UnpackLow(Vector128<byte>.Zero, v0);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, makes sense. Changed with 7775c34

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patience!

@brianpopow brianpopow merged commit b67a8db into master Nov 25, 2021
@brianpopow brianpopow deleted the bp/upscalesse branch November 25, 2021 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants