Add SSE41 version of UpSample #1836

brianpopow · 2021-11-17T20:53:49Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

This PR adds a SSE41 version of UpSample, which is used during decoding of lossy webp images to convert from YUV to RGB.
This is where a significant portion of the time is spend (ca. 30%).

Related to #1786

This is still work in progress. This is still left to do:

Implement converting last block.
~~The last column of each block seems to be incorrect.~~ Fixed with 3f43883
Add tests
Add benchmarks once above issues are fixed

Example output:

Output looks correct now:

brianpopow · 2021-11-18T14:11:42Z

Profiling results look pretty good:

master

PR

Testimage used:
Calliphora_lossy.zip

codecov · 2021-11-18T14:16:30Z

Codecov Report

Merging #1836 (8f6e9ba) into master (1713891) will increase coverage by 0%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #1836    +/-   ##
=======================================
  Coverage      87%     87%            
=======================================
  Files         937     937            
  Lines       48449   48654   +205     
  Branches     6057    6067    +10     
=======================================
+ Hits        42329   42534   +205     
  Misses       5112    5112            
  Partials     1008    1008

Flag	Coverage Δ
unittests	`87% <100%> (+<1%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/ImageSharp/Formats/Webp/Lossy/Vp8Decoder.cs	`96% <ø> (-1%)`	⬇️
.../ImageSharp/Formats/Webp/Lossy/WebpLossyDecoder.cs	`97% <100%> (-1%)`	⬇️
src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs	`99% <100%> (+<1%)`	⬆️
...rc/ImageSharp/Formats/Webp/Lossless/Vp8LEncoder.cs	`97% <0%> (-1%)`	⬇️
...ageSharp/Formats/Webp/Lossless/PredictorEncoder.cs	`97% <0%> (+<1%)`	⬆️
...ImageSharp/Formats/Webp/Lossless/Vp8LBitEntropy.cs	`100% <0%> (+1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1713891...8f6e9ba. Read the comment docs.

brianpopow · 2021-11-19T12:36:22Z

When looking at the profiler measures, the improvements seem huge.
I have started to look into improving the UpScale method, because the profiler was telling me 30% of the time is spent here.
Unfortunately when looking at the benchmark results, its not a big difference. Its barely noticeable.
I really was expecting a big impact on the lossy decoding time for this PR, but its not there.

I am not sure what im doing wrong here or if im reading the profiler output wrong. I am using dotTrace with Line-by-Line sampling, because I thought this is the most accurate, but it seems to be rather misleading.

antonfirsov · 2021-11-19T12:51:15Z

src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs

+                UpSample32Pixels(topU.Slice(uvPos), curU.Slice(uvPos), ru);
+                UpSample32Pixels(topV.Slice(uvPos), curV.Slice(uvPos), rv);
+                ConvertYuvToBgrSse41(topY, bottomY, topDst, bottomDst, ru, rv, pos, xStep);


I would avoid slicing and passing spans to these private methods here, and iterate instead using references or pointers. (Pinning is fine with such complex code IMO if it helps keeping readability.)

src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs

Co-authored-by: Anton Firszov <[email protected]>

antonfirsov · 2021-11-19T16:24:03Z

src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs

+        // Load the bytes into the *upper* part of 16b words. That's "<< 8", basically.
+        [MethodImpl(InliningOptions.ShortMethod)]
+        private static Vector128<byte> LoadHigh(Vector64<byte> src)
+        {
+            Vector128<byte> tmp = Unsafe.As<Vector64<byte>, Vector128<byte>>(ref src);
+            return Sse2.UnpackLow(Vector128<byte>.Zero, tmp);


Can't you actually load to Vector128<byte>, then bitshift? Could be much cheaper, not sure what is Unsafe.As<Vector64<byte>, Vector128<byte>>(ref src) being compiled to.

Its just a mov, SharpLab, this should be fine, right?

I cant seem to figure out how to do this differently/better.

I thought this is a vector level shuffle/shift, now I see what you are doing. What I mean is that if you only unpack from the lower half of tmp, pushing this through Vector64<byte> can be avoided by adding an 8 byte padding to y, u, v:

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCEBXAMy5ilwDcNek1YAlDgDsMAS3wwWASWlQZk3DLCDhjZiwnS5C5RlXrNuFgA0AHEiG1d4qbPksAwhHwAHGQBs+AGU+ADdNGG1aAGYGKVxsXiZSBncGAG8aBiymGIA1GDAMaHJSGwAeYABPDBgAPgYAQQAKWC4GKpqGXCgwAEoM6myhhnzC6FQK6rqGDB9yBgBeBgBVdQSFBtxJmrQRgqKoCY662paYNu6+h2Hs0YOS8uP62e9kpdX43hZNsrvxlG2dV2fygD0BtVOrRmc161xuTAA7AxArgYKQWKtvNgwABrAAyEAA7k0QWCniwAFp8CC7F6kWGZbIAX0ZWVZOT2Y1BpXBDDYZzaxy6PX67KGpJ5bCmzx8ixWay+P2OwP2xR5T0h52FVzF2WISJRaIxkixuIJxIlj2llOptJ8DMGzJoTKAA===

I was not aware of this, looks much better now, thanks.

Changed with cc5f7af

JimBobSquarePants · 2021-11-23T15:01:07Z

I'm happy for this to go in as-is but I'll leave it to @antonfirsov for a final review since he had questions/suggestions

antonfirsov

There is still some space for improvement, but feel free to move on, if you'd better focus on something else.

antonfirsov · 2021-11-24T11:14:59Z

src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs

+        // Load the bytes into the *upper* part of 16b words. That's "<< 8", basically.
+        [MethodImpl(InliningOptions.ShortMethod)]
+        private static Vector128<byte> LoadHigh(Vector64<byte> src)
+        {
+            Vector128<byte> tmp = Unsafe.As<Vector64<byte>, Vector128<byte>>(ref src);
+            return Sse2.UnpackLow(Vector128<byte>.Zero, tmp);


I thought this is a vector level shuffle/shift, now I see what you are doing. What I mean is that if you only unpack from the lower half of tmp, pushing this through Vector64<byte> can be avoided by adding an 8 byte padding to y, u, v:

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCEBXAMy5ilwDcNek1YAlDgDsMAS3wwWASWlQZk3DLCDhjZiwnS5C5RlXrNuFgA0AHEiG1d4qbPksAwhHwAHGQBs+AGU+ADdNGG1aAGYGKVxsXiZSBncGAG8aBiymGIA1GDAMaHJSGwAeYABPDBgAPgYAQQAKWC4GKpqGXCgwAEoM6myhhnzC6FQK6rqGDB9yBgBeBgBVdQSFBtxJmrQRgqKoCY662paYNu6+h2Hs0YOS8uP62e9kpdX43hZNsrvxlG2dV2fygD0BtVOrRmc161xuTAA7AxArgYKQWKtvNgwABrAAyEAA7k0QWCniwAFp8CC7F6kWGZbIAX0ZWVZOT2Y1BpXBDDYZzaxy6PX67KGpJ5bCmzx8ixWay+P2OwP2xR5T0h52FVzF2WISJRaIxkixuIJxIlj2llOptJ8DMGzJoTKAA===

antonfirsov · 2021-11-24T11:16:51Z

src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs

+        {
+            YuvToBgrSse41(topY.Slice(curX), ru, rv, topDst.Slice(curX * step));
+
+            if (bottomY != null)


This if can be moved outside of the loop for happier pipelines / branch predictor.

Moved the if outside of the loop with 65870b9

antonfirsov · 2021-11-24T11:17:09Z

src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs

@@ -312,6 +586,178 @@ public static void YuvToBgr(int y, int u, int v, Span<byte> bgr)
            bgr[0] = (byte)YuvToB(y, u);
        }

+#if SUPPORTS_RUNTIME_INTRINSICS
+
+        private static void ConvertYuvToBgrSse41(Span<byte> topY, Span<byte> bottomY, Span<byte> topDst, Span<byte> bottomDst, Span<byte> ru, Span<byte> rv, int curX, int step)


These spans can be also converted to ref-s.

I feel a bit more comfortable with using spans here, not sure if its really worth changing to ref's

The main overhead seems to appear here:

ImageSharp/src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs

Lines 620 to 622 in 984a725

ConvertYuv444ToRgbSse41(y.Slice(8), u.Slice(8), v.Slice(8), out Vector128<short> r1, out Vector128<short> g1, out Vector128<short> b1);

ConvertYuv444ToRgbSse41(y.Slice(16), u.Slice(16), v.Slice(16), out Vector128<short> r2, out Vector128<short> g2, out Vector128<short> b2);

ConvertYuv444ToRgbSse41(y.Slice(24), u.Slice(24), v.Slice(24), out Vector128<short> r3, out Vector128<short> g3, out Vector128<short> b3);

You can consider "de-spanning" only in this method and ConvertYuv444ToRgbSse41 replacing .Slice with Unsafe.Add to y/u/v base references.

ok i am fine with that: 6293f72

antonfirsov · 2021-11-24T12:30:51Z

src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs

+        // Convert 32 samples of YUV444 to R/G/B
+        private static void ConvertYuv444ToRgbSse41(Span<byte> y, Span<byte> u, Span<byte> v, out Vector128<short> r, out Vector128<short> g, out Vector128<short> b)
+        {
+            Vector128<byte> y0 = LoadHigh(ref MemoryMarshal.GetReference(y));


Ignore me if you already double checked this, but what you really need to make sure now is that if y,u,v have 8 bytes even at the very end, otherwise it's a memory corruption.

The code that allocates these buffers seems to add padding at first glance, but I don't really follow it:

ImageSharp/src/ImageSharp/Formats/Webp/Lossy/Vp8Decoder.cs

Lines 66 to 73 in 099a676

int extraRows = WebpConstants.FilterExtraRows[(int)LoopFilter.Complex]; // assuming worst case: complex filter

int extraY = extraRows * this.CacheYStride;

int extraUv = extraRows / 2 * this.CacheUvStride;

this.YuvBuffer = memoryAllocator.Allocate<byte>((WebpConstants.Bps * 17) + (WebpConstants.Bps * 9) + extraY);

this.CacheY = memoryAllocator.Allocate<byte>((16 * this.CacheYStride) + extraY);

int cacheUvSize = (16 * this.CacheUvStride) + extraUv;

this.CacheU = memoryAllocator.Allocate<byte>(cacheUvSize);

this.CacheV = memoryAllocator.Allocate<byte>(cacheUvSize);

…clean, they will be overwritten anyway This reverts commit cded607.

antonfirsov · 2021-11-25T10:00:40Z

src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs

+            Vector128<byte> y0 = LoadHigh(ref y);
+            Vector128<byte> u0 = LoadHigh(ref u);
+            Vector128<byte> v0 = LoadHigh(ref v);


Wanted to stop already but can't resist 😄 Pipelined loads may perform better:

Suggested change

Vector128<byte> y0 = LoadHigh(ref y);

Vector128<byte> u0 = LoadHigh(ref u);

Vector128<byte> v0 = LoadHigh(ref v);

// Load the bytes into the *upper* part of 16b words. That's "<< 8", basically.

Vector128<byte> y0 = Unsafe.As<byte, Vector128<byte>>(ref y);

Vector128<byte> u0 = Unsafe.As<byte, Vector128<byte>>(ref u);

Vector128<byte> v0 = Unsafe.As<byte, Vector128<byte>>(ref v);

y0 = Sse2.UnpackLow(Vector128<byte>.Zero, y0);

u0 = Sse2.UnpackLow(Vector128<byte>.Zero, u0);

v0 = Sse2.UnpackLow(Vector128<byte>.Zero, v0);

yes, makes sense. Changed with 7775c34

antonfirsov

Thanks for the patience!

brianpopow added 2 commits November 17, 2021 10:58

Move UpSample to YuvConversion class

7191aca

Add SSE41 version of UpSample

59a11bf

brianpopow added area:performance formats:webp labels Nov 17, 2021

brianpopow changed the title ~~Add SSE41 version of UpSample~~ WIP: Add SSE41 version of UpSample Nov 17, 2021

brianpopow and others added 5 commits November 17, 2021 21:59

Merge branch 'master' into bp/upscalesse

e6921e1

Upsample last block

2a03d00

Fix shuffle masks

3f43883

Fix last block

ec18321

Merge branch 'master' into bp/upscalesse

806a2ee

JimBobSquarePants and others added 4 commits November 19, 2021 22:04

Avoid implicit casting

c223d2e

Merge branch 'master' into bp/upscalesse

8985ed6

Add upsample tests

5954924

Avoid allocating uvBuffer on each upscale call

1eb1e82

brianpopow changed the title ~~WIP: Add SSE41 version of UpSample~~ Add SSE41 version of UpSample Nov 19, 2021

Merge branch 'master' into bp/upscalesse

d3a7a4a

antonfirsov reviewed Nov 19, 2021

View reviewed changes

src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs Outdated Show resolved Hide resolved

brianpopow and others added 5 commits November 19, 2021 15:28

Change some methods to be private

c59ae02

Re-grouping the code to do identical operations

c5170f9

Co-authored-by: Anton Firszov <[email protected]>

Add InliningOptions.ShortMethod to LoadHigh

0c05727

Group load uv vectors together

d58dde0

Pass in parameters as ref to UpSample32Pixels

7cf0c32

antonfirsov reviewed Nov 19, 2021

View reviewed changes

Merge branch 'master' into bp/upscalesse

d10a747

Merge branch 'master' into bp/upscalesse

d6b25e7

antonfirsov approved these changes Nov 24, 2021

View reviewed changes

brianpopow and others added 3 commits November 24, 2021 12:50

Better version of LoadHigh

cc5f7af

Avoid branching inside loop

65870b9

Merge branch 'master' into bp/upscalesse

984a725

antonfirsov reviewed Nov 24, 2021

View reviewed changes

brianpopow added 4 commits November 24, 2021 13:57

Use ref parameters in ConvertYuv444ToBgrSse41

6293f72

Fill buffers with default values only in Debug mode

2ca81ae

Allocate clean buffers

cded607

Revert "Allocate clean buffers": the tmp buffers does not need to be …

22537b2

…clean, they will be overwritten anyway This reverts commit cded607.

JimBobSquarePants approved these changes Nov 25, 2021

View reviewed changes

antonfirsov reviewed Nov 25, 2021

View reviewed changes

brianpopow and others added 2 commits November 25, 2021 14:18

Group loading y, u, v together

7775c34

Merge branch 'master' into bp/upscalesse

8f6e9ba

antonfirsov approved these changes Nov 25, 2021

View reviewed changes

brianpopow merged commit b67a8db into master Nov 25, 2021

brianpopow deleted the bp/upscalesse branch November 25, 2021 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SSE41 version of UpSample #1836

Add SSE41 version of UpSample #1836

brianpopow commented Nov 17, 2021 •

edited

Loading

brianpopow commented Nov 18, 2021 •

edited

Loading

codecov bot commented Nov 18, 2021 •

edited

Loading

brianpopow commented Nov 19, 2021

antonfirsov Nov 19, 2021 •

edited

Loading

antonfirsov Nov 19, 2021

brianpopow Nov 19, 2021

brianpopow Nov 23, 2021

antonfirsov Nov 24, 2021

brianpopow Nov 24, 2021

JimBobSquarePants commented Nov 23, 2021

antonfirsov left a comment

antonfirsov Nov 24, 2021

antonfirsov Nov 24, 2021

brianpopow Nov 24, 2021

antonfirsov Nov 24, 2021

brianpopow Nov 24, 2021

antonfirsov Nov 24, 2021

brianpopow Nov 24, 2021

antonfirsov Nov 24, 2021

antonfirsov Nov 25, 2021 •

edited

Loading

brianpopow Nov 25, 2021

antonfirsov left a comment

	ConvertYuv444ToRgbSse41(y.Slice(8), u.Slice(8), v.Slice(8), out Vector128<short> r1, out Vector128<short> g1, out Vector128<short> b1);
	ConvertYuv444ToRgbSse41(y.Slice(16), u.Slice(16), v.Slice(16), out Vector128<short> r2, out Vector128<short> g2, out Vector128<short> b2);
	ConvertYuv444ToRgbSse41(y.Slice(24), u.Slice(24), v.Slice(24), out Vector128<short> r3, out Vector128<short> g3, out Vector128<short> b3);

	int extraRows = WebpConstants.FilterExtraRows[(int)LoopFilter.Complex]; // assuming worst case: complex filter
	int extraY = extraRows * this.CacheYStride;
	int extraUv = extraRows / 2 * this.CacheUvStride;
	this.YuvBuffer = memoryAllocator.Allocate<byte>((WebpConstants.Bps * 17) + (WebpConstants.Bps * 9) + extraY);
	this.CacheY = memoryAllocator.Allocate<byte>((16 * this.CacheYStride) + extraY);
	int cacheUvSize = (16 * this.CacheUvStride) + extraUv;
	this.CacheU = memoryAllocator.Allocate<byte>(cacheUvSize);
	this.CacheV = memoryAllocator.Allocate<byte>(cacheUvSize);

-            Vector128<byte> y0 = LoadHigh(ref y);
-            Vector128<byte> u0 = LoadHigh(ref u);
-            Vector128<byte> v0 = LoadHigh(ref v);
+            // Load the bytes into the *upper* part of 16b words. That's "<< 8", basically.
+            Vector128<byte> y0 = Unsafe.As<byte, Vector128<byte>>(ref y);
+            Vector128<byte> u0 = Unsafe.As<byte, Vector128<byte>>(ref u);
+            Vector128<byte> v0 = Unsafe.As<byte, Vector128<byte>>(ref v);
+            y0 = Sse2.UnpackLow(Vector128<byte>.Zero, y0);
+            u0 = Sse2.UnpackLow(Vector128<byte>.Zero, u0);
+            v0 = Sse2.UnpackLow(Vector128<byte>.Zero, v0);

Add SSE41 version of UpSample #1836

Add SSE41 version of UpSample #1836

Conversation

brianpopow commented Nov 17, 2021 • edited Loading

Prerequisites

Description

brianpopow commented Nov 18, 2021 • edited Loading

master

PR

codecov bot commented Nov 18, 2021 • edited Loading

Codecov Report

brianpopow commented Nov 19, 2021

antonfirsov Nov 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimBobSquarePants commented Nov 23, 2021

antonfirsov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonfirsov Nov 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonfirsov left a comment

Choose a reason for hiding this comment

brianpopow commented Nov 17, 2021 •

edited

Loading

brianpopow commented Nov 18, 2021 •

edited

Loading

codecov bot commented Nov 18, 2021 •

edited

Loading

antonfirsov Nov 19, 2021 •

edited

Loading

antonfirsov Nov 25, 2021 •

edited

Loading