Jpeg encoder optimization via removing second transpose call from FDCT #1958

br3aker · 2022-01-25T13:05:38Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

Closes #1778 for the encoder part, decoder part was implemented long time ago so it can be closed after this PR.

Basically, this removes second matrix transpose call from FDCT-Quantize-Huffman pipeline in the jpeg encoder. Performance gain is not that big but it's still nice:

Current master

// Encoding q=98 | color=YCbCrRatio444
// Elapsed: 11934ms across 1000 iterations
// Average: 11,934ms

PR, roughly 4% faster

// Encoding q=98 | color=YCbCrRatio444
// Elapsed: 11457ms across 1000 iterations
// Average: 11,457ms

Tests were done using Calliphora.jpg from test suite. Quality 98, no chroma subsampling.

This optimization has 'constant' delta independent of output quality so lower qualities gain better performance boost:

Current master

// Encoding q=75 | color=YCbCrRatio444
// Elapsed: 9386ms across 1000 iterations
// Average: 9,386ms

PR, roughly 5% faster

// Encoding q=75 | color=YCbCrRatio444
// Elapsed: 8900ms across 1000 iterations
// Average: 8,9ms

src/ImageSharp/Formats/Jpeg/Components/ZigZag.Intrinsic.cs

Co-authored-by: Günther Foidl <[email protected]>

antonfirsov · 2022-01-25T20:06:49Z

src/ImageSharp/Formats/Jpeg/Components/ZigZag.Intrinsic.cs

+                Vector256<int> crln_01_AB_CD = Avx.LoadVector256(shuffleVectorsPtr + (0 * 32)).AsInt32();
+                Vector256<byte> row01_AB = Avx2.PermuteVar8x32(rowAB.AsInt32(), crln_01_AB_CD).AsByte();


Any chance we can group Loads and PermuteVars along the implementation (without running out of registers) so they are pipelined?

This can bring noticeable speedup if we are lucky. See #1242 (comment)

Cross-lane permutations are not pipelined:

Loads are the main bottlenecks here with latency of 7 which can be pipelined:

Even though shuffles and permutes don't pipeline well, shuffles can be pipelined on the newer architecture:

I'll try to group them without register spilling as well but I don't have such CPU to test performance.

Nah, grouped loads code is a bit slower:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042 Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores .NET Core SDK=6.0.100-preview.3.21202.5 [Host] : .NET Core 3.1.21 (CoreCLR 4.700.21.51404, CoreFX 4.700.21.51508), X64 RyuJIT DefaultJob : .NET Core 3.1.21 (CoreCLR 4.700.21.51404, CoreFX 4.700.21.51508), X64 RyuJIT

Method Mean Error StdDev Code Size

'Avx optimized' 5.666 ns 0.1440 ns 0.2018 ns 478 B

'Avx pipelined' 6.076 ns 0.1516 ns 0.2175 ns 501 B

While loads are grouped at the assembler level, 'pipelined' version saves 6 xmm (yes, xmm, not ymm) registers to the stack while current PR version saves 4 registers to the stack for whatever reason, thus the slowdown.

Cross-lane permutations are not pipelined:

To me Latency = 3, Througput = 1 means that you can pipeline 3 instructions so the overall latency of 3 _mm256_permutevar8x32_ps instructions will be 3 cycles instead of 3 * 3 cycles, which is already a win.

While loads are grouped at the assembler level, 'pipelined' version saves 6 xmm (yes, xmm, not ymm) registers to the stack while current PR version saves 4 registers to the stack for whatever reason, thus the slowdown.

That's super weird, especially the xmm part.

Anyways, this was just an idea, the improvement in the PR is already good enough.

It's the opposite: latency is how many 'cycles' single independent op takes, throughput is how many instructions can be started in one 'cycle', it's a reciprocal value though, so value of 0.5 means 2 instructions per cycle.

Let's take unaligned loads for example, latency of 7 and thoughput 0.5 on newer architecture, let's say we have this code:

... vmovupd ymm0,[rcx] vmovupd ymm1,[rcx+20] vmovupd ymm2,[rcx+40] vmovupd ymm3,[rcx+60] ...

Due to the fancy cpu out-of-order first two vmovupd loads can be issued simultaneously because they are independent. In the best case possible code above would be executed in 8 cycles:

|1 cycle|2 cycle|3 cycle|4 cycle|5 cycle|6 cycle|7 cycle|8 cycle | |vmovupd |use results| |vmovupd |use results| |vmovupd |use results| |vmovupd |use results|

This is a very rough 'view' of what could happen because all these vmovupd must have been decoded and present in instruction queue, there should be no other port competing instructions and other complicated stuff. But general idea is this.

I think you have pipelining confused with parallel dispatch.

A pipelined instruction is one that takes more than one cycle to complete but does not tie up the execution unit completely during that time, so instructions can overlap execution within that latency window. You've correctly illustrated that for vmovupd above, where both pipelining and parallel dispatch apply, because 2 ports can accept that instruction simultaneously and up to 7 instructions can be in various stages of completion within the EU's pipeline.

In the case of vpermd, the latency is 3, meaning it takes 3 cycles from dispatch to completion, and its reciprocal throughput is 1, meaning an instruction can be dispatched and retired every cycle (this one can only be dispatched on a single port). The total execution time for 2 vpermds would be 4 cycles, with both results available on the 5th.

|1 cycle|2 cycle|3 cycle|4 cycle |5 cycle | |vpermd |use results| |vpermd |use results|

So there is potential benefit in unrolling those.

I actually did, thanks for clarifying!
But as I said, grouped loads implementation is slower. Maybe grouped permutations and shuffles would yield better performance with enough manual tweaking and looking at the binary but with 'naive' grouping them in code yields worse/same results:

Method Mean Error StdDev Code Size

'Avx detailed masks' 9.309 ns 0.0382 ns 0.0319 ns 569 B

'Avx optimized' 5.965 ns 0.0696 ns 0.0617 ns 478 B

'Avx pipelined loads' 6.095 ns 0.0346 ns 0.0307 ns 501 B

'Avx pipelined the right way' 6.424 ns 0.0314 ns 0.0278 ns 488 B

Legend:
Detailed masks - first implementation with detailed permutation/shuffle masks
Optimized - reduced permutation/shuffle masks table
Pipelined loads - grouped mask loads
The right way - grouped everything manually to get the best codegen from the 'pipeline loads' implementation

For example, here's the codegen for the first 2 rows calculation:

... vmovupd ymm0,[rcx] vmovupd ymm1,[rcx+20] vmovupd ymm2,[rcx+40] vmovdqu ymm3,ymmword ptr [rax] vmovdqu ymm4,ymmword ptr [rax+20] vmovdqu ymm5,ymmword ptr [rax+40] vmovdqu ymm6,ymmword ptr [rax+60] vmovdqu ymm7,ymmword ptr [rax+80] vpermd ymm8,ymm3,ymm0 vpermd ymm3,ymm3,ymm1 vpermd ymm9,ymm6,ymm2 vpshufb ymm8,ymm8,ymm4 vpshufb ymm3,ymm3,ymm5 vpshufb ymm4,ymm9,ymm7 vpor ymm3,ymm8,ymm3 vpor ymm3,ymm3,ymm4 vmovupd [rdx],ymm3 ...

But yet this produces same performance (if not worse, benchmark is a bit unstable for whatever reason). Maybe I'm doing something wrong, I can provide full code for all of those benchmark versions if you want to play with this code but I'm honestly okay with current performance :)

Yep, that codegen looks good. I'd be happy to look at your variants if you want to share them, but if you're already happy with the perf, I'm in no position to argue ;)

This is something that may be worth revisiting once netcoreapp3.1 is EOL. The constant vector and vector CSE support in net6.0 might help. It's difficult to get optimal code on both runtimes at present.

I'm honestly okay with current performance

As am I 😄

This has been a fantastic discussion btw, I've learned a lot!

I'll leave it here just in case you'd want to take a look:
https://gist.github.com/br3aker/4da983868b8915180747ba1f8274862e

br3aker · 2022-01-26T10:36:53Z

@gfoidl sorry for pinging but this may interest you.

AVX implementation has a bit different asm output and performance on core 3.1 and .net 6.0:

.NET Core 3.1.21 (CoreCLR 4.700.21.51404, CoreFX 4.700.21.51508), X64 RyuJIT

Method	Mean	Error	StdDev	Code Size
'Avx optimized'	5.666 ns	0.1440 ns	0.2018 ns	478 B

.NET Core 6.0.0 (CoreCLR 6.0.21.20104, CoreFX 6.0.21.20104), X64 RyuJIT

Method	Mean	Error	StdDev	Code Size
'Avx optimized'	7.009 ns	0.1292 ns	0.1209 ns	588 B

I'm not even close to understand assembly nuances but .net 6.0 does mask loads a bit different than core 3.1. Take a look at first 2 rows calculation:

core 3.1

sub       rsp,58
vzeroupper
vmovaps   [rsp+40],xmm6
vmovaps   [rsp+30],xmm7
vmovaps   [rsp+20],xmm8
vmovaps   [rsp+10],xmm9
xor       eax,eax
mov       [rsp+8],rax
mov       rax,26ECE25CA20
mov       [rsp+8],rax
vmovupd   ymm0,[rcx]
vmovupd   ymm1,[rcx+20]
vmovupd   ymm2,[rcx+40]
vmovupd   ymm3,[rcx+60]
vmovdqu   ymm4,ymmword ptr [rax]
vpermd    ymm5,ymm4,ymm1
vpermd    ymm4,ymm4,ymm0
vmovdqu   ymm6,ymmword ptr [rax+20]
vpshufb   ymm4,ymm4,ymm6
vmovdqu   ymm6,ymmword ptr [rax+40]
vpshufb   ymm5,ymm5,ymm6
vmovdqu   ymm6,ymmword ptr [rax+60]
vpermd    ymm7,ymm6,ymm2
vmovdqu   ymm8,ymmword ptr [rax+80]
vpshufb   ymm8,ymm7,ymm8
vpor      ymm5,ymm5,ymm8
vpor      ymm4,ymm4,ymm5
...

.net 6.0

sub       rsp,58
vzeroupper
vmovaps   [rsp+40],xmm6
vmovaps   [rsp+30],xmm7
vmovaps   [rsp+20],xmm8
vmovaps   [rsp+10],xmm9
mov       rax,1CDF57ECA60
mov       [rsp+8],rax
vmovupd   ymm0,[rcx]
vmovupd   ymm1,[rcx+20]
vmovupd   ymm2,[rcx+40]
vmovupd   ymm3,[rcx+60]
vmovdqu   ymm4,ymmword ptr [rax]
vpermd    ymm5,ymm4,ymm1
vpermd    ymm4,ymm4,ymm0
mov       rax,1CDF57ECA80
vmovdqu   ymm6,ymmword ptr [rax]
vpshufb   ymm4,ymm4,ymm6
mov       rax,1CDF57ECAA0
vmovdqu   ymm6,ymmword ptr [rax]
vpshufb   ymm5,ymm5,ymm6
mov       rax,1CDF57ECAC0
vmovdqu   ymm6,ymmword ptr [rax]
vpermd    ymm7,ymm6,ymm2
mov       rax,1CDF57ECAE0
vmovdqu   ymm8,ymmword ptr [rax]
vpshufb   ymm8,ymm7,ymm8
vpor      ymm5,ymm5,ymm8
vpor      ymm4,ymm4,ymm5
...

Aren't those multiple mov rax addr slower than using [rax + const] directly?

antonfirsov · 2022-01-26T15:21:06Z

The test failure is result of eager merge of #1960. I will revert bbbf687 later.

gfoidl · 2022-01-26T18:51:56Z

Aren't those multiple mov rax addr slower than using [rax + const] directly?

IMO [rax + const] should be prefered (as you expect too). Code size is also quite smaller in .NET Core 3.1. (by 110 bytes). So it looks like a regression in .NET 6.
When I look with llvm-mca at the posted assembly, the .NET Core 3.1 version has about 400 µOps less to execute -- this reflects in the shown perf-numbers too.

Can we get a minimal repro so that an issue in dotnet/runtime can be filed?

(Don't hesitate to ping me, but in the next days I'm a bit slow to respond (out of office)).

br3aker · 2022-01-26T18:59:59Z

Can we get a minimal repro so that an issue in dotnet/runtime can be filed?

I'll play around with it and ping with results (tomorrow, probably), thanks!

saucecontrol · 2022-01-28T19:03:20Z

Here's a simple repro for that codegen issue, along with "one weird trick" to work around it 😂

Sharplab

br3aker · 2022-01-28T20:35:46Z

Got a bit ill so couldn't work these 2 days.

"one weird trick" to work around it

I've actually tried to do entire thing with 'ref pointers' and got worse performance, didn't check assembly though.

saucecontrol · 2022-01-28T20:41:12Z

The codegen is extremely finicky. Note the Unsafe.Add with 0 offset in that example. Take that out, and it regresses again. We're discussing that over in the #lowlevel channel on csharp discord. You may want to get in on that https://discord.gg/csharp

Hope you feel better!

Dmitry Pentin added 2 commits January 25, 2022 14:58

Fused transpose/zigzag implementation

477716c

Tests fix

219119a

JimBobSquarePants requested a review from antonfirsov January 25, 2022 13:27

JimBobSquarePants added this to the 2.0.0 milestone Jan 25, 2022

JimBobSquarePants added area:performance formats:jpeg labels Jan 25, 2022

gfoidl reviewed Jan 25, 2022

View reviewed changes

src/ImageSharp/Formats/Jpeg/Components/ZigZag.Intrinsic.cs Outdated Show resolved Hide resolved

gfoidl len check removal

69458f4

Co-authored-by: Günther Foidl <[email protected]>

antonfirsov reviewed Jan 25, 2022

View reviewed changes

Dmitry Pentin added 2 commits January 26, 2022 13:44

Fixed compilation error

b6400c2

Merge branch 'master' into dp/jpeg-encoder-zigzag

4ac5cc5

antonfirsov mentioned this pull request Jan 26, 2022

Move RentReturnRelease_SubsequentRentReturnsDifferentHandles out of process #1960

Merged

4 tasks

JimBobSquarePants approved these changes Jan 30, 2022

View reviewed changes

JimBobSquarePants merged commit aff08f7 into SixLabors:master Jan 30, 2022

br3aker deleted the dp/jpeg-encoder-zigzag branch January 31, 2022 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jpeg encoder optimization via removing second transpose call from FDCT #1958

Jpeg encoder optimization via removing second transpose call from FDCT #1958

br3aker commented Jan 25, 2022

antonfirsov Jan 25, 2022 •

edited

Loading

br3aker Jan 26, 2022 •

edited

Loading

br3aker Jan 26, 2022 •

edited

Loading

antonfirsov Jan 26, 2022 •

edited

Loading

br3aker Jan 26, 2022 •

edited

Loading

saucecontrol Jan 28, 2022

br3aker Jan 30, 2022 •

edited

Loading

saucecontrol Jan 30, 2022

JimBobSquarePants Jan 30, 2022

br3aker Jan 31, 2022

br3aker commented Jan 26, 2022 •

edited

Loading

antonfirsov commented Jan 26, 2022

gfoidl commented Jan 26, 2022

br3aker commented Jan 26, 2022

saucecontrol commented Jan 28, 2022 •

edited

Loading

br3aker commented Jan 28, 2022

saucecontrol commented Jan 28, 2022

		Vector256<int> crln_01_AB_CD = Avx.LoadVector256(shuffleVectorsPtr + (0 * 32)).AsInt32();
		Vector256<byte> row01_AB = Avx2.PermuteVar8x32(rowAB.AsInt32(), crln_01_AB_CD).AsByte();

Method	Mean	Error	StdDev	Code Size
'Avx detailed masks'	9.309 ns	0.0382 ns	0.0319 ns	569 B
'Avx optimized'	5.965 ns	0.0696 ns	0.0617 ns	478 B
'Avx pipelined loads'	6.095 ns	0.0346 ns	0.0307 ns	501 B
'Avx pipelined the right way'	6.424 ns	0.0314 ns	0.0278 ns	488 B

Jpeg encoder optimization via removing second transpose call from FDCT #1958

Jpeg encoder optimization via removing second transpose call from FDCT #1958

Conversation

br3aker commented Jan 25, 2022

Prerequisites

Description

antonfirsov Jan 25, 2022 • edited Loading

Choose a reason for hiding this comment

br3aker Jan 26, 2022 • edited Loading

Choose a reason for hiding this comment

br3aker Jan 26, 2022 • edited Loading

Choose a reason for hiding this comment

antonfirsov Jan 26, 2022 • edited Loading

Choose a reason for hiding this comment

br3aker Jan 26, 2022 • edited Loading

Choose a reason for hiding this comment

saucecontrol Jan 28, 2022

Choose a reason for hiding this comment

br3aker Jan 30, 2022 • edited Loading

Choose a reason for hiding this comment

saucecontrol Jan 30, 2022

Choose a reason for hiding this comment

JimBobSquarePants Jan 30, 2022

Choose a reason for hiding this comment

br3aker Jan 31, 2022

Choose a reason for hiding this comment

br3aker commented Jan 26, 2022 • edited Loading

antonfirsov commented Jan 26, 2022

gfoidl commented Jan 26, 2022

br3aker commented Jan 26, 2022

saucecontrol commented Jan 28, 2022 • edited Loading

br3aker commented Jan 28, 2022

saucecontrol commented Jan 28, 2022

antonfirsov Jan 25, 2022 •

edited

Loading

br3aker Jan 26, 2022 •

edited

Loading

br3aker Jan 26, 2022 •

edited

Loading

antonfirsov Jan 26, 2022 •

edited

Loading

br3aker Jan 26, 2022 •

edited

Loading

br3aker Jan 30, 2022 •

edited

Loading

br3aker commented Jan 26, 2022 •

edited

Loading

saucecontrol commented Jan 28, 2022 •

edited

Loading