Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jpeg encoder optimization via removing second transpose call from FDCT #1958

Merged
merged 5 commits into from
Jan 30, 2022

Conversation

br3aker
Copy link
Contributor

@br3aker br3aker commented Jan 25, 2022

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Closes #1778 for the encoder part, decoder part was implemented long time ago so it can be closed after this PR.

Basically, this removes second matrix transpose call from FDCT-Quantize-Huffman pipeline in the jpeg encoder. Performance gain is not that big but it's still nice:

Current master

// Encoding q=98 | color=YCbCrRatio444
// Elapsed: 11934ms across 1000 iterations
// Average: 11,934ms

PR, roughly 4% faster

// Encoding q=98 | color=YCbCrRatio444
// Elapsed: 11457ms across 1000 iterations
// Average: 11,457ms

Tests were done using Calliphora.jpg from test suite. Quality 98, no chroma subsampling.

This optimization has 'constant' delta independent of output quality so lower qualities gain better performance boost:

Current master

// Encoding q=75 | color=YCbCrRatio444
// Elapsed: 9386ms across 1000 iterations
// Average: 9,386ms

PR, roughly 5% faster

// Encoding q=75 | color=YCbCrRatio444
// Elapsed: 8900ms across 1000 iterations
// Average: 8,9ms

Co-authored-by: Günther Foidl <[email protected]>
Comment on lines +254 to +255
Vector256<int> crln_01_AB_CD = Avx.LoadVector256(shuffleVectorsPtr + (0 * 32)).AsInt32();
Vector256<byte> row01_AB = Avx2.PermuteVar8x32(rowAB.AsInt32(), crln_01_AB_CD).AsByte();
Copy link
Member

@antonfirsov antonfirsov Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we can group Loads and PermuteVars along the implementation (without running out of registers) so they are pipelined?

This can bring noticeable speedup if we are lucky. See #1242 (comment)

Copy link
Contributor Author

@br3aker br3aker Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-lane permutations are not pipelined:
image

Loads are the main bottlenecks here with latency of 7 which can be pipelined:
image

Even though shuffles and permutes don't pipeline well, shuffles can be pipelined on the newer architecture:
image

I'll try to group them without register spilling as well but I don't have such CPU to test performance.

Copy link
Contributor Author

@br3aker br3aker Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, grouped loads code is a bit slower:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-preview.3.21202.5
  [Host]     : .NET Core 3.1.21 (CoreCLR 4.700.21.51404, CoreFX 4.700.21.51508), X64 RyuJIT
  DefaultJob : .NET Core 3.1.21 (CoreCLR 4.700.21.51404, CoreFX 4.700.21.51508), X64 RyuJIT

Method Mean Error StdDev Code Size
'Avx optimized' 5.666 ns 0.1440 ns 0.2018 ns 478 B
'Avx pipelined' 6.076 ns 0.1516 ns 0.2175 ns 501 B

While loads are grouped at the assembler level, 'pipelined' version saves 6 xmm (yes, xmm, not ymm) registers to the stack while current PR version saves 4 registers to the stack for whatever reason, thus the slowdown.

Copy link
Member

@antonfirsov antonfirsov Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-lane permutations are not pipelined:

To me Latency = 3, Througput = 1 means that you can pipeline 3 instructions so the overall latency of 3 _mm256_permutevar8x32_ps instructions will be 3 cycles instead of 3 * 3 cycles, which is already a win.

While loads are grouped at the assembler level, 'pipelined' version saves 6 xmm (yes, xmm, not ymm) registers to the stack while current PR version saves 4 registers to the stack for whatever reason, thus the slowdown.

That's super weird, especially the xmm part.

Anyways, this was just an idea, the improvement in the PR is already good enough.

Copy link
Contributor Author

@br3aker br3aker Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the opposite: latency is how many 'cycles' single independent op takes, throughput is how many instructions can be started in one 'cycle', it's a reciprocal value though, so value of 0.5 means 2 instructions per cycle.

Let's take unaligned loads for example, latency of 7 and thoughput 0.5 on newer architecture, let's say we have this code:

...
vmovupd   ymm0,[rcx]
vmovupd   ymm1,[rcx+20]
vmovupd   ymm2,[rcx+40]
vmovupd   ymm3,[rcx+60]
...

Due to the fancy cpu out-of-order first two vmovupd loads can be issued simultaneously because they are independent. In the best case possible code above would be executed in 8 cycles:

|1 cycle|2 cycle|3 cycle|4 cycle|5 cycle|6 cycle|7 cycle|8 cycle    |
|vmovupd                                                |use results|
|vmovupd                                                |use results|
        |vmovupd                                                    |use results|
        |vmovupd                                                    |use results|

This is a very rough 'view' of what could happen because all these vmovupd must have been decoded and present in instruction queue, there should be no other port competing instructions and other complicated stuff. But general idea is this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have pipelining confused with parallel dispatch.

A pipelined instruction is one that takes more than one cycle to complete but does not tie up the execution unit completely during that time, so instructions can overlap execution within that latency window. You've correctly illustrated that for vmovupd above, where both pipelining and parallel dispatch apply, because 2 ports can accept that instruction simultaneously and up to 7 instructions can be in various stages of completion within the EU's pipeline.

In the case of vpermd, the latency is 3, meaning it takes 3 cycles from dispatch to completion, and its reciprocal throughput is 1, meaning an instruction can be dispatched and retired every cycle (this one can only be dispatched on a single port). The total execution time for 2 vpermds would be 4 cycles, with both results available on the 5th.

|1 cycle|2 cycle|3 cycle|4 cycle    |5 cycle    |
|vpermd                 |use results|
        |vpermd                     |use results|

So there is potential benefit in unrolling those.

Copy link
Contributor Author

@br3aker br3aker Jan 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did, thanks for clarifying!
But as I said, grouped loads implementation is slower. Maybe grouped permutations and shuffles would yield better performance with enough manual tweaking and looking at the binary but with 'naive' grouping them in code yields worse/same results:

Method Mean Error StdDev Code Size
'Avx detailed masks' 9.309 ns 0.0382 ns 0.0319 ns 569 B
'Avx optimized' 5.965 ns 0.0696 ns 0.0617 ns 478 B
'Avx pipelined loads' 6.095 ns 0.0346 ns 0.0307 ns 501 B
'Avx pipelined the right way' 6.424 ns 0.0314 ns 0.0278 ns 488 B

Legend:
Detailed masks - first implementation with detailed permutation/shuffle masks
Optimized - reduced permutation/shuffle masks table
Pipelined loads - grouped mask loads
The right way - grouped everything manually to get the best codegen from the 'pipeline loads' implementation

For example, here's the codegen for the first 2 rows calculation:

...
vmovupd   ymm0,[rcx]
vmovupd   ymm1,[rcx+20]
vmovupd   ymm2,[rcx+40]
vmovdqu   ymm3,ymmword ptr [rax]
vmovdqu   ymm4,ymmword ptr [rax+20]
vmovdqu   ymm5,ymmword ptr [rax+40]
vmovdqu   ymm6,ymmword ptr [rax+60]
vmovdqu   ymm7,ymmword ptr [rax+80]
vpermd    ymm8,ymm3,ymm0
vpermd    ymm3,ymm3,ymm1
vpermd    ymm9,ymm6,ymm2
vpshufb   ymm8,ymm8,ymm4
vpshufb   ymm3,ymm3,ymm5
vpshufb   ymm4,ymm9,ymm7
vpor      ymm3,ymm8,ymm3
vpor      ymm3,ymm3,ymm4
vmovupd   [rdx],ymm3
...

But yet this produces same performance (if not worse, benchmark is a bit unstable for whatever reason). Maybe I'm doing something wrong, I can provide full code for all of those benchmark versions if you want to play with this code but I'm honestly okay with current performance :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that codegen looks good. I'd be happy to look at your variants if you want to share them, but if you're already happy with the perf, I'm in no position to argue ;)

This is something that may be worth revisiting once netcoreapp3.1 is EOL. The constant vector and vector CSE support in net6.0 might help. It's difficult to get optimal code on both runtimes at present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly okay with current performance

As am I 😄

This has been a fantastic discussion btw, I've learned a lot!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it here just in case you'd want to take a look:
https://gist.github.com/br3aker/4da983868b8915180747ba1f8274862e

@br3aker
Copy link
Contributor Author

br3aker commented Jan 26, 2022

@gfoidl sorry for pinging but this may interest you.

AVX implementation has a bit different asm output and performance on core 3.1 and .net 6.0:

.NET Core 3.1.21 (CoreCLR 4.700.21.51404, CoreFX 4.700.21.51508), X64 RyuJIT

Method Mean Error StdDev Code Size
'Avx optimized' 5.666 ns 0.1440 ns 0.2018 ns 478 B

.NET Core 6.0.0 (CoreCLR 6.0.21.20104, CoreFX 6.0.21.20104), X64 RyuJIT

Method Mean Error StdDev Code Size
'Avx optimized' 7.009 ns 0.1292 ns 0.1209 ns 588 B

I'm not even close to understand assembly nuances but .net 6.0 does mask loads a bit different than core 3.1. Take a look at first 2 rows calculation:

core 3.1

sub       rsp,58
vzeroupper
vmovaps   [rsp+40],xmm6
vmovaps   [rsp+30],xmm7
vmovaps   [rsp+20],xmm8
vmovaps   [rsp+10],xmm9
xor       eax,eax
mov       [rsp+8],rax
mov       rax,26ECE25CA20
mov       [rsp+8],rax
vmovupd   ymm0,[rcx]
vmovupd   ymm1,[rcx+20]
vmovupd   ymm2,[rcx+40]
vmovupd   ymm3,[rcx+60]
vmovdqu   ymm4,ymmword ptr [rax]
vpermd    ymm5,ymm4,ymm1
vpermd    ymm4,ymm4,ymm0
vmovdqu   ymm6,ymmword ptr [rax+20]
vpshufb   ymm4,ymm4,ymm6
vmovdqu   ymm6,ymmword ptr [rax+40]
vpshufb   ymm5,ymm5,ymm6
vmovdqu   ymm6,ymmword ptr [rax+60]
vpermd    ymm7,ymm6,ymm2
vmovdqu   ymm8,ymmword ptr [rax+80]
vpshufb   ymm8,ymm7,ymm8
vpor      ymm5,ymm5,ymm8
vpor      ymm4,ymm4,ymm5
...

.net 6.0

sub       rsp,58
vzeroupper
vmovaps   [rsp+40],xmm6
vmovaps   [rsp+30],xmm7
vmovaps   [rsp+20],xmm8
vmovaps   [rsp+10],xmm9
mov       rax,1CDF57ECA60
mov       [rsp+8],rax
vmovupd   ymm0,[rcx]
vmovupd   ymm1,[rcx+20]
vmovupd   ymm2,[rcx+40]
vmovupd   ymm3,[rcx+60]
vmovdqu   ymm4,ymmword ptr [rax]
vpermd    ymm5,ymm4,ymm1
vpermd    ymm4,ymm4,ymm0
mov       rax,1CDF57ECA80
vmovdqu   ymm6,ymmword ptr [rax]
vpshufb   ymm4,ymm4,ymm6
mov       rax,1CDF57ECAA0
vmovdqu   ymm6,ymmword ptr [rax]
vpshufb   ymm5,ymm5,ymm6
mov       rax,1CDF57ECAC0
vmovdqu   ymm6,ymmword ptr [rax]
vpermd    ymm7,ymm6,ymm2
mov       rax,1CDF57ECAE0
vmovdqu   ymm8,ymmword ptr [rax]
vpshufb   ymm8,ymm7,ymm8
vpor      ymm5,ymm5,ymm8
vpor      ymm4,ymm4,ymm5
...

Aren't those multiple mov rax addr slower than using [rax + const] directly?

@antonfirsov
Copy link
Member

The test failure is result of eager merge of #1960. I will revert bbbf687 later.

@gfoidl
Copy link
Contributor

gfoidl commented Jan 26, 2022

Aren't those multiple mov rax addr slower than using [rax + const] directly?

IMO [rax + const] should be prefered (as you expect too). Code size is also quite smaller in .NET Core 3.1. (by 110 bytes). So it looks like a regression in .NET 6.
When I look with llvm-mca at the posted assembly, the .NET Core 3.1 version has about 400 µOps less to execute -- this reflects in the shown perf-numbers too.

Can we get a minimal repro so that an issue in dotnet/runtime can be filed?

(Don't hesitate to ping me, but in the next days I'm a bit slow to respond (out of office)).

@br3aker
Copy link
Contributor Author

br3aker commented Jan 26, 2022

Can we get a minimal repro so that an issue in dotnet/runtime can be filed?

I'll play around with it and ping with results (tomorrow, probably), thanks!

@saucecontrol
Copy link
Contributor

saucecontrol commented Jan 28, 2022

Here's a simple repro for that codegen issue, along with "one weird trick" to work around it 😂

Sharplab

@br3aker
Copy link
Contributor Author

br3aker commented Jan 28, 2022

Got a bit ill so couldn't work these 2 days.

"one weird trick" to work around it

I've actually tried to do entire thing with 'ref pointers' and got worse performance, didn't check assembly though.

@saucecontrol
Copy link
Contributor

The codegen is extremely finicky. Note the Unsafe.Add with 0 offset in that example. Take that out, and it regresses again. We're discussing that over in the #lowlevel channel on csharp discord. You may want to get in on that https://discord.gg/csharp

Hope you feel better!

@JimBobSquarePants JimBobSquarePants merged commit aff08f7 into SixLabors:master Jan 30, 2022
@br3aker br3aker deleted the dp/jpeg-encoder-zigzag branch January 31, 2022 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Jpeg encoder can be further optimized by fusing transpose and zig-zag steps
5 participants