-
-
Notifications
You must be signed in to change notification settings - Fork 852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jpeg encoder optimization via removing second transpose call from FDCT #1958
Jpeg encoder optimization via removing second transpose call from FDCT #1958
Conversation
Co-authored-by: Günther Foidl <[email protected]>
Vector256<int> crln_01_AB_CD = Avx.LoadVector256(shuffleVectorsPtr + (0 * 32)).AsInt32(); | ||
Vector256<byte> row01_AB = Avx2.PermuteVar8x32(rowAB.AsInt32(), crln_01_AB_CD).AsByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance we can group Loads and PermuteVars along the implementation (without running out of registers) so they are pipelined?
This can bring noticeable speedup if we are lucky. See #1242 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross-lane permutations are not pipelined:
Loads are the main bottlenecks here with latency of 7 which can be pipelined:
Even though shuffles and permutes don't pipeline well, shuffles can be pipelined on the newer architecture:
I'll try to group them without register spilling as well but I don't have such CPU to test performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, grouped loads code is a bit slower:
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-preview.3.21202.5
[Host] : .NET Core 3.1.21 (CoreCLR 4.700.21.51404, CoreFX 4.700.21.51508), X64 RyuJIT
DefaultJob : .NET Core 3.1.21 (CoreCLR 4.700.21.51404, CoreFX 4.700.21.51508), X64 RyuJIT
Method | Mean | Error | StdDev | Code Size |
---|---|---|---|---|
'Avx optimized' | 5.666 ns | 0.1440 ns | 0.2018 ns | 478 B |
'Avx pipelined' | 6.076 ns | 0.1516 ns | 0.2175 ns | 501 B |
While loads are grouped at the assembler level, 'pipelined' version saves 6 xmm
(yes, xmm
, not ymm
) registers to the stack while current PR version saves 4 registers to the stack for whatever reason, thus the slowdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross-lane permutations are not pipelined:
To me Latency = 3, Througput = 1 means that you can pipeline 3 instructions so the overall latency of 3 _mm256_permutevar8x32_ps
instructions will be 3 cycles instead of 3 * 3 cycles, which is already a win.
While loads are grouped at the assembler level, 'pipelined' version saves 6
xmm
(yes,xmm
, notymm
) registers to the stack while current PR version saves 4 registers to the stack for whatever reason, thus the slowdown.
That's super weird, especially the xmm part.
Anyways, this was just an idea, the improvement in the PR is already good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the opposite: latency is how many 'cycles' single independent op takes, throughput is how many instructions can be started in one 'cycle', it's a reciprocal value though, so value of 0.5 means 2 instructions per cycle.
Let's take unaligned loads for example, latency of 7 and thoughput 0.5 on newer architecture, let's say we have this code:
...
vmovupd ymm0,[rcx]
vmovupd ymm1,[rcx+20]
vmovupd ymm2,[rcx+40]
vmovupd ymm3,[rcx+60]
...
Due to the fancy cpu out-of-order first two vmovupd
loads can be issued simultaneously because they are independent. In the best case possible code above would be executed in 8 cycles:
|1 cycle|2 cycle|3 cycle|4 cycle|5 cycle|6 cycle|7 cycle|8 cycle |
|vmovupd |use results|
|vmovupd |use results|
|vmovupd |use results|
|vmovupd |use results|
This is a very rough 'view' of what could happen because all these vmovupd
must have been decoded and present in instruction queue, there should be no other port competing instructions and other complicated stuff. But general idea is this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have pipelining confused with parallel dispatch.
A pipelined instruction is one that takes more than one cycle to complete but does not tie up the execution unit completely during that time, so instructions can overlap execution within that latency window. You've correctly illustrated that for vmovupd
above, where both pipelining and parallel dispatch apply, because 2 ports can accept that instruction simultaneously and up to 7 instructions can be in various stages of completion within the EU's pipeline.
In the case of vpermd
, the latency is 3, meaning it takes 3 cycles from dispatch to completion, and its reciprocal throughput is 1, meaning an instruction can be dispatched and retired every cycle (this one can only be dispatched on a single port). The total execution time for 2 vpermd
s would be 4 cycles, with both results available on the 5th.
|1 cycle|2 cycle|3 cycle|4 cycle |5 cycle |
|vpermd |use results|
|vpermd |use results|
So there is potential benefit in unrolling those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually did, thanks for clarifying!
But as I said, grouped loads implementation is slower. Maybe grouped permutations and shuffles would yield better performance with enough manual tweaking and looking at the binary but with 'naive' grouping them in code yields worse/same results:
Method | Mean | Error | StdDev | Code Size |
---|---|---|---|---|
'Avx detailed masks' | 9.309 ns | 0.0382 ns | 0.0319 ns | 569 B |
'Avx optimized' | 5.965 ns | 0.0696 ns | 0.0617 ns | 478 B |
'Avx pipelined loads' | 6.095 ns | 0.0346 ns | 0.0307 ns | 501 B |
'Avx pipelined the right way' | 6.424 ns | 0.0314 ns | 0.0278 ns | 488 B |
Legend:
Detailed masks - first implementation with detailed permutation/shuffle masks
Optimized - reduced permutation/shuffle masks table
Pipelined loads - grouped mask loads
The right way - grouped everything manually to get the best codegen from the 'pipeline loads' implementation
For example, here's the codegen for the first 2 rows calculation:
...
vmovupd ymm0,[rcx]
vmovupd ymm1,[rcx+20]
vmovupd ymm2,[rcx+40]
vmovdqu ymm3,ymmword ptr [rax]
vmovdqu ymm4,ymmword ptr [rax+20]
vmovdqu ymm5,ymmword ptr [rax+40]
vmovdqu ymm6,ymmword ptr [rax+60]
vmovdqu ymm7,ymmword ptr [rax+80]
vpermd ymm8,ymm3,ymm0
vpermd ymm3,ymm3,ymm1
vpermd ymm9,ymm6,ymm2
vpshufb ymm8,ymm8,ymm4
vpshufb ymm3,ymm3,ymm5
vpshufb ymm4,ymm9,ymm7
vpor ymm3,ymm8,ymm3
vpor ymm3,ymm3,ymm4
vmovupd [rdx],ymm3
...
But yet this produces same performance (if not worse, benchmark is a bit unstable for whatever reason). Maybe I'm doing something wrong, I can provide full code for all of those benchmark versions if you want to play with this code but I'm honestly okay with current performance :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that codegen looks good. I'd be happy to look at your variants if you want to share them, but if you're already happy with the perf, I'm in no position to argue ;)
This is something that may be worth revisiting once netcoreapp3.1 is EOL. The constant vector and vector CSE support in net6.0 might help. It's difficult to get optimal code on both runtimes at present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm honestly okay with current performance
As am I 😄
This has been a fantastic discussion btw, I've learned a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it here just in case you'd want to take a look:
https://gist.github.com/br3aker/4da983868b8915180747ba1f8274862e
@gfoidl sorry for pinging but this may interest you. AVX implementation has a bit different asm output and performance on .NET Core 3.1.21 (CoreCLR 4.700.21.51404, CoreFX 4.700.21.51508), X64 RyuJIT
.NET Core 6.0.0 (CoreCLR 6.0.21.20104, CoreFX 6.0.21.20104), X64 RyuJIT
I'm not even close to understand assembly nuances but
sub rsp,58
vzeroupper
vmovaps [rsp+40],xmm6
vmovaps [rsp+30],xmm7
vmovaps [rsp+20],xmm8
vmovaps [rsp+10],xmm9
xor eax,eax
mov [rsp+8],rax
mov rax,26ECE25CA20
mov [rsp+8],rax
vmovupd ymm0,[rcx]
vmovupd ymm1,[rcx+20]
vmovupd ymm2,[rcx+40]
vmovupd ymm3,[rcx+60]
vmovdqu ymm4,ymmword ptr [rax]
vpermd ymm5,ymm4,ymm1
vpermd ymm4,ymm4,ymm0
vmovdqu ymm6,ymmword ptr [rax+20]
vpshufb ymm4,ymm4,ymm6
vmovdqu ymm6,ymmword ptr [rax+40]
vpshufb ymm5,ymm5,ymm6
vmovdqu ymm6,ymmword ptr [rax+60]
vpermd ymm7,ymm6,ymm2
vmovdqu ymm8,ymmword ptr [rax+80]
vpshufb ymm8,ymm7,ymm8
vpor ymm5,ymm5,ymm8
vpor ymm4,ymm4,ymm5
...
sub rsp,58
vzeroupper
vmovaps [rsp+40],xmm6
vmovaps [rsp+30],xmm7
vmovaps [rsp+20],xmm8
vmovaps [rsp+10],xmm9
mov rax,1CDF57ECA60
mov [rsp+8],rax
vmovupd ymm0,[rcx]
vmovupd ymm1,[rcx+20]
vmovupd ymm2,[rcx+40]
vmovupd ymm3,[rcx+60]
vmovdqu ymm4,ymmword ptr [rax]
vpermd ymm5,ymm4,ymm1
vpermd ymm4,ymm4,ymm0
mov rax,1CDF57ECA80
vmovdqu ymm6,ymmword ptr [rax]
vpshufb ymm4,ymm4,ymm6
mov rax,1CDF57ECAA0
vmovdqu ymm6,ymmword ptr [rax]
vpshufb ymm5,ymm5,ymm6
mov rax,1CDF57ECAC0
vmovdqu ymm6,ymmword ptr [rax]
vpermd ymm7,ymm6,ymm2
mov rax,1CDF57ECAE0
vmovdqu ymm8,ymmword ptr [rax]
vpshufb ymm8,ymm7,ymm8
vpor ymm5,ymm5,ymm8
vpor ymm4,ymm4,ymm5
... Aren't those multiple |
IMO Can we get a minimal repro so that an issue in dotnet/runtime can be filed? (Don't hesitate to ping me, but in the next days I'm a bit slow to respond (out of office)). |
I'll play around with it and ping with results (tomorrow, probably), thanks! |
Here's a simple repro for that codegen issue, along with "one weird trick" to work around it 😂 |
Got a bit ill so couldn't work these 2 days.
I've actually tried to do entire thing with 'ref pointers' and got worse performance, didn't check assembly though. |
The codegen is extremely finicky. Note the Hope you feel better! |
Prerequisites
Description
Closes #1778 for the encoder part, decoder part was implemented long time ago so it can be closed after this PR.
Basically, this removes second matrix transpose call from FDCT-Quantize-Huffman pipeline in the jpeg encoder. Performance gain is not that big but it's still nice:
Current master
PR, roughly 4% faster
Tests were done using Calliphora.jpg from test suite. Quality 98, no chroma subsampling.
This optimization has 'constant' delta independent of output quality so lower qualities gain better performance boost:
Current master
PR, roughly 5% faster