Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unroll byref struct copies #86820

Merged
merged 5 commits into from
Jul 18, 2023
Merged

unroll byref struct copies #86820

merged 5 commits into from
Jul 18, 2023

Conversation

markples
Copy link
Member

@markples markples commented May 26, 2023

If a struct contains a byref, then it is known to be on the stack/regs (not in the heap), so GC write barriers are not required. This adds that case to lower*.cpp and attempts to make the code more similar. I didn't actually factor them (especially with a few subtle differences such as the call to getUnrollThreshold).

This partially handles #80086. It improves the code for common cases, but since the strategy is not always used, the correctness issue in it is not completely handled. Next step is to apply the fix for that and see how bad the regressions are; this change will reduce the impact.

Example:

static Span<int> Copy1(Span<int> s) => s;
G_M44162_IG01:  ;; offset=0000H
       vzeroupper 
						;; size=3 bbWeight=1 PerfScore 1.00
G_M44162_IG02:  ;; offset=0003H
       vmovdqu  xmm0, xmmword ptr [rdx]
       vmovdqu  xmmword ptr [rcx], xmm0
						;; size=8 bbWeight=1 PerfScore 6.00
G_M44162_IG03:  ;; offset=000BH
       mov      rax, rcx
						;; size=3 bbWeight=1 PerfScore 0.25
G_M44162_IG04:  ;; offset=000EH
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 15, prolog size 3, PerfScore 9.75, instruction count 5, allocated bytes for code 15 (MethodHash=4d5b537d) for metho

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 26, 2023
@ghost ghost assigned markples May 26, 2023
@ghost
Copy link

ghost commented May 26, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: markples
Assignees: markples
Labels:

area-CodeGen-coreclr

Milestone: -

@markples markples added this to the 8.0.0 milestone May 26, 2023
@ghost ghost closed this Jun 25, 2023
@ghost
Copy link

ghost commented Jun 25, 2023

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@markples markples reopened this Jul 15, 2023
@markples
Copy link
Member Author

Diff results - lots of interference from the negative diffs hitting all PRs right now

linux arm64

Diffs are based on 876,006 contexts (105,646 MinOpts, 770,360 FullOpts).

Overall (-4,320 bytes)
MinOpts (-2,748 bytes)
FullOpts (-1,572 bytes)

linux x64

Diffs are based on 929,778 contexts (110,567 MinOpts, 819,211 FullOpts).

Overall (+1,189 bytes)
MinOpts (-422 bytes)
FullOpts (+1,611 bytes)

windows arm64

Diffs are based on 923,219 contexts (84,344 MinOpts, 838,875 FullOpts).

Overall (-4,164 bytes)
MinOpts (-2,100 bytes)
FullOpts (-2,064 bytes)

windows x64

Diffs are based on 987,198 contexts (109,803 MinOpts, 877,395 FullOpts).

Overall (-4,641 bytes)
MinOpts (-4,438 bytes)
FullOpts (-203 bytes)

linux arm

Diffs are based on 846,987 contexts (105,632 MinOpts, 741,355 FullOpts).

MISSED contexts: 33,754 (3.99%)

Overall (-10,776 bytes)
MinOpts (-7,584 bytes)
FullOpts (-3,192 bytes)

windows x86

Diffs are based on 976,498 contexts (114,341 MinOpts, 862,157 FullOpts).

Overall (+2,116 bytes)
MinOpts (+129 bytes)
FullOpts (+1,987 bytes)

@markples
Copy link
Member Author

good diff from arm64

-            mov     x14, x19
-            ; byrRegs +[x14]
-            mov     x13, x20
-            ; byrRegs +[x13]
-            bl      CORINFO_HELP_ASSIGN_BYREF
-            ldr     x12, [x13], #0x08
-            str     x12, [x14], #0x08
+						;; size=8 bbWeight=1 PerfScore 6.00
+G_M20461_IG12:        ; bbWeight=1, nogc, extend
+            ldp     x2, x3, [x20]
+            stp     x2, x3, [x19]
+						;; size=8 bbWeight=1 PerfScore 5.00
+G_M20461_IG13:        ; bbWeight=1, isz, extend

@markples
Copy link
Member Author

similar on x64

-       add      rdi, 56
-       lea      rsi, bword ptr [rbp-70H]
-       ; byrRegs +[rsi]
-       call     CORINFO_HELP_ASSIGN_BYREF
-       movsq    
-       call     CORINFO_HELP_ASSIGN_BYREF
-       movsq    
+						;; size=22 bbWeight=1 PerfScore 6.50
+G_M5232_IG05:        ; bbWeight=1, nogc, extend
+       vmovdqu  ymm0, ymmword ptr [rbp-70H]
+       vmovdqu  ymmword ptr [rdi+38H], ymm0
+						;; size=10 bbWeight=1 PerfScore 6.00

@markples
Copy link
Member Author

occasional size regression

+3 (+5.77%) : 66460.dasm - System.Text.SpanLineEnumerator:GetEnumerator():System.Text.SpanLineEnumerator:this (Tier0)
@@ -15,31 +15,34 @@
 G_M50205_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
        push     rbp
        sub      rsp, 16
+       vzeroupper 
        lea      rbp, [rsp+10H]
        mov      bword ptr [rbp-08H], rdi
        mov      bword ptr [rbp-10H], rsi
-						;; size=18 bbWeight=1 PerfScore 3.75
+						;; size=21 bbWeight=1 PerfScore 4.75
 G_M50205_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
-       mov      rsi, bword ptr [rbp-08H]
-       ; byrRegs +[rsi]
-       mov      rdi, bword ptr [rbp-10H]
-       ; byrRegs +[rdi]
-       call     CORINFO_HELP_ASSIGN_BYREF
-       movsq    
-       call     CORINFO_HELP_ASSIGN_BYREF
-       movsq    
-       movsq    
-       mov      rax, bword ptr [rbp-10H]
+       mov      rax, bword ptr [rbp-08H]
        ; byrRegs +[rax]
-						;; size=28 bbWeight=1 PerfScore 8.00
-G_M50205_IG03:        ; bbWeight=1, epilog, nogc, extend
+       mov      rcx, bword ptr [rbp-10H]
+       ; byrRegs +[rcx]
+						;; size=8 bbWeight=1 PerfScore 2.00
+G_M50205_IG03:        ; bbWeight=1, nogc, extend
+       vmovdqu  ymm0, ymmword ptr [rax]
+       vmovdqu  ymmword ptr [rcx], ymm0
+       mov      rdx, qword ptr [rax+20H]
+       mov      qword ptr [rcx+20H], rdx
+						;; size=16 bbWeight=1 PerfScore 10.00
+G_M50205_IG04:        ; bbWeight=1, extend
+       mov      rax, bword ptr [rbp-10H]
+						;; size=4 bbWeight=1 PerfScore 1.00
+G_M50205_IG05:        ; bbWeight=1, epilog, nogc, extend
        add      rsp, 16
        pop      rbp
        ret      
 						;; size=6 bbWeight=1 PerfScore 1.75
 ; END METHOD System.Text.SpanLineEnumerator:GetEnumerator():System.Text.SpanLineEnumerator:this
 
-; Total bytes of code 52, prolog size 10, PerfScore 18.70, instruction count 16, allocated bytes for code 52 (MethodHash=ee6c3be2) for method System.Text.SpanLineEnumerator:GetEnumerator():System.Text.SpanLineEnumerator:this (Tier0)
+; Total bytes of code 55, prolog size 13, PerfScore 25.00, instruction count 16, allocated bytes for code 55 (MethodHash=ee6c3be2) for method System.Text.SpanLineEnumerator:GetEnumerator():System.Text.SpanLineEnumerator:this (Tier0)

@markples markples marked this pull request as ready for review July 17, 2023 19:11
@markples
Copy link
Member Author

@dotnet/jit-contrib
cc @clamp03 @shushanhf since there are architecture-specific changes

@markples
Copy link
Member Author

/azp run runtime-coreclr gcstress0x3-gcstress0xc

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@clamp03
Copy link
Member

clamp03 commented Jul 18, 2023

@dotnet/jit-contrib cc @clamp03 @shushanhf since there are architecture-specific changes

Thank you! It looks good to me. CC @t-mustafin @alpencolt

Comment on lines 1 to 15
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<!-- Needed for CLRTestEnvironmentVariable -->
<RequiresProcessIsolation>true</RequiresProcessIsolation>

<AllowUnsafeBlocks>True</AllowUnsafeBlocks>
<Optimize>True</Optimize>
</PropertyGroup>
<ItemGroup>
<Compile Include="$(MSBuildProjectName).cs" />

<CLRTestEnvironmentVariable Include="DOTNET_TieredCompilation" Value="0" />
<CLRTestEnvironmentVariable Include="DOTNET_JITMinOpts" Value="0" />
</ItemGroup>
</Project>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason to set these environment variables and RequiresProcessIsolation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this might be my misunderstanding after seeing these set in a number (which isn't actually as many as I thought) of optimization tests. It looks like they are set when we do FileCheck asm checks (ideally we could say something like "do FileCheck X if and only if optimized" rather than "always optimize and do FileCheck X"), but I figured normal asm diffs would be sufficient here.

I think I'm still missing something because I'd expect most tests to run without optimizations due to tiering yet most of the asmdiff contexts have optimizations enabled. Perhaps if I dug into the asmdiff setup I'd find some settings for most/all tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm still missing something because I'd expect most tests to run without optimizations due to tiering yet most of the asmdiff contexts have optimizations enabled. Perhaps if I dug into the asmdiff setup I'd find some settings for most/all tests?

PR runs of runtime always run with tiered compilation disabled. In the rolling builds on main we run all tests both with and without tiered compilation enabled. It's set here:

${{ elseif eq(variables['Build.Reason'], 'PullRequest') }}:
scenarios:
- no_tiered_compilation
${{ else }}:
scenarios:
- normal
- no_tiered_compilation

So usually (for non file check) we will not set any environment variables and leave it up to these defined scenarios.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks! I hadn't found this test snippet yet.

@jakobbotsch
Copy link
Member

/azp run runtime-coreclr superpmi-diffs, runtime-coreclr superpmi-replay

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Comment on lines 426 to 429
unsigned slots = this->GetSlotCount();
for (unsigned i = 0; i < slots; i++)
{
if (this->IsGCByRef(i))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We usually don't prefix with this unless necessary.

@markples
Copy link
Member Author

cleaner results;

linux arm64

Diffs are based on 1,536,168 contexts (480,629 MinOpts, 1,055,539 FullOpts).

Overall (-5,232 bytes)
MinOpts (-3,260 bytes)
FullOpts (-1,972 bytes)

linux x64

Diffs are based on 1,660,406 contexts (558,006 MinOpts, 1,102,400 FullOpts).

Overall (-1,142 bytes)
MinOpts (-750 bytes)
FullOpts (-392 bytes)

osx arm64

Diffs are based on 1,187,983 contexts (429,171 MinOpts, 758,812 FullOpts).

Overall (-5,732 bytes)
MinOpts (-3,276 bytes)
FullOpts (-2,456 bytes)

windows arm64

Diffs are based on 1,401,929 contexts (440,568 MinOpts, 961,361 FullOpts).

Overall (-4,416 bytes)
MinOpts (-2,580 bytes)
FullOpts (-1,836 bytes)

windows x64

Diffs are based on 1,661,390 contexts (471,241 MinOpts, 1,190,149 FullOpts).

MISSED contexts: 3 (0.00%)

Overall (-8,993 bytes)
MinOpts (-5,772 bytes)
FullOpts (-3,221 bytes)

linux arm

Diffs are based on 1,323,737 contexts (358,680 MinOpts, 965,057 FullOpts).

MISSED contexts: 44,200 (3.34%)

Overall (-13,518 bytes)
MinOpts (-9,530 bytes)
FullOpts (-3,988 bytes)

windows x86

Diffs are based on 1,438,138 contexts (387,507 MinOpts, 1,050,631 FullOpts).

No diffs found.

@markples markples merged commit 8724933 into dotnet:main Jul 18, 2023
@xtqqczze
Copy link
Contributor

static Span<int> Copy1(Span<int> s) => s;

Extended set of cases: https://csharp.godbolt.org/z/zf8e5qa4G

@markples
Copy link
Member Author

Thanks @xtqqczze. I think that I have all of these except for CreateSpan3 (quite similar to CreateSpan2 except with byref instead of * type).

bool ClassLayout::HasGCByRef() const
{
unsigned slots = GetSlotCount();
for (unsigned i = 0; i < slots; i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an intuition that the GCByRef slots are at the beginning or at the end or scattered? In other words, are we more likely to fast-return if we walk them list backwards?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any intuition or measurements on this. If necessary, then this could be cached similar to whether the layout has any gc pointers, though if I recall correctly we are out of bits at this size.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants