unroll byref struct copies #86820

markples · 2023-05-26T22:32:37Z

If a struct contains a byref, then it is known to be on the stack/regs (not in the heap), so GC write barriers are not required. This adds that case to lower*.cpp and attempts to make the code more similar. I didn't actually factor them (especially with a few subtle differences such as the call to getUnrollThreshold).

This partially handles #80086. It improves the code for common cases, but since the strategy is not always used, the correctness issue in it is not completely handled. Next step is to apply the fix for that and see how bad the regressions are; this change will reduce the impact.

Example:

static Span<int> Copy1(Span<int> s) => s;

G_M44162_IG01:  ;; offset=0000H
       vzeroupper 
						;; size=3 bbWeight=1 PerfScore 1.00
G_M44162_IG02:  ;; offset=0003H
       vmovdqu  xmm0, xmmword ptr [rdx]
       vmovdqu  xmmword ptr [rcx], xmm0
						;; size=8 bbWeight=1 PerfScore 6.00
G_M44162_IG03:  ;; offset=000BH
       mov      rax, rcx
						;; size=3 bbWeight=1 PerfScore 0.25
G_M44162_IG04:  ;; offset=000EH
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 15, prolog size 3, PerfScore 9.75, instruction count 5, allocated bytes for code 15 (MethodHash=4d5b537d) for metho

ghost · 2023-05-26T22:32:47Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author:	markples
Assignees:	markples
Labels:	`area-CodeGen-coreclr`
Milestone:	-

ghost · 2023-06-25T23:01:44Z

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

markples · 2023-07-17T18:43:26Z

Diff results - lots of interference from the negative diffs hitting all PRs right now

linux arm64

Diffs are based on 876,006 contexts (105,646 MinOpts, 770,360 FullOpts).

Overall (-4,320 bytes)
MinOpts (-2,748 bytes)
FullOpts (-1,572 bytes)

linux x64

Diffs are based on 929,778 contexts (110,567 MinOpts, 819,211 FullOpts).

Overall (+1,189 bytes)
MinOpts (-422 bytes)
FullOpts (+1,611 bytes)

windows arm64

Diffs are based on 923,219 contexts (84,344 MinOpts, 838,875 FullOpts).

Overall (-4,164 bytes)
MinOpts (-2,100 bytes)
FullOpts (-2,064 bytes)

windows x64

Diffs are based on 987,198 contexts (109,803 MinOpts, 877,395 FullOpts).

Overall (-4,641 bytes)
MinOpts (-4,438 bytes)
FullOpts (-203 bytes)

linux arm

Diffs are based on 846,987 contexts (105,632 MinOpts, 741,355 FullOpts).

MISSED contexts: 33,754 (3.99%)

Overall (-10,776 bytes)
MinOpts (-7,584 bytes)
FullOpts (-3,192 bytes)

windows x86

Diffs are based on 976,498 contexts (114,341 MinOpts, 862,157 FullOpts).

Overall (+2,116 bytes)
MinOpts (+129 bytes)
FullOpts (+1,987 bytes)

markples · 2023-07-17T18:47:37Z

good diff from arm64

-            mov     x14, x19
-            ; byrRegs +[x14]
-            mov     x13, x20
-            ; byrRegs +[x13]
-            bl      CORINFO_HELP_ASSIGN_BYREF
-            ldr     x12, [x13], #0x08
-            str     x12, [x14], #0x08
+						;; size=8 bbWeight=1 PerfScore 6.00
+G_M20461_IG12:        ; bbWeight=1, nogc, extend
+            ldp     x2, x3, [x20]
+            stp     x2, x3, [x19]
+						;; size=8 bbWeight=1 PerfScore 5.00
+G_M20461_IG13:        ; bbWeight=1, isz, extend

markples · 2023-07-17T18:48:20Z

similar on x64

-       add      rdi, 56
-       lea      rsi, bword ptr [rbp-70H]
-       ; byrRegs +[rsi]
-       call     CORINFO_HELP_ASSIGN_BYREF
-       movsq    
-       call     CORINFO_HELP_ASSIGN_BYREF
-       movsq    
+						;; size=22 bbWeight=1 PerfScore 6.50
+G_M5232_IG05:        ; bbWeight=1, nogc, extend
+       vmovdqu  ymm0, ymmword ptr [rbp-70H]
+       vmovdqu  ymmword ptr [rdi+38H], ymm0
+						;; size=10 bbWeight=1 PerfScore 6.00

markples · 2023-07-17T18:49:17Z

occasional size regression

+3 (+5.77%) : 66460.dasm - System.Text.SpanLineEnumerator:GetEnumerator():System.Text.SpanLineEnumerator:this (Tier0)
@@ -15,31 +15,34 @@
 G_M50205_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
        push     rbp
        sub      rsp, 16
+       vzeroupper 
        lea      rbp, [rsp+10H]
        mov      bword ptr [rbp-08H], rdi
        mov      bword ptr [rbp-10H], rsi
-						;; size=18 bbWeight=1 PerfScore 3.75
+						;; size=21 bbWeight=1 PerfScore 4.75
 G_M50205_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
-       mov      rsi, bword ptr [rbp-08H]
-       ; byrRegs +[rsi]
-       mov      rdi, bword ptr [rbp-10H]
-       ; byrRegs +[rdi]
-       call     CORINFO_HELP_ASSIGN_BYREF
-       movsq    
-       call     CORINFO_HELP_ASSIGN_BYREF
-       movsq    
-       movsq    
-       mov      rax, bword ptr [rbp-10H]
+       mov      rax, bword ptr [rbp-08H]
        ; byrRegs +[rax]
-						;; size=28 bbWeight=1 PerfScore 8.00
-G_M50205_IG03:        ; bbWeight=1, epilog, nogc, extend
+       mov      rcx, bword ptr [rbp-10H]
+       ; byrRegs +[rcx]
+						;; size=8 bbWeight=1 PerfScore 2.00
+G_M50205_IG03:        ; bbWeight=1, nogc, extend
+       vmovdqu  ymm0, ymmword ptr [rax]
+       vmovdqu  ymmword ptr [rcx], ymm0
+       mov      rdx, qword ptr [rax+20H]
+       mov      qword ptr [rcx+20H], rdx
+						;; size=16 bbWeight=1 PerfScore 10.00
+G_M50205_IG04:        ; bbWeight=1, extend
+       mov      rax, bword ptr [rbp-10H]
+						;; size=4 bbWeight=1 PerfScore 1.00
+G_M50205_IG05:        ; bbWeight=1, epilog, nogc, extend
        add      rsp, 16
        pop      rbp
        ret      
 						;; size=6 bbWeight=1 PerfScore 1.75
 ; END METHOD System.Text.SpanLineEnumerator:GetEnumerator():System.Text.SpanLineEnumerator:this
 
-; Total bytes of code 52, prolog size 10, PerfScore 18.70, instruction count 16, allocated bytes for code 52 (MethodHash=ee6c3be2) for method System.Text.SpanLineEnumerator:GetEnumerator():System.Text.SpanLineEnumerator:this (Tier0)
+; Total bytes of code 55, prolog size 13, PerfScore 25.00, instruction count 16, allocated bytes for code 55 (MethodHash=ee6c3be2) for method System.Text.SpanLineEnumerator:GetEnumerator():System.Text.SpanLineEnumerator:this (Tier0)

markples · 2023-07-17T19:16:54Z

@dotnet/jit-contrib
cc @clamp03 @shushanhf since there are architecture-specific changes

markples · 2023-07-17T23:19:50Z

/azp run runtime-coreclr gcstress0x3-gcstress0xc

azure-pipelines · 2023-07-17T23:20:08Z

Azure Pipelines successfully started running 1 pipeline(s).

clamp03 · 2023-07-18T06:31:36Z

@dotnet/jit-contrib cc @clamp03 @shushanhf since there are architecture-specific changes

Thank you! It looks good to me. CC @t-mustafin @alpencolt

jakobbotsch · 2023-07-18T07:24:48Z

src/tests/JIT/opt/Misc/Runtime_80086/Runtime_80086.csproj

+<Project Sdk="Microsoft.NET.Sdk">
+  <PropertyGroup>
+    <!-- Needed for CLRTestEnvironmentVariable -->
+    <RequiresProcessIsolation>true</RequiresProcessIsolation>
+
+    <AllowUnsafeBlocks>True</AllowUnsafeBlocks>
+    <Optimize>True</Optimize>
+  </PropertyGroup>
+  <ItemGroup>
+    <Compile Include="$(MSBuildProjectName).cs" />
+
+    <CLRTestEnvironmentVariable Include="DOTNET_TieredCompilation" Value="0" />
+    <CLRTestEnvironmentVariable Include="DOTNET_JITMinOpts" Value="0" />
+  </ItemGroup>
+</Project>


Any particular reason to set these environment variables and RequiresProcessIsolation?

It looks like this might be my misunderstanding after seeing these set in a number (which isn't actually as many as I thought) of optimization tests. It looks like they are set when we do FileCheck asm checks (ideally we could say something like "do FileCheck X if and only if optimized" rather than "always optimize and do FileCheck X"), but I figured normal asm diffs would be sufficient here.

I think I'm still missing something because I'd expect most tests to run without optimizations due to tiering yet most of the asmdiff contexts have optimizations enabled. Perhaps if I dug into the asmdiff setup I'd find some settings for most/all tests?

I think I'm still missing something because I'd expect most tests to run without optimizations due to tiering yet most of the asmdiff contexts have optimizations enabled. Perhaps if I dug into the asmdiff setup I'd find some settings for most/all tests?

PR runs of runtime always run with tiered compilation disabled. In the rolling builds on main we run all tests both with and without tiered compilation enabled. It's set here:

runtime/eng/pipelines/common/templates/runtimes/run-test-job.yml

Lines 376 to 382 in ac3979a

${{ elseif eq(variables['Build.Reason'], 'PullRequest') }}:

scenarios:

- no_tiered_compilation

${{ else }}:

scenarios:

- normal

- no_tiered_compilation

So usually (for non file check) we will not set any environment variables and leave it up to these defined scenarios.

Ah, thanks! I hadn't found this test snippet yet.

jakobbotsch · 2023-07-18T07:26:09Z

/azp run runtime-coreclr superpmi-diffs, runtime-coreclr superpmi-replay

azure-pipelines · 2023-07-18T07:26:25Z

Azure Pipelines successfully started running 2 pipeline(s).

jakobbotsch · 2023-07-18T09:46:54Z

src/coreclr/jit/layout.cpp

+    unsigned slots = this->GetSlotCount();
+    for (unsigned i = 0; i < slots; i++)
+    {
+        if (this->IsGCByRef(i))


Nit: We usually don't prefix with this unless necessary.

markples · 2023-07-18T16:50:17Z

cleaner results;

linux arm64

Diffs are based on 1,536,168 contexts (480,629 MinOpts, 1,055,539 FullOpts).

Overall (-5,232 bytes)
MinOpts (-3,260 bytes)
FullOpts (-1,972 bytes)

linux x64

Diffs are based on 1,660,406 contexts (558,006 MinOpts, 1,102,400 FullOpts).

Overall (-1,142 bytes)
MinOpts (-750 bytes)
FullOpts (-392 bytes)

osx arm64

Diffs are based on 1,187,983 contexts (429,171 MinOpts, 758,812 FullOpts).

Overall (-5,732 bytes)
MinOpts (-3,276 bytes)
FullOpts (-2,456 bytes)

windows arm64

Diffs are based on 1,401,929 contexts (440,568 MinOpts, 961,361 FullOpts).

Overall (-4,416 bytes)
MinOpts (-2,580 bytes)
FullOpts (-1,836 bytes)

windows x64

Diffs are based on 1,661,390 contexts (471,241 MinOpts, 1,190,149 FullOpts).

MISSED contexts: 3 (0.00%)

Overall (-8,993 bytes)
MinOpts (-5,772 bytes)
FullOpts (-3,221 bytes)

linux arm

Diffs are based on 1,323,737 contexts (358,680 MinOpts, 965,057 FullOpts).

MISSED contexts: 44,200 (3.34%)

Overall (-13,518 bytes)
MinOpts (-9,530 bytes)
FullOpts (-3,988 bytes)

windows x86

Diffs are based on 1,438,138 contexts (387,507 MinOpts, 1,050,631 FullOpts).

No diffs found.

xtqqczze · 2023-07-18T18:28:36Z

static Span<int> Copy1(Span<int> s) => s;

Extended set of cases: https://csharp.godbolt.org/z/zf8e5qa4G

markples · 2023-07-18T18:47:14Z

Thanks @xtqqczze. I think that I have all of these except for CreateSpan3 (quite similar to CreateSpan2 except with byref instead of * type).

IDisposable · 2023-07-29T04:39:39Z

src/coreclr/jit/layout.cpp

+bool ClassLayout::HasGCByRef() const
+{
+    unsigned slots = GetSlotCount();
+    for (unsigned i = 0; i < slots; i++)


Is there an intuition that the GCByRef slots are at the beginning or at the end or scattered? In other words, are we more likely to fast-return if we walk them list backwards?

I don't have any intuition or measurements on this. If necessary, then this could be cached similar to whether the layout has any gc pointers, though if I recall correctly we are out of bits at this size.

unroll byref struct copies

908703b

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 26, 2023

ghost assigned markples May 26, 2023

markples mentioned this pull request May 26, 2023

Possible optimisation for derefencing a span pointer #80086

Closed

markples added this to the 8.0.0 milestone May 26, 2023

ghost closed this Jun 25, 2023

markples added 2 commits July 14, 2023 16:33

HasGCByRef and some xarch/arm consolidation

4a9c52a

loongarch64/riscv64 and more arch consolidation

512e1d5

markples reopened this Jul 15, 2023

markples marked this pull request as ready for review July 17, 2023 19:11

jakobbotsch reviewed Jul 18, 2023

View reviewed changes

Remove unnecessary envvars from test

7641a70

jakobbotsch reviewed Jul 18, 2023

View reviewed changes

jakobbotsch approved these changes Jul 18, 2023

View reviewed changes

Remove this->

4469cc2

markples merged commit 8724933 into dotnet:main Jul 18, 2023

markples mentioned this pull request Jul 18, 2023

Avoid calling GC write barrier for byrefs #89064

Closed

IDisposable reviewed Jul 29, 2023

View reviewed changes

markples mentioned this pull request Aug 31, 2023

Optimize out write barriers for fields in ref-like structs #9512

Closed

ghost locked as resolved and limited conversation to collaborators Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unroll byref struct copies #86820

unroll byref struct copies #86820

markples commented May 26, 2023 •

edited

Loading

ghost commented May 26, 2023

ghost commented Jun 25, 2023

markples commented Jul 17, 2023

markples commented Jul 17, 2023

markples commented Jul 17, 2023

markples commented Jul 17, 2023

markples commented Jul 17, 2023

markples commented Jul 17, 2023

azure-pipelines bot commented Jul 17, 2023

clamp03 commented Jul 18, 2023

jakobbotsch Jul 18, 2023

markples Jul 18, 2023

jakobbotsch Jul 18, 2023

markples Jul 18, 2023

jakobbotsch commented Jul 18, 2023

azure-pipelines bot commented Jul 18, 2023

jakobbotsch Jul 18, 2023

markples commented Jul 18, 2023

xtqqczze commented Jul 18, 2023

markples commented Jul 18, 2023

IDisposable Jul 29, 2023

markples Aug 8, 2023

	${{ elseif eq(variables['Build.Reason'], 'PullRequest') }}:
	scenarios:
	- no_tiered_compilation
	${{ else }}:
	scenarios:
	- normal
	- no_tiered_compilation

unroll byref struct copies #86820

unroll byref struct copies #86820

Conversation

markples commented May 26, 2023 • edited Loading

ghost commented May 26, 2023

ghost commented Jun 25, 2023

markples commented Jul 17, 2023

linux arm64

linux x64

windows arm64

windows x64

linux arm

windows x86

markples commented Jul 17, 2023

markples commented Jul 17, 2023

markples commented Jul 17, 2023

markples commented Jul 17, 2023

markples commented Jul 17, 2023

azure-pipelines bot commented Jul 17, 2023

clamp03 commented Jul 18, 2023

jakobbotsch Jul 18, 2023

Choose a reason for hiding this comment

markples Jul 18, 2023

Choose a reason for hiding this comment

jakobbotsch Jul 18, 2023

Choose a reason for hiding this comment

markples Jul 18, 2023

Choose a reason for hiding this comment

jakobbotsch commented Jul 18, 2023

azure-pipelines bot commented Jul 18, 2023

jakobbotsch Jul 18, 2023

Choose a reason for hiding this comment

markples commented Jul 18, 2023

linux arm64

linux x64

osx arm64

windows arm64

windows x64

linux arm

windows x86

xtqqczze commented Jul 18, 2023

markples commented Jul 18, 2023

IDisposable Jul 29, 2023

Choose a reason for hiding this comment

markples Aug 8, 2023

Choose a reason for hiding this comment

markples commented May 26, 2023 •

edited

Loading