Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand indirection cell calls in IR instead of doing ad-hoc optimizations in codegen #65388

Closed

Conversation

jakobbotsch
Copy link
Member

When the JIT calls a function that requires an indirection cell argument
it is beneficial to load the call target through the already placed arg.
Previously this was done as an ad-hoc optimization during codegen (with
LSRA making sure to allocate an extra register). Instead of this, expand
it as regular IR and remove the ad-hoc optimization.

This change makes the optimization that were previously done work for
CFG as well and generally seems to have (albeit small) positive diffs.

Fix #65076

When the JIT calls a function that requires an indirection cell argument
it is beneficial to load the call target through the already placed arg.
Previously this was done as an ad-hoc optimization during codegen (with
LSRA making sure to allocate an extra register). Instead of this, expand
it as regular IR and remove the ad-hoc optimization.

This change makes the optimization that were previously done work for
CFG as well and generally seems to have (albeit small) positive diffs.

Fix dotnet#65076
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 15, 2022
@ghost ghost assigned jakobbotsch Feb 15, 2022
@ghost
Copy link

ghost commented Feb 15, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

When the JIT calls a function that requires an indirection cell argument
it is beneficial to load the call target through the already placed arg.
Previously this was done as an ad-hoc optimization during codegen (with
LSRA making sure to allocate an extra register). Instead of this, expand
it as regular IR and remove the ad-hoc optimization.

This change makes the optimization that were previously done work for
CFG as well and generally seems to have (albeit small) positive diffs.

Fix #65076

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member Author

Codegen diff for a simple interface call with CFG enabled:

[MethodImpl(MethodImplOptions.NoInlining)]
public static void M(I i)
{
    i.Foo();
}

Arm64

Before

G_M10449_IG02:              ;; offset=0008H
        D280010F          movz    x15, #8
        F2ADACCF          movk    x15, #0x6d66 LSL #16
        F2CFFF2F          movk    x15, #0x7ff9 LSL #32
        F94001EF          ldr     x15, [x15]
        94000000          bl      CORINFO_HELP_VALIDATE_INDIRECT_CALL
        AA0F03E1          mov     x1, x15
        D280010B          movz    x11, #8
        F2ADACCB          movk    x11, #0x6d66 LSL #16
        F2CFFF2B          movk    x11, #0x7ff9 LSL #32
        D63F0020          blr     x1

After

G_M10449_IG02:              ;; offset=0008H
        D2800101          movz    x1, #8
        F2ADACA1          movk    x1, #0x6d65 LSL #16
        F2CFFF21          movk    x1, #0x7ff9 LSL #32
        F940002F          ldr     x15, [x1]
        94000000          bl      CORINFO_HELP_VALIDATE_INDIRECT_CALL
        AA0F03E2          mov     x2, x15
        AA0103EB          mov     x11, x1
        D63F0040          blr     x2

x64

Before

G_M10449_IG02:              ;; offset=0009H
       488B0D3833EDFF       mov      rcx, qword ptr [(reloc 0x7ff96d650008)]
       E8EBAC9F5F           call     CORINFO_HELP_VALIDATE_INDIRECT_CALL
       488BC1               mov      rax, rcx
       488B4C2430           mov      rcx, gword ptr [rsp+30H]
       49BB0800656DF97F0000 mov      r11, 0x7FF96D650008
       FFD0                 call     rax ; I:Foo():this
       90                   nop      

After

G_M10449_IG02:              ;; offset=0009H
       49BA0800646DF97F0000 mov      r10, 0x7FF96D640008
       498B0A               mov      rcx, qword ptr [r10]
       E8E5ACA05F           call     CORINFO_HELP_VALIDATE_INDIRECT_CALL
       488BC1               mov      rax, rcx
       488B4C2430           mov      rcx, gword ptr [rsp+30H]
       4D8BDA               mov      r11, r10
       FFD0                 call     rax ; I:Foo():this
       90                   nop      

(I will need to benchmark this to see if it is better to disable the expansion in this case)

Comment on lines +3399 to +3458
//
assert(genIsValidIntReg(targetAddrReg));
assert(addr != nullptr);

// Non-virtual direct call to known addresses
#ifdef TARGET_ARM
if (!validImmForBL((ssize_t)addr))
{
regNumber tmpReg = call->GetSingleTempReg();
instGen_Set_Reg_To_Imm(EA_HANDLE_CNS_RELOC, tmpReg, (ssize_t)addr);
// clang-format off
genEmitCall(emitter::EC_INDIR_R,
methHnd,
INDEBUG_LDISASM_COMMA(sigInfo)
nullptr, // addr
retSize
MULTIREG_HAS_SECOND_GC_RET_ONLY_ARG(secondRetSize),
NULL,
retSize,
di,
targetAddrReg,
tmpReg,
call->IsFastTailCall());
// clang-format on
}
else
{
// Generate a direct call to a non-virtual user defined or helper method
assert(call->gtCallType == CT_HELPER || call->gtCallType == CT_USER_FUNC);

void* addr = nullptr;
#ifdef FEATURE_READYTORUN
if (call->gtEntryPoint.addr != NULL)
{
assert(call->gtEntryPoint.accessType == IAT_VALUE);
addr = call->gtEntryPoint.addr;
}
else
#endif // FEATURE_READYTORUN
if (call->gtCallType == CT_HELPER)
{
CorInfoHelpFunc helperNum = compiler->eeGetHelperNum(methHnd);
noway_assert(helperNum != CORINFO_HELP_UNDEF);

void* pAddr = nullptr;
addr = compiler->compGetHelperFtn(helperNum, (void**)&pAddr);
assert(pAddr == nullptr);
}
else
{
// Direct call to a non-virtual user function.
addr = call->gtDirectCallAddress;
}

assert(addr != nullptr);

// Non-virtual direct call to known addresses
#ifdef TARGET_ARM
if (!validImmForBL((ssize_t)addr))
{
regNumber tmpReg = call->GetSingleTempReg();
instGen_Set_Reg_To_Imm(EA_HANDLE_CNS_RELOC, tmpReg, (ssize_t)addr);
// clang-format off
genEmitCall(emitter::EC_INDIR_R,
methHnd,
INDEBUG_LDISASM_COMMA(sigInfo)
NULL,
retSize,
di,
tmpReg,
call->IsFastTailCall());
// clang-format on
}
else
#endif // TARGET_ARM
{
// clang-format off
genEmitCall(emitter::EC_FUNC_TOKEN,
methHnd,
INDEBUG_LDISASM_COMMA(sigInfo)
addr,
retSize
MULTIREG_HAS_SECOND_GC_RET_ONLY_ARG(secondRetSize),
di,
REG_NA,
call->IsFastTailCall());
// clang-format on
}
{
// clang-format off
genEmitCall(emitter::EC_FUNC_TOKEN,
methHnd,
INDEBUG_LDISASM_COMMA(sigInfo)
addr,
retSize
MULTIREG_HAS_SECOND_GC_RET_ONLY_ARG(secondRetSize),
di,
REG_NA,
call->IsFastTailCall());
// clang-format on
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is hard to read but the only change here is that I removed the first if case (with the ad-hoc optimization) and unindented the rest.

@JulieLeeMSFT JulieLeeMSFT requested review from AndyAyersMS and echesakov and removed request for echesakov February 16, 2022 03:15
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Feb 16, 2022
- Helpers can use indir cell args so we do not only see CT_USER_FUNC
- Fix an incorrect redefinition
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Seems like arm is not yet happy though.

src/coreclr/jit/gentree.h Outdated Show resolved Hide resolved
src/coreclr/jit/gentree.h Outdated Show resolved Hide resolved
src/coreclr/jit/valuenum.cpp Outdated Show resolved Hide resolved
The callee-saved register used for the indir cell arg was not being
killed at the call site leaving the arg register marked as busy.
After previous fix I believe this should no longer be necessary.
@jakobbotsch
Copy link
Member Author

The ARM issue seemed to be an LSRA bug: for R2R calls we were not killing the register containing the indirection cell address at the call site which led to that register being marked busy for the remainder of the function. ARM is a bit special there since it uses a callee-preserved register. There was a hack in morph to work around this problem (I assume) that I have removed as well.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Feb 21, 2022

On x64 with CFG it looks like it's better to use the constant in both places:

 ; MyBenchmarks.SimpleIfaceCall.Call()
        push      rax
        xor       eax,eax
        mov       [rsp],rax
        mov       rcx,[rcx+8]
        mov       [rsp],rcx
-       mov       r10,7FF972ED0510
-       mov       rcx,[r10]
+       mov       rcx,[7FF9736E0510]
        call      CORINFO_HELP_VALIDATE_INDIRECT_CALL
        mov       rax,rcx
        mov       rcx,[rsp]
-       mov       r11,r10
+       mov       r11,7FF972EA0510
        cmp       [rcx],ecx
        add       rsp,8
        jmp       rax
-; Total bytes of code 52
+; Total bytes of code 53

we get

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1526 (21H2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-alpha.1.21558.2
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-DBANTR : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-BFNMJK : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

EnvironmentVariables=COMPlus_JitForceControlFlowGuard=1  
Method Job Toolchain Mean Error StdDev Median Ratio RatioSD Code Size
Call Job-DBANTR \Core_Root_base\corerun.exe 1.599 ns 0.0778 ns 0.2271 ns 1.510 ns 1.21 0.23 52 B
Call Job-BFNMJK \Core_Root_diff\corerun.exe 1.332 ns 0.0604 ns 0.1602 ns 1.288 ns 1.00 0.00 53 B

@jakobbotsch
Copy link
Member Author

Not totally sure if this change can replace the old ad-hoc optimization. A few observations:

  1. In large functions with many locals the new indir cell locals may not be tracked/may cause other locals to not be tracked, which lead to regressions. It will probably help to reuse the local.
  2. In tier0/debug codegen the new locals cause extra spilling, which is to be expected. Not sure if this is really an issue (this is the cause of the aspnet regressions, which seem to be in tier0 method contexts when spot checking)
  3. On ARM32 LSRA seems to be bad at dealing with the new locals (and new local when reusing it). Still have not been able to pinpoint one particular issue, it seems the new local leads to globally different register allocation which causes regressions.

@jakobbotsch
Copy link
Member Author

Closing this, don't think we can teach RA/liveness to deal well with the extra locals required for this, and the regressions seem too large to ignore.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: Control-flow guard checked interface calls should not materialize cell addresses twice
3 participants